Open Access
19 August 2021 Adversarial and adaptive tone mapping operator: multi-scheme generation and multi-metric evaluation
Xingdong Cao, Kenneth K. F. Lai, Michael R. Smith, Svetlana Yanushkevich
Author Affiliations +
Abstract

Tone mapping is one of the main techniques to convert high-dynamic range (HDR) images into low-dynamic range (LDR) images. We propose to use a variant of generative adversarial networks to adaptively tone map images. We designed a conditional adversarial generative network composed of a U-Net generator and patchGAN discriminator to adaptively convert HDR images into LDR images. We extended previous work to include additional metrics such as tone-mapped image quality index (TMQI), structural similarity index measure, Fréchet inception distance, and perceptual path length. In addition, we applied face detection on the Kalantari dataset and showed that our proposed adversarial tone mapping operator generates the best LDR image for the detection of faces. One of our training schemes, trained via 256  ×  256 resolution HDR–LDR image pairs, results in a model that can generate high TMQI low-resolution 256  ×  256 and high-resolution 1024  ×  2048 LDR images. Given 1024  ×  2048 resolution HDR images, the TMQI of the generated LDR images reaches a value of 0.90, which outperforms all other contemporary tone mapping operators.

1.

Introduction

The dynamic range of an image is described as the variation of luminance in different parts of the image.1 The majority of real-life images are of low dynamic range (LDR) and are generally represented by an 8-bit integer per pixel format.2 In contrast, high dynamic range (HDR) uses more bits (16/32) to quantify the pixel values. Even though HDR images can better describe a scene, most common 8-bit display methods are not compatible with HDR images. A cost-effective method of displaying HDR images is to convert them into LDR images as opposed to using a 16-bit display setting.

Many tone mapping operators (TMOs) have been proposed and have shown incredible progress in many scenarios. Even though tone mapping is one of the most common ways to perform HDR to LDR conversion, TMOs have many limitations, such as generalization, parameter turning, expert knowledge, and model instability.

The main research question of this work is: Is it possible to propose a TMO that can adaptively tone-map all HDR images with different contents? In this paper, we seek to answer this question by exploring deep learning techniques. We propose a specific deep learning network, a conditional generative adversarial network (cGAN),3 to adaptively convert an HDR image into an LDR image. Our proposed model is training via HDR–LDR image pairs containing assorted content, including natural scenarios, indoor/outdoor scenes, regular/irregular geometric shapes, colorful/monochrome objects, and drastic luminance changes.

In general, the implementation of any generative adversarial networks (GANs) requires an objective loss function. In deep learning networks, the loss function measures the difference between the output and input images. Common loss functions are the absolute (called L1) or squared (called L2). In this work, we implement a unique network composed of general cGAN loss, feature matching loss, and perceptual loss. Combining these losses allows the proposed adversarial tone mapping operator (adTMO) to learn the distribution of ideally tone-mapped images.

For low-resolution image-to-image translation tasks, cGAN has shown great success in generating high-quality target images.4 However, for high-resolution image-to-image translation tasks, many problems exist. These problems require complex models to combat tilling patterns, local blurring, and saturated artifacts.5,6 One of the main deterrences of using high-resolution images is the amount of resources required for training, specifically the amount of time required for convergence. In our work, we explore the possibility of using low-resolution images to train a cGAN model (“U-Net” G and PatchGAN D). We extended the work on adTMO7 to include additional metrics such as structural similarity index measure (SSIM), perceptual path length (PPL), Fréchet inception distance (FID), and multi-scale structural similarity index measure (MS-SSIM), as well as the performance metrics for face detection. We show that adTMO outperforms most other TMOs when testing on low- and high-resolution HDR images.

This paper aims to design a smart TMO that can adaptively convert complex scenic HDR images into LDR images. The main contributions of our work are listed as follows.

  • 1. We propose adTMO, a variant of cGAN capable of adaptively generating high-resolution and high-quality LDR images.

  • 2. We explore different training and testing schemes, in order to find the best possible combination to generate the highest quality images.

  • 3. We evaluate the performance of adTMO and other TMOs using metrics such as SSIM and FID. In addition, we look at the performance of face detection applied to the different tone-mapped images.

This paper is organized as follows: Section 2 provides a literature review related to TMOs, cGAN, and metrics used for evaluating image-to-image translation tasks. Section 3 describes the architecture of adTMO and the different training/testing schemes we apply. Section 4 details the databases used for training and the preprocessing and postprocessing steps applied to the images. Section 5 summarizes the results of adTMO. Section 6 concludes our paper.

2.

Related Work

In this section, we provide a short review of tone-mapping literature, cGAN, and metrics used for evaluating image-to-image translation tasks.

2.1.

TMOs

Over the past 20 years, different TMOs have been designed to convert HDR images into LDR images. They can be divided into two categories, global TMOs and local TMOs, based on how they work on image pixels. Global TMOs, such as Larson et al.8 and Drago et al.,9 apply the same function on all pixels of an image. Global TMOs take less time to convert HDR images, but the output LDR images have reduced contrast. Local TMOs, e.g., Chiu et al.10 and Tumblin et al.,11 calculate the output pixel value based on the input and its neighboring pixels. Local TMOs can preserve the local structure and generate good contrast but at a cost of more computation time. In addition, most TMOs can only deal with some specific scenarios and do not generalize well with regard to image content.

2.2.

Generative Adversarial Networks

First proposed by Goodfellow in 2014,12 GAN has shown great success in many fields. GAN consists of a generator model (G) and a discriminator model (D). The goal of G is to generate fake samples that are real enough to fool D. For D, its goal is to distinguish real samples from collected databases and fake samples generated by G. By training G and D simultaneously, they can compete with each other and achieve an equilibrium allowing G to implicitly learn the distribution of real samples from the collected databases, without the need of complex loss functions.

In this paper, we adopt cGAN,3 so that the goal of G changes to generating fake samples under new conditions. Many low-resolution image-to-image translation tasks, such as semantic labels to photos and architectural labels to photo, adopt cGAN to generate target images and achieve satisfactory results.4 Patel et al.13 conducted a similar work using cGAN to convert HDR images into LDR images, but they only tested with 256×256 resolution image crops. A complex multi-scale architecture for high-resolution image-to-image tasks is proposed by Wang et al.5 and Rana et al.6 Those proposed networks required high-resolution training images and took many resources including memory and time to train. It took a week to train the multi-scale network6 using a 12-GB NVIDIA Titan-X GPU on a Intel Xeon e7 core i7 machine.

Due to the downsampling process in the generation part of cGAN, it is challenging for the input images to preserve the fine details. A bilateral filter is a common method to perform edge-preserving and noise-reducing operations which can be adopted to preserve the finer details of an image.14 A method that optimizes the bilateral filtering method to have a constant time O(1) was proposed by Porikli.15 Others proposed to preserve edges in images include global image smoothing based on the weighted least squares (WLS)16 and guided image filter.17 Extended work on WLS was conducted by Min et al.18 to create a fast variant, achieving comparable results but requiring much less computational time. Optimization to the guided image filtering technique was performed by incorporating an edge-aware weighting into the guided filter, which greatly reduced the halo artifacts in images.19

Zheng et al.20 proposed to create a hybrid model that consists of both a model-driven and data-driven approach to generate a higher quality image. In this paper, we have mainly focused on the data-driven approach via the use of cGAN. However, there is an immense value in a hybrid model; thus we plan to create such a hybrid model in future works by integrating the model-driven portion into our data-driven model.

2.3.

Evaluation for Image-to-Image Translation Task

Evaluation of image-to-image translation tasks remains an open question. SSIM was proposed by Wang et al.21 to compare the structural information based on the human visual system. SSIM is commonly used to compare the similarity between the generated images and the ground-truth images. It is defined by Wang et al.21 as follows:

Eq. (1)

SSIM(x,y)=(2μxμy+C1)(2σxy+C2)(μx2+μy2+C1)(σx2+σy2+C2),
where μ is the mean with respect to x or y, σ is the variance with respect to x or y, and C1, C2 are the constants defined as (0.01L)2 and (0.03L)2 (L is the dynamic range of the pixels), respectively.

Based on SSIM, a metric called multi-scale structural similarity (MS-SSIM)22 was designed to incorporate the variations of viewing conditions.

FID23 was proposed to capture the similarity between the generated and ground-truth images. To compute FID, both the generated and real images are propagated through a pretrained Inception V3 model24 and their difference from the last pooling layer is used. A smaller FID represents higher similarity, that is given an FID of 0, two images are identical. The FID is defined as follows:

Eq. (2)

FID=μrμg2+tr(Σr+Σg2ΣrΣg),
where μ represents the mean for the real (r) and generated (g) images, Σ represents the covariance for the real (r) and generated (g) images, and tr is the trace linear function.

Similar to FID, PPL25 uses the pretrained VGG1626 as embeddings to calculate the perceptual similarity between two images. As with FID, a smaller PPL means that two images have a greater perceptual similarity.

Evaluating the performance of TMOs is also an issue for tone mapping operations. One intuitive solution is a subjective evaluation, which involves human participants ranking LDR images generated by different TMOs based on their subjective preference. Such subjective evaluation takes a lot of time and energy, with the results unstable across different participant groups.27 Another solution is objective metrics, e.g., tone-mapped image quality index (TMQI)28 and TMQI-II,29 widely used in tone-mapping optimization studies.6,30 TMQI represents a form of indexing that considers the naturalness of tone-mapped LDR images, and structural fidelity between the HDR and tone-mapped LDR images expressed as28

Eq. (3)

TMQI(H,L)=a[S(H,L)]α+(1a)[N(L)]β,
where H and L denote the original HDR image and the tone-mapped LDR image, S and N denote the structural fidelity and statistical naturalness measures, respectively. α and β control the sensitivities of S and N, and 0a1 adjusts the relative weights between S and N. In this paper, we use the default α, β, and a, recommended by Yeganeh and Wang.28

3.

Proposed Method

In this section, we will detail our proposed adTMO to convert HDR images into LDR images, the architecture of our G and D, the objective function we use, and the different training/testing schemes we deploy.

3.1.

cGAN-Based adTMO

In this paper, we construct adTMO based on the principle of cGAN3 that can translate HDR images into LDR images. Figure 1 shows the training pipeline of our proposed adTMO. We train D using (HDR, LDR) pairs where D is trying to predict (HDR, RealLDR) pair as real and predict (HDR, FakeLDR) pair as fake. G is trying to generate FakeLDR that is real enough so that D is unable to distinguish FakeLDR from RealLDR. We train G and D simultaneously, specifically, in each iteration, we train D twice with weight set to 0.5 [once using the (HDR, RealLDR) pair, and once using the (HDR, FakeLDR) pair].

Fig. 1

Training pipeline of cGAN. D is trained to distinguish ground truth LDR image from the generated LDR image. G is trained to generate LDR image that is real enough to fool D.

JEI_30_4_043020_f001.png

3.2.

Network Architectures

We adopt the network architectures from Isola et al.,4 where G is a U-Net31 and D is a 70×70 PatchGAN,32 both using convolution-BatchNorm-LeakyRelu33 blocks with α=0.2.

3.2.1.

Generator architecture

Figure 2 shows the architecture of our G, which is a U-Net consisting of one input block, seven encoding blocks, one bottleneck, seven decoding blocks, and one output block. Each encoding block will down-sample image size by 1/4 (1/2 of width and 1/2 of height) of the previous block with strides=2, and each decoding block will up-sample the previous block by 4 times. We added direct connections between the encoding and decoding blocks in order to preserve some of the finer details that may have been lost during the downsampling process. This direct connection, also called skip connection, allows for the gradient of the later layers to propagate back to the earlier layers. Such propagation prompts the model to learn, more efficiently, the mapping between the input and output layers, allowing for the finer details to be recovered from the downsampling process. For the i’th decoding block, we add a direct skip from the last i’th encoding block and concatenate the two blocks in channel before applying the LeakyRelu activation function. The filter size is set to 4×4 for all blocks. The filter number is set to 64 for the first encoding block and doubles for each of the next encoding block until it reaches 512, then remains unchanged. The filter number for each decoding block is the same as the encoding block with which it connects. For the bottleneck block, the filter number is set to 512, and the activation function is ReLU. For the output block, the filter number is set to 1 and the activation function is sigmoid. We can feed our G with images of different sizes given it is fully convolutional.

Fig. 2

Architecture of the U-Net generator with one input block, seven encoding blocks, one bottleneck block, seven decoding blocks, and one output block. There is a direct skip connecting each encoding–decoding pair.

JEI_30_4_043020_f002.png

3.2.2.

Discriminator architecture

Figure 3 shows the architecture of our D. This is a 70×70 PatchGAN consisting of one input layer, five encoding blocks, and one output block. The input layer concatenates the input HDR and LDR image in the color channel. Each of the first four encoding blocks will down-sample image size to 1/4 of the previous block with strides=2. For the last encoding block, we set strides=1, leaving the image size unchanged. The number of filters for each encoding blocks is defined as follows 64, 128, 256, 512, and 512. The output block has 1 filter, with strides=1, a sigmoid activation and outputs a 16×16 matrix. Each value in the output matrix maps to a 70×70 receptive field in the input layer, identifying this patch as either real or fake.

Fig. 3

Architecture of the PatchGAN discriminator. Each value in the output matrix identifies a 70×70 receptive field in the input layer as either real or fake.

JEI_30_4_043020_f003.png

3.3.

Objective Function

As discussed earlier, the goal of G is to convert an HDR image into its tone-mapped LDR version, and the goal of D is to distinguish the generated LDR image from the ground-truth LDR image. The objective of cGAN3 can therefore be written as

Eq. (4)

LG(G,D)=E(x)log(1D(x,G(x)))LD(G,D)=E(x,y)logD(x,y)E(x)log(1D(x,G(x))),
where G tries to minimize LG(G,D), and D tries to minimize LD(G,D).

In addition to the cGAN loss, we incorporated a feature matching loss LFM based on D. We extract features from multiple layers of D and attempt to match these intermediate representations between the real and generated LDR image, i.e., we minimize the difference between the features via the L1 norm:

Eq. (5)

LFM(G,D)=E(x,y)i=1M1Ui[D(i)(x,y)D(i)(x,G(x))1],
where D(i) denotes the i’th layer with Ui activations of D, and M is the number of layers of D. In this experiment, we chose five convolution layers in the five encoding blocks of D.

Additionally, we appended the perceptual loss Lprp used by Johnson et al.,34 which consists of the features computed from every single layer of the pretrained Inception V3 network,24 given by

Eq. (6)

Lprp(G)=E(x,y)i=1N1Vi[F(i)(y)F(i)(G(x))1],
where F(i) denotes the i’th layer with Vi activations of the Inception V3 network, and N is the selected number of layers in the Inception V3 network. In this experiment, we empirically choose five activation layers of the Inception V3 network as F to calculate Lprp.

With LFM and Lprp, we are able to keep both low-level image characteristics and high-level perceptual information. Combining these losses together, our final objective is expressed as

Eq. (7)

Gloss=LG(G,D)+αLFM(G,D)+βLprp(G)Dloss=LD(G,D),
where α and β control the weight of LFM and Lprp with respect to LcGAN. Here we set α=10 and β=10, recommended by Rana et al.6

3.4.

Training and Testing

We deploy different training and testing scheme combinations to achieve better performance.

3.4.1.

Training

We adopt three training schemes.

  • Training scheme A (see purple box in Fig. 4). All HDR images were resized into 256×256 resolution, and TMOs were used to generate tone-mapped LDR images. The generated 748 HDR-LDR image pairs were used to train our adTMO.

  • Training scheme B (see blue box in Fig. 4). This scheme required resizing the HDR images into 1024×1024 resolution and using TMOs to generate tone-mapped LDR images. The next step was to randomly crop the corresponding 256×256 resolution regions from HDR images and LDR images. We generated 23,936 HDR-LDR image pairs to train the adTMO.

  • Training scheme C. The resized and cropped 256×256 resolution images were combined from training schemes A and B to provide all together 24,684 training pairs.

Fig. 4

The purple and blue boxes, respectively, show how we generate training pairs for training schemes A and B.

JEI_30_4_043020_f004.png

All training schemes used 256×256 resolution images as the training database, so the training process took less time and resources than using high-resolution images. The Adam optimizer35 was used for all three schemes, with learing rate=0.0002, β1=0.5, β2=0.999. We set the batch size to 1 and trained until the loss converged. The training process was deployed on an NVIDIA GeForce RTX 2080, and each training process can be finished within 30 h, which is much shorter than the 1-week training time in the muti-scale network propose by Rana et al.6

3.4.2.

Testing

We deploy different testing schemes to evaluate the performance of our proposed adTMO.

  • Testing scheme W (see the red box of Fig. 5). Test with resized 256×256 resolution images, we resized original HDR images into 256×256 resolution then fed them into G and generated the target LDR images.

  • Testing scheme X (see the blue box of Fig. 5). Test with resized 1024×2048 resolution images, we resized original HDR images into 1024×2048 resolution then fed them into G and generated the target LDR images.

  • Testing scheme Y (see the brown box of Fig. 5). Test with cropped 256×256 resolution images, we cropped 1024×2048 resolution HDR images into 256×256 resolution pieces, then fed them into G, and generated the target LDR pieces.

  • Testing scheme Z (see the purple box of Fig. 5). Test with 4×8 concatenated cropped 256×256 resolution images, we cropped 1024×2048 resolution HDR images into 32 256×256 resolution pieces, fed them into G and generated the target LDR images, and then concatenated them together into the complete 1024×2048 resolution images.

Fig. 5

The red, blue, brown, and purple boxes, respectively, show the process of test schemes W, X, Y, and Z.

JEI_30_4_043020_f005.png

4.

Experimental Setup

In this section, we will detail the HDR image databases collected, how we pre- and postprocessed these databases.

4.1.

Databases

From the many open-source HDR image databases accessible online, we selected our databases based on their content diversity, usability, resolution, and quality. Table 1 summarizes the HDR image databases we used, with the majority being high-resolution. We used 105 images from Kalantari and Ramamoorthi45 to test adTMO, and 748 images from other 10 databases in Table 1 to train adTMO.

Table 1

HDR image databases.

Databases# Images# Pixels per image (×106)Databases# Images# Pixels per image (×106)
Ref. 36880.5Ref. 37921.8
Ref. 28260.6Ref. 384414.5
Ref. 392243.2Ref. 40150.3
Ref. 41641.5Ref. 4282.9
Ref. 4372.4Ref. 4418012.9
Ref. 4510511.1

4.2.

Resizing

We used two collections of 256×256 resolution images for training. The first set of images were the original images resized to 256×256 resolution (based on training scheme A), whereas the second set of images were randomly cropped from resized 1024×1024 images (based on training scheme B). For testing purpose, we resized HDR images into two resolutions: 256×256 and 1024×2048.

4.3.

Target LDR Images Generation

All the collected HDR images were unlabeled, i.e., the ground-truth LDR images were unknown. To solve this problem, for each HDR image, we applied 30 different TMOs to get 30 LDR image candidates using the MATLAB HDR TOOLBOX46 and followed the suggestion to apply GammaTMO after tone-mapping as some specific TMOs require gamma encoding. From these 30 LDR image candidates, we selected the one with the highest TMQI as the ground-truth LDR image. Table 2 summarizes the performance of each TMO when applied to the resized 256×256 HDR images. In Table 2, we provide the average TMQI for each TMO after applying it to the whole training set, and the number of LDR images with the highest TMQI among 30 candidates. The last row tabulates the average TMQI of the selected 748 target LDR images. Among the TMOs provided by the MATLAB HDR TOOLBOX, WardHistAdjTMO reaches the highest average TMQI and provides the most ground-truth LDR images (124 images). Apart from RamanTMO, which contributed 0 ground-truth images, all other TMOs provide at least one image for the ground-truth set.

Table 2

TMOs performance in tone-mapping 256×256 HDR images.

TMOsTMQI# LDR images with highest TMQITMOsTMQI# LDR images with highest TMQI
AshikhminTMO0.83±0.0723BanterleTMO0.89±0.0424
BestExposureTMO0.88±0.0512BruceExpoBlendTMO0.85±0.069
ChiuTMO0.86±0.0628DragoTMO0.89±0.0418
DurandTMO0.87±0.0739ExponentialTMO0.84±0.031
FerwerdaTMO0.80±0.1113GammaTMO0.75±0.1515
KimKautzConsistentTMO0.90±0.0537KrawczykTMO0.88±0.0730
KuangTMO0.90±0.0525LischinskiTMO0.93±0.0489
LogarithmicTMO0.82±0.0718MertensTMO0.83±0.065
NormalizeTMO0.88±0.0719PattanaikTMO0.73±0.091
RamanTMO0.80±0.050ReinhardDevlinTMO0.86±0.0717
ReinhardTMO0.92±0.0460SchlickTMO0.77±0.102
TumblinTMO0.83±0.0816VanHaterenTMO0.76±0.092
WardGlobalTMO0.81±0.086WardHistAdjTMO0.92±0.04124
YPFerwerdaTMO0.86±0.0638YPTumblinTMO0.80±0.076
MATLAB tonemap function0.89±0.0575
target LDR images0.96±0.02748

This approach to generate target LDR images is similar to the one proposed by Cai et al.47 to generate high-contrast images. Both our work and theirs aim to reproduce satisfactory natural LDR images. Although we focus on keeping the structural similarity from the HDR images and retaining the color naturalness, Cai et al. aimed to produce a high-contrast image from an under-/over-exposed image. Difference also exists in how to select the “ground-truth” target image. We use an objective metric TMQI to select a ground-truth LDR image, whereas Cai et al. used a subjective ranking to select a ground-truth high-contrast image.

4.4.

Normalization

We linearly normalized the pixel value of input HDR and LDR images into [0, 1]. For input HDR images, the min/max normalization was applied:

Eq. (8)

vout=vinvminvmaxvmin,
where vmax and vmin are the maximum and minimum pixel values of the input HDR image, respectively. For input LDR image, we applied vout=vin/255 to do the normalization so that the pixel values of input LDR image are also in the range of [0, 1].

4.5.

Luminance Extraction and Color Reproduction

When training and testing our proposed adTMO, we used the luminance channel rather than the RGB channels of the input images to ease the computation complexity and reduce the memory requirement. Before training, we calculated the weighted sum of the RGB channels to extract the luminance channel with the weights from Ref. 6:

Eq. (9)

L=0.2959·CR+0.5870·CG+0.1140·CB.

After generating the luminance channel from G, we used Cout=Cin·Lout/Lin to reproduce the RGB channels, where Lin and Lout are the input and output luminance channels, respectively, and Cin and Cout are the RGB channels of the original HDR image and the generated LDR image after color reproduction. After color reproduction, some pixel values would be larger than 255 and they were reduced to 255 to maintain the 8-bit RGB range.

5.

Results

In this section, we discuss the results of our proposed adTMO, in terms of multiple metrics of the generated LDR images in different training/testing schemes.

Figure 6 demonstrates one scenario of LDR content in the RGB channels after color reproduction, in different training/testing schemes. We omit the generated LDR content in testing scheme Y because they were the images used for constructing the images in testing scheme Z. LDR images in testing scheme W [(a), (d), and (g)] have higher TMQI, but such conversion is meaningless, as many details are lost in the resizing operation. LDR images of testing scheme X, Z in training scheme A [(b), (c)] have lower TMQI with shadows around the flowers, because we only trained adTMO with resized 256×256 images so that many fine details from the original images were lost. After we add cropped images into training databases, adTMO was able to learn how to keep the details of the original images. Therefore, the LDR images of testing scheme X in training scheme B, C [(e) and (h)] look more natural and have higher TMQI. The LDR images of testing scheme Z [(c), (f), and (i)] show “concatenated” edges, because cropping a complete image into pieces and generating their tone-mapped LDR images individually break the internal connections between these pieces. Future work is required to generate these individual images and combine them in such a way that these edges are removed while maintaining the high contrast in each individual image. Some finer details are not kept well by using the proposed adTMO. It should be noted that edge-preserving techniques such as bilateral filtering or guided image filtering have shown great promise in alleviating this problem. Further experimentation is required, and we plan in the future to incorporate these techniques into a deep-learning based TMO to create a more robust operator.

Fig. 6

The RGB channels of LDR images generated by adTMO after color reproduction. (a)–(c) are based on training scheme A; (d)–(f) are based on training scheme B; (g)–(i) are based on training scheme C. (a), (d), (g)are based on testing scheme W; (b), (e), (h) are based on testing scheme X; and (c), (f), (i) are based on testing scheme Z.

JEI_30_4_043020_f006.png

We chose training scheme C to train the proposed adTMO, testing scheme W to tone-map 256×256 resolution images and testing scheme X to tone-map 1024×2048 resolution images given that train scheme C has the larger data set for training, and the resulting LDR images [(g) and (h)] have higher TMQI.

In Fig. 7, we demonstrate qualitative comparisons of adTMO and other top-9-ranked TMOs that produce the highest TMQI for four different scenarios, in generating 1024×2048 resolution images. In most scenarios, including indoor/outdoor, irregular geometric shape, large colors range, and drastic luminance changes, our adTMO outperforms all other TMOs on TMQI metric. As well, the LDR images generated by adTMO do not suffer contrast problems like other LDR images. Tables 3 and 4 list different metrics mentioned in Sec. 2 of the test dataset tone-mapped by 30 TMOs and the proposed adTMO. We modify the PPL so that it can be used to evaluate TMOs. Specifically, the PPL is calculated as follows:

Eq. (10)

PPL=E[1ϵ2d(g{lerp[f(z1),f(z2);t]},g{lerp[f(z1),f(z2);t+ϵ]})].
where f(z) represent the function mapping latent space to style vector in adTMO, t is uniformly distributed between 0 and 1, lerp represents for linear interpolation, g is the generator function to create image, d measures the perceptual distance between the images, and ϵ is set as 104 here. In generating 256×256 resolution images, our proposed adTMO outperforms all other TMOs with regard to the metric FID and outperforms most of TMOs with regard to other metrics. In generating 1024×2048 resolution images, our proposed adTMO outperforms all other TMOs with regard to the metrics TMQI, SSIM, and MS-SSIM and outperforms most other TMOs with regard to FID and PPL. We also divided the images into two sets, one for indoor scenes and another for outdoor scenes. Both reach high TMQI (0.89 and 0.90) for 1024×2048 resolution images. Our deep learning-based tone mapping algorithm uses a mixture of best features from other TMOs. In the absence of interactive parameter adjustment as it is not always available, our approach offers the best TMQI.

Fig. 7

Qualitative comparisons of adTMO and top-9-ranked TMOs for outdoor and indoor scenes on TMQI metric.

JEI_30_4_043020_f007.png

Table 3

Qualitative comparisons of adTMO and all other TMOs for 256×256 resolution images on SSIM, MS-SSIM, FID, and PPL metrics. The bold values indicate the metric where adTMO performs the best amongst all other TMOs.

TMOsTMQISSIMMS-SSIMFIDPPL
AshikhminTMO0.85±0.070.710.73103.2327.5
BanterleTMO0.89±0.050.720.7391.3242.6
BestExposureTMO0.90±0.050.810.8292.7210.4
BruceExpoBlendTMO0.88±0.070.780.8187.4154.5
ChiuTMO0.86±0.060.710.7598.0201.7
DragoTMO0.89±0.050.760.7893.4296.1
DurandTMO0.90±0.060.780.7988.3164.1
ExponentialTMO0.84±0.040.730.76121.9219.5
FerwerdaTMO0.84±0.090.750.77108.3285.1
GammaTMO0.80±0.070.620.68118.4439.5
KimKautzConsistentTMO0.90±0.050.780.7884.2138.6
KrawczykTMO0.86±0.080.700.72104.7248.6
KuangTMO0.89±0.060.780.7994.5238.5
LischinskiTMO0.93±0.050.820.8374.3159.2
LogarithmicTMO0.88±0.070.760.7898.2223.8
MertensTMO0.87±0.060.710.7396.2194.4
NormalizeTMO0.87±0.080.730.76101.4245.8
PattanaikTMO0.77±0.020.600.63164.9468.1
RamanTMO0.85±0.070.690.71116.7280.2
ReinhardDevlinTMO0.84±0.040.710.72113.8202.7
ReinhardTMO0.92±0.050.800.8180.5143.8
SchlickTMO0.84±0.090.700.72104.6257.3
TumblinTMO0.86±0.040.700.72108.5236.1
VanHaterenTMO0.82±0.040.680.70115.7275.9
WardGlobalTMO0.89±0.060.800.8192.5193.6
WardHistAdjTMO0.93±0.040.800.8170.3152.9
YPFerwerdaTMO0.86±0.060.720.7498.2204.2
YPTumblinTMO0.81±0.030.680.71102.5257.4
YPWardGlobalTMO0.87±0.060.710.7498.4201.5
MATLAB tonemap function0.87±0.040.740.76129.5286.3
Proposed adTMO0.92±0.050.800.8268.2163.2

Table 4

Qualitative comparisons of adTMO and all other TMOs for 1024×2048 resolution images on SSIM, MS-SSIM, FID, PPL, and face detection accuracy metrics. The bold values indicate the metric where adTMO performs the best amongst all other TMOs.

TMOsTMQISSIMMS-SSIMFIDPPLFace detection acc. (%)
AshikhminTMO0.82±0.090.690.70114.6254.870.5
BanterleTMO0.84±0.080.670.69104.5239.687.6
BestExposureTMO0.85±0.070.730.73102.4218.588.6
BruceExpoBlendTMO0.81±0.070.700.7196.5204.583.8
ChiuTMO0.78±0.060.640.68104.9208.778.1
DragoTMO0.84±0.070.690.7198.6175.385.7
DurandTMO0.89±0.070.750.77104.7264.987.6
ExponentialTMO0.83±0.050.700.71142.7304.673.3
FerwerdaTMO0.76±0.110.700.72123.8175.070.5
GammaTMO0.78±0.080.610.66121.5275.173.3
KimKautzConsistentTMO0.85±0.070.750.7697.4204.681.9
KrawczykTMO0.81±0.100.680.69119.5259.080.0
KuangTMO0.85±0.080.720.74101.3237.181.0
LischinskiTMO0.89±0.070.800.8187.5174.288.6
LogarithmicTMO0.82±0.080.720.74103.9222.580.0
MertensTMO0.84±0.080.680.7199.4194.977.1
NormalizeTMO0.82±0.090.680.70105.4223.778.1
PattanaikTMO0.70±0.060.580.61195.8479.14.8
RamanTMO0.82±0.080.640.66124.7275.977.1
ReinhardDevlinTMO0.79±0.050.690.70115.3246.176.2
ReinhardTMO0.86±0.070.750.7687.1196.487.6
SchlickTMO0.79±0.080.660.68119.6219.475.2
TumblinTMO0.80±0.060.670.69125.2231.878.1
VanHaterenTMO0.77±0.060.620.65128.5274.076.2
WardGlobalTMO0.82±0.070.750.7696.3162.481.9
WardHistAdjTMO0.89±0.060.760.7677.5186.583.8
YPFerwerdaTMO0.86±0.080.710.73107.5201.476.2
YPTumblinTMO0.75±0.050.660.67111.8214.870.5
YPWardGlobalTMO0.80±0.060.640.67105.7197.581.0
MATLAB tonemap function0.84±0.060.700.72141.6308.188.6
Proposed adTMO0.90±0.060.800.8179.5187.490.5

In addition to the above-mentioned metrics, we also applied a face detection technique to the generated 1024×2048 LDR images to measure the face detection accuracy as HDR-LDR translation is often used in security and healthcare applications. The face detection accuracy is defined as acc=TP/(TP+FN), where TP and FN represent the number of faces that are detected and not detected, respectively. The face detector used in this paper is the Haar cascades face detector,48 and the test set we used for evaluation is by Kalantari and Ramamoorthi,45 which consists of HDR images containing human faces. Our proposed adTMO reaches the highest face detection accuracy compared with other TMOs. The main reason contributing to this is that we use the pretrained Inception V3 network24 to derive the perceptual loss, so our generated LDR images look more natural, and the face detector trained on natural images can achieve higher accuracy in LDR images generated by our adTMO. Overall, adTMO output has the highest quality, regarding high-resolution 1024×2048 images and is comparable to the results for 256×256 images.

6.

Conclusion

We propose an adTMO, which can adaptively generate high-resolution and high-quality LDR images. We explore different training and testing schemes and find the best possible combination to generate the highest quality images. We use multiple metrics including TMQI, SSIM, MS-SSIM, and face detection accuracy to measure the performance of the proposed adTMO. When testing on low-resolution LDR images, our adTMO has the highest performance on the FID metric across all other TMOs. When testing on high-resolution LDR images, our adTMO has the highest performance on TMQI, SSIM, MS-SSIM, and face detection accuracy over all other TMOs. Looking specifically at the TMQI metric, the proposed adTMO achieves a TMQI of 0.90±0.06, which is superior to the DeepTMO’s6 0.88±0.06. In addition, we have the advantage in the training time where we spend 30 h for training, which is much short than DeepTMO’s 1 week.

Acknowledgments

This project was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) through the grant “Biometric-enabled Identity management and Risk Assessment for Smart Cities,” and the Mitacs Globalink Graduate Fellowship, Canada.

References

1. 

F. Mccollough, Complete Guide to High Dynamic Range Digital Photography, Lark Books(2008). Google Scholar

2. 

E. Reinhard et al., High Dynamic Range Imaging: Acquisition, Display, and Image-Based Lighting, Morgan Kaufmann(2010). Google Scholar

3. 

M. Mirza and S. Osindero, “Conditional generative adversarial nets,” (2014). Google Scholar

4. 

P. Isola et al., “Image-to-image translation with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., 1125 –1134 (2017). https://doi.org/10.1109/CVPR.2017.632 Google Scholar

5. 

T.-C. Wang et al., “High-resolution image synthesis and semantic manipulation with conditional GANs,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., 8798 –8807 (2018). https://doi.org/10.1109/CVPR.2018.00917 Google Scholar

6. 

A. Rana et al., “Deep tone mapping operator for high dynamic range images,” IEEE Trans. Image Process., 29 1285 –1298 (2019). https://doi.org/10.1109/TIP.2019.2936649 IIPRE4 1057-7149 Google Scholar

7. 

X. Cao et al., “Adversarial and adaptive tone mapping operator for high dynamic range images,” in Proc. IEEE Symp. Ser. Comput. Intell. (SSCI), 1814 –1821 (2020). https://doi.org/10.1109/SSCI47803.2020.9308535 Google Scholar

8. 

G. W. Larson, H. Rushmeier and C. Piatko, “A visibility matching tone reproduction operator for high dynamic range scenes,” IEEE Trans. Vis. Comput. Graphics, 3 (4), 291 –306 (1997). https://doi.org/10.1109/2945.646233 1077-2626 Google Scholar

9. 

F. Drago et al., “Adaptive logarithmic mapping for displaying high contrast scenes,” Proc. Comput. Graphics Forum, 22 (3), 419 –426 (2003). https://doi.org/10.1111/1467-8659.00689 Google Scholar

10. 

K. Chiu et al., “Spatially nonuniform scaling functions for high contrast images,” in Proc. Graphics Interface, 245 –245 (1993). Google Scholar

11. 

J. Tumblin, J. K. Hodgins and B. K. Guenter, “Two methods for display of high contrast images,” ACM Trans. Graphics, 18 (1), 56 –94 (1999). https://doi.org/10.1145/300776.300783 ATGRDF 0730-0301 Google Scholar

12. 

I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural Inf. Process. Syst., 2672 –2680 (2014). Google Scholar

13. 

V. A. Patel, P. Shah and S. Raman, “A generative adversarial network for tone mapping HDR images,” in Proc. Conf. Comput. Vision, Pattern Recognit., Image Process. and Graphics, 220 –231 (2017). https://doi.org/10.1007/978-981-13-0020-2_20 Google Scholar

14. 

C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Sixth Int. Conf. Comput. Vision (IEEE Cat. No. 98CH36271), 839 –846 (1998). https://doi.org/10.1109/ICCV.1998.710815 Google Scholar

15. 

F. Porikli, “Constant time o (1) bilateral filtering,” in IEEE Conf. Comput. Vision and Pattern Recognit., 1 –8 (2008). https://doi.org/10.1109/CVPR.2008.4587843 Google Scholar

16. 

Z. Farbman et al., “Edge-preserving decompositions for multi-scale tone and detail manipulation,” ACM Trans. Graphics, 27 (3), 1 –10 (2008). https://doi.org/10.1145/1360612.1360666 ATGRDF 0730-0301 Google Scholar

17. 

K. He, J. Sun, X. Tang, “Guided image filtering,” in Comput. Vision-ECCV 2010, 1 –14 (2010). Google Scholar

18. 

D. Min et al., “Fast global image smoothing based on weighted least squares,” IEEE Trans. Image Process., 23 (12), 5638 –5653 (2014). https://doi.org/10.1109/TIP.2014.2366600 IIPRE4 1057-7149 Google Scholar

19. 

Z. Li et al., “Weighted guided image filtering,” IEEE Trans. Image Process., 24 120 –129 (2015). https://doi.org/10.1109/TIP.2014.2371234 IIPRE4 1057-7149 Google Scholar

20. 

C. Zheng et al., “Single image brightening via multi-scale exposure fusion with hybrid learning,” IEEE Trans. Circuits Syst. Video Technol., 31 (4), 1425 –1435 (2020). https://doi.org/10.1109/TCSVT.2020.3009235 Google Scholar

21. 

Z. Wang et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., 13 (4), 600 –612 (2004). https://doi.org/10.1109/TIP.2003.819861 IIPRE4 1057-7149 Google Scholar

22. 

Z. Wang, E. P. Simoncelli and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proc. Thirty-Seventh Asilomar Conf. Signals, Syst. & Comput.s, 1398 –1402 (2003). Google Scholar

23. 

M. Heusel et al., “GANs trained by a two time-scale update rule converge to a local nash equilibrium,” in Proc. Adv. Neural Inf. Process. Syst., 6626 –6637 (2017). Google Scholar

24. 

C. Szegedy et al., “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., 2818 –2826 (2016). https://doi.org/10.1109/CVPR.2016.308 Google Scholar

25. 

T. Karras, S. Laine and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proc. IEEE Conf. Comput. Vision and Pattern Recognit., 4401 –4410 (2019). https://doi.org/10.1109/CVPR.2019.00453 Google Scholar

26. 

K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” (2014). Google Scholar

27. 

P. Ledda et al., “Evaluation of tone mapping operators using a high dynamic range display,” ACM Trans. Graphics, 24 (3), 640 –648 (2005). https://doi.org/10.1145/1073204.1073242 ATGRDF 0730-0301 Google Scholar

28. 

H. Yeganeh and Z. Wang, “Objective quality assessment of tone-mapped images,” IEEE Trans. Image Process., 22 (2), 657 –667 (2012). https://doi.org/10.1109/TIP.2012.2221725 IIPRE4 1057-7149 Google Scholar

29. 

K. Ma et al., “High dynamic range image compression by optimizing tone mapped image quality index,” IEEE Trans. Image Process., 24 (10), 3086 –3097 (2015). https://doi.org/10.1109/TIP.2015.2436340 IIPRE4 1057-7149 Google Scholar

30. 

K. Debattista, “Application-specific tone mapping via genetic programming,” Proc. Comput. Graphics Forum, 37 (1), 439 –450 (2018). https://doi.org/10.1111/cgf.13307 Google Scholar

31. 

O. Ronneberger, P. Fischer and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” Lect. Notes Comput. Sci., 9351 234 –241 (2015). https://doi.org/10.1007/978-3-319-24574-4_28 LNCSD9 0302-9743 Google Scholar

32. 

C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian generative adversarial networks,” in Proc. Eur. Conf. Comput. Vision, 702 –716 (2016). Google Scholar

33. 

A. Radford, L. Metz and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” (2015). Google Scholar

34. 

J. Johnson, A. Alahi and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in Proc. Eur. Conf. Comput. Vision, 694 –711 (2016). Google Scholar

35. 

D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” (2014). Google Scholar

36. 

F. Xiao et al., “High dynamic range imaging of natural scenes,” in Proc. Color and Imaging Conf., 337 –342 (2002). Google Scholar

37. 

M.-A. Gardner et al., “Learning to predict indoor illumination from a single image,” (2017). Google Scholar

38. 

P. Stanczyk and C. Phillips, “openexr images,” (2020) https://github.com/AcademySoftwareFoundation/openexr-images Google Scholar

39. 

B. Funt and L. Shi, “The rehabilitation of maxRGB,” in Proc. Color and Imaging Conf., 256 –259 (2010). Google Scholar

40. 

P. Modin, “HDR Vault Image Set (Version 1.0.0). Zenodo,” (2018). https://zenodo.org/record/1245790#.YRazFYgzY2w Google Scholar

41. 

M. D. Fairchild, “The HDR photographic survey,” in Proc. Color and Imaging Conf., 233 –238 (2007). Google Scholar

42. 

Pfstools Google Group, “Pfstools HDR image gallery,” http://pfstools.sourceforge.net Google Scholar

43. 

, “HDR source image gallery,” (2018) http://resources.mpi-inf.mpg.de/hdr/gallery.html Google Scholar

44. 

W. J. Adams et al., “The southampton-york natural scenes (SYNS) dataset: statistics of surface attitude,” Sci. Rep., 6 35805 (2016). https://doi.org/10.1038/srep35805 SRCEC3 2045-2322 Google Scholar

45. 

N. K. Kalantari and R. Ramamoorthi, “Deep high dynamic range imaging of dynamic scenes,” ACM Trans. Graphics, 36 (4), 1 –12 (2017). https://doi.org/10.1145/3072959.3073609 ATGRDF 0730-0301 Google Scholar

46. 

F. Banterle et al., Advanced High Dynamic Range Imaging, CRC Press(2017). Google Scholar

47. 

J. Cai, S. Gu and L. Zhang, “Learning a deep single image contrast enhancer from multi-exposure images,” IEEE Trans. Image Process., 27 (4), 2049 –2062 (2018). https://doi.org/10.1109/TIP.2018.2794218 IIPRE4 1057-7149 Google Scholar

48. 

P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision and Pattern Recognit., I–I (2001). https://doi.org/10.1109/CVPR.2001.990517 Google Scholar

Biography

Xingdong Cao received his BSc degree in electrical engineering from Zhejiang University, Zhejiang, China, in 2019. He is currently an MSc student under the supervision of professor Svetlana Yanushkevich at the Biometric Technologies Laboratory, Department of Electrical and Software Engineering, University of Calgary, Calgary, Alberta, Canada. His research interests include applying machine learning technologies to the biometrics field.

Kenneth Lai received his BSc and MSc degrees from the University of Calgary, Calgary, Alberta, Canada, in 2012 and 2015, respectively, where he is currently pursuing his PhD in the Department of Electrical and Software Engineering. His areas of interest include biometrics and its application to security and health care systems.

Michael Smith is a professor emeritus in electrical and software engineering at Schulich School of Engineering, University of Calgary, Calgary, Canada, with research interests in software engineering and customized real-time digital signal processing algorithms in the context of mobile embedded systems and biomedical instrumentation. He is a senior member of IEEE.

Svetlana Yanushkevich received her Dr.Tech.Sc. (Dr. Habilitated) degree from the Warsaw University of Technology in 1999. She is currently a professor in the Department of Electrical and Software Engineering at the University of Calgary. She is directing the Biometric Technologies Laboratory and conducting research in the area of biometric-based authentication technologies. She is a senior member of IEEE.

CC BY: © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.
Xingdong Cao, Kenneth K. F. Lai, Michael R. Smith, and Svetlana Yanushkevich "Adversarial and adaptive tone mapping operator: multi-scheme generation and multi-metric evaluation," Journal of Electronic Imaging 30(4), 043020 (19 August 2021). https://doi.org/10.1117/1.JEI.30.4.043020
Received: 12 February 2021; Accepted: 3 August 2021; Published: 19 August 2021
Lens.org Logo
CITATIONS
Cited by 1 scholarly publication.
Advertisement
Advertisement
KEYWORDS
High dynamic range imaging

Time multiplexed optical shutter

Image resolution

Facial recognition systems

Databases

Computer programming

Image filtering

RELATED CONTENT


Back to Top