EFCANet: Exposure Fusion Cross-Attention Network for Low-Light Image Enhancement

Yang, Zhe; Liu, Fangjin; Li, Jinjiang

doi:10.3390/app13010380

Open AccessArticle

EFCANet: Exposure Fusion Cross-Attention Network for Low-Light Image Enhancement

by

Zhe Yang

^1,2,†,

Fangjin Liu

^3,† and

Jinjiang Li

^4,*

¹

School of Computer Science and Technology, Shandong Technology and Business University, Yantai 264005, China

²

Information Technology Department, Qingdao Vocational and Technical College of Hotel Management, Qingdao 266100, China

³

School of Information and Electronic Engineering, Shandong Technology and Business University, Yantai 264005, China

⁴

Co-Innovation Center of Shandong Colleges and Universities: Future Intelligent Computing, Shandong Technology and Business University, Yantai 264005, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(1), 380; https://doi.org/10.3390/app13010380

Submission received: 15 November 2022 / Revised: 12 December 2022 / Accepted: 19 December 2022 / Published: 28 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

Image capture devices capture poor-quality images under low-light conditions, and the resulting images have dark areas due to insufficient exposure. Traditional Multiple Exposure Fusion (MEF) methods fuse images with different exposure levels from a global perspective, which often leads to secondary exposure in well-exposed areas of the original image. At the same time, the image sequences with different exposure levels are not sufficient, and the MEF method is limited by the training data and benchmark labels. To address the above problems, this paper proposes an exposure fusion cross-attention network based low-light image enhancement (EFCANet). EFCANet is characterized by recovering normal light images from a single exposure-corrected image. First, the Exposure Image Generator (EIG) is used to estimate the single exposure-corrected image corresponding to the original input image. Then, the color space of the exposure-corrected image and the original input image are converted from RGB to YCbCr, aiming to maintain the balance of brightness and color. Finally, a Cross-Attention Fusion Module (CAFM) is used to fuse the images on the YCbCr color space to achieve image enhancement. We use a single CAFM as a recursive unit, and EFCANet progressively uses four recursive units. The intermediate enhancement results generated by the first recursive unit and the exposure-corrected image of the original input image in YCbCr color space are used as inputs for the second recursive unit. We conducted comparison experiments with 14 state-of-the-art methods on eight publicly available datasets. The experimental results demonstrate that the image quality of EFCANet enhancement is better than other methods.

Keywords:

low-light enhancement; exposure image generator; cross-attention fusion; recursive calculation

1. Introduction

With the rapid development of computer vision technology, images are widely used in the fields of surveillance equipment, satellite remote sensing, and medical imaging. However, in the process of image acquisition, the imaging device cannot obtain a sufficient number of photons due to the influence of illumination factors, which leads to underexposure of the image. In underexposed images, the contrast of the scene is low, and darker areas occupy most of the image, and the visual effect of the image is poor. If the visibility of the image is improved only by extending the exposure time of the device, motion blur may be caused by the movement of objects in a short period of time. As shown in Figure 1, low-light images not only have the problems of uneven lighting and low definition, but they also have the most difficulty to solve the problem of high noise. These problems seriously affect the visual effect of human observation and greatly limit the normal function of other computer vision algorithms and vision devices. Therefore, the enhancement of low-light images is an essential task.

A great deal of research has been conducted in low-light image enhancement, with researchers using histogram-based methods and various methods based on Retinex. Histogram-based methods, such as [5], attempt to achieve contrast enhancement by reassigning intervals with higher frequency of the original image gray levels to global or local intervals. However, such methods can lead to inadequate or excessive enhancement. While algorithms based on histogram equalization have been proposed, low-light image enhancement algorithms based on Retinex theory and low-light image enhancement algorithms based on convolutional neural networks have also developed rapidly. The Retinex theory proposed by Land et al. [6] suggests that the human visual system has color constancy, which means that when the illumination on the surface of an object changes, the human eye’s perception of the color of the object is not affected. Most of the algorithms based on Retinex theory first estimate the illumination component, then remove the illumination component, and finally keep the reflection component and use it as the enhancement result of the image.

To mine the structural information in low-light images, the SSR (Single-Scale Retinex) algorithm [7] uses a Gaussian low-pass filter to estimate the light component, but the color of the object image is prone to distortion, and there is a halo effect near the edges of the object. To solve the problems of SSR, the MSR (Multi-Scale Retinex) [8] algorithm uses a color balance factor to reduce the loss of color information. Although the Retinex-based algorithm is effective in enhancing the contrast and reducing the loss of detail information in dark areas, there is a shortcoming: when the illumination and reflection components cannot be completely separated, the information of the reflection component will be captured by the illumination component. However, most Retinex-based methods only use the reflectance component to enhance the image, which leads to less than optimal enhancement results.

A characteristic of low-light images is the variety of noise or speckles in the image, and being able to remove noise and speckles adaptively is the biggest difficulty in image enhancement. Although noise can be dealt with using image noise reduction algorithms, noise reduction itself is a smoothing process that causes severe loss of details.The algorithm of JED (Joint Enhancement and Denoising) [9] introduces denoising in enhancement by using a weight matrix to suppress noise during the decomposition of the illumination and reflection components, and smoothing the noise in the spatial localization of the feature information. Researchers have also conducted extensive studies on how to remove speckles from images without causing loss of detail information [10,11].

Multiexposure image fusion is a technique that fuses multiple images of different exposures to produce a single well-exposed image. The input multiple exposure image sequence is required in the algorithm proposed by [12]. The algorithm decomposes each input image into a base layer and a detail layer and fuses the information from both layers to generate the final enhanced image. To improve the image distortion, EFF (Exposure Fusion Framework) [13] studies the relationship between two images with different exposure levels, uses illumination estimation to estimate the exposure map, and adjusts each pixel to normal exposure according to the estimated exposure map. BIMEF (Bio-Inspired Multi-Exposure Fusion Framework) [14] is a low-light image enhancement algorithm based on multiexposure fusion, which first simulates the human eye to adjust the exposure to generate a sequence of multiexposure images, and finally, simulates the human brain to fuse the generated exposure images to obtain the final enhanced image.

The overall brightness of the image enhanced by the algorithm based on multiexposure fusion is uniform. However, the limitations of MEF are mainly threefold: (1) It requires more LDR (Low Dynamic Range) images (usually more than two) in the exposure stack to capture the entire dynamic range of the scene, resulting in more storage requirements, processing time, and power. (2) The stumbling block to using deep learning in MEF is the lack of sufficient training data and labels that provide the basis for supervised learning. (3) Fusion of multiple input images can produce a better HDR (High Dynamic Range) image, but in the fusion process, it is easy to fuse the well-exposed area of the original image with the multiexposure image sequence again, and there is a risk of overexposure.

To alleviate the problem of overexposure, underexposure, and data limitation, in this study, we propose an exposure fusion cross-attention network (EFCANet) for low-light image enhancement, whose algorithm flow is shown in Figure 2. The Exposure Image Generator (EIG), which estimates the exposure factor of each pixel, then adjusts the exposure of each pixel according to the exposure factor to obtain an exposure-corrected image. As far as we know, the YCbCr channel contains luminance component Y and color component Cb and Cr. The image structure details exist in the luminance channel, and the luminance variations are more prominent in the luminance channel than in the chrominance channel. To better maintain the balance between image color and luminance, we convert the original image and the corresponding exposure-corrected image from RGB color space to YCbCr color space. Then, the Cross-Attention Fusion Module (CAFM) is invoked in the image fusion stage. CAFM can effectively avoid fusion artifacts and color inconsistencies in fused images. CAFM consists of two Cross-Attention Fusions (XAF) and one Channel Attention Fusion (CAF). XAF compensates for the low-light image of the primary input with the exposure-corrected image of the secondary input. CAF implements a two-branch fusion to produce better fusion results. The first CAFM reconstructs the first intermediate enhancement result, which is still in YCbCr color space. Four CAFM iterations are used in the network, and the input of the latter CAFM is the output of the previous CAFM and the corresponding exposure-corrected image of the original low-light image. Finally, the YCbCr image is converted to RGB image to obtain the final enhancement result.

The main contributions of this paper are summarized as follows:

An exposure image generator is designed to reconstruct an exposure-corrected image for any single low-light image. This generator generates exposure-corrected images corresponding to low-illumination images by calculating the exposure factor of each pixel, which effectively improves the global brightness of the image. The exposure-corrected image is used as part of the input for image fusion.
In order to make local and global features achieve different association and emphasis, we invoke the Cross-Attention Fusion Module (CAFM). CAFM includes both cross-fusion attention (XAF) and Channel Fusion Attention (CAF). XAF enables exposure-corrected images to compensate for low-light images. CAF enables two-branch feature information fusion. CAFM can achieve better fusion effect, avoiding fusion artifacts and color inconsistency problems.
The algorithm takes a progressive recursive approach to complete the image enhancement, and the feature information at each stage is the most a priori feature information to guide the next stage of feature learning. As the number of recursions increases, the algorithm integrates the feature information of different stages to improve the performance of the network.

The main research content of this paper is as follows: In the second part, we introduce the low-light image enhancement and image fusion-based low-light image enhancement algorithm. In the third part, we describe the method we use in detail. In the fourth part, we conduct subjective and objective comparison tests and ablation experiments with current popular methods. The fifth part is the summary.

2. Related Work

In the field of computer vision, low-light image enhancement is a prerequisite basis for many studies. We can enhance low-light images by both hardware and software. If hardware is used to enhance low-light images, it is common practice to use an infrared camera or increase the aperture of the camera to obtain more photons. However, the image quality enhanced by hardware devices is not high, and hardware devices are often too expensive. So, the main solution nowadays is to implement it through software. The methods for enhancing low-light images using software can be mainly divided into conventional-based algorithms and deep-learning-based algorithms. In the following sections, we discuss the existing algorithms for low-light image enhancement.

2.1. Low-Light Image Enhancement

Low-light image enhancement is to enhance the brightness and details of an image to obtain a color-saturated image with clear details. In recent years, more and more low-illumination image enhancement algorithms have been proposed. Land [6] proposed the Retinex model, which treats each image as the result of the product of the illumination component R and the reflection component L. After that, many Retinex-based methods were proposed, such as MSR [8], SSR [7], SRIE [15], etc., which were also used for image enhancement. However, due to the incomplete separation of the illumination and reflection components, the enhanced images often suffer from edge enhancement overload and color distortion, resulting in visual degradation. Based on Retinex theory, also widely used in other image enhancement, the underwater image enhancement method [16] uses a color correction scheme to eliminate chromatic aberrations, then by imposing a multiorder gradient prior on the reflectance and illumination maps, and finally, establishes a maximum a posteriori formula for the target image on the color-corrected image. IB (Illumination Boost algorithm) [17] targets the enhancement of nighttime images, and the algorithm applies logarithmic and exponential functions to enhance local contrast, uses the LIP method to obtain the features of both images, uses S-curves to improve the overall brightness, and uses linear scaling functions to resize pixels. Intensity is readjusted to the standard dynamic range. The IB algorithm keeps the well-exposed areas of the original image from being enlarged and generates an image with better perceptual quality. Noise may be amplified during the image enhancement process. Therefore, image enhancement should be performed to preserve details and remove noise at the same time.

With the development of deep learning, some excellent traditional algorithms and theories have been introduced to deep learning for mapping low-light images to normal-light images. In the RetinexNet [1] algorithm, RetinexNet uses Decom-Net constraint to learn the smooth map of reflection and illumination, Enhance-Net to enhance the illumination map, and a joint denoising method to remove the noise from the reflection map; this algorithm can effectively enhance the detail information. The KinD (Kindling the Darkness) [2] algorithm, also inspired by Retinex theory, decouples the original image into two components, which are then responsible for adjusting the light and removing degradation to achieve better regularized learning.

Recently, RetinexDIP [18] proposed a new Retinex decomposition “generation” strategy to transform the decomposition problem into a generation problem.The DLN [19] algorithm treats low-light image enhancement as a problem of estimating the residuals between low-light and normal-light images. The GLADNet [20] algorithm uses a network of encoders and decoders to generate the global prior information of the illumination map. The global prior information is reconstructed with the original image to reconstruct the detail information. GLADNet uses a nearest-neighbor interpolation operation and is able to avoid object vignetting. EnlightenGAN (EnGAN) [3] can be trained in the absence of low/normal-light image pairs, using an attention-guided U-Net as a generator, using a double discriminator to guide the global and local information. Self-feature retention loss is used to guide the training process. DRBN [21] enhances images in two phases, the first phase learns to recover image details on a paired dataset, and the second phase uses a GAN network to improve the visual quality of the image. Zero-DCE [22] and Zero-DCE++ [4] are unsupervised learning networks that treat the image enhancement task as a higher-order curve estimation problem. The image is used as input and the curve as output, and the network learns by adjusting the parameters of the curve to constrain the learning of the network within the dynamic range of the input at the pixel level. This class of methods opens up a new learning strategy that eliminates the problem of the need for paired data sets. Ref. [23] input low-light images into a dual-attention model for global feature extraction, and the recurrent layer is introduced into the low-light enhancement task to improve the subjective quality of the enhanced image.

2.2. Low-Light Image Enhancement Algorithm Based on Image Fusion

In recent years, many Multiple Exposure Fusion (MEF) algorithms have been proposed, and the main idea of the algorithm is to calculate the weight of each image in the multisource image sequence by the algorithm, and the fused image is the weighted sum of each image. Various MEF methods use different techniques to find the best weight map. The process in [24] is very simple, metric scoring each pixel point based on three metrics: local contrast, color saturation, and exposure quality. The normalized weight map is obtained by normalizing the metric scores, and the weight map guides the multiscale image fusion. However, since the metric occurs at the pixel point, the brightness of the fused image is susceptible to color mismatch and edge artifacts. In [25], to make the fused image as close as possible to a single image of the original scene, the algorithm uses the intensity of the image details extracted by the bilateral filter as weights to guide the fusion. The algorithm fuses well for images with large exposure differences.

In this study, the process of multiple exposure image generation is avoided, reducing the consumption of computer resources and time. We use an exposure image generator to generate exposure-corrected images corresponding to a single low-light image. To achieve a more comprehensive image quality enhancement, the details of the two images are fused together by a cross-fusion attention mechanism to generate the final enhanced image.

3. Proposed Method

EFCANet is applicable to a single image, which is different from several other methods that require multiple images of the same scene to be fused. First, we use the Exposure Image Generator (EIG) to generate exposure-corrected images corresponding to low-light images. The Exposure Image Generator (EIG) increases the global brightness of the image. To maintain color and luminance balance, the exposure correction image and the original low-light image are converted from the RGB color space to the YCbCr color space, so that blending occurs in the YCbCr color space. Then, we use the Cross-Attention Fusion Model (CAFM) to fuse the original low-light image with the exposure-corrected image. CAFM yields an intermediate enhancement result with a color space of YCbCr. Then, CAFM is iteratively used three more times, and the input of the latter CAFM is the output of the previous CAFM and the exposure-corrected image with YCbCr color space. Finally, the enhanced image is transferred from the YCbCr color space back to the RGB color space to obtain the final enhanced image. The flow chart of the method is shown in Figure 2.

3.1. EIG (Exposure Image Generator)

For static scenes, multiple low dynamic range images with different exposure settings can be combined to produce a single high dynamic range image. The question of how to obtain the exposed images involved in the fusion becomes a critical one. If the exposure setting of the image is changed on a pixel-by-pixel basis, the exposure of each pixel can be adjusted independently, allowing direct tone mapping without the need to construct intermediate HDR images. We were inspired to design the Exposure Image Generator (EIG). The EIG consists of three main steps. The first is to determine the optimal exposure factor for each pixel, and the second is to recombine the luminance and hue information using spatial bilateral filters, while estimating the pixel contribution by the difference between the intensity of each pixel and the intensity of the central pixel. EIG as an image processing algorithm greatly saves computational resources and avoids the deterioration of fusion results due to the unavailability of information from some images in the multiexposure image sequence. Next, the details of each step are described in detail.

Exposure factor: A prerequisite for acquiring exposure-corrected images is the need to determine the number of dimensions of the low-light image and the exposure factor corresponding to each pixel. Compared with the multiexposure image acquisition process, EIG can select the optimal exposure factor for each pixel, expanding the dynamic range and reducing noise within the spatial neighborhood at the pixel level. Moreover, it exposes hidden details in the scene that are barely visible. How to obtain the exposure factor matrix

K_{H \times W}

corresponding to the original input image? First, spatial Gaussian blur is used to reduce the noise in the luminance component of the low-light image. The spatially uniform tone mapping function

T (x, μ)

is applied to the spatial Gaussian blur result of the image to obtain the exposure factor

λ

for each pixel . The role of

λ

is to multiply with the spatial domain kernel and pixel domain kernel in the bilateral filter and the pixel values to obtain the contribution value of each pixel.

T (x, μ)

is a nonlinear mapping function, where the parameters are independent parameters used in the tone-mapping process to attenuate the image detail layer and adjust the contrast of the global layer features.

T (x, u) = \frac{log \{\frac{x}{x M a x} (μ - 1) + 1\}}{log (μ)}

(1)

where

x M a x

is the parameter that controls the input brightness level, and u is the parameter that controls the decay curve. The hue mapping function

T (x, μ)

is similar to the conventional gamma function, but at pixel intensities close to 0,

T (x, μ)

, it does not suffer from severe slope disappearance like gamma correction, nor does it overemphasize dark areas.

Spatial Bilateral Filter (SBF): SBF is a nonlinear filter that performs a weighted average of neighboring pixels to represent the intensity of the central pixel. Each pixel after filtering is a weighted average of its domain pixels (as shown in Figure 3). Since the spatial bilateral filter combines the spatial proximity and pixel value similarity of the central pixel, while considering the spatial domain information and grayscale similarity, it achieves the effect of preserving edge information and noise reduction and smoothing. In this study, the exposure factor is multiplied with the central pixel and its spatial domain kernel value and pixel domain kernel value in the SBF to readjust the pixel intensity of each pixel.

We introduce a pixel intensity difference value

R (m, p)

in the bilateral filter to estimate the contribution of a pixel by the difference between the intensity of each pixel and the intensity of the central pixel. The value of

R (m, p)

is the difference between the intensity of the neighboring pixel m and the intensity of the central pixel p. We extend the intensity difference of a pixel to the intensity difference value of a pixel with nonluminosity differences that is any relationship satisfying the following properties:

R (x, x) = 0

and

R (x, y) = R (y, x)

. If the triangular inequality holds, then the difference value satisfies the following relationship:

R (x, y) + R (y, z) \geq R (x, z)

.

x, y, z

is the pixel value of any pixel point within the I matrix.

The spatial bilateral filter with pixel intensity difference values is shown in Equation (2):

S (p, θ_{1}, θ_{2}) = \frac{1}{W_{p}} \sum_{m \in K_{s}} λ G_{θ_{1}} (| | m - p | |, θ_{1}) G_{θ_{2}} (R (m, p), θ_{2}) I_{m}

(2)

Among them:

W_{p} = \sum_{m \in K_{s}} G_{θ_{1}} (| | m - p | |, θ_{1}) G_{θ_{2}} (R (m, p), θ_{2})

(3)

G_{θ} (x, θ) = \frac{1}{\sqrt{2 π} θ} e^{\frac{- (x^{2} + θ^{2})}{2 θ^{2}}}

(4)

R (m, p) = I_{m} - I_{p}

(5)

P represents the pixel value of the center pixel,

K_{s}

denotes all pixel points in the field of the center pixel

k \times k

, and m, p is the pixel value of any pixel point within the

K_{s}

matrix. The variable

θ_{1}, θ_{2}, K_{s}

controls the operation of the bilateral filter, and

θ_{1}

controls the rate of spatial Gaussian decay.

θ_{2}

controls the weight of the Gaussian intensity difference values, and the attenuation is too different from the contribution of neighboring pixels.

K_{s}

determines the size of the range in which the pixels in the field around the central pixel are located.

W_{p}

normalizes the sum of the weights.

P is the pixel value of the central pixel, and m is the first input pixel point within the neighborhood pixel. In this paper, the neighborhood size is 5 × 5.

G_{θ_{1}}, G_{θ_{2}}

is the spatial domain kernel and the image pixel domain kernel, respectively, and the value of

G_{θ_{1}}, G_{θ_{2}}

corresponding to each pixel is found using Equation (4). For each pixel in the neighborhood, use its corresponding

G_{θ_{1}}, G_{θ_{2}}

multiplied by the exposure factor

λ

and then multiplied by the pixel value of m. These four numbers are multiplied together to obtain a result. The same calculation is performed for all pixels in the neighborhood, and when the traversal is complete, each pixel in the neighborhood is associated with the central pixel, and the results are added together to obtain the numerator of the spatial bilateral filter. Multiply

G_{θ_{1}}

with

G_{θ_{2}}

to obtain

W_{P}

for each pixel, and add

W_{P}

for all pixels in the neighborhood as the denominator. The numerator and denominator are divided to obtain the new pixel value corresponding to the central pixel.

Tone Mapping (TM): The final step of the exposure generator is tone mapping, where the tone mapping algorithm compresses the brightness of the exposed image while maintaining image detail and color to avoid losing important information. In this paper, instead of explicitly generating HDR images, we vary the exposure settings of the image on a pixel-by-pixel basis, allowing each pixel’s exposure to be adjusted independently, thus allowing direct tone mapping without the need to construct intermediate HDR images. We assume that PSNR varies with pixel intensity and that detail information is less accurate in dark areas than in evenly exposed areas. We associate the confidence of the tone mapper with the detail based on the luminous intensity of the pixel. The tone mapper processes the detail layer and the base layer in different pipelines, each pipeline attenuating the base layer features to achieve the desired contrast and attenuating the detail based on the estimated accuracy by local luminance. Finally, these two components are recombined to form the final output.

In EIG’s tone mapping, we focus on detail features and base features to effectively address the retention of detail information and the alignment of base tones. First, the image is converted into luminance component and color component. Then, the luminance components are converted to logarithmic domain, and the bilateral filter decomposes the logarithmic domain image into base and detail layers. The base layer describes the luminance information of the image, and the detail layer describes the texture and edge information of the image. The base layer is extracted using the bilateral filter, and the base layer is subtracted from the logarithmic luminance of the original image to produce the detail layer. The noise in the color component is attenuated by Gaussian blur, and the final exposure-corrected image is obtained by recombining the luminance and color components. The same spatial uniform tone mapping function

T (x, μ)

, with different parameters, is used to attenuate the image details and adjust the contrast of the base layer. Using Equation (1) for the linear intensity of the base layer for uniform tone mapping, the value of

μ_{1}

is approximately 40. The logarithmic intensity of the detail is attenuated based on the luminance of the base layer, and the attenuation process is linear with a luminance of 0.5 times the maximum pixel, which will have half of the high frequencies blocked. Since the confidence of the detail decreases at dark values, using Equation (1) and using a different

μ_{2} (μ_{2} = 700)

, the peaks of the lower intensities are reduced. The noise in the color component is attenuated by standard Gaussian blurring. The color component is then processed, and the base and detail layers are combined with the color component to obtain the exposure-corrected image after tone mapping.

3.2. Cross-Attention Fusion Module (CAFM)

Attention mechanisms have been used to correlate local and global features in low-light image enhancement [26,27,28,29]. Inspired by this, and to better fuse underexposed images of the same scene and their corresponding exposure-corrected images, we use the Cross-Attention Fusion Module (CAFM). Figure 4 shows the overall architecture of our used CAFM, which consists of both cross-attention fusion (XAF) and Channel Attention Fusion (CAF). The principle of CAFM is to compensate the content of underexposed images with the exposure correction information of one image. CAFM uses the strip pool [30] to locate well-exposed regions in the global image horizontally and vertically from the secondary input (

I^{a}

) and uses them to complement the primary input (

I^{b}

), as shown in the XAF in Figure 4. The strip pool [30] averages row pixels using horizontal single-pixel-long kernels and column pixels using vertical single-pixel-long kernels, and then merges them to discover dependencies between features across regions. CAFM further extends the strip pool to cross-fuse two inputs in different domains.

We iteratively use CAFM four times. First, the exposure-corrected image and the low-light image are converted from RGB color space to YCbCr color space, and then both images are input to the first CAFM to obtain the first intermediate enhancement result. The intermediate enhancement result is still in the YCbCr color space. The inputs of the remaining three CAFM are the output of the previous CAFM and the exposure-corrected image corresponding to the original input low-light image. In image fusion, the RGB channel is highly correlated and the input image is usually converted from RGB color space to YCbCr color space in order to maintain the image color and detail information [26,31]. To illustrate CAFM in detail, we explain the first CAFM in the network, and the remaining three CAFM are based on the same principle as the first one.

Cross-attention fusion (XAF): We assume that the exposure-corrected image in YCbCr color space is

I^{a} \in R^{C \times H \times W}

, and the low-illumination image is

I^{b} \in R^{C \times H \times W}

.

I^{b}

is the primary input and

I^{a}

is the secondary input. Horizontal and vertical bar pooling is performed on the secondary input

I^{a}

to find the well-exposed regions in each channel space. The horizontal bar result

I_{i, c}^{V} \in R^{H \times 1 \times C}

and the vertical bar result

I_{j, c}^{L} \in R^{1 \times H \times W}

are obtained from the secondary input. Where C is the number of feature channels, and H and W are the tensor height and width. Here,

I_{i, c}^{V} = \frac{1}{W} \sum_{j = 0}^{W - 1} I_{i, j, c}^{a}, 1 \leq i \leq H

(6)

I_{j, c}^{L} = \frac{1}{H} \sum_{i = 0}^{H - 1} I_{i, j, c}^{a}, 1 \leq j \leq W

(7)

Then, upsample

I_{i, c}^{V}

and

I_{j, c}^{L}

, respectively, and fuse the feature tensor after upsampling to obtain

I_{i, j, c}^{M}

. Converting

I_{i, j, c}^{M}

to an attention mask

M_{I}

whose larger value reflects the information part in the suboutput,

M_{I}

can be expressed as:

M_{I} = σ (c_{1 \times 1} (I^{M}))

(8)

c_{1 \times 1}

is a 1 × 1 convolutional layer and

σ

is the sigmiod activation function. The primary and secondary inputs are summed to obtain feature

I^{S}

. The output of cross-fusion attention (XAF) is equal to

I^{S}

plus

I^{S}

multiplied by the feature that passes the attention mask

M_{I}

.

I^{O} = I^{S} + c_{3 \times 3} (M_{I} \otimes (I^{S}))

(9)

Here, ⊗ stands for element multiplication,

I^{S}

for (

I^{a}

+

I^{b}

), and

c_{3 \times 3}

for the

3 \times 3

convolutional layer.

Channel Attention Fusion (CAF): After XAF, in order to fuse global information in the channel dimension, we use the proposed channel attention mechanism to integrate the channel information. First, the two cross-attention fusion (XAF) dual-branch information

I^{U}

and

I^{D}

are merged to obtain

Q_{S}

, and the channel information

Q_{S}

is extracted using the global average pooling GAP. Next, the fully connected layer activates

Q_{S}

to generate adaptive selection weights such as

(X_{A}, X_{B}) = f c (Q_{s})

. The Softmax function converts the adaptive selection weights into attention weights

(X_{a}, X_{b})

as follows:

X_{a} = \frac{e^{X_{A}}}{e^{X_{A}} + e^{X_{B}}}, X_{b} = \frac{e^{X_{B}}}{e^{X_{A}} + e^{X_{B}}}

(10)

Here,

(X_{A}, X_{B}) = f c (Q_{s})

denotes the adaptive selection weights, and

(X_{a}, X_{b})

denotes the attention weights of

I^{U}

and

I^{D}

, respectively. Applying the attention weights

(X_{a}, X_{b})

to

I^{U}

and

I^{D}

yields the final feature map E.

E = I^{U} X_{a} + I^{D} X_{b}

(11)

Here,

[X_{a} + X_{b}] = [1, 1 \dots . 1] \in R^{c}

,

E = [E_{1}, E_{2}, \dots, E_{C}]

. The figure shows the architecture of channel fusion attention.

3.3. Loss Function

Essentially, our network processes the three channels of

Y, C b,

and

C r

of the image, respectively, and our loss function is the mean square error of the estimated output of each training from

Y, C b,

and

C r

with respect to the true label.

\begin{matrix} L = M (E_{Y,} G_{Y}) + λ M (E_{C b,} G_{C b}) + λ M (E_{C r,} G_{C r}) \end{matrix}

(12)

Here, M denotes the mean squared error loss function

M S E (.)

,

G_{Y} G_{C b,} G_{C r}

denotes the

Y, C b,

and

C r

color space channels of the real label.

E_{Y} E_{C b,} E_{C r}

denotes the

Y, C b,

and

C r

color space channels of the enhanced image.

λ

is a hyperparameter with a value of 0.5.

4. Experiment

4.1. Experimental Implementation Details

The experimental environment is an Ubuntu 18.04 operating system, TITAN RTX graphics card, 24G video RAM, Intel Xeon(R) Silver 4210 CPU, and 376 GB of RAM. The algorithm was implemented based on pytorch 0.4.0, using the ADAM optimizer, and the initial learning rate was set to 0.0001, after which the learning rate decreased by 0.5 for 200 training sessions.

We use the public open-source datasets LOL (Low Light) [1] and the Brightening Train dataset [1]. Both the LOL dataset and Brightening dataset have real labels. There are 1500 images in LOL and Brightening Train. The data loader randomly selects 1300 images from the original dataset as the training dataset, 100 images as the validation dataset, and 100 images as the test dataset, and there is no intersection between the three. We choose 7 public datasets to test the network performance, the datasets with real labels include the LOL dataset, the MIT [32] dataset, and the SICE (Single Image Contrast Enhancer) [33] dataset. The test sets without real labels include: the MEF dataset [34], the DICM dataset [35], the VV dataset [36], the NPE dataset [37], and the LIME [38] dataset.

4.2. Subjective Evaluation

We chose 14 popular low-light image enhancement methods to compare with our proposed method. The 14 compared methods include: Retinex-theory-based methods: SRIE [15] (2016), JED [9] (2018), SDD [39] (2020), and RetinexDIP [18] (2021); methods based on image fusion: BIMEF [14] (2017), FFM [40] (2019), and EFF [13] (2017); and deep-learning-based methods: RetinexNet [1] (2018), KinD [2] (2019), EnGAN [3] (2020), DLN [19] (2020), DRBN [21] (2020), Zero-DCE [22] (2020), and Zero-DCE++ [4] (2021). In this section, we analyze the visual effects of the different methods on different datasets.

First, we compare the proposed method with the method based on Retinex theory, as shown in Figure 5. Among the methods based on Retinex theory, we chose: SRIE [15], JED [9], SDD [39], and RetinexDIP [18]. We randomly selected four low-light images from the LOL and Brightening Train test datasets to demonstrate the enhanced visual effects of the four methods, as shown in Figure 5. Figure 5a shows the input low-light images. The darker regions of the original images produced by the SRIE method are not well-exposed. As in the second column, the book bag and other objects in the first image have low luminance and cannot clearly distinguish each object. The JED method performs noise smoothing during the enhancement process, so the resulting image has less noise and uniform luminance. However, due to excessive smoothing, some texture information is lost, resulting in blurred images. For example, the outline of the building in the third image in the second row is defocused. The brightness of the image enhanced by the SDD method is not sufficiently increased, the image tones are deepened, and the details of the image are not well-recovered. Figure 5e shows the image produced by the RetinexDIP method, the overall brightness of the image is not uniform, the color information is not accurate, and the image is prone to noise. For example, in the fifth column, the color of the sky should be blue, but the color of the sky appears as gray–blue after RetinexDIP enhancement. Figure 5f shows the enhancement results of the algorithm proposed in this paper. Our proposed method improves the brightness of the image while maintaining the true color of the image, and there are no artifacts or image noise. Figure 5g shows the real label, and it can be seen that the enhancement result of our method is highly consistent with the label.

In addition to comparing with the method based on Retinex theory, we also compare the proposed method with the image-fusion-based low-light image enhancement methods: BIMEF [14], FFM [40], and EFF [13]. We randomly selected six low-light images from the SICE [33] dataset, including outdoor buildings, outdoor scenery, and indoor scenes, and the results are shown in Figure 6. Figure 6a shows the original low-light images, and Figure 6b shows that the brightness of the images processed with the BIMEF method is not sufficiently enhanced, and there seems to be a layer of gray enveloping the images. Figure 6c,e shows the enhancement results of EFF and FFM, respectively. It can be observed that the brightness and sharpness of the image have been greatly improved, but some areas are still not fully exposed. As in the fourth image in the second row, the stone columns in the gallery hall are still dark. Figure 6e shows the enhancement results of our method, compared with the real label in Figure 6f. Our image quality has a substantial improvement, the darker areas of the image become uniform in brightness, and the color information of the whole image is accurate.

In Figure 7, we use CNN-based low-light image enhancement algorithms to enhance images with RetinexNet [1], KinD [2], EnGAN [3], DLN [41], Zero-DCE [22], DRBN [21], and Zero-DCE++ [4]. We randomly selected six images from the MIT [32] dataset for visual effect demonstration, and the results are shown in Figure 7. From the visual effect observation, the overall quality of the enhanced images is better, but there are still some images with defects. Both RetinexNet and KinD methods are convolutional neural networks proposed based on Retinex theory, and both methods effectively improve the brightness of images.The enhancement process of RetinexNet and KinD methods only enhances the light component, causing image color inaccurate information and vignetting of object edges, such as the second blue dress in Figure 7b, which causes the image to look blurred because the vividness is excessively increased. The lawn in the fourth picture of Figure 7c completely loses its original green color. Figure 7d is the result of EnGAN enhancement by the weakly supervised network, and the brightness and color of the image are more balanced. However, the sharpness of the image needs to be improved. Figure 7e is the enhancement result of DLN, and the overall effect of the image is better, but white color appears in the sky in the fifth image in Figure 7e. The brightness of the images enhanced by the methods in Figure 7g,h is still relatively dark. Both Zero-DCE and Zero-DCE++ are a kind of unsupervised learning neural network. As a whole, the enhanced images of the Zero-DCE method have uniform colors, but the enhanced image regions still have dark areas due to the lack of guidance from the real labels during the network learning process. However, the enhanced results of Zero-DCE++ appear overexposed in some regions. Figure 7j shows the enhancement results of the method in this paper, and it can be seen that the outline of the object is clear and the details are complete.

Figure 8 shows the enhancement effects of different enhancement methods on the LOL dataset. We randomly selected 2 images from the LOL dataset to compare the visual effects of the 14 methods. In addition, some regions are selected for magnification display. From the figure, we can observe that the images enhanced by SRIE and SDD are not sufficiently exposed during the enhancement process, resulting in low brightness of the whole image and obvious dark areas. For example, the brightness of the bowling ball area and the face of the person enhanced by SDD is very low. The RetinexNet method causes the image to be severely sharpened, and at the same time, the image appears with chromatic aberration and a lot of noise. However, the images enhanced by JED, BIMEF, DRBN, and KinD have a more uniform luminance and color distribution from the overall view, but there is still noise in the images. The DLN method causes the face of the person to be overexposed. The results of RetinexNet and RetinexDIP enhancement have a lot of noise. The unsupervised methods Zero-DCE and Zero-DCE++ enhance the results with uniform luminance, but the images have uneven tones. In general, compared with the real image, the enhanced image of our proposed method has appropriate brightness, no obvious dark areas, rich and natural color without distortion, and better preservation of image detail information.

Figure 9 shows the enhancement effect of different enhancement methods on the cloudy weather images in the NPE dataset. In life, it can be observed that the sky in rainy weather appears gray, and it is very important to keep the hue of this particular scene of rainy weather during image enhancement. The EnGAN, DLN, and DRBN methods can be observed in the presence of overexposed areas, causing severe loss of image information, such as dark clouds in the sky being enhanced to white or light green. The brightness of the middle part of the SRIE-enhanced image is uniform, but black artifacts exist in the bottom edge region of the image enhanced by JED, RetinexDIP, KinD, and RetinexNet. The results of Zero-DCE enhancement are not sufficiently exposed, and there is chromatic aberration and noise in the color of the green lawn in the image. Based on the image fusion methods BIMEF, EFF, and FFM, the overall enhancement results of the three methods are very similar, but the brightness of the image is still relatively dark. Zero-DCE++ produces an image with a grayish purple color of the sky, losing the original gray color. Figure 9p shows the enhancement results of our proposed method, and our method outperforms the other methods in terms of both the brightness of the image and the naturalness of the color information.

Figure 10 shows the enhancement effects of different enhancement methods on the outdoor landscape images in the MEF dataset. The original images were captured with low brightness, because they were captured with backlight. We use the 14 contrast methods mentioned above and the methods in this paper to enhance the images. The original images are richer in color and have more diverse object textures and shapes, and this information needs to be preserved in the enhancement process. It can be seen that SRIE enhancement is not effective, and the image is not fully exposed. There are some dark areas in the results of BIMEF, EFF, FFM, KinD, EnGAN, and DLN enhancement, and the color information of the enhanced image is not accurate, such as the color of the lawn being deepened and the hue of the flower being diminished. Figure 10n shows the enhancement result of DRBN, which shows that the sky area is overenhanced and the white clouds in the original image have completely disappeared. The unsupervised methods, Zero-DCE and Zero-DCE++, suffer from color inaccuracy in the enhanced results due to the lack of labeling guidance. As in Figure 10m,o, the color of the sky appears bluish purple. Figure 10p shows the enhanced image of our method, which has a good balance of color and luminance compared with several other methods.

Figure 11 shows the enhancement effects of different enhancement methods on the indoor scenes in the DICM dataset, and we use the 14 comparison methods mentioned above and the methods in this paper to enhance the images. The original image has richer colors and more diverse object textures and shapes, and this information needs to be preserved in the enhancement process. It can be seen that the brightness of the image generated by SRIE is low and the contrast is poor. The brightness distribution of BIMEF, FFM, KinD, and Zero-DCE is not uniform in the enhanced results. Figure 11p shows the image enhanced by our proposed method, which effectively enhances the contrast and detail parts in dark areas compared with several other methods.

Figure 12 shows the enhancement effect of each method on the night scene images in the LIME dataset. The images generated by the RetinexNet and DLN methods are blurred and have inaccurate color information. The images generated by BIMEF, JED, FFM, and SDD are clear, with clear object outlines, but the overall brightness needs to be improved. The images generated by EnGAN are overexposed in some areas. The images recovered by Zero-DEC have uniform colors, but the brightness is dark. The brightness of the DRBN-enhanced image is low, and some regions are still not clearly visible due to the low brightness. The Zero-DCE++-enhanced image achieves a better balance in brightness and color. Compared with the above methods, the images enhanced by the method in this paper have uniform color and brightness of the objects and clear outline of the objects.

Objective Evaluation

In Section 4.2, we subjectively analyze the enhancement effects of other methods and the method in this paper. The subjective analysis is relatively intuitive, but sufficient statistical data are still needed to illustrate the superiority of the method in this paper. Therefore, we use objective metric data for quantitative analysis of image quality. First, we measure the performance of the model using objective metrics for each of the methods in Figure 5, Figure 6 and Figure 7. Then, we objectively evaluate the performance of each method on the complete datasets of LOL [1], MIT [32], NPE [37], MEF [34], VV [36], DICM [35], and LIME [38]. Finally, we perform an objective analysis of the objectively evaluated values. For the dataset with authentic labels, LOL [1] and MIT [32] use six metrics PSNR, SSIM [42], BRISQUE [43], NIQE [44], PI [45], and RMSE [45]. For the datasets MEF [34],VV [36], DICM [35], NPE [37], and LIME [38] without real labels, we used the no-reference indicators BRISQUE [43] and NIQE [44].

PSNR is based on the image pixel statistics, which calculates the difference between the grayscale value of the image and the corresponding pixel of the real label. The larger the PSNR value is, the better the noise immunity of the evaluated image is and the better the image quality is.

SSIM is based on structural information and measures the structural similarity between the evaluated image and the real label image from three perspectives: brightness, contrast, and structure, respectively; larger SSIM values indicate better image recovery and lower distortion.

The PI value represents the subjective perceptual quality of an image, the lower the PI value, the more realistic the image looks, and the lower the PI value, the better the perceptual quality of the image.

The RMSE value measures the deviation between the algorithm score value and the subjective score value, and the smaller the value, the better the performance of the algorithm.

BRISQUE is a reference-free spatial domain image quality assessment metric, which extracts the contrast normalization coefficients from the image, extracts the fitted Gaussian distribution features, and inputs to SVM to perform regression to obtain the image quality assessment results. The smaller the value of BRISQUE, the better the image quality recovery.

NIQE is also a reference-free image quality assessment metric, which uses a natural scene statistical model and a multivariate Gaussian model to calculate the feature distance of the image. A smaller value of NIQE indicates a higher quality of the network-enhanced image.

Table 1 shows the metric values of the enhanced images for the different methods in Figure 5. Our method and the JED method achieve better results for PSNR and SSIM metrics, which indicates that the enhanced images contain more detailed content and structural information. Compared with the second best metrics, our PI, RMSE, NIQE, and BRISQUE are reduced by 0.7374, 42.5141, 0.9203, and 3.4586, respectively.

Table 2 shows the metric values of the enhanced images for the different methods in Figure 6. The EFF method ranks second in terms of performance in the PSNR, SSIM, PI, and RMSE metrics. In the NIQE and BRISQUE metrics, the performance of the BIMEF method is ranked second. EFF studies the relationship between two maps with different exposure levels using illumination estimation to estimate the exposure rate map and adjusts each pixel to normal exposure according to the estimated exposure rate map. BIMEF [14] is a low-light image enhancement algorithm based on Multiple Exposure Fusion(MEF). Unlike EFF, for BIMEF, we use cross-fusion attention to fuse two images, fusing from global and local features. Compared with EFF and BIMEF, our method achieves the best results in all six metrics.

Table 3 shows the metric values of the enhanced images for the different methods in Figure 7. As can be seen from the table, our method achieves good results in each metric compared with the state-of-the-art direction.

In the performance test of the network, we quantitatively evaluated the LOL [1] dataset using six metrics, PSNR, SSIM, PI, RMSE, BRISQUE, and NIQE, and the evaluation results are shown in Table 4. The LOL dataset contains images that are rich in detail and complex. The performance of the EFF method and Zero-DCE++ method, in terms of PSNR and SSIM metrics, is ahead of the other methods. The GAN-network-based method EnlightGAN lacks the guidance of real labels in the learning process, so it does not perform as well as our method in terms of image quality. JED, FMM, and BIMEF use fusion strategies to enhance images but still cannot fully fuse global and local information, so the quality of enhanced images is poor and the metrics are not satisfactory. We use double attention to fuse the images, focusing on the balance of global and local, as can be seen from the table, our proposed method achieves the best performance on the LOL dataset, which also shows that our method has better results in enhancing the texture details of the images.

Table 5 shows the objective metric values of different methods for the whole dataset of MIT [32]. We used PSNR, SSIM, PI, RMSE, BRISQUE, and NIQE to evaluate the performance metrics of various enhancement methods. In terms of PSNR and SSIM, our method achieved a PSNR of 25.6867 and SSIM of 0.9197. In terms of PSNR and RMSE metrics, EFF achieved the second best result, EFF fused two images with different exposure levels, and the goodness of the enhancement results was easily affected by the fusion method. Compared with our method, the PSNR index of EFF is lower than that of our index. Our index is 0.9197 for SSIM, and better results are obtained for EnGAN and DRBN without reference index of PI and NIQE.

Our method achieves the best results on all the metrics. This also shows the effectiveness of our method.

For the five datasets without true labels, MEF, DICM, NPE, VV, and LIME were objectively compared with the complete dataset using the BRISQUE and NIQE quality evaluation metrics, respectively, and the results are shown in Table 6. EnGAN outperforms the other compared methods on the MEF, DICM, VV, and LIME datasets. EnGAN uses a double discriminator to guide the learning of global and local information without the supervision of real labels, which is prone to detail loss. Despite the reference-free dataset used, the statistics in Table 6 show that our method outperforms 14 other low-light image enhancement methods in terms of both spatial information of the image and naturalness of the image, despite the reference-free dataset used.

4.3. Ablation Experiments

The EFCNet method enhances the image by using cross-attention fusion units in a progressive iterative manner. Therefore, we perform ablation experiments to find the optimal number of iterations. We use 1–7 iterations of cross-attention fusion units to train the model. The results of the enhanced images obtained with different number of iterations are shown in Figure 13. It can be observed that when the number of iterations is one, the clarity of the enhanced image is relatively poor. When the number of iterations is two, the hue information of the image is inaccurate, the image becomes green, and the objects in the scene are blurred. When the number of iterations is three, the enhancement effect is obviously improved, but the scene has more artifacts. We increased the number of iterations to four, five, six, and seven, and we found that the image contrast and brightness reached the best when the number of iterations was four. In addition, we objectively evaluated the set of enhancement results for a different number of iterations using the BRISQUE and NIQE metrics, and Table 7 shows the metric values for a different number of iterations. Smaller values of BRISQUE and NIQE metrics indicate better results. As can be seen from the table, the results after four iterations have the lowest values in BRISQE and NIQE. Therefore, we determined that the optimal number of iterations is four.

In order to obtain a clearer picture of the performance of each module, we perform four sets of ablation experiments on the network model. To obtain a clear picture of the performance of each part of the network, we performed ablation experiments. In the ablation experiments, 1500 images from RetinexNet [1] were used for training. The image size, learning speed, epoch, and number of recursions were the same for the four sets of ablation experiments. We randomly selected three images from the DICM and MEF datasets to compare the visual effects of the enhanced images from different ablation experiments, which are shown in Figure 14.

In the first set of ablation experiments (b), we remove the exposure image generator, and the next cross-attention fusion mechanism is input with two low-light images. The experimental results in Figure 14b show that the model lacks exposure-corrected images, fusing only the two low-light images does not achieve the desired effect, and the enhanced images clearly have low contrast and luminance.

In the second set of ablation experiments (c), we do not perform the color space conversion, so that the images are always kept in the RGB color space. Figure 14c shows that the color of the algorithm-enhanced image appears grayish purple and has low sharpness and uneven brightness due to the lack of color and detail information.

In the third set of ablation experiments (d), we remove the XAF in CAFM and keep only the channel fusion attention to focus on the feature information within the channel. Compared with Figure 14c, the color information and image brightness of this row of images have a large improvement. However, the richness and comprehensiveness of information is reduced due to the lack of fusion of low-light image information and exposure-corrected image information, as well as the reduction in global information, resulting in blurring of the enhanced image and some regions being over-enhanced.

In the fourth set of ablation experiments (e), in contrast to the third set of ablation experiments, we remove the CAF in CAFM and retain only the cross-fusion attention. Compared with Figure 14d, the image quality is significantly improved, but the brightness of the enhanced image is not uniform due to the lack of intrachannel local information.

Figure 14f shows the enhancement results of the method in this paper. Compared with the first four sets of experiments, the method in this paper outperforms the method of adding only some modules, both in terms of contrast, color balance and brightness uniformity, and detail texture recovery, which also proves the rationality and effectiveness of the method in this paper.

5. Conclusions

In this study, the problems of existing image fusion-based low-light image enhancement methods are first analyzed. Then, to address these problems, we propose an exposure fusion cross-attention network (EFCANet) for low-light image enhancement. A single low-light image is input to the model, and an exposure-corrected image is generated using an exposure image generator (EIG). EFCANet enhances low-light images in a progressive recursive manner, with a single Cross-Attention Fusion Module (CAFM) acting as a recursive unit and a total of four recursive units. In order to maintain the balance of luminance and color, the recursive unit fuses the exposure-corrected image and the low-light image in the YCbCr color space to achieve image enhancement. We evaluate the enhanced images qualitatively and quantitatively. Finally, the effectiveness and rationality of the network structure are verified by ablation experiments. In the experimental analysis, EFCANet has advantages in several indexes compared with other methods. Although EFCANet achieved better enhancement results, it can only enhance static images but not real-time enhancement and video image enhancement. In future work, our research goal is to improve the real-time performance of the algorithm and put it into applications with real-time requirements such as unmanned driving.

Author Contributions

Conceptualization, J.L.; methodology, Z.Y.; software, F.L.; validation, Z.Y.; formal analysis, J.L.; investigation, J.L.; resources, F.L.; data curation, F.L.; writing—original draft preparation, Z.Y.; visualization, F.L.; supervision, J.L.; project administration, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Natural Science Foundation of China (62272281, 61972235), Shandong Natural Science Foundation of China (ZR2021MF107, ZR2022MA076), Yantai science and technology innovation development plan (2022JCYJ031).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep retinex decomposition for low-light enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar]
Zhang, Y.; Zhang, J.; Guo, X. Kindling the darkness: A practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 1632–1640. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. Enlightengan: Deep light enhancement without paired supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Li, C.; Guo, C.; Chen, C.L. Learning to enhance low-light image via zero-reference deep curve estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4225–4238. [Google Scholar] [CrossRef]
Thomas, G.; Manickavasagan, A.; Khriji, L.; Al-Yahyai, R. Contrast enhancement using brightness preserving histogram equalization technique for classification of date varieties. J. Eng. Res. 2014, 11, 55–63. [Google Scholar] [CrossRef] [Green Version]
Land, E.H. The retinex theory of color vision. Sci. Am. 1977, 237, 108–129. [Google Scholar] [CrossRef]
Jobson, D.J.; Rahman, Z.-u.; Woodell, G.A. Properties and performance of a center/surround retinex. IEEE Trans. Image Process. 1997, 6, 451–462. [Google Scholar] [CrossRef]
Jobson, D.J.; Rahman, Z.-U.; Woodell, G.A. A multiscale retinex for bridging the gap between color images and the human observation of scenes. IEEE Trans. Image Process. 1997, 6, 965–976. [Google Scholar] [CrossRef] [Green Version]
Ren, X.; Li, M.; Cheng, W.-H.; Liu, J. Joint enhancement and denoising method via sequential decomposition. In Proceedings of the 2018 IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy, 27–30 May 2018; pp. 1–5. [Google Scholar]
Sharifi, A.; Amini, J.; Sumantyo, J.T.S.; Tateishi, R. Speckle reduction of polsar images in forest regions using fast ica algorithm. J. Indian Soc. Remote Sens. 2015, 43, 339–346. [Google Scholar] [CrossRef]
Sharifi, A.; Amini, J. Forest biomass estimation using synthetic aperture radar polarimetric features. J. Appl. Remote Sens. 2015, 9, 097695. [Google Scholar] [CrossRef]
Nejati, M.; Karimi, M.; Soroushmehr, S.R.; Karimi, N.; Samavi, S.; Najarian, K. Fast exposure fusion using exposedness function. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 2234–2238. [Google Scholar]
Ying, Z.; Li, G.; Ren, Y.; Wang, R.; Wang, W. A new low-light image enhancement algorithm using camera response model. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 3015–3022. [Google Scholar]
Ying, Z.; Li, G.; Gao, W. A bio-inspired multi-exposure fusion framework for low-light image enhancement. arXiv 2017, arXiv:1711.00591. [Google Scholar]
Fu, X.; Zeng, D.; Huang, Y.; Zhang, X.-P.; Ding, X. A weighted variational model for simultaneous reflectance and illumination estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2782–2790. [Google Scholar]
Zhuang, P.; Li, C.; Wu, J. Bayesian retinex underwater image enhancement. Eng. Appl. Artif. Intell. 2021, 101, 104171. [Google Scholar] [CrossRef]
Al-Ameen, Z. Nighttime image enhancement using a new illumination boost Algorithm. IET Image Process. 2019, 13, 1314–1320. [Google Scholar] [CrossRef]
Zhao, Z.; Xiong, B.; Wang, L.; Ou, Q.; Yu, L.; Kuang, F. Retinexdip: A unified deep framework for low-light image enhancement. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 1076–1088. [Google Scholar] [CrossRef]
Wang, L.-W.; Liu, Z.-S.; Siu, W.-C.; Lun, D.P. Lightening network for low-light image enhancement. IEEE Trans. Image Process. 2020, 29, 7984–7996. [Google Scholar] [CrossRef]
Wang, W.; Wei, C.; Yang, W.; Liu, J. Gladnet: Low-light enhancement network with global awareness. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 751–755. [Google Scholar]
Yang, W.; Wang, S.; Fang, Y.; Wang, Y.; Liu, J. From fidelity to perceptual quality: A semi-supervised approach for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3063–3072. [Google Scholar]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
Li, J.; Feng, X.; Hua, Z. Low-light image enhancement via progressive-recursive network. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4227–4240. [Google Scholar] [CrossRef]
Mertens, T.; Kautz, J.; Van Reeth, F. Exposure fusion. In Proceedings of the 15th Pacific Conference on Computer Graphics and Applications (PG’07), Washington, DC, USA, 29 October–2 November 2007; pp. 382–390. [Google Scholar]
Raman, S.; Chaudhuri, S. Bilateral filter based compositing for variable exposure photography. In Eurographics (Short Papers); The Eurographics Association: London, UK, 2009; pp. 1–4. [Google Scholar]
Prabhakar, K.R.; Srikar, V.S.; Babu, R.V. Deepfuse: A deep unsupervised approach for exposure fusion with extreme exposure image pairs. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4714–4722. [Google Scholar]
Huang, S.-W.; Peng, Y.-T.; Chen, T.-H.; Yang, Y.-C. Two-exposure image fusion based on cross attention fusion. In Proceedings of the 2021 55th Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 31 October–3 November 2021; pp. 867–872. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Zhang, C.; Yan, Q.; Zhu, Y.; Li, X.; Sun, J.; Zhang, Y. Attention-based network for low-light image enhancement. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.-M.; Feng, J. Strip pooling: Rethinking spatial pooling for scene parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4003–4012. [Google Scholar]
Yin, J.-L.; Chen, B.-H.; Peng, Y.-T.; Tsai, C.-C. Deep prior guided network for high-quality image fusion. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
Bychkovsky, V.; Paris, S.; Chan, E.; Durand, F. Learning photographic global tonal adjustment with a database of input/output image pairs. In Proceedings of the CVPR 2011, Colorado Springs, CO, USA, 20–25 June 2011; pp. 97–104. [Google Scholar]
Cai, J.; Gu, S.; Zhang, L. Learning a deep single image contrast enhancer from multi-exposure images. IEEE Trans. Image Process. 2018, 27, 2049–2062. [Google Scholar] [CrossRef]
Ma, K.; Zeng, K.; Wang, Z. Perceptual quality assessment for multi-exposure image fusion. IEEE Trans. Image Process. 2015, 24, 3345–3356. [Google Scholar] [CrossRef]
Lee, C.; Lee, C.; Kim, C.-S. Contrast enhancement based on layered difference representation of 2D histograms. IEEE Trans. Image Process. 2013, 22, 5372–5384. [Google Scholar] [CrossRef]
Vonikakis, V.; Kouskouridas, R.; Gasteratos, A. On the evaluation of illumination compensation algorithms. Multimed. Tools Appl. 2018, 77, 9211–9231. [Google Scholar] [CrossRef]
Wang, S.; Zheng, J.; Hu, H.-M.; Li, B. Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE Trans. Image Process. 2013, 22, 3538–3548. [Google Scholar] [CrossRef]
Guo, X.; Li, Y.; Ling, H. Lime: Low-light image enhancement via illumination map estimation. IEEE Trans. Image Process. 2016, 26, 982–993. [Google Scholar] [CrossRef]
Hao, S.; Han, X.; Guo, Y.; Xu, X.; Wang, M. Low-light image enhancement with semi-decoupled decomposition. IEEE Trans. Multimed. 2020, 22, 3025–3038. [Google Scholar] [CrossRef]
Dai, Q.; Pu, Y.-F.; Rahman, Z.; Aamir, M. Fractional-order fusion model for low-light image enhancement. Symmetry 2019, 11, 574. [Google Scholar] [CrossRef] [Green Version]
Kou, F.; Li, Z.; Wen, C.; Chen, W. Multi-scale exposure fusion via gradient domain guided image filtering. In Proceedings of the 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, China, 10–14 July 2017; pp. 1105–1110. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [Green Version]
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef]
Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 2012, 20, 209–212. [Google Scholar] [CrossRef]
Blau, Y.; Michaeli, T. The perception-distortion tradeoff. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6228–6237. [Google Scholar]

Figure 1. Comparison between state-of-the-art methods and ours. RetinexNet (2018) [1], KinD (2019) [2], EnGAN (2020) [3], Zero-DCE++ (2021) [4].

Figure 2. Overall structure of EFCANet. CAFM is Cross-Attention Fusion Module, and XAF is cross-attention fusion. CAF is Channel Attention Fusion.

Figure 3. Details of the tone mapping process.

Figure 4. CAFM is Cross-Attention Fusion Module, XAF is cross-attention fusion, and CAF is Channel Attention Fusion. GAP denotes global average pooling; FC Layer is fully connected layer.

Figure 5. Example of comparison with Retinex-theory-based low-light image enhancement method on the LOL [1] dataset.

Figure 6. Example of comparison with a low-light image enhancement method based on exposure image fusion on the SICE (Single Image Contrast Enhancer) [33] dataset.

Figure 7. Example of comparison with CNN-based low-light image enhancement methods on the MIT [32] dataset.

Figure 8. Example of comparison with 14 low-light image enhancement methods on the LOL [1] dataset.

Figure 9. Visual effect of the cloudy weather images in the NPE [37] dataset after being enhanced.

Figure 10. Comparison of the visual effects of the enhanced outdoor building images in the MEF [34] dataset.

Figure 11. Comparison of the visual effects of the enhanced indoor images in the DICM [35] dataset.

Figure 12. Comparison of the visual effect of the enhanced night view images in the LIME [38] dataset.

Figure 13. Comparison of visual effects with a different number of iterations.

Figure 14. Comparison of visual effects of ablation experiments. (a) input image, (b) no exposure image generator, (c) no conversion to YCbCr color space, (d) no cross-fusion attention used, (e) no channel fusion attention used, and (f) the method of this paper.

Table 1. Average performance of different enhancement methods in Figure 5.

Method	SRIE	JED	SDD	RetinexDIP	Ours
PSNR↑	13.4335	14.8575	14.1392	11.2798	22.2035
SSIM↑	0.5558	0.6847	0.6528	0.6019	0.8820
PI↓	5.1526	4.8825	4.7763	4.4450	3.7076
RMSE↓	69.3866	62.608	64.2121	67.2756	20.0939
NIQE↓	6.7049	5.9012	5.5274	7.0337	4.6071
BRISQUE↓	39.1997	23.4526	26.9602	28.6887	19.9940

Table 2. Average performance of different enhancement methods in Figure 6.

Method	BIMEF	EFF	FFM	Ours
PSNR↑	13.1149	14.0921	12.7239	21.7908
SSIM↑	0.7410	0.7438	0.7104	0.8095
PI↓	2.8376	2.7542	2.9721	2.6137
RMSE↓	49.7947	47.0027	52.8759	41.9082
NIQE↓	3.4503	3.5421	3.5096	3.2326
BRISQUE↓	9.1046	9.5589	14.4858	13.8614

Table 3. Average performance of the different enhancement methods in Figure 7.

Method	PSNR↑	SSIM↑	PI↓	RMSE↓	NIQE↓	BRISQUE↓
RetinexNet (2018)	15.6477	0.7897	3.6348	36.7010	6.1259	32.9049
KinD (2019)	18.6718	0.8046	2.8930	40.1974	4.0720	28.5728
EnGAN (2020)	11.2798	0.6019	4.4450	67.2756	7.0337	28.6887
DLN (2020)	17.0217	0.5277	2.7925	40.5834	4.2376	23.3972
Zero-DCE (2020)	17.6574	0.8098	3.0649	30.4134	4.8660	25.1409
SDD (2020)	18.6515	0.8095	3.0368	33.4495	4.6589	29.1570
DRBN (2020)	17.3580	0.8480	3.7196	30.1505	4.5263	23.9609
Zero-DCE++ (2021)	16.4965	0.5538	4.0115	35.7106	4.1429	19.1223
RetinexDIP (2021)	17.2111	0.7544	3.9552	29.4634	4.5509	25.3002
Ours	23.8017	0.8746	2.7460	24.7726	4.1447	18.7094

Table 4. Objective evaluation results of the six indicators PSNR, SSIM, PI, RMSE, NIQE, and RISQUE for the LOL [1] dataset.

Method	PSNR↑	SSIM↑	PI↓	RMSE↓	NIQE↓	BRISQUE↓
SRIE (2016)	14.7860	0.6014	5.0187	47.2893	6.5299	37.6744
BIMEF (2017)	18.5550	0.7233	4.5758	35.9697	7.7284	30.0861
EFF (2017)	20.4500	0.6193	5.1562	24.7017	9.2437	45.5565
RetinexNet (2018)	14.5686	0.2990	5.2698	29.7274	9.4277	33.8642
JED (2018)	18.0634	0.7725	4.5566	36.6428	5.7988	29.4081
FFM (2019)	17.6911	0.7680	4.6958	36.4465	6.9621	28.7708
KinD (2019)	16.5800	0.5765	3.5422	47.0003	4.9219	28.9662
EnGAN (2020)	10.5615	0.6318	3.9861	39.7336	4.6520	20.4200
DLN (2020)	17.9582	0.7383	3.7175	35.8579	6.2213	21.3218
Zero-DCE (2020)	18.9855	0.6768	4.5768	33.9804	8.0578	30.1077
SDD (2020)	17.1535	0.7259	4.2133	39.7718	5.3781	29.4991
DRBN (2020)	14.6856	0.6011	4.0122	28.941	4.9865	28.9887
Zero-DCE++ (2021)	20.0743	0.7903	3.9062	53.7578	6.4713	31.6658
RetinexDIP (2021)	17.6561	0.6612	5.3600	31.9070	6.9657	23.2744
Ours	25.066	0.8905	3.6612	16.5207	4.9034	18.7594

Table 5. Objective evaluation results of the six metrics PSNR, SSIM, PI, RMSE, NIQE, and BRISQUE for the MIT [32] dataset.

Method	PSNR↑	SSIM↑	PI↓	RMSE↓	NIQE↓	BRISQUE↓
SRIE (2016)	18.5994	0.6479	4.6264	32.2827	8.2338	37.0769
BIMEF (2017)	19.3802	0.8839	3.2687	31.201	4.7516	19.2054
EFF (2017)	21.07	0.7090	4.0820	24.1850	7.7022	32.7328
RetinexNet (2018)	16.3971	0.7724	3.5378	35.603	5.8523	32.9832
JED (2018)	18.9021	0.8199	3.8976	32.1988	5.6357	35.8391
FFM (2019)	17.7743	0.8517	3.3305	37.0954	4.5366	19.9998
KinD (2019)	19.6922	0.8606	3.091	25.571	5.1806	24.4180
EnGAN (2020)	19.3250	0.8789	3.05	27.9136	4.3873	20.9074
DLN (2020)	17.9633	0.8643	3.2012	29.6263	4.3616	20.3596
Zero-DCE (2020)	19.1880	0.8837	3.2012	29.6263	4.8345	23.5798
SDD (2020)	18.4178	0.8263	3.6015	33.7773	4.7409	28.4139
DRBN (2020)	18.5074	0.8549	3.4016	30.4341	4.3457	24.9967
Zero-DCE++ (2021)	16.9810	0.5293	4.1063	30.197	4.4001	19.8254
RetinexDIP (2021)	15.1113	0.7614	3.3943	34.3869	4.7571	23.0474
Ours	23.6867	0.8197	2.9344	17.0415	4.1834	18.0237

Table 6. Objective evaluation results of nonreference indicators BRISQUE and NIQE for the public datasets MEF [34], DICM [35], VV [36], NPE [37], and LIME [38].

	BRISQUE					NIQE
Method	MEF	DICM	VV	NPE	LIME	MEF	DICM	VV	NPE	LIME
SRIE (2016)	31.6755	31.0909	30.0013	28.8648	21.3854	9.8002	4.4626	4.2338	4.0076	4.3107
BIMEF (2017)	20.5770	24.2961	23.1839	21.9251	13.7832	4.3061	3.8263	3.6850	3.6893	4.8900
EFF (2017)	33.3979	34.7787	35.0649	33.0947	32.7853	4.6912	3.3572	3.6916	3.4547	3.9745
RetinexNet (2018)	24.9352	27.5574	25.8767	25.1027	16.9106	6.1001	4.4492	4.1033	4.2984	5.4387
JED (2018)	25.1958	28.3969	29.3638	28.5549	15.4578	6.4179	3.9760	4.8492	3.8585	6.1167
FFM (2019)	25.5495	25.6837	29.4840	23.6145	13.8984	4.7857	3.7660	3.5581	3.7240	4.9322
KinD (2019)	25.9532	24.9554	24.5015	23.2977	14.8946	5.6242	4.1372	3.5892	3.7670	6.6546
EnGAN (2020)	14.3043	14.3979	19.5189	22.4630	12.8414	3.2211	3.1189	3.3489	3.8172	3.3794
DLN (2020)	22.3135	23.2821	25.8513	23.6826	13.9512	3.3556	4.0048	3.6935	3.8892	4.09
Zero-DCE (2020)	24.545	24.8596	22.7712	24.0938	13.7468	4.4834	4.5601	3.5707	3.8097	4.6631
SDD (2020)	28.3008	27.8422	27.4322	27.1678	15.0549	6.1955	3.8403	4.3646	3.7466	5.5152
DRBN (2020)	26.6506	18.7275	18.2681	17.3514	18.2028	5.7662	3.8799	3.9150	3.7202	4.6236
Zero-DCE++ (2021)	17.8427	20.3962	22.3832	20.6055	19.9268	3.3954	3.5376	3.6212	3.7690	3.9949
RetinexDIP (2021)	16.4742	19.5039	21.7850	19.9879	18.3420	3.2827	3.7192	3.5898	3.5473	3.7112
Ours	11.6015	8.2625	13.0735	13.8618	12.5453	3.0018	3.07	3.3356	3.4629	3.0663

Table 7. Metric values of the enhanced result set for different number of iterations.

Method	Interation1	Interation2	Interation3	Interation4	Interation5	Interation6	Interation7
BRISQUE↓	32.8803	30.0622	23.1613	17.9312	19.9733	21.278	21.7397
NIQE↓	5.235	5.0973	4.4524	3.9756	4.4731	4.2737	4.5315

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Z.; Liu, F.; Li, J. EFCANet: Exposure Fusion Cross-Attention Network for Low-Light Image Enhancement. Appl. Sci. 2023, 13, 380. https://doi.org/10.3390/app13010380

AMA Style

Yang Z, Liu F, Li J. EFCANet: Exposure Fusion Cross-Attention Network for Low-Light Image Enhancement. Applied Sciences. 2023; 13(1):380. https://doi.org/10.3390/app13010380

Chicago/Turabian Style

Yang, Zhe, Fangjin Liu, and Jinjiang Li. 2023. "EFCANet: Exposure Fusion Cross-Attention Network for Low-Light Image Enhancement" Applied Sciences 13, no. 1: 380. https://doi.org/10.3390/app13010380

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EFCANet: Exposure Fusion Cross-Attention Network for Low-Light Image Enhancement

Abstract

1. Introduction

2. Related Work

2.1. Low-Light Image Enhancement

2.2. Low-Light Image Enhancement Algorithm Based on Image Fusion

3. Proposed Method

3.1. EIG (Exposure Image Generator)

3.2. Cross-Attention Fusion Module (CAFM)

3.3. Loss Function

4. Experiment

4.1. Experimental Implementation Details

4.2. Subjective Evaluation

Objective Evaluation

4.3. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI