Boosting Noise Reduction Effect via Unsupervised Fine-Tuning Strategy

Jiang, Xinyi; Xu, Shaoping; Wu, Junyun; Zhou, Changfei; Ji, Shuichen

doi:10.3390/app14051742

Open AccessArticle

Boosting Noise Reduction Effect via Unsupervised Fine-Tuning Strategy

by

Xinyi Jiang

,

Shaoping Xu

^*

,

Junyun Wu

,

Changfei Zhou

and

Shuichen Ji

School of Mathematics and Computer Sciences, Nanchang University, Nanchang 330031, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(5), 1742; https://doi.org/10.3390/app14051742

Submission received: 7 January 2024 / Revised: 10 February 2024 / Accepted: 13 February 2024 / Published: 21 February 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Over the last decade, supervised denoising models, trained on extensive datasets, have exhibited remarkable performance in image denoising, owing to their superior denoising effects. However, these models exhibit limited flexibility and manifest varying degrees of degradation in noise reduction capability when applied in practical scenarios, particularly when the noise distribution of a given noisy image deviates from that of the training images. To tackle this problem, we put forward a two-stage denoising model that is actualized by attaching an unsupervised fine-tuning phase after a supervised denoising model processes the input noisy image and secures a denoised image (regarded as a preprocessed image). More specifically, in the first stage we replace the convolution block adopted by the U-shaped network framework (utilized in the deep image prior method) with the Transformer module, and the resultant model is referred to as a U-Transformer. The U-Transformer model is trained to preprocess the input noisy images using noisy images and their labels. As for the second stage, we condense the supervised U-Transformer model into a simplified version, incorporating only one Transformer module with fewer parameters. Additionally, we shift its training mode to unsupervised training, following a similar approach as employed in the deep image prior method. This stage aims to further eliminate minor residual noise and artifacts present in the preprocessed image, resulting in clearer and more realistic output images. Experimental results illustrate that the proposed method achieves significant noise reduction in both synthetic and real images, surpassing state-of-the-art methods. This superiority stems from the supervised model’s ability to rapidly process given noisy images, while the unsupervised model leverages its flexibility to generate a fine-tuned network, enhancing noise reduction capability. Moreover, with support from the supervised model providing higher-quality preprocessed images, the proposed unsupervised fine-tuning model requires fewer parameters, facilitating rapid training and convergence, resulting in overall high execution efficiency.

Keywords:

boosting denoising effect; supervised denoising models; data bias; unsupervised denoising models; flexibility; fine-tuning

1. Introduction

Degradation of image quality is inevitable during the processes of image formation, transmission, reception, and processing, owing to illumination and interference caused by noise [1,2,3]. Noise can significantly impact the performance of subsequent high-level visual processing. Hence, prior to conducting high-level vision tasks, it is crucial to employ image denoising techniques. These techniques aim to eliminate noise from images and reconstruct high-quality, clean images characterized by well-defined edges and rich details [4,5,6]. Over the past few decades, the field of image denoising has witnessed vigorous development, leading to the evolution of numerous excellent denoising algorithms [7,8,9,10].

In the realm of traditional image-denoising models, the use of self-similarity between non-local patches has been deeply explored by NLM [11], BM3D [12], and WNNM [13] to address the issue of image denoising. Until very recently, it has continued to be extensively employed. As an example, in [14] Song and Huang introduced an innovative expectation maximization-based framework for concurrent stripe estimation and image denoising. Within this framework, they derived a nonparametric estimator to precisely compute the conditional expectation of the true image. They opted to utilize a modified non-local means (NLM) [11] algorithm for this purpose. While these traditional methods achieved satisfactory results in denoising at the time, the inherent complexity of the denoising problem led to intricate algorithm design and long execution times. Over the past decade, the rapid advancement of deep neural networks (DNNs) in image processing and analysis has brought about significant breakthroughs in image denoising tasks, leveraging their remarkable learning capabilities and nonlinear mapping abilities [15,16]. As one of the early endeavors in this field, Jain and Seung et al. applied a simple convolutional neural networks(CNN) for image denoising, which demonstrated comparable performance to wavelet and Markov random field (MRF) methods [17]. Subsequently, researchers have proposed a series of more advanced deep-learning methods, including TNRD [18], DnCNN [19], FFDNet [20], GANID [21], and CBDNet [22], which have exhibited excellent denoising performance compared to traditional non-deep-learning methods. Zhang et al. [19] incorporated deep architecture, residual-learning mechanisms [23], and batch normalization processing [24] into image denoising with DnCNN. This approach incorporates convolution layers, rectified linear units [25], and BN blocks. The initial estimation of residual noise within the input noisy image enables subtraction from the original noisy image, resulting in a clean output. This model enhances network training performance, expedites the process, and achieves superior results in removing additive white Gaussian noise(AWGN). However, it struggles to yield desired outcomes when the noise level is unknown. To enhance the model’s generalization ability, Zhang et al. proposed a fast and flexible denoising network (FFDNet) [20]. FFDNet effectively and efficiently handles a wide range of noise levels by incorporating an adjustable noise level map as an input. In the latest study [26], Karaca et al. enhanced the denoising performance by incorporating an efficient attention module, called the convolutional block attention module (CBAM), into the FFDNet model. However, FFDNet and its variants’ generalization ability can be further improved for complex noise scenarios. DRUNet [27] is a residual connection-based network designed for denoising. By combining the strengths of DnCNN and FFDNet, it employs a multi-level residual architecture to learn the intricate relationship between real signals and noise in images. Additionally, it extracts high-level feature representations through deep network structures. Compared to traditional single-level denoising networks, DRUNet excels at capturing complex noise patterns in elaborate images, resulting in more precise and reliable interference removal. However, DRUNet requires lengthy training times and requires further enhancements to address the challenges posed by extremely low signal-to-noise ratio (SNR) conditions. Furthermore, its effectiveness in suppressing noise interference within complex real-world scenes remains limited. Anwar et al. proposed a modular one-stage real-image denoising network (RIDNet) [28] in 2019 to enhance the practicality of denoising algorithms. RIDNet consists of three components: a feature extraction module, a feature learning residual module, and a reconstruction module. By incorporating residuals within the residual structure to alleviate low-frequency information propagation and employing feature attention mechanisms to explore channel correlations, RIDNet effectively accomplishes denoising tasks in real-world scenarios. SwinIR [29], proposed by Liang et al., integrates shallow and deep feature extraction with high-quality image reconstruction. This model effectively learns the mapping relationship between low-quality images and their corresponding high-quality counterparts, restoring clear details and mitigating noise interference. Moreover, it employs a deep transformer network structure to efficiently eliminate image noise and enhance visual quality. SwinIR exhibits superior denoising performance compared to the widely adopted CNN method. However, it should be noted that its training time is relatively lengthy, and its effectiveness for very low-quality images may be limited. In summary, once the supervised denoising models are trained, their parameters are kept fixed, resulting in a constant execution time for any image processing task, thereby enhancing efficiency. However, this process relies heavily on acquiring and labeling a substantial amount of expensive labeled data. Additionally, the trained model may only demonstrate suitability for scenarios similar to the training data and might not generalize well in unknown scenarios.

Researchers have addressed the poor flexibility and training difficulty of supervised denoising models by adopting unsupervised or self-supervised methods and proposing enhanced models, which offer increased adaptability and generalization ability. An example is the weakly supervised denoising model Noise2Noise (N2N) proposed by Lehtinen et al., utilizing pairs of noisy images from the same scene for training without requiring ground truth images, achieving noise reduction performance comparable to supervised models [30]. However, its applicability is limited by the need for a large number of diverse noisy images from the same scene. Nevertheless, the statistical plausibility of using noisy images as supervision has been demonstrated, inspiring subsequent research. Subsequent models, such as Noise2Void and Noise2Self, were introduced to train deep neural networks using only unorganized noisy image pairs [31,32]. Noise2Void avoids learning identity mapping by employing blind-spot networks that utilize neighboring pixels to predict each pixel. Similar strategies are employed in the parallel work of Noise2Self. Despite achieving good denoising results, these models use noisy images for training that are heavily correlated with the observed noisy image, posing challenges and costs in practical collection. In contrast to blind-spot methods, Noisier2Noise and Noisy-as-Clean generate training pairs by adding synthetic noise to the observed noisy image [33,34]. The Noise-As-Clean (NAC) method introduces additional noise to the original noisy image, creating noise–noise pairs for training deep neural networks. By employing the NAC strategy, an existing supervised noise reduction model can seamlessly transition into an unsupervised model through a simple modification of the training pair. However, accurately determining a reliable noise model remains challenging in real-world scenarios. To enhance denoising performance, Neighbor2Neighbor constructs a noisy pair by subsampling a given noisy image [35]. However, its performance still relies heavily on a large-scale training dataset. Recently, there has been a focus on addressing the challenges posed by noisy images, with some pioneering works highlighting the significance of incorporating information from both noisy and neighboring pixels during image recovery. However, these methods are often highly sensitive to the standard deviation of the noise added to clean images, a parameter that is typically unknown without access to the clean images themselves. To mitigate this challenge, Noise2Info, introduced by Wang et al. in 2023 [36], was proposed. This method aims to extract critical information, specifically the standard deviation of the noise solely based on the characteristics of the noisy images. In contrast to methods requiring numerous noisy images, Self2Self [37], introduced by Quan et al., utilizes a single noisy image without ground truth for training, applying Bernoulli sampling to generate a sequence of noisy images for network training. Quadratic regularization of sampled pixels is performed through local convolution, enhancing network performance. Nevertheless, diminishing reliance on a single image for training may lead to overfitting issues. Zero-shot Noise2Noise (ZS-N2N), as proposed by Mansour and Heckel in 2023 [38], represents a novel approach to image denoising. This method is designed to address scenarios where data availability is scarce and limited computing resources are available. Unlike traditional denoising methods that require pairs of clean and noisy images for training, ZS-N2N trains a lightweight network using only a single noisy image. The training target is derived from downsampled image pairs, allowing the network to learn denoising capabilities efficiently. Similarly, the deep image prior (DIP) method was proposed to train solely on a single degraded image, leveraging structural priors embedded within the network architecture itself [39]. However, its sensitivity towards target images requires further enhancements to improve performance. In summary, unsupervised and self-supervised denoising models still exhibit a significant performance gap compared to supervised models due to challenges in obtaining high-quality training data and their relatively slow training process. Moreover, they are more sensitive to initial conditions and hyper-parameters.

Improving the noise reduction effect has always been the primary goal pursued by researchers. Although many of the supervised methods now proposed have good performance, they are limited by the specific dataset and cannot achieve good results in various scenarios. Meanwhile, the unsupervised method has a wide range of adaptivity, which can compensate for the shortcomings of the supervised denoising process. However, the denoising performance of unsupervised models still exhibits a considerable gap compared to supervised models, assuming the availability of a large training dataset. Given that supervised, unsupervised, and self-supervised models have their advantages and disadvantages in noise reduction, a natural idea is to merge them to enhance the robustness and generalization ability of the noise reduction model in order to achieve better noise reduction results. In this paper, a novel two-stage denoising model, which enhances denoising performance through an unsupervised fine-tuning strategy, is proposed. More specifically, in the first stage we propose replacing the convolutional layers with Transformer modules in the UNet network framework of the DIP method, resulting in a new supervised denoising network (named U-Transformer) with stronger denoising performance. After the given noisy image undergoes processing by the supervised U-Transformer denoising model, the quality of the resulting denoised image (referred to as preprocessed image) is significantly improved. However, there is still slight residual noise and artifacts due to data bias, resulting in a small deviation of the image intensity from the original image. Aiming to achieve better denoising results, we introduce an extremely simplified network comprising only one Transformer module for fine-tuning the preprocessed image under the unsupervised training method. Experimental results show that our method achieves excellent denoising results in both synthetic and real-world scenarios and outperforms the current leading methods. The major contributions of this paper are as follows:

(1): By combining supervised and unsupervised methods, this study presents a holistic approach to address the image denoising problem, aiming to leverage the benefits of labeled data (the prior knowledge embedded within it) in the supervised method and the adaptive nature (fine-tuning capability) of the unsupervised method. Therefore, our combined approach enhances the robustness, generalization ability, and adaptability of the denoising model.
(2): Our noise reduction model achieves plug-and-play capability. While the fine-tuning stage we use relies on the noise reduction performance of the supervised denoising models, in theory it is a relatively independent process. In practice, during the preprocessing stage, any supervised model can be used as a processing method for the given noisy images. This demonstrates the modularity of our approach, enabling us to apply the proposed fine-tuning strategy and combine it with any future advanced supervised models that may arise. As a result, we have the opportunity to obtain a noise reduction model with better noise reduction performance.
(3): The efficiency of the unsupervised noise reduction stage has been enhanced. Utilizing the preprocessing model yields high-quality outcomes (i.e., preprocessed images), resulting in a small difference between the preprocessed image and the final denoised image. This subtle difference significantly reduces the difficulty of non-linear mapping in the fine-tuning network. Consequently, with only a minimalist network comprising one Transformer module, we can accomplish the fine-tuning task. This network possesses a significant advantage in the efficiency of network parameter updates, resulting in much less execution time required during the fine-tuning phase compared to conventional unsupervised fine-tuning methods.

The paper is structured as follows: Section 2 offers an overview of the background knowledge and the DIP denoising model pertinent to our study. In Section 3, we elaborate on our networks employed in the two stages, detailing their structure and hybrid loss function, respectively. Section 4 presents the experimental dataset, settings, and results, comparing them with other state-of-the-art (SOTA) denoising algorithms. Lastly, Section 5 provides a comprehensive summary of our work.

2. Related Work

2.1. Formulation

Various factors can lead to the degradation of images due to noise during acquisition and transmission. Assuming the observed value y denotes a noisy image and its corresponding clean image is represented by x, a common assumption regarding noise distribution is AWGN with variance

σ^{2}

, denoted by

N (0; σ^{2})

. The degradation process is expressed as:

y = x + n

(1)

Hence, the task of image denoising involves an inverse problem, requiring the restoration of the latent clean image, x, from the observed data, y. This problem is considered a classic ill-posed problem.

Current methods for image denoising are broadly categorized into two groups: model-based [11,12,40,41,42,43,44,45,46] and learning-based methods [18,19,20,47,48,49,50,51]. Model-based approaches can be conceptualized from a Bayesian perspective through a maximum a posteriori (MAP) estimation problem:

x^{*} = \underset{x}{arg max} \{l o g P (y | x) + l o g P (x)\}

(2)

Here,

log P (y | x)

represents the log-likelihood of the observed noisy image, y. It gauges the discrepancy between x and y and is linked to the data fidelity term

F (x, y)

. To enhance denoising efficacy, traditional model-based methods need to incorporate a regularization term. This term confines the search space and imposes constraints on the characteristics of the resulting image. However, this approach involves intricate design complexities and demonstrates suboptimal execution efficiency, compared to the current mainstream supervised methods.

2.2. Supervised Model

Diverging from model-based methods, which hinge on meticulously designed image priors, learning-based approaches acquire mapping functions to restore high-quality images. This is achieved through the optimization of a loss function applied to a training set comprising pairs of degraded-clean images, denoted as

U_{n = 1}^{N} \{y_{n}, x_{n}\}

. DNN techniques commonly employ a supervised training strategy, guided by external priors and datasets, to minimize the predicted discrepancy between the iterative image and the clean reference.

min_{θ} \sum L (F_{θ} (y_{i}), x_{i})

(3)

The clean–noisy image pairs, denoted as

x_{i}

and

y_{i}

, undergo processing through a nonlinear function,

F_{θ}

, parameterized by

θ

. This function maps the noisy patch to the predicted clean patch. Regardless of the complexity and advancement in the design of a supervised model, its fundamental nature remains that of a mapping function transforming a noisy image into a clear image, a transformation supported by an extensive amount of training data. Nevertheless, due to its reliance on large-scale datasets for training, it encounters a data dependency issue. Changes in the input may lead to inconsistencies between the final generated image distribution and the distribution of the training data. This limitation hampers the migration ability of the supervised network and lacks corresponding mechanisms.

2.3. DIP

Deep image prior (DIP), introduced by Ulyanov et al. [39], is an unsupervised learning technique that focuses on optimizing network parameters to adapt to input noisy images for comprehensive reconstruction and denoising purposes. This optimization process is uniquely based on a single noisy image as the training target, leveraging the implicit constraints inherent in deep learning for online training. The training procedure can be expressed as follows:

θ^{*} = \underset{θ}{arg min} E (f_{θ} (z); y)

(4)

x = f_{θ} (z)

(5)

where f refers to a neural network,

θ

represents the network parameters (initially randomly initialized), and z is a tensor following a random distribution (similar to the input of a generative adversarial network). Through training, the optimal parameter solution, denoted as

θ^{*}

, is obtained, and

x^{*}

corresponds to the optimal image output generated by the network. Notably, the DIP starts with the random initialization of a convolutional neural network and iteratively adjusts its parameters by minimizing a loss function to closely align the generated image with the input noisy image. This loss function typically involves calculating pixel differences between the two images. By iteratively optimizing the parameters, the DIP effectively produces reconstructions that closely resemble their original counterparts.

In the domain of image processing, the DIP method has emerged as a powerful approach for unsupervised learning in image restoration. Its ability to recover images without relying on extensive training datasets makes it highly adaptable for image denoising. The efficacy of the DIP technique dynamically adapts to different application scenarios and intrinsic image characteristics, allowing for optimal performance in diverse contexts. However, when compared to supervised models, the DIP falls short in the presence of a substantial amount of labeled training data. Moreover, as shown in Figure 1, utilizing a five-layer UNet network [52] to obtain multi-scale information during the iterative DIP process leads to reduced iteration speed. While this approach enhances denoising effectiveness, it also increases computational complexity and time costs. Additionally, the DIP method gradually learns the target image, progressively approaching the desired outcome during the denoising process. The quality of the target image directly impacts the denoising effect. Extensive experimental results indicate that using a poor-quality image as the target image results in a wide network search range and suboptimal denoising outcomes (for specific details, refer to the ablation experiments). To achieve superior denoising results, it is advisable to consider using higher-quality target images or employing other preprocessing techniques to enhance the image quality of the target image. In summary, although the DIP method offers significant advantages in unsupervised learning for image restoration, its performance can be further improved by using a higher-quality preprocessed image as the target image instead of directly using the noisy image. These enhancements are likely to yield more efficient and effective image denoising results.

3. Methodology

3.1. Basic Idea

The severe data bias issues encountered by supervised denoising models can be addressed by leveraging the flexibility of unsupervised denoising models such as DIP. These models generate neural network parameters tailored to the specific noisy image for denoising. Thus, supervised and unsupervised models complement each other effectively. Our observations indicate that the quality of the target image significantly influences the noise reduction effect during unsupervised fine-tuning (see Section 4.2 for details). To further enhance the denoising performance of supervised models, we propose a novel approach in this paper. As shown in Figure 2, our approach is divided into two stages. This approach utilizes an unsupervised training method similar to DIP for fine-tuning. The key to our approach lies in the quality of the preprocessed images. Therefore, we introduce a novel supervised noise reduction model, i.e., U-Transformer. This model replaces convolutional layers with Transformer modules in the classical UNet framework to preprocess the given noisy image. It generates a higher-quality image as input (target image) for the unsupervised fine-tuning network. As the gap between the preprocessed image and the final denoised image is already minimal, complex neural networks for non-linear mapping between them are unnecessary. Consequently, there is no need for complex multi-scale UNet architectures. A simplified network with only one Transformer block is sufficient. This results in a fine-tuning task that can be accomplished with minimal iterative update cost. In summary, we seamlessly integrate supervised and unsupervised methods while prioritizing execution efficiency. The proposed approach is expected to enhance the performance of supervised denoising models, including those with potentially higher denoising performance in the future. This improvement comes with increased robustness and scalability, achieved with minimal computational cost.

3.2. Supervised Preprocessing Stage

Due to the remarkable capabilities of CNNs in learning generalized image priors from extensive data, CNN-based models have become prevalent in image restoration and related tasks. The UNet, a classic CNN model, is particularly renowned for its distinctive architecture—a U-shaped encoder–decoder structure with skip connections. This design facilitates efficient information flow across various resolution levels, enabling the model to capture both high-level semantic features and intricate details. The skip connections play a crucial role in fusing feature maps from the encoder and decoder pathways, contributing to the precise localization of objects in the image. Nevertheless, the convolution operation stands out as the core module, resulting in a relatively small receptive field for processing. Recently, another influential class of architectures, the Transformer, has demonstrated significant performance gains in both natural language and high-level vision tasks. In the supervised noise reduction stage, drawing inspiration from the U-shaped network in the DIP method [39,52] and aiming to further enhance network performance, we opt to incorporate the Transformer block into the U-shaped network to preprocess the input noisy images. This strategic integration leverages the strengths of both UNet’s architecture and Transformer’s capabilities.

In terms of specific implementation, our proposed image restoration approach involves the utilization of a U-shaped Transformer (U-Transformer) network, as illustrated in Figure 3. The U-shaped network architecture serves as the foundation of our model, where the convolution operation in each layer of the original UNet is substituted with a Transformer block. Within the Transformer block, improvements have been incorporated into both the multi-head self-attention (MSA) and feed-forward network (FFN) components. To enhance computational efficiency, our MSA module applies self-attention across channels rather than spatial dimensions. This approach computes feature cross-covariance within channels, generating a transposed attention map that implicitly encodes global context information. The chosen FFN promotes information flow across hierarchical stages, allowing each stage to focus on details that complement the MSA’s capabilities. During the training phase, we employ the Adam optimizer and the L1 loss function for model optimization, defined as

L o s s_{1} = M A E = | x_{p} - x_{t} |

(6)

where

x_{p}

represents the output image of the network and

x_{t}

represents the ground truth image.

3.3. Unsupervised Fine-Tuning Stage

As mentioned earlier, the images obtained after the supervised noise reduction process are not guaranteed to achieve the best results. To obtain a better denoising effect, a feasible method is to input these higher-quality images into a fine-tuned small network to improve the image quality. In this work, we simplify the U-Transformer network, which originally consists of multiple Transformer blocks, to a simplified network with only one Transformer block. Through ablation experiments, it has been observed that utilizing only one Transformer block suffices to attain satisfactory denoising results and expedite network optimization. As shown in Figure 4, the proposed simplified network utilizes a hybrid loss function, defined as

L o s s_{2} = M S E_{1} + M S E_{2} = λ \cdot | | x_{f} - {y | |}_{2} + | | x_{f} - x_{p} {| |}_{2}

(7)

where

x_{f}

denotes the output image of the fine-turning network, y represents the noisy image (i.e., the first target image),

λ

represents the weight of the noisy image, and

x_{p}

corresponds to the preprocessed image (i.e., the second target image). Unlike the loss function employed in the standard DIP method, here, preprocessed image

x_{p}

is listed as the second target image. It needs to be clarified here that our approach can significantly reduce the search space for output image

x_{f}

, ensuring its image quality.

4. Experimental Results

4.1. Experimental Environment and Datasets

We assessed the efficacy and benefits of our suggested two-stage image denoising model by conducting performance comparisons with various state-of-the-art methods, including BM3D [12], DnCNN [19], FFDNet [20], DIP [39], IRCNN [53], DAGL [54], DeamNet [55], DRUNet [56], and SwinIR [29]. Among these methods, BM3D belongs to the category of filter-based denoising methods. It involves grouping similar blocks in noisy images and then recovering their common structure through spatial domain averaging or transform domain filtering, representing an early classic model. DnCNN and FFDNet, proposed by Zhang et al., mark a significant milestone by introducing a deep-learning-based framework for image denoising. The unsupervised method DIP demonstrates that appropriate hand-crafted network architectures correspond to better hand-crafted priors. We also include several state-of-the-art supervised methods. The IRCNN can train a set of fast and efficient CNN image denoisers by using flexible model-based optimization methods and fast discriminative learning methods, the DAGL is a state-of-the-art hybrid method, the DeamNet implements an innovative end-to-end trainable and interpretable deep denoising network, the DRUNet achieves plug-and-play image restoration tasks, and the SwinIR is a relatively mature Transformer structure approach. For the purpose of comparison, we will refer to the model used in the first stage with supervised training as Stage1 and the model used in the second stage with unsupervised fine-tuning training as Stage2. Denoising experiments were carried out on SET12, which comprises 12 grayscale images. This set includes 7 images sized 256 × 256 (Cameraman, House, Pepper, Fishstar, Monarch, Airplane, Parrot) and 5 images sized 512 × 512 (Lena, Barbara, Boat, Man, Couple). Additionally, we used BSD68, a dataset containing 68 images randomly chosen from BSD500, characterized by rich texture details. Urban100 consists of 100 high-resolution images featuring a variety of real-world structures, including indoor, urban, and architectural scenes. These diverse and complex structures make it particularly suitable for noise reduction tasks. For our experiment, we randomly selected 30 images from this dataset. SIDD comprises approximately 30,000 noisy images captured across 10 different scenes under varying lighting conditions, using a selection of five representative mobile phones. This dataset is commonly utilized for training and evaluating real-world image denoising algorithms. Due to the large size of the images in this dataset, we performed cropping during our experiments. Specifically, we employed 10 randomly chosen images for the denoising experiments. For visualization purposes, Figure 5 displays all 12 images of SET12 used in our experiments, while Figure 6 and Figure 7 showcase 10 representative images, each from the BSD68 and Urban100 datasets, respectively. Additionally, Figure 8 presents 10 representative images from the SIDD dataset. To objectively assess the performance of our denoising methods, we utilized the widely accepted PSNR metric to evaluate the denoising outcomes across different methods. The experiments were conducted using PyTorch 1.7 on a single NVIDIA RTX 3090 GPU, whose manufacturer is NVIDIA company and was made in Santa Clara, USA and a Lenovo desktop equipped with a 2.1 GHz Intel Core i7-6700k CPU and 16 GB of RAM.

4.2. Ablation Experiments

To verify whether or not the image quality of input and target images has a significant impact on unsupervised fine-tuning training, we conducted a series of experiments. Specifically, we choose four representative noise reduction methods (FFDNet, DAGL, DRUNet, and SwinIR) as well as our own Stage1 model (i.e., U-Transformer) as a preprocessor to denoise the given noisy images from the SET12 dataset. Each image has been added with Gaussian noise with a standard deviation of 25. The obtained denoised image,

x_{p}

, is then input into the fine-tuning network to boost the image denoising effect. Note that both the preprocessed image,

x_{p}

, and the noisy image, y, are used as the target images, and the final image,

x_{f}

, is generated by fine-tuning the Stage2 model. We meticulously recorded the PSNR values of the preprocessed images and the final denoised images obtained from various noisy images and calculated their averages. The experimental results are shown in Table 1. It can be observed from Table 1 that there is a positive correlation between the final denoising effect of the final denoised images and the image quality of the preprocessed images. Through the analysis of data correlations, it is evident that the PSNR data for rows

x_{p}

and

x_{f}

exhibit a strong positive linear relationship, with a Pearson’s correlation coefficient of 0.99. This suggests a close association between the quality of the output image,

x_{f}

, and that of the preprocessed image,

x_{p}

. More specifically, the higher quality for the preprocessed image,

x_{p}

, corresponds to a higher PSNR value for the output image,

x_{f}

. Furthermore, the data in Table 1 also indicates that the proposed Stage1 model is expected to outperform current mainstream denoising models in terms of PSNR. It should be noted that, while using other models, such as DnCNN, as a preprocessing model can lead to larger performance improvements (even up to 0.3 dB), the final denoising result is still not as good as our U-Transformer (Stage1) due to the lower PSNR values of its input,

x_{p}

. Therefore, in the unsupervised fine-tuning stage, we choose to use the preprocessed images,

x_{p}

, generated by U-Transformer as input and target images for fine-tuning the Stage2 model.

To further evaluate the performance improvement achieved by using noisy images and preprocessed images as target images, we conducted tests on various scenarios of the loss functions employed during the training of the Stage2 model. From the data in Table 2, it can be observed that the proposed hybrid loss function ensures the optimal denoising effect during the fine-tuning stage. The reason behind this is that, although preprocessed images exhibit high image quality, there is inevitably some smoothing of pixel values after the preprocessing stage. On the other hand, noisy images, while containing more noise, may have pixels that are not as significantly altered. Hence, there exists a certain complementarity between preprocessed and noisy images. As target images, this complementarity facilitates the Stage2 model to converge to more reasonable positions, resulting in a better denoising effect.

As mentioned earlier, the noisy image still plays a crucial role as the primary target image in the loss function. To validate this role, we conducted a series of experiments on the images from the SET12 dataset with noise levels of 15, 25, and 50, respectively. Table 3 lists the specific experimental weight values for noise levels of 15, 25, and 50, which is set to 0.2, 0.4, and 0.8 through trial and error, respectively. The following experiments all use the parameters listed in Table 3. The above experiments indicate that, as the severity of noise increases, the Stage2 model relies more on information from pixels in the noisy image that are not affected by noise for image restoration.

To validate whether the Stage2 model can achieve fine-tuning tasks using only a single Transformer block, we tested the denoising performance with different numbers of Transformer blocks. We employed various network structures, as shown in Table 4, which consisted of networks with different numbers of blocks. Each configuration was run 10 times on images of sizes 256 × 256 and 512 × 512, respectively, with an iteration step of 3000 in each run (ensure model convergence) to calculate the average running time. Finally, the execution times with only one block for image sizes of 256 × 256 and 512 × 512 are 30.94 s and 110.33 s, respectively. The denoising performance obtained using only one Transformer block is comparable to using multiple blocks, with acceptable difference in denoising performance in terms of PSNR, but it indeed requires the least execution time (only between one-eighth and one-fifth of the maximum execution time). Therefore, the proposed Stage2 model in this paper utilizes only a single Transformer block.

4.3. Experimental Results and Analysis

In this section, we present the PSNR results of our proposed two-stage denoising method. To validate the effectiveness of our approach, we conducted experiments on both synthetic and real-world noisy images. Initially, we performed denoising experiments on synthetic images from the SET12, BSD68, and Urban100 datasets and compared them with other supervised and unsupervised denoising methods. Table 5, Table 6 and Table 7 provide a detailed comparison of various denoising methods on each noisy image from the SET12 dataset, while Table 8 provides an overall performance comparison of various denoising methods on the SET12, BSD68 and Urban100 datasets at different noise levels. The best results are highlighted in bold. It can be observed from Table 5, Table 6, Table 7 and Table 8 that the proposed method outperforms the second-ranked algorithm by a margin of 0.4 dB on the SET12 dataset, 0.14 dB on the BSD68 dataset, and 0.30 dB on the Urban100 dataset, respectively, in terms of the PSNR evaluation metric. This indicates that our proposed method achieves superior results compared to mainstream supervised learning approaches such as FFDNet, DAGL, DreamNet, DRUNet, and SwinIR, in addition to unsupervised learning methods such as DIP and N2V.

Furthermore, to assess the efficacy of our proposed methodology in real-world noisy images, we conducted experiments on a randomly selected subset of 10 images from the SIDD dataset and compared the denoising outcomes with state-of-the-art algorithms including BM3D, FFDNet, SwinIR, DRUNet, DnCNN, IRCNN, and DIP. Given that real-world noisy images presents greater complexity than synthetic ones, conventional approaches exhibit comparatively inferior performance when confronted with real-world noise images. Table 9 displays the average PSNR value of the competing methods, with the best result highlighted in bold. Based on experimental results, our proposed method achieves a superior performance by 0.7 dB compared to the second-ranked denoiser, which demonstrates the distinct advantages of our approach for addressing real-world denoising problems.

4.4. Visual Comparisons

In image tasks, the visual quality is an important indicator to evaluate model performance. To visually compare the denoising effect of our proposed method, after applying a Gaussian noise level of 25 to the Airplane image from the Set12 dataset, we used various competing methods to denoise the noisy Airplane image and calculated the PSNR value of the denoised image (See Figure 9). The visual contrast effect of the locally enlarged area is shown in Figure 9. For better visualization, we chose to evaluate the part of the text on the wing section of the airplane and enlarged it. As can be seen from Figure 9, the denoised images obtained by the SWinIR and Stage1 methods are overly smooth, resulting in the loss of texture details; the DIP method has poor denoising effect; and the denoised lines obtained by the DAGL, DnCNN, FFDNet, and N2V methods are slightly fuzzy. In contrast, our proposed method retains good texture details and improves visibility in the enlarged area, indicating that the denoising results are closer to the ground-truth image, and this observation is also confirmed by the PSNR value. In summary, compared with the above methods, our proposed improvements make the proposed two-stage denoising technology have the strongest ability to preserve image details and achieve remarkable results.

Additionally, as illustrated in Figure 10, experiments conducted on real SIDD datasets to assess the generalization capability of the proposed method (i.e., Stage2 model) demonstrate effective noise elimination while preserving details within the magnification box. The overall color reproduction of the Stage2 model is observed to be closer to the original image compared to other methods and Stage1. Although DIP can also eliminate noise, its denoising effect is not as good as the method described in this paper, and its performance in terms of PSNR is also significantly inferior to ours.

5. Conclusions

In this paper, a novel two-stage denoising method supported by an unsupervised fine-tuning strategy is proposed to boost the denoising effect with the minimum time cost. Compared to the original DIP, our method takes the improved U-Transformer-generated high-quality preprocessed image as the target image, and we employ a simplified network model with only one Transformer block for fine-tuning, achieving superior denoising results with a relatively short execution time. The empirical results show that our approach exceeds both the original DIP technique and the most advanced supervised denoising methods, especially in scenarios where high-quality denoising is required. The reasons for this can be attributed to the following: (1) From a macro perspective, our proposed approach effectively integrates both supervised and unsupervised models while offering several advantages. It initially employs supervised denoising training to acquire the preprocessed image, characterized by high image quality, rendering it suitable for use as the target image. Based on this, the simplified unsupervised network can be used for fine-tuning, which significantly improves the overall denoising effect. (2) The unsupervised optimization process relies solely on a simplified network consisting of one Transformer block, which greatly speeds up the entire denoising process and improves efficiency. In summary, our proposed solution offers a promising and effective approach that demonstrates significant enhancements in removing noise from both synthetic and real-world images, surpassing other mainstream denoising methods. It should be noted that the proposed unsupervised fine-tuning method exhibits excellent scalability. In future research, as new powerful supervised models are introduced, the proposed method is poised to leverage these advancements, anticipating even higher denoising performance.

Author Contributions

Conceptualization, X.J. and S.X.; methodology, S.X.; software, X.J. and C.Z.; validation, X.J., C.Z. and S.J.; formal analysis, X.J., J.W. and S.X.; investigation, X.J. and S.J.; resources, S.X. and X.J.; data curation, X.J. and S.J.; writing—initial draft preparation, S.J.; review and editing, X.J., J.W. and S.X.; visualization, C.Z. and X.J.; supervision, S.X. and X.J.; project administration, S.X.; funding acquisition, S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of China, grant number 62162043.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, S.L.A.; Kouzani, A.Z.; Hu, E.J. Automated detection of lung nodules in computed tomography images: A review. Mach. Vis. Appl. 2012, 23, 151–163. [Google Scholar] [CrossRef]
Zha, Z.; Yuan, X.; Wen, B.; Zhou, J.; Zhang, J.; Zhu, C. From rank estimation to rank approximation: Rank residual constraint for image restoration. IEEE Trans. Image Process. 2019, 29, 3254–3269. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Gu, J.; Liu, Y.; Magid, S.A.; Dong, C.; Wang, Q.; Pfister, H.; Zhu, L. Masked image training for generalizable deep image denoising. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2023), Vancouver, BC, Canada, 17–24 June 2023; pp. 1692–1703. [Google Scholar]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 11, 257–276. [Google Scholar] [CrossRef]
Neshatavar, R.; Yavartanoo, M.; Son, S.; Lee, K.M. CVF-SID: Cyclic multi-variate function for self-supervised image denoising by disentangling noise from image. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 17583–17591. [Google Scholar]
Zhang, Y.; Li, D.; Law, K.L.; Wang, X.; Qin, H.; Li, H. Idr: Self-supervised image denoising via iterative data refinement. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022; pp. 2098–2107. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Sun, G.; Kong, Y.; Fu, Y. Accurate and fast image denoising via attention guided scaling. IEEE Trans. Image Process. 2021, 30, 6255–6265. [Google Scholar] [CrossRef] [PubMed]
Byun, J.; Cha, S.; Moon, T. Fbi-denoiser: Fast blind image denoiser for poisson-gaussian noise. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA, 20–25 June 2021; pp. 5768–5777. [Google Scholar]
Pang, T.; Zheng, H.; Quan, Y.; Ji, H. Recorrupted-to-recorrupted: Unsupervised deep learning for image denoising. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA, 20–25 June 2021; pp. 2043–2052. [Google Scholar]
Cheng, S.; Wang, Y.; Huang, H.; Liu, D.; Fan, H.; Liu, S. Nbnet: Noise basis learning for image denoising with subspace projection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA, 20–25 June 2021; pp. 4896–4906. [Google Scholar]
Buades, A.; Coll, B.; Morel, J.M. A non-local algorithm for image denoising. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2005), San Diego, CA, USA, 20–25 June 2005; pp. 60–65. [Google Scholar]
Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Process. 2007, 16, 2080–2095. [Google Scholar] [CrossRef]
Gu, S.; Zhang, L.; Zuo, W.; Feng, X. Weighted nuclear norm minimization with application to image denoising. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2014), Columbus, OH, USA, 23–28 June 2014; pp. 2862–2869. [Google Scholar]
Song, L.; Huang, H. Simultaneous destriping and image denoising using a nonparametric model with the em algorithm. IEEE Trans. Image Process. 2023, 32, 1065–1077. [Google Scholar] [CrossRef] [PubMed]
Liang, J.; Liu, R. Stacked denoising autoencoder and dropout together to prevent overfitting in deep neural network. In Proceedings of the International Congress on Image and Signal Processing (CISP 2015), Shenyang, China, 14–16 October 2015; pp. 697–701. [Google Scholar]
Xu, Q.; Zhang, C.; Zhang, L. Denoising convolutional neural network. In Proceedings of the International Conference on Information and Automation (ICIA 2015), Lijiang, China, 8–10 August 2015; pp. 1184–1187. [Google Scholar]
Jain, V.; Seung, S. Natural image denoising with convolutional networks. In Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–11 December 2008; Volume 21. [Google Scholar]
Chen, Y.; Pock, T. Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1256–1272. [Google Scholar] [CrossRef]
Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef]
Zhang, K.; Zuo, W.; Zhang, L. FFDNet: Toward a fast and flexible solution for CNN-based image denoising. IEEE Trans. Image Process. 2018, 27, 4608–4622. [Google Scholar] [CrossRef]
Zhong, Y.; Liu, L.; Zhao, D.; Li, H. A generative adversarial network for image denoising. Multimed. Tools Appl. 2020, 79, 16517–16529. [Google Scholar] [CrossRef]
Guo, S.; Yan, Z.; Zhang, K.; Zuo, W.; Zhang, L. Toward convolutional blind denoising of real photographs. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 1712–1722. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2016), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML 2015), Lille, France, 6–11 July 2015; pp. 448–456. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Karaca, N.; Çiftçi, S. Image denoising with CNN-based attention. In Proceedings of the 2023 4th International Informatics and Software Engineering Conference (IISEC), Ankara, Turkiye, 21–22 December 2023; pp. 1–6. [Google Scholar]
Devalla, S.K.; Renukanand, P.K.; Sreedhar, B.K.; Subramanian, G.; Zhang, L.; Perera, S.; Mari, J.M.; Chin, K.S.; Tun, T.A.; Strouthidis, N.G.; et al. DRUNET: A dilated-residual U-Net deep learning network to segment optic nerve head tissues in optical coherence tomography images. Biomed. Opt. Express 2018, 9, 3244–3265. [Google Scholar] [CrossRef] [PubMed]
Anwar, S.; Barnes, N. Real image denoising with feature attention. In Proceedings of the International Conference on Computer Vision (ICCV 2019), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3155–3164. [Google Scholar]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image restoration using swin transformer. In Proceedings of the International Conference on Computer Vision Workshops (ICCVW 2021), Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Jaakko, L.; Jacob, M.; Jon, H.; Samuli, L.; Tero, K.; Miika, A.; Timo, A. Noise2Noise: Learning image restoration without clean data. In Proceedings of the International Conference on Machine Learning (ICML 2018), Stockholm, Sweden, 10–15 July 2018; pp. 2971–2980. [Google Scholar]
Krull, A.; Buchholz, T.O.; Jug, F. Noise2Void - learning denoising from single noisy images. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, USA, 15–20 June 2019; pp. 2124–2132. [Google Scholar]
Batson, J.; Royer, L. Noise2self: Blind denoising by self-supervision. In Proceedings of the International Conference on Machine Learning (ICML 2019), Long Beach, CA, USA, 10–15 June 2019; pp. 524–533. [Google Scholar]
Moran, N.; Schmidt, D.; Zhong, Y.; Coady, P. Noisier2noise: Learning to denoise from unpaired noisy data. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 13–19 June 2020; pp. 12064–12072. [Google Scholar]
Xu, J.; Huang, Y.; Cheng, M.M.; Liu, L.; Zhu, F.; Xu, Z.; Shao, L. Noisy-as-clean: Learning self-supervised denoising from corrupted image. IEEE Trans. Image Process. 2020, 29, 9316–9329. [Google Scholar] [CrossRef] [PubMed]
Huang, T.; Li, S.; Jia, X.; Lu, H.; Liu, J. Neighbor2Neighbor: Self-supervised denoising from single noisy images. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2021), Nashville, TN, USA, 20–25 June 2021; pp. 14776–14785. [Google Scholar]
Wang, J.; Di, S.; Chen, L.; Wai Ng, C.W. Noise2Info: Noisy image to information of noise for self-supervised image denoising. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 15988–15997. [Google Scholar]
Quan, Y.; Chen, M.; Pang, T.; Ji, H. Self2self with dropout: Learning self-supervised denoising from single image. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2020), Seattle, WA, USA, 13–19 June 2020; pp. 1887–1895. [Google Scholar]
Mansour, Y.; Heckel, R. Zero-Shot Noise2Noise: Efficient Image Denoising without any Data. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14018–14027. [Google Scholar]
Ulyanov, D.; Vedaldi, A.; Lempitsky, V. Deep image prior. Int. J. Comput. Vis. 2020, 128, 1867–1888. [Google Scholar] [CrossRef]
Rudin, L.I.; Osher, S.; Fatemi, E. Nonlinear total variation based noise removal algorithms. Phys. D Nonlinear Phenom. 1992, 60, 259–268. [Google Scholar] [CrossRef]
Candès, E.J.; Donoho, D.L. Curvelets: A Surprisingly Effective Nonadaptive Representation for Objects with Edges; Department of Statistics, Stanford University: Stanford, CA, USA, 1999. [Google Scholar]
Nikolova, M. A variational approach to remove outliers and impulse noise. J. Math. Imaging Vis. 2004, 20, 99–120. [Google Scholar] [CrossRef]
Elad, M.; Aharon, M. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process. 2006, 15, 3736–3745. [Google Scholar] [CrossRef] [PubMed]
Chen, G.; Kégl, B. Image denoising with complex ridgelets. Pattern Recognit. 2007, 40, 578–585. [Google Scholar] [CrossRef]
Foi, A.; Katkovnik, V.; Egiazarian, K. Pointwise shape-adaptive DCT for high-quality denoising and deblocking of grayscale and color images. IEEE Trans. Image Process. 2007, 16, 1395–1411. [Google Scholar] [CrossRef]
Mallat, S. A Wavelet Tour of Signal Processing; Elsevier: Amsterdam, The Netherlands, 1999. [Google Scholar]
Lefkimmiatis, S. Universal denoising networks: A novel CNN architecture for image denoising. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2018), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3204–3213. [Google Scholar]
Lin, K.; Li, T.H.; Liu, S.; Li, G. Real photographs denoising with noise domain adaptation and attentive generative adversarial network. In Proceedings of the Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2019), Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Li, G.; Xu, X.; Zhang, M.; Liu, Q. Densely connected network for impulse noise removal. Pattern Anal. Appl. 2020, 23, 1263–1275. [Google Scholar] [CrossRef]
Lyu, Q.; Guo, M.; Pei, Z. DeGAN: Mixed noise removal via generative adversarial networks. Appl. Soft Comput. 2020, 95, 106478. [Google Scholar] [CrossRef]
Tian, C.; Xu, Y.; Li, Z.; Zuo, W.; Fei, L.; Liu, H. Attention-guided CNN for image denoising. Neural Netw. 2020, 124, 117–129. [Google Scholar] [CrossRef] [PubMed]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
Zhang, K.; Zuo, W.; Gu, S.; Zhang, L. Learning deep CNN denoiser prior for image restoration. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2808–2817. [Google Scholar]
Mou, C.; Zhang, J.; Wu, Z. Dynamic attentive graph learning for image restoration. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 4308–4317. [Google Scholar]
Ren, C.; He, X.; Wang, C.; Zhao, Z. Adaptive consistency prior based deep network for image denoising. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8592–8602. [Google Scholar]
Zhang, K.; Li, Y.; Zuo, W.; Zhang, L.; Van Gool, L.; Timofte, R. Plug-and-Play image restoration with deep denoiser prior. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 6360–6376. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The U-shaped network architecture employed by the DIP method.

Figure 2. The architecture of our two-stage denoising approach. It consists of two stages: supervised preprocessing and fine-tuned image generation. The objective of the first stage is to input a noisy image, y, and generate a preprocessed image,

x_{p}

, while the second stage is used to fine-tune this image and produce a better denoised image,

x_{f}

, as the final output.

Figure 2. The architecture of our two-stage denoising approach. It consists of two stages: supervised preprocessing and fine-tuned image generation. The objective of the first stage is to input a noisy image, y, and generate a preprocessed image,

x_{p}

, while the second stage is used to fine-tune this image and produce a better denoised image,

x_{f}

, as the final output.

Figure 3. The architecture of the proposed U-Transformer. Our U-Transformer employs a U-shaped framework, with the Transformer block at its core. Each Transformer block comprises two sequentially executed multi-head self-attention (MSA) modules and feedforward neural network (FFN) modules. Within the MSA modules, deep convolutions are utilized to capture local context before computing cross-covariance across channels, thereby generating an attention map that implicitly encodes global context. The FFN modules consist of two branches: one is responsible for normal convolution feature extraction, while the other is tasked with generating gating weights. These weights are then element-wise multiplied with the features, and the resulting output is combined with a residual connection.

Figure 4. The architecture of fine-tuned image generation. The Transformer block used therein is the same as in Figure 3.

Figure 5. Twelve widely used test images in the SET12 dataset. (a) Cameraman. (b) House. (c) Peppers. (d) Starfish. (e) Monarch. (f) Airplane. (g) Parrot. (h) Lena. (i) Barbara. (j) Boat. (k) Man. (l) Couple.

Figure 6. Ten representative test images in the BSD68 dataset.

Figure 7. Ten representative test images in the Urban100 dataset.

Figure 8. Ten representative test images in the SIDD dataset.

Figure 9. Visual comparison of denoising effects using different methods on Airplane image corrupted by a noise level of 25. (a) Original. (b) Noisy. (c) DIP/27.89 dB. (d) BM3D/28.42 dB. (e) IRCNN/29.05 dB. (f) FFDNet/29.05 dB. (g) DnCnn/29.06 dB. (h) DRUNet/29.35 dB. (i) DAGL/29.41 dB. (j) SWinIR/29.44 dB. (k) Stage1/29.47 dB. (l) Stage2/29.83 dB.

Figure 10. Visual comparison of denoising effects using different methods on a real noisy image from the SIDD datatset. (a) Original. (b) Noisy. (c) DRUNet/29.35 dB. (d) SWinIR/29.61 dB. (e) DnCnn/29.62 dB. (f) BM3D/37.07 dB. (g) IRCNN/37.62 dB. (h) FFDNet/37.94 dB. (i) DAGL/38.00 dB. (j) DIP/38.68 dB. (k) Stage1/41.15 dB. (l) Stage2/42.76 dB.

Table 1. Average PSNR values of different methods on Set12 dataset with a noise level at 25.

x_{p}

represents a pre-processed image.

x_{f}

represents the final output. The improvement row represents the PSNR value of the final output image that is boosted from the preprocessed image.

Table 1. Average PSNR values of different methods on Set12 dataset with a noise level at 25.

x_{p}

represents a pre-processed image.

x_{f}

represents the final output. The improvement row represents the PSNR value of the final output image that is boosted from the preprocessed image.

Image	DnCNN	DAGL	DRUNet	IRCNN	SWinIR	U-Transformer (Stage1)
$x_{p}$	30.35	30.86	30.94	30.37	31.01	31.04
$x_{f}$	30.65	31.04	31.10	30.57	31.17	31.23
Improvement	+0.30	+0.18	+0.16	+0.20	+0.16	+0.19

Table 2. Average PSNR values of different loss functions on Set12 dataset with a noise level at 25. The bold part represent the best PSNR result.

Loss Fuction
$L 1 = λ \cdot \| \| x_{f} - {y \| \|}_{2} + \| \| x_{f} - x_{p} {\| \|}_{2}$	$L 2 = \| \| x_{f} - y {\| \|}_{2}$	$L 3 = \| \| x_{f} - x_{p} {\| \|}_{2}$
31.23	30.91	30.90

Table 3. The optimal weights for different noise levels on Set12 and BSD68 datasets.

Noise Level	15	25	50
Weight	0.2	0.4	0.8

Table 4. The execution time required for denoising images across diverse resolutions using different numbers of Transformer blocks.

Number of Blocks	1 Block		2 Blocks		3 Blocks		4 Blocks		5 Blocks
Size	256 × 256	512 × 512	256 × 256	512 × 512	256 × 256	512 × 512	256 × 256	512 × 512	256 × 256	512 × 512
Time (s)	30.94	110.33	60.24	212.45	89.54	317.49	119.52	424.34	147.92	524.55
PSNR	30.98	33.07	31.05	33.12	31.18	33.13	31.17	33.20	31.16	33.13

Table 5. PSNR values of different methods on 12 images in the Set12 dataset with a noise level at 15. The bold part represent the best PSNR result.

Methods	BM3D	DnCNN	FFDNet	DIP	DAGL	DeamNet	DRUNet	IRCNN	SwinIR	Stage1	Stage2
Cameraman	31.90	32.14	32.42	30.79	32.69	32.76	32.91	32.53	33.04	33.02	33.16
House	34.92	34.96	35.01	33.89	35.76	35.68	35.83	34.88	35.91	36.18	36.29
Peppers	32.71	33.09	33.10	32.03	33.41	33.42	33.56	33.21	33.66	33.63	33.75
Starfish	31.12	31.92	32.02	31.03	32.61	32.50	32.44	31.96	32.57	32.58	32.65
Monarch	31.95	33.08	32.77	31.55	33.48	33.51	33.61	32.98	33.68	33.73	33.83
Airplane	31.10	31.54	31.58	30.42	31.95	31.90	31.99	31.66	32.08	32.07	32.20
Eagle	31.31	31.64	31.77	30.60	31.96	32.01	32.13	31.88	32.16	32.14	32.31
Lena	34.23	34.52	34.63	33.42	34.84	34.86	34.93	34.50	34.97	34.99	35.04
Barbara	33.04	32.03	32.50	29.46	33.73	33.16	33.44	32.41	33.98	33.56	33.85
Boat	32.09	32.36	32.35	31.09	32.64	32.62	32.71	32.36	32.81	32.83	32.97
Man	31.90	32.37	32.40	30.77	32.52	32.52	32.61	32.36	32.67	32.63	32.68
Couple	32.04	32.38	32.45	30.86	32.67	32.68	32.78	32.37	32.83	32.81	32.89
Avg.	32.36	32.67	32.75	31.33	33.19	33.13	33.25	32.76	33.36	33.35	33.47

Table 6. PSNR values of different methods on 12 images in the Set12 dataset with a noise level at 25. The bold part represent the best PSNR result.

Methods	BM3D	DnCNN	FFDNet	DIP	DAGL	DeamNet	DRUNet	IRCNN	SwinIR	Stage1	Stage2
Cameraman	29.38	30.03	33.06	28.36	30.32	30.38	30.61	30.12	30.69	30.72	30.98
House	32.86	33.04	33.27	31.61	33.80	33.73	33.93	33.02	33.91	34.18	34.45
Peppers	30.20	30.73	30.79	29.20	31.04	31.09	31.22	30.81	31.28	31.31	31.49
Starfish	28.52	29.24	29.33	28.58	30.09	29.89	29.88	29.21	29.96	30.01	30.17
Monarch	29.30	30.37	30.14	28.97	30.72	30.72	30.89	30.20	30.95	31.02	31.21
Airplane	28.42	29.06	29.05	27.93	29.41	29.30	29.35	29.05	29.44	29.47	29.80
Eagle	28.94	29.35	29.43	28.07	29.54	29.60	29.72	29.47	29.75	29.75	30.04
Lena	32.07	32.40	32.59	31.32	32.84	32.85	32.97	32.40	32.98	33.00	33.07
Barbara	30.66	29.67	29.98	28.94	31.49	30.74	31.23	29.93	31.69	31.39	31.64
Boat	29.87	30.19	30.23	29.08	30.49	30.48	30.58	30.17	30.64	30.73	30.83
Man	29.59	30.06	30.10	28.82	30.20	30.17	30.30	30.02	30.32	30.33	30.38
Couple	29.70	30.05	30.18	28.61	30.38	30.44	30.56	30.05	30.57	30.60	30.70
Avg.	29.96	30.35	30.43	29.12	30.86	30.78	30.94	30.37	31.01	31.04	31.23

Table 7. PSNR values of different methods on 12 images in the Set12 dataset with a noise level at 50. The bold part represent the best PSNR result.

Methods	BM3D	DnCNN	FFDNet	DIP	DAGL	DeamNet	DRUNet	IRCNN	SwinIR	Stage1	Stage2
Cameraman	26.19	27.03	27.03	24.56	27.42	27.42	27.80	27.16	27.79	27.88	28.56
House	29.57	29.91	30.43	28.42	31.04	31.16	31.26	29.91	31.11	31.56	32.13
Peppers	26.72	27.35	27.43	25.73	27.60	27.76	27.87	27.33	27.91	28.04	28.63
Starfish	24.90	25.60	25.77	24.53	26.46	26.47	26.49	25.48	26.55	26.66	27.12
Monarch	25.71	26.84	26.88	25.59	27.30	27.18	27.31	26.66	27.31	27.39	27.92
Airplane	25.17	25.82	25.90	24.41	26.21	26.07	26.08	25.78	26.14	26.17	26.82
Eagle	25.87	26.48	26.58	24.53	26.66	26.71	26.92	26.48	26.91	26.92	27.75
Lena	29.02	29.34	29.68	28.02	29.92	29.93	30.15	29.36	30.11	30.19	30.51
Barbara	27.21	26.32	26.48	23.29	28.26	27.60	28.16	26.17	28.41	28.31	28.74
Boat	26.73	27.18	27.32	25.78	27.56	27.59	27.66	27.17	27.70	27.83	28.08
Man	26.79	27.17	27.31	26.08	27.39	27.36	27.46	27.14	27.45	27.51	27.72
Couple	26.47	26.87	27.07	25.26	27.33	27.43	27.59	26.86	27.53	27.61	27.88
Avg.	26.70	27.18	27.32	25.52	27.75	27.72	27.90	27.12	27.91	28.01	28.48

Table 8. Average PSNR results of different methods on SET12, BSD68, and Urban30 with noise levels at 15, 25, and 50. The bold part represent the best PSNR result.

Methods	Noise Level = 15			Noise Level = 25			Noise Level = 50			Average
Methods	Set12	BSD68	Urban30	Set12	BSD68	Urban30	Set12	BSD68	Urban30	Set12	BSD68	Urban30
BM3D	32.36	31.07	33.14	29.96	28.56	30.45	26.70	25.62	26.36	29.67	28.42	29.98
DnCNN	32.67	31.62	33.30	30.35	29.17	30.44	27.18	26.23	26.47	30.07	29.01	30.07
FFDNet	32.75	31.65	32.92	30.43	29.21	30.34	27.32	26.30	26.74	30.17	29.06	30.30
DAGL	33.19	31.87	34.99	30.87	29.42	32.88	27.75	26.49	28.70	30.60	29.26	32.19
DeamNet	33.13	31.87	34.29	30.78	29.42	31.67	27.72	26.53	28.09	30.55	29.27	31.35
DRUNET	33.25	31.92	34.48	30.94	29.48	32.15	27.90	26.59	28.70	30.69	29.33	31.77
IRCNN	32.76	31.64	33.05	30.37	29.15	30.23	27.12	26.19	26.40	30.08	28.99	29.89
DIP	30.03	29.76	28.75	29.12	27.38	27.88	25.52	24.17	24.44	28.22	27.11	27.02
SwinIR	33.36	31.97	34.95	31.01	29.51	32.54	27.91	26.59	29.02	30.76	29.36	32.17
Stage1	33.35	31.96	35.04	31.04	29.52	32.69	28.01	26.63	29.40	30.80	29.37	32.27
Stage2	33.47	32.00	35.18	31.23	29.61	32.91	28.48	26.88	29.62	31.06	29.50	32.57

Table 9. Average PSNR values of different methods on 10 images in the SIDD dataset. The bold part represent the best PSNR result.

Methods	BM3D	DnCNN	FFDNet	DIP	DAGL	DRUNet	IRCNN	SwinIR	Stage1	Stage2
PSNR	29.67	25.50	28.32	35.74	37.52	26.95	29.54	29.24	37.69	38.27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, X.; Xu, S.; Wu, J.; Zhou, C.; Ji, S. Boosting Noise Reduction Effect via Unsupervised Fine-Tuning Strategy. Appl. Sci. 2024, 14, 1742. https://doi.org/10.3390/app14051742

AMA Style

Jiang X, Xu S, Wu J, Zhou C, Ji S. Boosting Noise Reduction Effect via Unsupervised Fine-Tuning Strategy. Applied Sciences. 2024; 14(5):1742. https://doi.org/10.3390/app14051742

Chicago/Turabian Style

Jiang, Xinyi, Shaoping Xu, Junyun Wu, Changfei Zhou, and Shuichen Ji. 2024. "Boosting Noise Reduction Effect via Unsupervised Fine-Tuning Strategy" Applied Sciences 14, no. 5: 1742. https://doi.org/10.3390/app14051742

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Boosting Noise Reduction Effect via Unsupervised Fine-Tuning Strategy

Abstract

1. Introduction

2. Related Work

2.1. Formulation

2.2. Supervised Model

2.3. DIP

3. Methodology

3.1. Basic Idea

3.2. Supervised Preprocessing Stage

3.3. Unsupervised Fine-Tuning Stage

4. Experimental Results

4.1. Experimental Environment and Datasets

4.2. Ablation Experiments

4.3. Experimental Results and Analysis

4.4. Visual Comparisons

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI