Next Article in Journal
Defogging Algorithm Based on Polarization Characteristics and Atmospheric Transmission Model
Next Article in Special Issue
Caps Captioning: A Modern Image Captioning Approach Based on Improved Capsule Network
Previous Article in Journal
Design and Verification of an Integrated Panoramic Sun Sensor atop a Small Spherical Satellite
Previous Article in Special Issue
Semi-Supervised Defect Detection Method with Data-Expanding Strategy for PCB Quality Inspection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Deep Learning-Based Synthesized View Quality Enhancement with DIBR Distortion Mask Prediction Using Synthetic Images

School of Information Engineering, Guangdong University of Technology, Guangzhou 510006, China
*
Author to whom correspondence should be addressed.
Sensors 2022, 22(21), 8127; https://doi.org/10.3390/s22218127
Submission received: 2 October 2022 / Revised: 20 October 2022 / Accepted: 20 October 2022 / Published: 24 October 2022

Abstract

:
Recently, deep learning-based image quality enhancement models have been proposed to improve the perceptual quality of distorted synthesized views impaired by compression and the Depth Image-Based Rendering (DIBR) process in a multi-view video system. However, due to the lack of Multi-view Video plus Depth (MVD) data, the training data for quality enhancement models is small, which limits the performance and progress of these models. Augmenting the training data to enhance the synthesized view quality enhancement (SVQE) models is a feasible solution. In this paper, a deep learning-based SVQE model using more synthetic synthesized view images (SVIs) is suggested. To simulate the irregular geometric displacement of DIBR distortion, a random irregular polygon-based SVI synthesis method is proposed based on existing massive RGB/RGBD data, and a synthetic synthesized view database is constructed, which includes synthetic SVIs and the DIBR distortion mask. Moreover, to further guide the SVQE models to focus more precisely on DIBR distortion, a DIBR distortion mask prediction network which could predict the position and variance of DIBR distortion is embedded into the SVQE models. The experimental results on public MVD sequences demonstrate that the PSNR performance of the existing SVQE models, e.g., DnCNN, NAFNet, and TSAN, pre-trained on NYU-based synthetic SVIs could be greatly promoted by 0.51-, 0.36-, and 0.26 dB on average, respectively, while the MPPSNRr performance could also be elevated by 0.86, 0.25, and 0.24 on average, respectively. In addition, by introducing the DIBR distortion mask prediction network, the SVI quality obtained by the DnCNN and NAFNet pre-trained on NYU-based synthetic SVIs could be further enhanced by 0.02- and 0.03 dB on average in terms of the PSNR and 0.004 and 0.121 on average in terms of the MPPSNRr.

1. Introduction

With the development of video capture and display technology, a 3D video system could provide people with a more and more immersive and realistic sensation, such as six-degrees-of-freedom (6DoF) video, which is close to the viewing experience of people interacting with the real world. However, along with the sensory impact of an immersive visual experience, the associate data volume has increased dozens of times, which has brought great challenges to the collection, storage, and transmission of virtual reality. In order to alleviate the pressure of storage and bandwidth, it is necessary to increase the compression ratio or use more sparse viewpoints to synthesize the virtual view/synthesized view. These processes will inevitably bring distortion to the video and damage the visual perception quality of users. To improve users’ visual experience, it is necessary to enhance the image quality of the synthesized view.
In a synthesized view image (SVI), there exists compression distortion and synthesis distortion caused by the DIBR process, and it is difficult for conventional image denoising and restoration models to deal with or eliminate these distortions due to their complexity. Learning-based image denoising and restoration models have been proved to be effective in dealing with such distortion, better for their powerful learning ability. In SynVD-Net [1], the compression and DIBR distortion elimination in the synthesized video was modeled as a perceptual video denoising problem, and a derived perceptual loss was derived and integrated with image denoising/restoration models, e.g., DnCNN [2], a U-shape sub-network in CBDNet [3], and a Residual Dense Network (RDN) [4], to enhance the perceptual quality. In [5], a Two-Stream Attention Network (TSAN) was proposed by combining a global stream which extracts the global context information and a local stream which extracts the local variance information. Following [5], a Residual Distillation Enhanced Network (RDEN)-guided lightweight synthesized video quality enhancement (SVQE) method [6] was proposed which claims to address the huge complexities and effectively deal with the distortion in the synthesized view. However, the existing Multi-view Video plus Depth (MVD) database has few sequences; thus, few (insufficient) noisy/clean sample pairs with various content are available for learning. This may hinder the ability of SVQE models, and it may be unable to fairly evaluate the capabilities of SVQE models.
How to improve the SVQE model performance with limited training data remains a non-trivial problem. There are different ways to improve the performance, such as data augmentation, model structure regularization, pre-training, transfer learning, and semi-supervised learning, among which the latter three learning-based techniques are usually conducted with data augmentation. Because massive natural RGB or RGBD images are easily accessible or available, in this paper, these data are utilized to simulate DIBR distortion and construct the synthetic SVIs, based on which the SVQE models could be first pre-trained and then fine-tuned on limited MVD data. In addition, in order to better improve the quality of SVIs, the human perception toward the virtual view is considered by embedding the DIBR distortion mask prediction network which could predict the position of the DIBR distortion into the SVQE models. The major contributions of this paper are threefold.
  • A transfer learning-based scheme for the SVQE task is proposed, in which the SVQE model is first pre-trained on a synthetic synthesized view database, then fine-tuned on the MVD database;
  • A synthetic synthesized view database is constructed in which a specific data synthesis method based on the random irregular polygon generation method simulating the special characteristics of SVI distortion is proposed, which has been validated on well-known state-of-the-art denoising or SVQE models on public RGB/RGBD databases;
  • A sub-network is employed to predict the DIBR distortion mask and embedded with SVQE models using synthetic SVIs. The attempt of explicitly introducing the DIBR distortion position information is proved to be effective in elevating the performance of SVQE models.
The remainder of this paper is organized as follows. Section 2 reviews the related work. The motivation and the proposed random polygon-based DIBR simulation and the DIBR mask prediction methods are proposed in Section 3. Section 4 describes the datasets preparation. Section 5 presents the substantial experimental results and detailed analyses. Section 6 concludes the paper.

2. Related Work

In this section, the development of image denoising methods and SVQE models is first reviewed, followed by a brief introduction of image data augmentation and data synthesis, and ended with recent research works about distortion mask prediction in image denoising or restoration.

2.1. SVQE Models

Image denoising or restoration is a classical image processing low-level task, which attracts an enduring passion from academy and industry. The attributions of state-of-the-art image denoising and SVQE methods are listed in Table 1. To begin with hand-crafted feature-based methods, NLM [7] and BM3D [8] are the most classical conventional image denoising methods which utilize the non-local self-similarity in images or image sparsity in the transform domain. Recently, with the development of deep learning, numerous image denoising and restoration methods have sprung up like bamboo shoots. For example, DnCNN [2], FFDNet [9], CBDNet [3], RDN [4], and NAFNet [10] were proposed successively with increasing denoising ability. These methods were initially proposed for uniform noise, such as Gaussian noise and real image capturing noise. With the great success of a transformer which has been applied into various computer vision tasks, transformer-based networks, such as Restormer [11] and SwinIR [12], have been proposed for low-level image processing tasks, e.g., real image denoising and image super-resolution. These transformer-based image restoration networks could model long-range relationships in images, which are beneficial to image restoration, especially for DIBR structure distortion. However, the disadvantages are that they consume large computational resources and need a large amount of data. In this paper, we mainly focus on CNN-based SVQE models.
For compression distortion caused by codec, VRCNN [13] and other compression methods [14,15] were proposed to deal with the blocking artifacts and texture blur caused by the compression. In the MVD-based 3D video system, a virtual view is synthesized by a compressed texture and depth video through the DIBR process, which includes the uniform compression distortion (mainly transferring from texture images) and irregular synthesis distortion (mainly originating from the DIBR process and distorted depth images). To improve the perceptual quality of SVIs during compression, Zhu et al. [16] proposed a network which was adapted from a DnCNN network and utilized the neighboring view information to enhance the reference synthesized view and refine the synthesized view obtained from compressed texture and depth video. Later, Pan et al. proposed the methods TSAN [5] and RDEN [6] to improve the SVI quality. To improve the perceptual synthesized video quality, SynVD-Net [1] was proposed by deriving a CNN-friendly loss from the perceptual synthesized video quality metric to reduce the flicker distortion. These SVQE models could better enhance the SVI quality. However, due to the limited MVD data, the potential of SVQE models may not be fully excavated.
Table 1. The attributions of state-of-the-art image denoising and SVQE methods.
Table 1. The attributions of state-of-the-art image denoising and SVQE methods.
MethodsHand-CraftedCNN-BasedTransformer-BasedGeneral 2D ImagesSynthesized ViewsMain Noise Types
NLM [7], BM3D [8] Gaussian noise
DnCNN [2], FFDNet  [9], CBDNet [3], RDN [4], NAFNet [10] Gaussian noise, blur, real image noise, low resolution
Restormer [11], SwinIR [12]
VRCNN [13], Zhu et al. [16], TSAN [5], RDEN [6], SynVD-Net [1] Compression distortion, DIBR distortion

2.2. Image Data Augmentation and Data Synthesis

Image data augmentation has been widely used in learning-based computer vision tasks, which includes basic (classic and typical) [17,18,19,20] and deep learning-based [21,22,23,24] data augmentation methods. Basically, the basic data augmentation methods can be categorized as data warping [17], e.g., geometric and color transformation [18], mixing image [19,20], random erasing, and so on. These augmentation methods use oversampling or data warping to preserve the label. The deep learning-based data augmentation methods can be classified as GAN-based [21], neural-style transfer [22], adversary training [23,24], and so on. The above data augmentation methods are general and could partially improve the performance of related image processing tasks. Another common way for using limited data in computer vision application is to use transfer learning to pre-train a model on a large-scale external database or use domain adaptation methods. However, often there is a certain feature gap between an external database or a pre-trained model and the downstream specific tasks [25].
Recently, the domain-specific data synthesis methods which could utilize the strong prior knowledge of target images are used in many tasks and have demonstrated its effectiveness in real image denoising [3], rain removal [26], shadow removal [27,28], and other tasks [29,30]. Because real MVD data is limited, it is difficult to obtain real multi-view data and the associated information, e.g., camera parameters and depth values, thus making synthesizing an SVI thorny. To tackle this issue, a DIBR distortion simulation has been proposed in some image quality assessment (IQA) studies for 3D synthesized images. In [31], a DIBR distortion simulation was proposed so as to predict the DIBR-synthesized image quality without a real time-consuming DIBR process. However, the virtual view synthesis method utilizes wavelet transform to mix high-frequency signals near the texture and depth edges and could not simulate the random geometric displacement distortion caused by the depth value error. In [32], a DIBR distortion simulation was realized by a hand-crafted and GAN-based method to solve the data shortage problem. The DIBR distortion synthesized by a GAN may not well match the distribution of DIBR distortion and it needs large data and data-labeling to train, which is troublesome. In this paper, we aim to propose a simple data synthesis method for DIBR distortion simulation.

2.3. Distortion Mask Prediction

In some image restoration tasks, e.g., real image denoising [3], de-raining [33], or shadow removal [28,34], the image restoration task is explicitly or implicitly divided into two tasks, i.e., distortion mask (or labels) estimation and image restoration/denoising. In [3], a noise estimation sub-network is embedded into an image denoising framework. In [33], the de-raining network is composed of a rain density-aware network for the rain density label prediction and de-rain network and were jointly learned. In [28], a novel Dual Hierarchical Aggregation Network (DHAN) was proposed which can simultaneously output a shadow mask and a shadow-erased image using the GAN-synthesized shadow images. In [34], a distortion localization network is integrated with an image restoration network to handle spatially varying distortion. These works [3,28,33] all used self-created synthetic images except that pseudo distortion labels were used in [34].

3. Method

In this section, the motivation is first illustrated. Pipeline of DIBR distortion simulation is then described, in which different kinds of local noise are compared and the proposed random irregular polygon-based DIBR distortion generation method is introduced. Thus, synthetic databases could be constructed with synthetic SVIs and corresponding DIBR distortion masks. Last, the DIBR distortion mask prediction sub-network is introduced and integrated with SVQE models based on the constructed synthetic SVIs. The definitions of key variables and acronyms used in this section are listed in Table 2.

3.1. Motivation

Nowadays, the Internet is abundant in high-quality specially constructed image databases or user uploaded images/videos. Thus, it is easy to obtain enough data to conduct the learning task, e.g., compressed image quality enhancement (denoted as task T s ), by collecting original RGB images and producing their corresponding compressed images by compression tools. Suppose a domain D which consists of data/feature space χ and a marginal probability distribution P ( X ) [35], where X = { x 1 , x 2 , , x n } χ . However, in an MVD video system, insufficient data are available to recover high-quality from distorted synthesized images, i.e., synthesized view quality enhancement (denoted as learning task T t ), which may weaken the performance of SVQE models. Confronted with such situation that abundant image pairs D s = { ( x s 1 , y s 1 ) , ( x s 2 , y s 2 ) , , ( x s n , y s n ) } (ground truth/distorted images) could be collected for T s but limited image pairs D t = { ( x t 1 , y t 1 ) , ( x t 2 , y t 2 ) , , ( x t n , y t n ) } could be gathered for task T t , it may naturally occur to people that transfer learning could be utilized to transfer from T s to T t , as shown in Figure 1a. However, due to the discrepancy between D s and D t , the knowledge that could be learned and transferred from task T s may only be compression distortion elimination knowledge, which may be suboptimal when applied on task T t . In addition, the compression distortion is relatively regular, while DIBR distortion in SVI is more irregular and hard to handle.
To break the gap between domain D s and D t , and make better use of the big data in domain D s , a method is proposed to generate the synthetic noise simulating the DIBR distortion, aiming that the knowledge of synthetic noise distribution could be approaching true noise distribution in synthesized images, and thus could effectively utilize the massive data in domain D t . As shown in Figure 1b, DIBR distortion simulation module is introduced after image compression, and synthetic synthesized images are thus generated accordingly.

3.2. DIBR Distortion Simulation

Figure 2 shows the pipeline of DIBR distortion simulation. Original images from NYU [36] and DIV2K [37] databases (public RGB/RGBD databases) are first compressed by using codec with given Quantization Parameter (QP) parameter. The associated depth images of the compressed images are available for RGBD images or could be generated by mono depth estimation methods [38,39]. Then, the DIBR distortion will be generated along the depth edges because depth edges are assumed to be the most possible areas where DIBR distortion resides. Next, the proposed random irregular polygon-based DIBR distortion generation method is employed on the compressed RGB/RGBD data. In this way, the synthetic synthesized view database is constructed, which includes synthetic synthesized images and corresponding DIBR distortion mask.

3.3. Different Local Noise Comparison and Proposed Random Irregular Polygon-Based DIBR Distortion Generation

Figure 3a demonstrates the SVI with DIBR distortion of sequences Lovebird1 and Balloons, such as cracker, fragment, and irregular geometric displacement along the edges of objects. To investigate which kind of distortion resembles the DIBR distortion more, three different noise patterns, e.g., Gaussian noise, speckle noise, and patch shuffle-based noise, are compared. Gaussian noise is a well-known noise with normal distribution. Speckle noise is a type of granular noise which often exists in medical ultrasound images and synthetic aperture radar (SAR) images. Patch shuffle [40] is a method to randomly shuffle the pixels in a local patch of images or feature maps during training which is used to regularize the training of classification-related CNN models. Taking the DIBR distortion simulation effects for Lovebird1 as example, as shown in Figure 3, different synthetic SVIs are obtained by adding compressed neighboring captured views with Gaussian noise, speckle noise, and patch shuffle-based noise along the areas with strongly discontinuous depth, respectively. The real SVI is listed as anchor. Denote the captured view as I , then the synthetic synthesized view by the random noise can be written as
I s y n = ( 1 M ) I + M ( I + I δ ) 2 ,
where I s y n denotes the synthetic SVI, I denotes the compressed captured view images, 1 denotes the matrix with all elements as 1, M denotes the mask area corresponding to the detected strong depth edges, ⊙ denotes dot product, and I δ denotes the images added with random noise, i.e., Gaussian noise, speckle noise, or the patch-shuffled version of I. It could be observed that I s y n synthesized by Gaussian noise and speckle noise are not very visually resembling synthesis distortion, and I s y n synthesized by patch-based noise exhibits similar behaviors a little in the way that the pixels in a local patch appear as disorderly and irregular.
SVI with DIBR distortion can be viewed as the tiny movement of textures within random polygon area along the depth transition area. To better simulate the irregular geometric distortion, in this section, a simple random polygon generation method which could control irregularity and spikiness will be introduced as follows. A random polygon generation method could be found in [41]. Following the method [42], to generate a random polygon, a random set of points with angularly sorted order would be first generated; then, the vertices would be connected based on the order. First, given a center point P, a group of points would be sampled on a circle around point P. Random noise is added by varying the angular spacing between sequential points and the radial distance of each point from the center. The process can be formulated as
θ i = θ i 1 + 1 k θ i θ i = U ( 2 π n ϵ , 2 π n + ϵ ) , k = θ i / π r i = c l i p ( N ( R , ) , 0 , R )
where θ i and r i represent the angle and radius between the i-th point and assumed center point, respectively. θ i denotes the random variable controlling angular space between sequential points, which is subject to a uniform distribution featured by the smallest value 2 π n ϵ and largest value 2 π n + ϵ , where n denotes the number of vertices. Moreover, r i is subject to Gaussian distribution with a given radius R as mean value and σ as the variance. R could be used to adjust the magnitude of the generated polygon. ϵ could be used to adjust the irregularity of the generated polygon by controlling the angular variance degree through the interval size of U. σ could be used to adjust the spikiness of the generated polygon by controlling the radius variance through the normal distribution. Large ϵ and σ indicates strong irregularity and spikiness, and vice versa, which can be shown in Figure 4.
Thus, the synthetic SVI composed by the proposed random polygon noise can be obtained as
I s y n = ( 1 M ) I + M ( I + I s h ) 2 , I s h ( ψ ) = I ( ψ + η )
where ψ denotes the vertices set located in a local region generated by the random polygon method, η denotes a random vector for all points of ψ to be bodily shifted in I s h . I s h is fused with I in the strong depth regions. In Figure 3f,j, it can be observed that the DIBR distortion generated by the activity of textures within random polygon area along the edges resembles the distortion visually.

3.4. DIBR Distortion Mask Prediction Network Embedding

Existing IQA models for SVI demonstrate that DIBR distortion position determination is the key procedure for quality assessment [43,44], which hints that knowing and paying more attention to DIBR distortion position may elevate SVQE models in enhancing SVI quality. Therefore, how to incorporate the DIBR distortion position into SVQE models becomes a new issue. The intuitive way is directly integrating DIBR distortion position with distorted image as a whole input. Figure 5a shows the sketch map of this way. It could be validated by experiment in Section 4 that knowing DIBR distortion position is helpful for synthesized image quality enhancement. However, the ground truth DIBR distortion position is often not known, so the position has to be detected or estimated. Inspired by de-raining [33] and shadow removal [28,34], SVI quality enhancement could be regarded as two tasks, i.e., DIBR distortion mask estimation and image restoration/denoising. Reviewing these works, there are three main possible ways to group mask estimation and image restoration task network, i.e., successive (series) network, parallel network (multi-task), parallel interactive network. The sketch map of these ways is demonstrated in Figure 5b–d. In addition to different organization or design of networks, attention mechanism, such as spatial attention [5], self-attention [10], or non-local attention [45], is also considered in existing denoising or restoration networks. In this work, we mainly focus on networks which explicitly combine the DIBR mask prediction and DIBR distortion elimination and mainly test the successive (series) network shown in Figure 5b.

4. Datasets Preparation

Two datasets, i.e., the RGBD database NYU Depth Dataset V2 [36] and the RGB database DIV2K [37], were employed for synthetic SVI database construction for the pre-training of SVQE models. MVD dataset from SIAT Synthesized Video Quality Database [46] was used as the benchmark dataset for SVQE.
NYU-based synthetic SVIs: NYU Depth Dataset V2 consists of RGB and raw depth images from various indoor scenarios captured by Microsoft kinect, which are originally proposed for image segmentation and depth estimation. The database is comprised of 1449 labeled datasets and 407,024 unlabeled datasets. In our experiment, only 1449 labeled datasets of aligned RGB and depth images are employed. The resolution of the images is 640 × 480. The images were compressed by X264 with QP, which was set as 35, an intermediate distortion level. The NYU-based synthetic SVIs were generated through Equations (2) and (3) on Y-component of compressed images.
DIV2K-based synthetic SVIs: DIV2K dataset consists of 1000 2K resolution RGB images with various content which was proposed for super-resolution. In our experiment, 750 images were employed for training. The compression and DIBR distortion generation procedures are the same as that in NYU Depth Dataset V2, and the QP was set as 45 for the high-resolution images, because QP 35 is not noticeable for DIV2K dataset.
Example images of constructed synthetic SVIs based on NYU Depth Dataset V2 and DIV2K are shown in Figure 6.
MVD: MVD dataset is the same as that in [1], and it includes 12 common MVD sequences with a variety of content. Selected reference views were compressed and then used to synthesize an intermediate view. Note 3DV-ATM v10.0 software [47] was used for compression and VSRS-1D-Fast software [48] was utilized for the reference views to render the intermediate virtual view. In experiments, five sequences were selected in training, and the left seven sequences were used in testing. The testing sequences are denoted as Seqs-H.264 for simplicity. In our test, we only trained and tested on the intermediate distortion level, and 10 or 21 images were collected from the distorted video, which are 94 training frames in total. The detailed information about sequences, view resolution, reference and rendered views, and compression parameter pairs ( Q P t , Q P d ) for reference views of texture and depth videos can be referred to [1].
The detailed information of the two constructed synthetic SVI databases and MVD database is demonstrated in Table 3. The origin databases and image resolution for generating synthetic SVIs are listed, the contained noise in SVIs and the image numbers in synthetic SVI databases for pre-training are also listed. Similarly, the related information of real MVD dataset is presented.

5. Experimental Results and Analysis

In this section, the experimental configuration is first described. The proposed random polygon-based noise for the DIBR distortion simulation is then verified. Afterward, quantitative comparisons among the SVQE models are conducted based on with or without using the SVI datasets generated by the proposed DIBR distortion simulation method. To verify the effectiveness of the proposed scheme by embedding the DIBR distortion mask prediction sub-network into the SVQE models, related experiments are also carried out. The computational complexity comparisons are also described. Finally, the experimental results are discussed.

5.1. Experimental Configuration

SVQE models and training setting: Four deep learning-based image denoising or SVQE models, i.e., DnCNN, VRCNN, TSAN, and NAFNet, were employed as testing models. The training scheme is that these models are first pre-trained on synthetic datasets based on NYU-V2/DIV2K and then fine-tuned on the MVD dataset. The common settings for training are that the patch size was set as 128 × 128 and the epoch size was set as 100 for pre-training and 30 for fine-tuning. The batch size was set as 128 for DnCNN and VRCNN and 32 for TSAN and NAFNet. In addition, Adam was adopted as the optimization algorithm with default settings, i.e., β 1 = 0.9, β 2 = 0.999, for DnCNN, VRCNN, and TSAN; AdamW [49] was adopted as the settings, i.e., β 1 = 0.9, β 2 = 0.9, and weight decay 1 × 10 5 , for NAFNet. The initial and minimum learning rates were set as 1 × 10 4 and 1 × 10 6 for both DnCNN and VRCNN and 1 × 10 3 and 1 × 10 7 for NAFNet while the learning rate was kept the same, i.e., 1 × 10 4 , for TSAN. DnCNN, VRCNN, and NAFNet were trained with the cosine decay strategy, and TSAN kept the default setting, i.e., without using the cosine decay strategy. In addition, the cropped patches for training were randomly horizontally flipped or rotated by 90 , 180 , and 270 . The experiments were conducted on an Ubuntu 20.04.4 operating system with an Intel Xeon Silver 4216 CPU, 64GB memory, an NVIDIA RTX A6000, and a PyTorch platform.
Evaluation metrics: PSNR and IWSSIM [50] are well-known and widely used metrics for conventional 2D images. MPPSNRr [51] and SC-IQA [52] were proposed specifically for DIBR distortion in SVIs and have achieved high correspondence between the predicted quality scores and subjective scores, which may more truly reflect the perceptual quality of synthesized views. In our experiments, both image quality metrics and SVI quality metrics were used to measure the SVI quality.

5.2. Verification of Proposed Random Polygon-Based Noise for DIBR Distortion Simulation

In order to verify whether the random irregular polygon-based DIBR distortion generation method is necessary and effective, the database with only compression distortion and the other three different noise patterns (i.e., Gaussian, speckle, and patch shuffle-based noise) as the pre-trained database were employed as the comparison schemes. Note the edge region was located in the same way as that of the proposed method in synthesizing SVIs with other types of random noise. The DnCNN and NAFNet methods were first pre-trained on the NYU database with different distortion schemes and then fine-tuned on the MVD training set.
Table 4 and Table 5 show the denoising performance of DnCNN and NAFNet on the MVD testing sequences Seqs-H.264 among different distortion schemes, respectively. The best and second results for each sequence and on average are highlighted in bold and the best results are underlined again. In terms of both the image quality metrics (i.e., PSNR and SSIM) and the SVI quality metrics (i.e., MPPSNRr and SC-IQA), it can be observed that by pre-training on the NYU database with only the compression distortion could enhance the distorted synthesized video quality on average as compared with the scheme without pre-training. In addition, it could also be found that Gaussian, speckle, patch shuffle-based, and the proposed random irregular polygon-based noise (denoted as randompoly) could contribute to the quality enhancement of the distorted synthesized video. Statistically, by counting the number of occurrences of the best two results of each sequence and on average in Table 4 and Table 5, it could be found that the proposed randompoly noise achieves the best, while the patch shuffle-based performs second. Therefore, it can be inferred that by pre-training on large massive distorted images with different types of noise, the SVQE models could learn more about how to restore images as compared with training on limited MVD data. Our proposed random irregular polygon-based method which could reflect the geometric displacement well is more appropriate to simulate the DIBR distortion, which could greatly elevate the SVQE models’ ability.
To further validate the role of the proposed irregular polygon-based DIBR distortion generation method, a visual quality comparison among different kinds of local noise and SVQE models is performed. Figure 7 and Figure 8 show the quality comparison of sequences Dancer and Poznanhall2 on the pre-training NYU databases with the five different local synthetic noises of the three SVQE models. It can be observed that when only pre-trained on NYU with only compression distortion, the boundaries along the hands and fingers in Dancer and the pillars in Poznanhall2 are clearer than that scheme without pre-training, but they are not as clear as that pre-trained on NYU with other random distortion. By contrast, the SVQE models with the proposed irregular random polygon-based distortion could visually exhibit more pleasant denoised images, which have sharper and complete object boundaries.

5.3. Quantitative Comparisons among SVQE Models Pre-Trained with Synthetic Synthesized Image Database

Table 6 and Table 7 demonstrate the denoising/quality enhancement performance of four SVQE models, i.e., DnCNN, VRCNN, TSAN, and NAFNet, on the synthetic synthesized image database (generated from NYU and DIV2K) in terms of image quality metrics and SVI quality metrics. The synthetic databases are termed as SynData for simplicity in the following context. The original model names are used to denote the image denoising/SVQE models only pre-trained on the MVD data, and model-syn-N/D is used to denote the image denoising/SVQE models first pre-trained on SynData (NYU/DIV2K) then fine-tuned on the MVD data. Compared to the scheme of four image denoising/SVQE models directly trained on MVD data, it can be observed that four image denoising/SVQE models first pre-trained on SynData then fine-tuned on MVD data can enhance the synthesized views measured by PSNR, IWSSIM, MPPSNRr, and SC-IQA by large gains. Looking at PSNR in Table 6, DnCNN, NAFNet, and TSAN could achieve gains of 0.51-, 0.36-, and 0.26 dB, respectively, while VRCNN could only achieve a gain of 0.08 dB. It is also the same tendency for the four models on the other three metrics. It can be found that the DnCNN and NAFNet models could benefit most from the synthetic dataset on both image quality metrics and SVI metrics. Similar findings can be also observed on DIV2K. In addition, because the images in DIV2K have a 2K resolution, which is similar to that of MVD, and the number of extracted patches from DIV2K is larger than that of NYU, using DIV2K as the pre-trained dataset could have a better performance on average due to the large resolution and larger training samples. The experimental results validate that the proposed SynData with irregular polygon-based distortion could benefit the current SVQE models. In addition, other conclusions could also be drawn. First, a larger synthetic database with the proposed distortion could lead to a better SVQE performance of deep models. Second, different SVQE models would benefit differently with pre-training on the proposed synthetic database.

5.4. Effectiveness of Integrating DIBR Distortion Mask Prediction Sub-Network

To further improve the performance of the current SVQE models, the role of the DIBR distortion mask by combining distorted SVIs directly with the ground truth DIBR distortion mask as input to SVQE models was explored and tested. The performance of three DnCNN-based schemes, i.e., DnCNN only trained on MVD (i.e., DnCNN) and DnCNN pre-trained on the NYU (i.e., DnCNN-syn-N) and DIV2K databases (i.e., DnCNN-syn-D), are listed as anchors. Table 8 shows that three corresponding DnCNN-based schemes with ground truth DIBR distortion masks as input, i.e., DnCNN-GTmask, DnCNN-syn-GTmask-N, and DnCNN-syn-GTmask-D, could elevate the distorted synthesized images largely by 0.42-, 0.37-, and 0.39 dB measured by PSNR, respectively. This implies that knowing where the DIBR distortion resides is beneficial to denoise the DIBR distortion.
However, it is actually hard to know the exact position of the DIBR distortion. Thus, similar to those rain or shadow removal works, the detection of the DIBR distortion could be a choice. The noise estimation network used in CBDNet as the DIBR distortion estimation network was employed and then combined with the denoising/SVQE networks to enhance the quality of the outputs. Different from CBDNet, the local DIBR distortion is estimated rather than the whole distortion map. In the experiments, two representative models, DnCNN and NAFNet, were used, and both databases, NYU and DIV2K, were tested. Table 9 shows the average denoising performance on Seqs-H.264 of DnCNN and also NAFNet with and without the DIBR mask prediction network pre-trained on SynData (NYU and DIV2K), respectively. It could be observed that with the DIBR mask estimation network, the quality of the distorted SVIs by DnCNN and NAFNet could be elevated on average in terms of the PSNR, MPPSNRr, and SC-IQA metrics, on both databases, except in IW-SSIM. In addition, the times when the deep models with the DIBR mask prediction perform superior to that without the DIBR mask prediction are counted, and then the surpassing degree is calculated. It can be obtained that the surpassing degrees for the average is 0.69, and for 4/7 of all sequences it is above 0.50, which indicates that our DIBR distortion prediction network works and can further enhance the performance with proposed synthetic databases.
In Figure 7 and Figure 8, it can be observed that part of the DIBR distortion region is repaired while some imprecise repainting is introduced. For instance, the little finger is more clear with the DIBR distortion mask than that without the DIBR distortion mask while some additional noise is introduced along the arm. The reason may lie in that the DIBR distortion prediction network could not precisely predict the DIBR distortion location. Therefore, a more elaborately designed prediction network and architecture are needed for a better SVQE performance.

5.5. Computational Complexity Analysis

To test the computational complexity of the proposed random polygon-based (randompoly) DIBR distortion simulation method, experiments on 100 randomly selected images in the NYU Depth Dataset V2 were carried out on a desktop with a windows 10 operating system, an Intel i7-8750H CPU @ 2.20GHz, and the Matlab R2014a platform. As shown in Table 10, compared with other random noise methods, the proposed randompoly method for synthesizing a synthetic SVI takes about 1268.82 milliseconds (ms) on average, while it costs about 60∼70 ms for other random noise generation methods. The computing efficiency shall be further improved for the randompoly method in the future. To test the increased time complexity of introducing a DIBR distortion mask prediction sub-network into an SVQE model, experiments were conducted on the same configurations mentioned in Section 5.1. The calculation time for a frame was averaged over 200 frames for comparison methods. From Figure 9, it can be observed that compared with the original SVQE models, the SVQE models combined with the DIBR distortion mask prediction sub-network consume a little more time. Specifically, the time complexity of introducing the DIBR distortion mask prediction sub-network increases about 1.34 ms (24.53%) and 5.70 ms (3.96%) for a resolution of 1024 × 768 and 4.49 ms (70.12%) and 14.57 ms (3.92%) for a resolution of 1920 × 1088 for DnCNN and NAFNet, respectively, indicating that the computational complexities depend on the complexity of the combined SVQE models and image resolutions.

5.6. Discussion

Our proposed random irregular polygon-based (randompoly) DIBR distortion simulation method demonstrates a superior performance than other kinds of random noise in simulating the DIBR distortion. State-of-the-art denoising/SVQE models when pre-training on synthetic SVIs generated by the proposed randompoly method bring large gains in the SVQE performance, objectively and subjectively, than the denoising/SVQE models directly trained on the real MVD dataset. In addition, the proposed DIBR distortion mask prediction sub-network embedded with SVQE models could further enhance the SVQE performance. In the future, a GAN-based or diffusion model-based DIBR simulation method is expected. In addition, more deep investigation is demanded on how to augment images with DIBR distortion and how to effectively introduce the DIBR distortion location information into SVQE models. In addition, transformer-based denoising models for SVQE with synthetic images could be investigated.

6. Conclusions

In this paper, a transfer learning-based framework for synthesized image quality enhancement (SVQE) is suggested, in which SVQE models could first be pre-trained on synthetic synthesized images (SVIs) based on substantial RGB/RGBD data, then fine-tuned on real Multi-view Video plus Depth (MVD) dataset, and finally introduce a DIBR distortion mask prediction network together with SVQE models. Different kinds of random noise in simulating DIBR distortion have been explored and validated that the proposed random irregular polygon-based DIBR distortion method is more effective in improving the performance of existing SVQE models. The substantial experimental results on the public MVD sequences demonstrate that existing denoising/SVQE models could achieve large gains by pre-training on synthetic images generated from the proposed random irregular polygon-based method in both image and SVI quality metrics and also demonstrate superior visual quality. In addition, the combination of the DIBR distortion mask prediction network with existing SVQE models has been proved valid for SVQE models.

Author Contributions

Conceptualization, H.Z.; methodology, H.Z.; software, H.Z. and X.Y.; validation, H.Z.; formal analysis, H.Z.; investigation, H.Z. and D.Z.; resources, H.Z., X.Y. and J.C.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, H.Z.; visualization, H.Z.; supervision, J.C. and B.W.-K.L.; project administration, J.C. and B.W.-K.L.; funding acquisition, J.C., B.W.-K.L. and H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Joint Fund of the National Natural Science Foundation of China and Guangdong Province under grant no. U1701266, in part by the Guangdong Provincial Key Laboratory of Intellectual Property and Big Data under grant no. 2018B030322016, and in part by the GuangDong Basic and Applied Basic Research Foundation under grant no. 2021A1515110031.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhang, H.; Zhang, Y.; Zhu, L.; Lin, W. Deep learning-based perceptual video quality enhancement for 3D synthesized view. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5080–5094. [Google Scholar] [CrossRef]
  2. Zhang, K.; Zuo, W.; Chen, Y.; Meng, D.; Zhang, L. Beyond a gaussian denoiser: Residual learning of deep CNN for image denoising. IEEE Trans. Image Process. 2017, 26, 3142–3155. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Guo, S.; Yan, Z.; Zhang, K.; Zuo, W.; Zhang, L. Toward convolutional blind denoising of real photographs. In Proceedings of the 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1712–1722. [Google Scholar] [CrossRef] [Green Version]
  4. Zhang, Y.; Tian, Y.; Kong, Y.; Zhong, B.; Fu, Y. Residual dense network for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 2480–2495. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Pan, Z.; Yu, W.; Lei, J.; Ling, N.; Kwong, S. TSAN: Synthesized view quality enhancement via two-stream attention network for 3D-HEVC. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 345–358. [Google Scholar] [CrossRef]
  6. Pan, Z.; Yuan, F.; Yu, W.; Lei, J.; Ling, N.; Kwong, S. RDEN: Residual distillation enhanced network-guided lightweight synthesized view quality enhancement for 3D-HEVC. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6347–6359. [Google Scholar] [CrossRef]
  7. Buades, A.; Coll, B.; Morel, J. A non-local algorithm for image denoising. In Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA, USA, 20–26 June 2005; pp. 60–65. [Google Scholar] [CrossRef]
  8. Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Image denoising by sparse 3-D transformdomain collaborative filtering. IEEE Trans. Image Process. 2007, 16, 2080–2095. [Google Scholar] [CrossRef]
  9. Zhang, K.; Zuo, W.; Zhang, L. FFDNet: Toward a fast and flexible solution for CNN-based image denoising. IEEE Trans. Image Process. 2018, 27, 4608–4622. [Google Scholar] [CrossRef] [Green Version]
  10. Chen, L.; Chu, X.; Zhang, X.; Sun, J. Simple baselines for image restoration. arXiv 2022, arXiv:2204.04676. [Google Scholar]
  11. Zamir, S.W.; Arora, A.; Khan, S.; Hayat, M.; Khan, F.S.; Yang, M. Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 5718–5729. [Google Scholar] [CrossRef]
  12. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image restoration using swin transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops, ICCVW, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
  13. Dai, Y.; Liu, D.; Wu, F. A convolutional neural network approach for post-processing in HEVC intra coding. In Proceedings of the 23rd International Conference on MultiMedia Modeling (MMM), Reykjavik, Iceland, 4–6 January 2017; pp. 28–39. [Google Scholar] [CrossRef] [Green Version]
  14. Liu, J.; Zhou, M.; Xiao, M. Deformable convolution dense network for compressed video quality enhancement. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 1930–1934. [Google Scholar] [CrossRef]
  15. Yang, R.; Sun, X.; Xu, M.; Zeng, W. Quality-gated convolutional LSTM for enhancing compressed video. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 532–537. [Google Scholar] [CrossRef] [Green Version]
  16. Zhu, L.; Zhang, Y.; Wang, S.; Yuan, H.; Kwong, S.; Ip, H.H.S. Convolutional neural network-based synthesized view quality enhancement for 3D video coding. IEEE Trans. Image Process. 2018, 27, 5365–5377. [Google Scholar] [CrossRef]
  17. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1106–1114. [Google Scholar]
  18. Takahashi, R.; Matsubara, T.; Uehara, K. Data augmentation using random image cropping and patches for deep CNNs. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2917–2931. [Google Scholar] [CrossRef] [Green Version]
  19. Summers, C.; Dinneen, M.J. Improved mixed-example data augmentation. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1262–1270. [Google Scholar] [CrossRef] [Green Version]
  20. Liang, D.; Yang, F.; Zhang, T.; Yang, P. Understanding mixup training methods. IEEE Access 2018, 6, 58774–58783. [Google Scholar] [CrossRef]
  21. Sixt, L.; Wild, B.; Landgraf, T. RenderGAN: Generating realistic labeled data. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Workshop Track Proceedings, Toulon, France, 24–26 April 2017. [Google Scholar]
  22. Zhu, J.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef] [Green Version]
  23. Shrivastava, A.; Pfister, T.; Tuzel, O.; Susskind, J.; Wang, W.; Webb, R. Learning from simulated and unsupervised images through adversarial training. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2242–2251. [Google Scholar] [CrossRef]
  24. Wang, X.; Man, Z.; You, M.; Shen, C. Adversarial generation of training examples: Applications to moving vehicle license plate recognition. arXiv 2017, arXiv:1707.03124. [Google Scholar]
  25. Chen, P.; Li, L.; Wu, J.; Dong, W.; Shi, G. Contrastive self-supervised pre-training for video quality assessment. IEEE Trans. Image Process. 2022, 31, 458–471. [Google Scholar] [CrossRef] [PubMed]
  26. Liu, T.; Xu, M.; Wang, Z. Removing rain in videos: A large-scale database and a two-stream ConvLSTM approach. In Proceedings of the 2019 IEEE International Conference on Multimedia and Expo (ICME), Shanghai, China, 8–12 July 2019; pp. 664–669. [Google Scholar] [CrossRef] [Green Version]
  27. Inoue, N.; Yamasaki, T. Learning from synthetic shadows for shadow detection and removal. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 4187–4197. [Google Scholar] [CrossRef]
  28. Cun, X.; Pun, C.; Shi, C. Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting GAN. In Proceedings of the 34th AAAI Conference on Artificial Intelligence (AAAI), New York, NY, USA, 7–12 February 2020; pp. 10680–10687. [Google Scholar]
  29. Madhusudana, P.C.; Birkbeck, N.; Wang, Y.; Adsumilli, B.; Bovik, A.C. Image quality assessment using synthetic images. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 4–8 January 2022; pp. 93–102. [Google Scholar] [CrossRef]
  30. Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324. [Google Scholar] [CrossRef] [Green Version]
  31. Li, L.; Huang, Y.; Wu, J.; Gu, K.; Fang, Y. Predicting the quality of view synthesis with color-depth image fusion. IEEE Trans. Circuits Syst. Video Technol. 2021, 31, 2509–2521. [Google Scholar] [CrossRef]
  32. Ling, S.; Li, J.; Che, Z.; Zhou, W.; Wang, J.; Le Callet, P. Re-visiting discriminator for blind free-viewpoint image quality assessment. IEEE Trans. Multimed. 2021, 23, 4245–4258. [Google Scholar] [CrossRef]
  33. Zhang, H.; Patel, V.M. Density-aware single image de-raining using a multi-stream dense network. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 695–704. [Google Scholar] [CrossRef] [Green Version]
  34. Purohit, K.; Suin, M.; Rajagopalan, A.N.; Boddeti, V.N. Spatially-adaptive image restoration using distortion-guided networks. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2289–2299. [Google Scholar] [CrossRef]
  35. Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
  36. Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor indoor segmentation and support inference from RGBD images. In Proceedings of the 12th European Conference on Computer Vision (ECCV), Florence, Italy, 7–13 October 2012; pp. 746–760. [Google Scholar]
  37. Timofte, R.; Gu, S.; Wu, J.; Van Gool, L.; Zhang, L.; Yang, M.H.; Haris, M.; Shakhnarovich, G.; Ukita, N.; Hu, S.; et al. NTIRE 2018 challenge on single image super-resolution: Methods and results. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 852–863. [Google Scholar] [CrossRef]
  38. Ranftl, R.; Lasinger, K.; Hafner, D.; Schindler, K.; Koltun, V. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 1623–1637. [Google Scholar] [CrossRef]
  39. Shih, M.L.; Su, S.Y.; Kopf, J.; Huang, J.B. 3D photography using context-aware layered depth inpainting. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8025–8035. [Google Scholar] [CrossRef]
  40. Kang, G.; Dong, X.; Zheng, L.; Yang, Y. Patchshuffle regularization. arXiv 2017, arXiv:1707.07103. [Google Scholar]
  41. Hada, P.S. Approaches for Generating 2D Shapes. Master’s Dissertation, Department of Computer Science, University of Nevada, Las Vegas, NV, USA, 2014. [Google Scholar]
  42. Random Polygon Generation. Available online: https://stackoverflow.com/questions/8997099/algorithm-to-generate-random-2d-polygon (accessed on 19 October 2022).
  43. Li, L.; Zhou, Y.; Gu, K.; Lin, W.; Wang, S. Quality assessment of DIBR-synthesized images by measuring local geometric distortions and global sharpness. IEEE Trans. Multimed. 2018, 20, 914–926. [Google Scholar] [CrossRef]
  44. Wang, G.; Wang, Z.; Gu, K.; Xia, Z. Blind quality assessment for 3D-synthesized images by measuring geometric distortions and image complexity. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 4040–4044. [Google Scholar] [CrossRef]
  45. Wan, Z.; Zhang, B.; Chen, D.; Zhang, P.; Chen, D.; Liao, J.; Wen, F. Bringing old photos back to life. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2744–2754. [Google Scholar] [CrossRef]
  46. Liu, X.; Zhang, Y.; Hu, S.; Kwong, S.; Kuo, C.C.J.; Peng, Q. Subjective and objective video quality assessment of 3D synthesized views with texture/depth compression distortion. IEEE Trans. Image Process. 2015, 24, 4847–4861. [Google Scholar] [CrossRef] [PubMed]
  47. Reference Software for 3D-AVC: 3DV-ATM V10.0. Available online: https://hevc.hhi.fraunhofer.de/svn/svn_3DVCSoftware/ (accessed on 19 October 2022).
  48. VSRS-1D-Fast. Available online: https://hevc.hhi.fraunhofer.de/svn/svn_3DVCSoftware (accessed on 19 October 2022).
  49. Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. In Proceedings of the 5th International Conference on Learning Representations (ICLR),Workshop Track Proceedings, Toulon, France, 24–26 April 2017. [Google Scholar]
  50. Wang, Z.; Li, Q. Information content weighting for perceptual image quality assessment. IEEE Trans. Image Process. 2011, 20, 1185–1198. [Google Scholar] [CrossRef] [PubMed]
  51. Sandić-Stanković, D.; Kukolj, D.; Le Callet, P. Multi–scale synthesized view assessment based on morphological pyramids. Eur. J. Electr. Eng. 2016, 67, 3–11. [Google Scholar] [CrossRef] [Green Version]
  52. Tian, S.; Zhang, L.; Morin, L.; Déforges, O. SC-IQA: Shift compensation based image quality assessment for DIBR-synthesized views. In Proceedings of the 2018 IEEE Visual Communications and Image Processing (VCIP), Taichung, Taiwan, 9–12 December 2018; pp. 1–4. [Google Scholar] [CrossRef]
Figure 1. Transferring from learning task T s to task T t . (a) Compressed image quality enhancement to synthesized image quality enhancement. (b) Synthetic synthesized image quality enhancement to synthesized image quality enhancement.
Figure 1. Transferring from learning task T s to task T t . (a) Compressed image quality enhancement to synthesized image quality enhancement. (b) Synthetic synthesized image quality enhancement to synthesized image quality enhancement.
Sensors 22 08127 g001
Figure 2. Overview of DIBR distortion simulation pipeline.
Figure 2. Overview of DIBR distortion simulation pipeline.
Sensors 22 08127 g002
Figure 3. Comparison of DIBR distortion simulation effects by local random noise. (a,b) are SVIs from sequences Lovebird1 and Balloons, respectively, and the enlarged areas are the representative areas with both compression and DIBR distortion. (cj) represent the DIBR distortion simulation effects of rectangle areas in (a,b) by Gaussian, speckle, patch shuffle-based, and the proposed random irregular polygon-based noise on compressed captured views of Lovebird1 and Balloons, respectively.
Figure 3. Comparison of DIBR distortion simulation effects by local random noise. (a,b) are SVIs from sequences Lovebird1 and Balloons, respectively, and the enlarged areas are the representative areas with both compression and DIBR distortion. (cj) represent the DIBR distortion simulation effects of rectangle areas in (a,b) by Gaussian, speckle, patch shuffle-based, and the proposed random irregular polygon-based noise on compressed captured views of Lovebird1 and Balloons, respectively.
Sensors 22 08127 g003
Figure 4. Examples of generated random polygons. n denotes the number of vertices, ϵ denotes irregularity, and σ denotes spikiness. (a) n = 6, ϵ = 0, σ = 0. (b) n = 6, ϵ = 0.5, σ = 0. (c) n = 6, ϵ = 0, σ = 0.5. (d) n = 6, ϵ = 0.7, σ = 0.7. (e) n = 15, ϵ = 0.7, σ = 0.7.
Figure 4. Examples of generated random polygons. n denotes the number of vertices, ϵ denotes irregularity, and σ denotes spikiness. (a) n = 6, ϵ = 0, σ = 0. (b) n = 6, ϵ = 0.5, σ = 0. (c) n = 6, ϵ = 0, σ = 0.5. (d) n = 6, ϵ = 0.7, σ = 0.7. (e) n = 15, ϵ = 0.7, σ = 0.7.
Sensors 22 08127 g004
Figure 5. Four possible ways of image denoising/restoration networks integrating with DIBR distortion position. (a) Intuitive way of integrating ground truth DIBR distortion position. (b) Successive networks with DIBR distortion prediction. (c) Parallel networks with DIBR distortion prediction. (d) Parallel interactive network with DIBR distortion prediction.
Figure 5. Four possible ways of image denoising/restoration networks integrating with DIBR distortion position. (a) Intuitive way of integrating ground truth DIBR distortion position. (b) Successive networks with DIBR distortion prediction. (c) Parallel networks with DIBR distortion prediction. (d) Parallel interactive network with DIBR distortion prediction.
Sensors 22 08127 g005
Figure 6. Example images of NYU- and DIV2K-based synthetic image datasets. Zooming for better viewing of synthetic DIBR distortion. (a) NYU-based synthetic images. (b) DIV2K-based synthetic images.
Figure 6. Example images of NYU- and DIV2K-based synthetic image datasets. Zooming for better viewing of synthetic DIBR distortion. (a) NYU-based synthetic images. (b) DIV2K-based synthetic images.
Sensors 22 08127 g006
Figure 7. Visual quality comparison of two denoising models, i.e., DnCNN, NAFNet, for SVQE of Dancer with pre-training on synthetic synthesized image database with different random noise types, i.e., compress, Gaussian, speckle, patch shuffle, randompoly (proposed DIBR distortion simulation method), generated from NYU database. ‘Mask’ represents the denoising models that were further integrated with a DIBR distortion mask prediction sub-network using synthetic images generated by ‘randompoly’ method.
Figure 7. Visual quality comparison of two denoising models, i.e., DnCNN, NAFNet, for SVQE of Dancer with pre-training on synthetic synthesized image database with different random noise types, i.e., compress, Gaussian, speckle, patch shuffle, randompoly (proposed DIBR distortion simulation method), generated from NYU database. ‘Mask’ represents the denoising models that were further integrated with a DIBR distortion mask prediction sub-network using synthetic images generated by ‘randompoly’ method.
Sensors 22 08127 g007
Figure 8. Visual quality comparison of two denoising models, i.e., DnCNN, NAFNet, for SVQE of Poznanhall2 with pre-training on synthetic synthesized image database with different random noise types, i.e., compress, Gaussian, speckle, patch shuffle, randompoly (proposed DIBR distortion simulation method), generated from NYU database. ‘Mask’ represents the denoising models that were further integrated with a DIBR distortion mask prediction sub-network using synthetic images generated by ‘randompoly’ method.
Figure 8. Visual quality comparison of two denoising models, i.e., DnCNN, NAFNet, for SVQE of Poznanhall2 with pre-training on synthetic synthesized image database with different random noise types, i.e., compress, Gaussian, speckle, patch shuffle, randompoly (proposed DIBR distortion simulation method), generated from NYU database. ‘Mask’ represents the denoising models that were further integrated with a DIBR distortion mask prediction sub-network using synthetic images generated by ‘randompoly’ method.
Sensors 22 08127 g008
Figure 9. Time complexity comparisons between SVQE models with and without DIBR distortion mask prediction sub-network.
Figure 9. Time complexity comparisons between SVQE models with and without DIBR distortion mask prediction sub-network.
Sensors 22 08127 g009
Table 2. Definitions of key variables and acronyms.
Table 2. Definitions of key variables and acronyms.
VariablesDescriptions
D s / D t , χ , T s / T t The source/target domain, feature space, and source/target learning task, respectively
X = { x 1 , x 2 , , x n } Data samples set, which χ
D s = { ( x s 1 , y s 1 ) , ( x s 2 , y s 2 ) , , ( x s n , y s n ) } ,
D t = { ( x t 1 , y t 1 ) , ( x t 2 , y t 2 ) , , ( x t n , y t n ) }
Ground truth/distorted image pairs for source/target learning tasks
I, I δ , I s h , I s y n A captured view image, I added with random noise, I added with synthetic geometric distortion generated by the proposed random polygon method, and synthetic synthesized image, respectively
MMask indicating whether the area in I corresponds to strong depth edges
θ i , r i , n, RThe angle, the radius between the i-th point and assumed center point, and the number of vertices, the average value of radius of the generated random polygon, respectively.
ϵ , σ Random variables indicating the irregularity and spikiness of the generated random polygon, respectively
MVD, SVQE, SVI, IQAAcronyms for Multi-view Video plus Depth, synthesized view quality enhancement, synthesized view image, and image quality assessment, respectively
Table 3. Dataset descriptions.
Table 3. Dataset descriptions.
DatasetsOrigins/BenchmarkResolutionContained NoiseTrainingTesting
Synthetic SVI datasets
(for pre-training)
NYU Depth Dataset V2 [36]640 × 480Compression and
synthetic DIBR distortion
1449 images/
DIV2K [37]2K 1750 images/
Real MVD datasetSIAT Database [46]1920 × 1088
/1024 × 768
Compression and
DIBR distortion
94 images1200 images
1 The images in DIV2K dataset are of various 2K resolutions, e.g., 1356 × 2040, 2040 × 1536, and 2040 × 960.
Table 4. SVQE performance comparison of DnCNN on Sess-H.264 by pre-training on synthetic synthesized image database with different random noise types generated from NYU database. ‘Randompoly’ is the proposed DIBR distortion simulation method and highlighted. The best and second results are highlighted in bold and the best results are underlined again.
Table 4. SVQE performance comparison of DnCNN on Sess-H.264 by pre-training on synthetic synthesized image database with different random noise types generated from NYU database. ‘Randompoly’ is the proposed DIBR distortion simulation method and highlighted. The best and second results are highlighted in bold and the best results are underlined again.
MetricsModelsKendoNewspaperLovebird1Poznanhall2DancerOutdoorPoznancarparkAverage
PSNRw/o pre-train33.5829.7931.9834.9430.9033.1530.9632.19
Compress34.0529.9332.1535.3131.8933.7331.5032.65
Gaussian34.1229.9632.2135.3131.9733.6431.5132.67
Speckle34.0929.9432.2135.2631.9433.6631.4732.65
Patch shuffle34.1929.8832.1635.3132.1633.7531.5032.71
Randompoly34.0929.9332.1835.2932.1733.7331.4932.70
IW-SSIMw/o pre-train0.93180.90950.94020.90670.93320.96420.92150.9296
Compress0.93650.91240.94200.91080.94210.96790.92530.9338
Gaussian0.93550.91320.94240.90980.94330.96710.92490.9338
Speckle0.93510.91360.94270.90870.94300.96750.92520.9337
Patch shuffle0.93510.91360.94190.90940.94480.96750.92520.9339
Randompoly0.93570.91320.94250.91000.94470.96780.92520.9342
MPPSNRrw/o pre-train36.6231.5336.0537.7329.4034.4234.2734.29
Compress36.9831.9836.5837.8231.9935.1034.7935.03
Gaussian37.0532.0736.5837.7132.3835.0234.7935.09
Speckle37.0332.1336.5437.7732.2134.9134.7835.05
Patch shuffle37.0432.1736.5837.8732.6735.0934.8135.18
Randompoly37.1332.1036.5737.9132.6634.8834.7935.15
SC-IQAw/o pre-train19.7717.0619.3220.3215.6621.8616.5618.65
Compress20.2217.5519.7620.4518.0124.4817.3919.70
Gaussian20.2617.5519.8820.4918.0623.9617.4619.67
Speckle20.1717.4920.0620.4318.2024.0717.3719.68
Patch shuffle20.2917.5519.7020.4918.4624.7817.4319.81
Randompoly20.2817.5719.9720.5118.1924.3917.3819.75
Table 5. SVQE performance comparison of NAFNet on Sess-H.264 by pre-training on synthetic synthesized image database with different random noise types generated from NYU database. ‘Randompoly’ is the proposed DIBR distortion simulation method and highlighted. The best and second results are highlighted in bold and the best results are underlined again.
Table 5. SVQE performance comparison of NAFNet on Sess-H.264 by pre-training on synthetic synthesized image database with different random noise types generated from NYU database. ‘Randompoly’ is the proposed DIBR distortion simulation method and highlighted. The best and second results are highlighted in bold and the best results are underlined again.
MetricsModelsKendoNewspaperLovebird1Poznanhall2DancerOutdoorPoznancarparkAverage
PSNRw/o pre-train34.0029.8632.3235.3931.2833.4231.5032.54
Compress34.3030.0232.3035.4332.1933.9031.6632.83
Gaussian34.1629.9632.2735.4532.4933.8631.6232.83
Speckle34.2629.9732.3535.4032.3533.6631.6332.80
Patch shuffle34.2930.0732.3235.4032.4133.8631.6632.86
Randompoly34.2730.0432.4235.4132.5933.8931.6632.90
IW-SSIMw/o pre-train0.93860.91360.94340.91510.93420.96710.92450.9338
Compress0.94170.91560.94440.91580.94540.96940.92840.9373
Gaussian0.94130.91630.94480.91660.94770.96910.92790.9377
Speckle0.94140.91580.94470.91560.94630.96850.92720.9371
Patch shuffle0.94130.91590.94410.91560.94680.96930.92800.9373
Randompoly0.94160.91640.94560.91580.94880.96940.92850.9380
MPPSNRrw/o pre-train36.9932.0436.5737.9332.0934.7634.8035.02
Compress37.2032.2036.8237.9732.7135.4134.9135.32
Gaussian37.1532.2336.8737.9533.0835.4034.9635.38
Speckle37.2232.2336.8637.9532.9735.4434.8435.36
Patch shuffle37.2532.2736.8537.8733.1435.5134.9035.40
Randompoly37.2732.1136.6438.0433.1135.4834.9635.37
SC-IQAw/o pre-train20.0417.6320.0220.5917.7123.5917.2819.55
Compress20.3317.5620.0620.5418.1024.2017.4819.75
Gaussian20.3517.5320.3620.6018.6424.3417.3619.88
Speckle20.3917.5020.4820.5918.3523.5317.4719.76
Patch shuffle20.4617.5720.0420.5918.4324.5117.4819.87
Randompoly20.5517.6120.3420.6518.9724.3017.5019.99
Table 6. SVQE comparison measured by image quality metrics among DnCNN-, VRCNN-, TSAN-, NAFNet-based schemes by pre-training on synthetic databases from NYU and DIV2K on Seqs-H.264. Model (DnCNN, VRCNN, TSAN, NAFNet) represents the baselines, and model-syn-N/D represents the existing models (DnCNN, VRCNN, TSAN, NAFNet) combined with transfer learning scheme using our proposed synthetic images. The performance gains of the proposed model-syn-N/D over model are highlighted in bold.
Table 6. SVQE comparison measured by image quality metrics among DnCNN-, VRCNN-, TSAN-, NAFNet-based schemes by pre-training on synthetic databases from NYU and DIV2K on Seqs-H.264. Model (DnCNN, VRCNN, TSAN, NAFNet) represents the baselines, and model-syn-N/D represents the existing models (DnCNN, VRCNN, TSAN, NAFNet) combined with transfer learning scheme using our proposed synthetic images. The performance gains of the proposed model-syn-N/D over model are highlighted in bold.
MetricsModelsKendoNewspaperLovebird1Poznanhall2DancerOutdoorPoznancarparkAverage
PSNRDnCNN33.5829.7931.9834.9430.9033.1530.9632.19
DnCNN-syn-N34.0929.9332.1835.2932.1733.7331.4932.70 (+0.51)
DnCNN-syn-D34.1529.9332.1935.3032.3333.8231.5232.75 (+0.56)
VRCNN33.9029.8432.0935.1431.5233.2831.3632.45
VRCNN-syn-N33.9929.8732.0835.2031.5933.5531.3932.52 (+0.08)
VRCNN-syn-D34.1329.9432.1235.2432.0333.7531.4432.66 (+0.21)
TSAN33.9329.9932.2735.0331.6433.4231.0832.48
TSAN-syn-N34.1229.8832.1635.3232.4833.8431.4032.74 (+0.26)
TSAN-syn-D34.2030.0432.3035.3832.4533.8031.5332.81 (+0.33)
NAFNet34.0029.8632.3235.3931.2833.4231.5032.54
NAFNet-syn-N34.2730.0432.4235.4132.5933.8931.6632.90 (+0.36)
NAFNet-syn-D34.4030.0932.3435.5132.7334.0431.7132.97 (+0.44)
IW-SSIMDnCNN0.93180.90950.94020.90670.93320.96420.92150.9296
DnCNN-syn-N0.93570.91320.94250.91000.94470.96780.92520.9342 (+0.0046)
DnCNN-syn-D0.93770.91350.94360.91160.94540.96810.92610.9351 (+0.0056)
VRCNN0.93240.91150.94060.90620.94010.96620.92270.9314
VRCNN-syn-N0.93370.91210.94040.90680.94080.96660.92340.9320 (+0.0006)
VRCNN-syn-D0.93360.91290.94100.90640.94490.96750.92400.9329 (+0.0015)
TSAN0.93300.91380.93990.90660.94070.96650.92090.9316
TSAN-syn-N0.93300.91380.93990.91300.94710.96650.92700.93433 (+0.0027)
TSAN-syn-D0.94240.91600.94450.91510.94590.96860.92770.93716 (+0.0055)
NAFNet0.93860.91360.94340.91510.93420.96710.92450.9338
NAFNet-syn-N0.94160.91640.94560.91580.94880.96940.92850.9380 (+0.0042)
NAFNet-syn-D0.94390.91710.94520.91690.94910.97020.92910.9388 (+0.0050)
Table 7. SVQE comparison measured by SVI metrics among DnCNN-, VRCNN-, TSAN-, NAFNet-based schemes by pre-training on synthetic databases from NYU and DIV2K on Seqs-H.264. Model (DnCNN, VRCNN, TSAN, NAFNet) represents the baselines, and model-syn-N/D represents the existing models (DnCNN, VRCNN, TSAN, NAFNet) combined with transfer learning scheme using our proposed synthetic images. The performance gains of the proposed model-syn-N/D over model are highlighted in bold.
Table 7. SVQE comparison measured by SVI metrics among DnCNN-, VRCNN-, TSAN-, NAFNet-based schemes by pre-training on synthetic databases from NYU and DIV2K on Seqs-H.264. Model (DnCNN, VRCNN, TSAN, NAFNet) represents the baselines, and model-syn-N/D represents the existing models (DnCNN, VRCNN, TSAN, NAFNet) combined with transfer learning scheme using our proposed synthetic images. The performance gains of the proposed model-syn-N/D over model are highlighted in bold.
MetricsModelsKendoNewspaperLovebird1Poznanhall2DancerOutdoorPoznancarparkAverage
MPPSNRrDnCNN36.6231.5336.0537.7329.4034.4234.2734.29
DnCNN-syn-N37.1332.1036.5737.9132.6634.8834.7935.15 (+0.86)
DnCNN-syn-D37.1131.8936.5137.9233.0135.2434.7735.21 (+0.92)
VRCNN36.8932.0636.5437.7831.6934.6934.7134.91
VRCNN-syn-N36.9732.0436.3337.8431.9134.8434.7734.96 (+0.05)
VRCNN-syn-D36.9632.2236.6437.7932.8935.4434.7135.24 (+0.33)
TSAN36.7932.2136.5837.6932.5835.2934.7335.12
TSAN-syn-N37.2832.2336.6437.8233.3935.3834.8435.37 (+0.24)
TSAN-syn-D37.2632.0636.6537.8933.4335.3334.8735.35 (+0.23)
NAFNet36.9932.0436.5737.9332.0934.7634.8035.02
NAFNet-syn-N37.2732.1136.6438.0433.1135.4834.9635.37 (+0.25)
NAFNet-syn-D37.4632.2236.8338.0033.5235.5534.8935.49 (+0.47)
SC-IQADnCNN19.7717.0619.3220.3215.6621.8616.5618.65
DnCNN-syn-N20.2817.5719.9720.5118.1924.3917.3819.75 (+1.10)
DnCNN-syn-D20.3017.5819.9520.4718.3124.0117.3719.71 (+1.06)
VRCNN20.1117.4619.5120.2718.1422.8817.3019.38
VRCNN-syn-N20.1617.5219.4720.3417.6624.1217.2819.51 (+0.13)
VRCNN-syn-D20.1417.5519.4720.3217.6024.9717.1619.60 (+0.22)
TSAN19.8817.5919.5020.2817.3424.5316.7819.42
TSAN-syn-N20.3317.3919.3320.5218.5824.4716.8919.65 (+0.23)
TSAN-syn-D20.3417.5719.7520.6618.5523.7217.1419.68 (+0.26)
NAFNet20.0417.6320.0220.5917.7123.5917.2819.55
NAFNet-syn-N20.5517.6120.3420.6518.9724.3017.5019.99 (+0.44)
NAFNet-syn-D20.6517.6020.0420.7419.1525.0517.5920.12 (+0.57)
Table 8. SVQE comparison measured by PSNR among DnCNN-based schemes and that with ground truth DIBR distortion masks on Seqs-H.264. DnCNN-syn-GTmask-N, and DnCNN-syn-GTmask-D are abbreviated as DnCNN-syn-GM-N and DnCNN-syn-GM-D, respectively. The performance gains of DnCNN-based schemes with ground truth DIBR distortion masks over DnCNN-based schemes without masks are highlighted in bold.
Table 8. SVQE comparison measured by PSNR among DnCNN-based schemes and that with ground truth DIBR distortion masks on Seqs-H.264. DnCNN-syn-GTmask-N, and DnCNN-syn-GTmask-D are abbreviated as DnCNN-syn-GM-N and DnCNN-syn-GM-D, respectively. The performance gains of DnCNN-based schemes with ground truth DIBR distortion masks over DnCNN-based schemes without masks are highlighted in bold.
ModelsKendoNewspaperLovebird1Poznanhall2DancerOutdoorPoznancarparkAverage
DnCNN33.5829.7931.9834.9430.9033.1530.9632.19
DnCNN-GTmask34.3530.1332.1634.6032.6133.5930.8532.61 (+0.42)
DnCNN-syn-N34.0929.9232.1835.3032.1433.7431.5032.70
DnCNN-syn-GM-N35.2030.7232.4435.0933.0534.0031.0333.07 (+0.37)
DnCNN-syn-D34.1229.9132.1635.3232.2933.7931.5332.73
DnCNN-syn-GM-D35.3530.8332.5134.9233.2234.1730.8633.12 (+0.39)
Table 9. SVQE comparison among DnCNN- and NAFNet-based schemes (model-syn-N/D) and that with DIBR distortion prediction (model-syn-mask-N/D) on Seqs-H.264. Only when the SVQE models with DIBR distortion prediction (highlighted bold) are superior than the same model pre-trained on the same synthetic database, the results are highlighted bold.
Table 9. SVQE comparison among DnCNN- and NAFNet-based schemes (model-syn-N/D) and that with DIBR distortion prediction (model-syn-mask-N/D) on Seqs-H.264. Only when the SVQE models with DIBR distortion prediction (highlighted bold) are superior than the same model pre-trained on the same synthetic database, the results are highlighted bold.
ModelPSNRIW-SSIMMPPSNRrSC-IQA
DnCNN-syn-N32.700.934235.14819.75
DnCNN-syn-mask-N 32.72 0.934135.15219.80
DnCNN-syn-D32.750.935235.20819.71
DnCNN-syn-mask-D32.780.934535.26419.82
NAFNet-syn-N32.900.938035.37219.99
NAFNet-syn-mask-N32.930.938035.49320.00
NAFNet-syn-D32.970.938835.49320.12
NAFNet-syn-mask-D32.980.938835.52820.08
Table 10. The computational complexity comparisons between different types of random noise for DIBR distortion simulation (milliseconds (ms)).
Table 10. The computational complexity comparisons between different types of random noise for DIBR distortion simulation (milliseconds (ms)).
DatabaseGaussianSpecklePatch shuffleRandompoly
NYU Depth Dataset V267.9065.6176.451268.82
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhang, H.; Cao, J.; Zheng, D.; Yao, X.; Ling, B.W.-K. Deep Learning-Based Synthesized View Quality Enhancement with DIBR Distortion Mask Prediction Using Synthetic Images. Sensors 2022, 22, 8127. https://doi.org/10.3390/s22218127

AMA Style

Zhang H, Cao J, Zheng D, Yao X, Ling BW-K. Deep Learning-Based Synthesized View Quality Enhancement with DIBR Distortion Mask Prediction Using Synthetic Images. Sensors. 2022; 22(21):8127. https://doi.org/10.3390/s22218127

Chicago/Turabian Style

Zhang, Huan, Jiangzhong Cao, Dongsheng Zheng, Ximei Yao, and Bingo Wing-Kuen Ling. 2022. "Deep Learning-Based Synthesized View Quality Enhancement with DIBR Distortion Mask Prediction Using Synthetic Images" Sensors 22, no. 21: 8127. https://doi.org/10.3390/s22218127

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop