Blind Image Quality Assessment Based on Classification Guidance and Feature Aggregation

Cai, Weipeng; Fan, Cien; Zou, Lian; Liu, Yifeng; Ma, Yang; Wu, Minyuan

doi:10.3390/electronics9111811

Open AccessArticle

Blind Image Quality Assessment Based on Classification Guidance and Feature Aggregation

by

Weipeng Cai

¹

,

Cien Fan

^1,*

,

Lian Zou

¹,

Yifeng Liu

²,

Yang Ma

¹ and

Minyuan Wu

¹

School of Electronic Information, Wuhan University, Wuhan 430072, China

²

National Engineering Laboratory for Public Safety Risk Perception and Control by Big Data (NEL-PSRPC), Beijing 100041, China

^*

Author to whom correspondence should be addressed.

Electronics 2020, 9(11), 1811; https://doi.org/10.3390/electronics9111811

Submission received: 29 September 2020 / Revised: 23 October 2020 / Accepted: 24 October 2020 / Published: 2 November 2020

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

In this work, we present a convolutional neural network (CNN) named CGFA-CNN for blind image quality assessment (BIQA). A unique two-stage strategy is utilized which firstly identifies the distortion type in an image using Sub-Network I and then quantifies this distortion using Sub-Network II. Different from most deep neural networks, we extract hierarchical features as descriptors to enhance the image representation and design a feature aggregation layer in an end-to-end training manner applying Fisher encoding to visual vocabularies modeled by Gaussian mixture models (GMMs). Considering the authentic distortions and synthetic distortions, the hierarchical feature contains the characteristics of a CNN trained on the self-built dataset and a CNN trained on ImageNet. We evaluated our algorithm on four publicly available databases, and the results demonstrate that our CGFA-CNN has superior performance over other methods both on synthetic and authentic databases.

Keywords:

blind image quality assessment; deep neutral networks; feature aggregation

1. Introduction

Digital pictures may occur different distortions in the procedure of acquisition, transmission, and compression, leading to an unsatisfactory perceived visual quality or a certain level of annoyance. Thus, it is crucial to predict the quality of digital pictures in many applications, such as compression, communication, printing, display, analysis, registration, restoration, and enhancement [1,2,3]. Generally, image quality assessment approaches can be classified into three kinds according to the additional information needed. Specifically, full-reference image quality assessment (FR-IQA) [4,5,6,7] and reduced-reference image quality assessment (RR-IQA) [8,9,10] need full and partial information of reference images, respectively, while blind image quality assessment (BIQA) [11,12,13,14] performs quality measure without any information from the reference image. Thus, BIQA methods are more attractive in many practical applications because the reference image usually is not available or hard to derive.

Early studies mainly focused on one or more specific distortion types, such as Gaussian blur [15], blockiness from JPEG compression [16], or ringing arising from JPEG2000 compression [17]. However, images may be affected by unknown distortion in many practical scenarios. In contrast, general BIQA methods aim to work well for arbitrary distortion, which can be classified into two categories according to the features extracted, i.e., Natural Scene Statistics (NSS)-based methods and training-based methods.

NSS-based methods [18] assume that the natural image with non-distorted obeys certain perceptually relevant statistical laws that are violated by the presence of common image distortions, and they attempt to describe an image utilizing its scene statistics from different domains. For example, BIRSQUE [19] derives features from the locally normalized luminance coefficients in the spatial domain. M3 [20] utilizes the joint local contrast features from the gradient magnitude (GM) map and the Laplacian of Gaussian (LOG) response. Later, a perceptually motivated and feature-driven model is deployed in FRIQUEE [21], in which a large collection of features defined in various complementary, perceptually relevant color and transform-domain spaces are drawn from among the most successful BIQA models produced to date.

However, knowledge-driven feature extraction and data-driven quality prediction are separated in the above methods. It has been demonstrated that training-based methods outperform NSS-based methods by a large margin because a fully data-driven BIQA solution becomes possible. For example, CORNIA [22] constructs a codebook in an unsupervised manner, using raw image patches as local descriptors and using soft-assignment for encoding. Considering that the feature set generally adopted in previous methods are from zero-order statistics and insufficient for BIQA, HOSA [23] constructs a much smaller codebook using K-means clustering [24] and introducing higher-order statistics. In contrast, the above methods capture spatially normalized coefficients and codebook-based features which are learned automatically from beginning to end by using CNNs. For example, TCSN [25] aims to learn the complicated relationship between visual appearance and perceived quality via a two-stream convolutional neural network. DIQA [26] defines two separated CNN branches to learn objective distortion and human visual sensitivity, respectively.

In this work, we propose an end-to-end BIQA based on classification guidance and feature aggregation, which is accomplished by two sub-networks with shared features in the early layers. Due to the lack of training data, we construct a large-scale dataset by means of synthesizing distortions and pre-train Sub-Network I to identify an image into a specific distortion type from a set of pre-defined categories. We find the proposed method will be much harder to achieve high accuracies on authentic images if only it is exposed to synthetic distortions during training. Then, we extract hierarchical features from the shared layers of two-subnetworks and another CNN (VGG-16 [27]) pre-trained on ImageNet [28], in which pictures occur as a natural consequence of photography and a unified feature group is formed.

Sub-Network II takes the hierarchical features and the classification information as inputs to predict the perceptual quality. The combination of two sub-networks enables the learning framework to have the probability of favorable quality perception and proper parameter initialization in an end-to-end training manner. We design a feature aggregation layer that could convert arbitrary input seizes to a fixed-length representation. Then, a fully connected layer is exploited as a linear regression model to map the high-dimensional features into the quality scores. This allows the proposed CGFA-CNN to accept an image of any size as the input, thus there is no need to perform any transformation of images (including cropping, scaling, etc.), which would affect perceptual quality scores.

The paper is structured as follows. In Section 2, previous work on CNN-based BIQA related to our work is briefly reviewed. In Section 3, details of the proposed method are described. In Section 4, experimental results on the public IQA databases and the corresponding analysis are presented. In Section 5, the work of this paper is concluded.

2. Related Work

In this section, we provide a brief survey about the major solutions to the lack of training data in BIQA and a review of recent studies related to our work.

Due to the number of parameters to be trained on CNN is usually very large, the training set needs to contain sufficient data to avoid over-fitting. However, the number of samples and image contents in the public quality-annotated image databases are rather limited, which cannot meet the need for end-to-end training of a deep network. Currently, there are two main methods to tackle this challenge.

The first method is to train the model based on image patches. For example, deepIQA [29] randomly samples image patches from the entire image as inputs and predicts the quality score on local regions by assigning the subjective mean score (MOS) of the pristine image to all patches within it. Although taking small patches as inputs for data augmentation is superior to using the whole image in a given dataset, this method still suffers from limitations because local image quality with context varies across spatial locations even when the distortion is homogeneous. To resolve this problem, BIECON [30] makes use of the existing FR-IQA algorithms to assign quality labels for sampled image patches, but the performance of such a network depends highly on that of FR-IQA models. Other methods such as dipIQ [31] attempting to generate discriminable image pairs by involving FR-IQA models may suffer from similar problems.

The second method is to pre-train a network with large-scale datasets in other fields. For each pre-trained architecture, two types of back-end training strategies are available: replacing the last layer of the pre-trained CNN model with the regression layer and fine-tuning it with the IQA database to conduct image quality prediction or using SVR to regress the extracted features through the pre-trained networks onto subjective scores. For instance, DeepBIQ [32] reports on the use of different features extracted from pre-trained CNNs for different image classification tasks via ImageNet [28] and Places365 [33] as a generic image description. Kim et al. [34] selected the well-known deep CNN models AlexNet [35] and ResNet50 [36] as the architectures of the baseline models, which have been pre-trained for the task of image classification on ImageNet [28]. These methods directly inheriting the weights from the pre-trained models for general image classification tasks have a defect of low relevance to BIQA but unnecessary complexity.

To better address the training data shortage problem, MEON [37] proposes a cascaded multi-task framework, which firstly trains a distortion type identification network by large-scale pre-defined samples. Then, a quality prediction network is trained subsequently, taking advantage of distortion information obtained from the first stage. Furthermore, DB-CNN [38] not only constructs a pre-training set based on Waterloo Exploration Database [39] and PASCAL VOC [40] for synthetic distortions, but also uses ImageNet [28] to pre-train another CNN for authentic distortions. Motivated by the previous studies on MEON [37] and DB-CNN [38], we construct a pre-training set based on Waterloo Exploration Database [39] and PASCAL VOC [40] for synthetic distortions. Besides, both distortion type and distortion level are considered at the same time, which results in better quality-aware initializations and distortion information.

Although previous DNN-based BIQA methods have achieved significant performance, all of these methods usually comprise convolutional layers and pooling layers for feature extraction and employ fully connected layers for regression, which would suffer three limitations. First, such techniques as averaging or maximum pooling are too simple to be accurate for long sequences. Second, a fully connected layer is destructive to the high-dimensional disorder and spatial invariance of the local feature. Third, such CNNs typically require a fixed image size. To feed into the network, images have to be resized or cropped to a fixed size, and either scaling or cropping would cause the perceptual difference with the assigned quality labels. To tackle these challenges, we explore more sophisticated pooling techniques based on clustering approaches such as Bag-of-visual-words (BOW) [41], Vector of Locally aggregated Descriptors (VLAD) [42] and Fisher Vectors [43]. Studies have shown that integrating VLAD as a differentiable module in a neural network can significantly improve the aggregated representation for the task of place recognition [44] and video classification [45]. Our proposed feature aggregation layer acts as a pooling layer on top of the convolutional layers, which converts arbitrary input seizes to a fixed-length representation. Afterward, using a fully connected layer for regression does not require any preprocessing of the input image.

3. The Proposed Method

The framework CGFA-CNN is illustrated in Figure 1. Sub-Network I aims to classify an image into a specific distortion type and initialize the shared layers for a further learning process, which is firstly pre-trained on a self-built dataset. Sub-Network II predicts the perceptual quality of the same image, which is fine-tuned with the IQA databases and takes advantage of distortion information obtained from Sub-Network I. The feature aggregation layer (FV layer) and classification-guided gating unit (CGU) are described in Section 3.3 and Section 3.4.

3.1. Distortion Type Identification

3.1.1. Construction of the Pre-Training Dataset

Due to the deficiency of the available quality-annotated samples, we firstly construct a large-scale dataset based on Waterloo Database [39] and PASCAL VOC Database [40]. The former contains 4744 images and can be loosely categorized into seven classes. The latter contains 17,125 images covering 20 categories. In this paper, we merge the two databases and obtain 21,869 pristine images with various contents. Then, nine types of distortion are introduced: JPEG compression, JPEG2000 compression, Gaussian blur, white Gaussian noise, contrast stretching, pink noise, image quantization with color dithering, over-exposure, and under-exposure. We synthesize each image with five distortion levels following Ma et al. [39] except for over-exposure and under-exposure, where only three levels are generated according to Ma et al. [46]. The constructed dataset consists of 896,629 images, which are organized into 41 subcategories according to the distortion type and degradation level. We label these images by the subcategory they belong to.

3.1.2. Sub-Network I Architecture

Inspired by the VGG-16 network architecture [27], we design a similar structure subject to some modifications identifying the distortion type of the input image. Details are given in Table 1. The tailored VGG-16 network comprises a stack of convolutions (Conv) for feature extraction, one maximum pooling (MaxPool) for feature fusion, and three fully connected layers (FC) for feature regression. All hidden layers are equipped with the Rectified Linear Unit (ReLU) [35] and Batch Normalization (BN) [47]. We denote the input mini-batch training data by

{\{(X^{(n)}, p^{(n)})\}}_{n = i}^{N}

, where

X^{(n)}

is the nth input image and

p^{(n)}

is a multi-class indicator vector of the ground truth distortion type. We append the soft-max layer at the end and define the soft-max function as

{\hat{p}}_{i}^{(n)} (X^{(n)}; W) = \frac{exp (y_{i}^{(n)} (X^{(n)}; W))}{\sum_{j = 1}^{C} exp (y_{i}^{(n)} (X^{(n)}; W))},

(1)

where

{\hat{p}}_{}^{(n)} = {[{\hat{p}}_{1}^{(n)}, \dots, {\hat{p}}_{C}^{(n)}]}^{T}

is a C-dimensional probability vector of the nth input in a mini-batch, indicating the probability of each distortion type. Model parameters of Sub-Network I are collectively denoted as

W

. A cross-entropy loss is used to train this sub-network

ℓ_{s} ({X^{(n)}}; W) = - \sum_{n = 1}^{N} \sum_{i = 1}^{C} p_{i}^{(n)} log {\hat{p}}_{i}^{(n)} (X^{(n)}; W) .

(2)

Notably, in the fine-tuning phase, except for the shared layers, the rest of Sub-Network I only participates in the forward propagation and the parameters are fixed.

3.2. Feature Extraction and Fusion

In Figure 2, we can see that the representation of different distortion types varies in each convolution. Therefore, only using features extracted from the last convolution is not enough to predict the quality of an image. Inspired by the idea of combining the complementary features and hierarchical feature extraction strategy in our previous work [48], we resort to extracting features from low-level, middle-level and high-level convolutional layers as descriptors by rescaling and concatenating them. We design Sub-Network I to identify a given image’s distortion type pre-trained on a synthesized dataset. We find this takes advantage of synthetic images but fails to handle those authentically distorted. More details can be found in Section 4.5. Then, we model synthetic and authentic distortions by two separated CNNs and fuse the two feature sets into a unified representation for final quality prediction. The tailored VGG-16 pre-trained on ImageNet that contains many realistic natural images of different perceptual quality is added to extract relevant features for authentic images. In the proposed CGFA-CNN index, we take a raw image of

H \times W \times 3

as input and predict its perceptual quality. Then, the fused feature group acquired is with the size of

\frac{H}{16} \times \frac{W}{16} \times D

. Here, D is the channel of hierarchical features. Sub-Network II takes the fused feature group and the estimated probability vector

{\hat{p}}_{}^{(n)}

as inputs.

3.3. Feature Aggregation Layer and Encoding

In this paper, we design a feature aggregation layer that employs the Fisher Vectors (FV) [43] to perform the feature aggregation and encoding procedures. Because GMM [49] and FV are non-differentiable and fail to achieve theoretically valid backpropagation, we define a FV layer to yield a quality-aware feature vector

f

. The implementation is shown in Figure 3.

As illustrated in Figure 1, the fused feature group is a

\frac{H}{16} \times \frac{W}{16} \times D

map, which can be considered as a set of D-dimensional descriptors extracted at

\frac{H}{16} \times \frac{W}{16}

spatial locations. Then, we utilize GMM to obtain the cluster centers

C

of K components and encoding vector

f

of the image descriptors-X.

3.3.1. GMM Clustering

A Gaussian mixture model

p (x | θ)

is a mixture of K multivariate Gaussian distributions [49], which can be formulated as

p (x | θ) = \sum_{k = 1}^{K} p (x | μ_{k}, \sum_{k}) π_{k},

(3)

p (x | μ_{k}, \sum_{k}) = \frac{e^{- \frac{1}{2} {(x - μ_{k})}^{T} \sum_{k}^{- 1} (x - μ_{k})}}{\sqrt{{(2 π)}^{D} det \sum_{k}}},

(4)

θ = (π_{1}, μ_{1}, \sum_{1}, \dots, π_{K}, μ_{K}, \sum_{K}),

(5)

where

θ

is the vector of parameters of the model. For each Gaussian component,

π_{k}

is the prior probability value,

μ_{k}

is the means, and

\sum_{K}

is the diagonal covariance matrices. The parameters are learned from a training set of descriptors

x_{1}, \dots, x_{N}

. The GMM defines the assignments

q_{k i} (k = 1, \dots,

K, i = 1, \dots, N)

of the N descriptors to the K Gaussian components

q_{k i} = \frac{p (x_{i} | μ_{k}, \sum_{k}) π_{k}}{\sum_{j = 1}^{K} p (x_{i} | μ_{j}, \sum_{j}) π_{j}}, k = 1, \dots, K .

(6)

3.3.2. Fisher Encoding

Fisher encoding captures both the first- and second-order differences between the image descriptors and the centers of a GMM. The construction of the encoding begins by learning a GMM model

θ

. For each

k = 1, \dots, K

, define the vectors

u_{k} = \frac{1}{N \sqrt{π_{k}}} \sum_{i = 1}^{N} q_{k i} \sum_{k}^{- \frac{1}{2}} (x_{i} - μ_{k}),

(7)

v_{k} = \frac{1}{N \sqrt{2 π_{k}}} \sum_{i = 1}^{N} q_{k i} [(x_{i} - μ_{k}) \sum_{k}^{- 1} (x_{i} - μ_{k}) - 1] .

(8)

The Fisher encoding of the set of local descriptors is then given by the concatenation of

μ_{k}

and

v_{k}

for all K components, giving an encoding of size

2 \times D \times K

f_{F i s h e r} = {[u_{1}^{T}, \dots u_{K}^{T}, v_{1}^{T}, \dots, v_{K}^{T}]}^{T} .

(9)

To integrate Fisher vector as a differentiable module in a neural network, we write the descriptor

x_{i}

hard assignment to the cluster k as a soft assignment

a_{k} (x_{i}) = \frac{e^{- α {∥x_{i} - c_{k}∥}^{2}}}{\sum_{j = 1}^{K} e^{- α {∥x_{i} - c_{j}∥}^{2}}} .

(10)

Then, we can write the FV representation as

F V 1_{j, k} = \sum_{i = 1}^{N} a_{k} (x_{i}) (\frac{x_{i} (j) - c_{k} (j)}{σ_{k} (j)}),

(11)

F V 2_{j, k} = \sum_{i = 1}^{N} a_{k} (x_{i}) ({(\frac{x_{i} (j) - c_{k} (j)}{σ_{k} (j)})}^{2} - 1),

(12)

where

F V 1

and

F V 2

are capturing the first- and second-order statistics, respectively.

x_{i} (j)

is the jth dimensions of the ith descriptor and

c_{k} (j)

is the kth cluster centers.

c_{k}

and

σ_{k}

(

k \in [1, K]

) are the learnable clusters and the clusters’ diagonal covariance. We define

α

as positive ranging between 0 and 1.

Let

ω_{k} = 2 α c_{k}

and

b_{k} = - α {∥c_{k}∥}^{2}

; Equation (10) can then be written as

a_{k} (x_{i}) = \frac{e^{ω_{k}^{T} x_{i} + b_{k}}}{\sum_{j = 1}^{K} e^{ω_{j}^{T} x_{i} + b_{j}}} .

(13)

where

\{ω_{k}\}

,

\{b_{k}\}

, and

\{c_{k}\}

are sets of trainable parameters for each cluster k.

3.3.3. Beyond the FV Aggregation

The source of discontinuities in the traditional Bag-of-visual-words (BOW) [41] and Vector of Locally aggregated Descriptors (VLAD) [42] are the hard assignments

q_{k i}

of descriptors

x

to cluster centers

c_{k}

. To make this operation differentiable, we replace it with the descriptor

x_{i}

hard assignment to the cluster as a soft assignment and reuse the same soft assignment established in Equation (12) to obtain differentiable representation. We denote them as VLAD layer and BOW layer, respectively. The differentiable BOW representation and VLAD representation can be written as

B O W_{k} = \sum_{i = 1}^{N} a_{k} (x_{i}),

(14)

V L A D_{j, k} = \sum_{i = 1}^{N} a_{k} (x_{i}) (x_{i} (j) - c_{k} (j)),

(15)

where

a_{k} (x_{i})

denotes the membership of the descriptor

x_{i}

to cluster k. BOW is the histogram of the number of image descriptors assigned to each visual word. Therefore, it produces a K-dimensional vector, while VLAD is a simplified non-probabilistic version of the FV and produces a

D \times K

-dimensional vector.

The soft assignment

a_{k} (x_{i})

can be regarded as a two-step process: (i) Perform a

1 \times 1

convolution with a set of K filters

ω_{k}

and bias

b_{i}

. Then, the output produced is

ω_{k}^{T} x_{i} + b_{k}

. (ii) Follow a soft-max function to obtain soft assignment of the descriptor

x_{i}

to the cluster k. Notably, for BOW encoding, there is no need to store the sum of residuals for each visual word, which is the difference vector between the descriptor and its corresponding cluster center.

The advantage of the BOW aggregation is that it aggregates the descriptor into a more compact representation, and fewer parameters are trained in a discriminative manner only including

\{ω_{k}\}

and

\{b_{k}\}

. The drawback is that significantly more clusters are needed to obtain a rich representation. The VLAD computes the first-order residuals between the descriptors and the cluster centers, making the richness of representation relatively sufficient, and parameters to be learned are moderate, including

\{ω_{k}\}

,

\{b_{k}\}

and

\{c_{k}\}

. In contrast, the FV aggregation concatenates both the first- and second-order aggregated residuals, but too many parameters need to be learned, including

\{ω_{k}\}

,

\{b_{k}\}

,

\{c_{k}\}

and

\{σ_{k}\}

.

As discussed Section 4.5, we also experimented with averaging and maximum pooling of the image descriptor-X. The results show that FV proves itself to be superior to the reference BOW and VLAD approach. Additionally, simply using averaging or maximum pooling results in poor performance.

3.4. Classification-Guided Gating Unit and Quality Prediction

We pre-trained Sub-Network I to identify the distortion type of the input, and Sub-Network II takes the estimated probability vector

\hat{p}

from Sub-Network I as partial input. To introduce this prior information of the classification, a Classification-guided Gating Unit (CGU) is utilized to emphasize informative features and suppress less useful ones. The CGU combines

\hat{p}

and

f

to produce a score vector

\hat{f}

\hat{f} = \hat{p} \cdot σ (W \cdot f_{Fisher} + b),

(16)

where

σ

is a linear mapping and

(W, b)

are the learnable parameters. Then, a linear mapping is followed to yield an overall quality score q. To increase nonlinearity, two fully connected layers are applied as the linear mapping.

For Sub-Network II, the

L_{1}

function is used as the empirical loss

ℓ = \frac{1}{N} \sum_{i = 1}^{N} ∥q_{i} - \hat{q_{i}}∥,

(17)

where

q_{i}

is the MOS of the ith image in a mini-batch and

\hat{q_{i}}

is the predicted quality score by CGFA-CNN.

4. Experimental Results and Discussions

4.1. Database Description and Experimental Protocol

(1) IQA databases: These experiments were confirmed on three singly distorted synthetic IQA databases, namely LIVE [50], CSIQ [51], and TID2013 [52], and an authentic LIVE Challenge database [53]. LIVE contains five distortion types—JPEG compression (JPEG), JPEG-2000 compression (JP2K), White noise (GN), Gaussian blurring (GB), and Fast-fading error (FF)—at 7–8 degradation levels. CSIQ contains six distortion types—JPEG compression (JPEG), JPEG-2000 compression (JP2K), global contrast decrements (GC), additive pink Gaussian noise (PN), additive white Gaussian noise (WN), and Gaussian blurring (GB)—at 3–5 degradation levels. TID2013 contains 24 sceptic distortion types: additive Gaussian noise, additive noise in color components, spatially correlated noise, masked noise, high-frequency noise, impulse noise, quantization noise, Gaussian blur, image denoising, JPEG compression, JPEG2000 compression, JPEG transmission errors, non-eccentricity pattern errors, local bock-wise distortions, mean shift, contrast change, change of color saturation, multiplicative Gaussian noise, comfort noise, lossy compression of noisy images, color quantization with dither, chromatic aberrations, sparse sampling and reconstruction, which are denoted as #01–#24, respectively.

(2) Evaluation Criteria: Two evaluation criteria are adopted as follows to benchmark BIQA models:

Spearman’s rank-order correlation coefficient (SRCC) is a nonparametric measure

$SRCC = 1 - \frac{6 \sum_{i} d_{i}^{2}}{I (I^{2} - 1)},$

(18)

where I is the test image number and $d_{i}$ is the rank difference between the MOS and the model prediction of the ith image.
Pearson linear correlation coefficient (PLCC) is a nonparametric measure of the linear correlation

$PLCC = \frac{\sum_{i} (q_{i} - q_{m}) ({\hat{q}}_{i} - {\hat{q}}_{m})}{\sqrt{{\sum_{i} (q_{i} - q_{m})}^{2}} \sqrt{{\sum_{i} ({\hat{q}}_{i} - {\hat{q}}_{m})}^{2}}},$

(19)

where $q_{i}$ and ${\hat{q}}_{i}$ stand for the MOS and the model prediction of the ith image, respectively.

For synthetic databases LIVE, CSIQ and TID2013, we divided the distorted images into two splits of non-overlapping content—80% of which were used as fine-tuning samples and the other 20% were left as testing samples. For the LIVE Challenge database, the distorted images were divided into two groups—80% for training and 20% for testing. This random process was repeated ten times, and the average SRCC and PLCC are reported as the final results. Besides, the three synthetic databases were selected for cross-database experiments, using one database as the training sets while the other as the testing.

We compared the proposed CGFA-CNN against several state-of-the-art BIQA methods, including three based on NSS (BRISQUE [19], M3 [20], and FRIQUEE [21]), two based on manual feature learning (CORNIA [22] and HOSA [23]), and eight based on CNN (BIECON [30], dipIQ [31], deepIQA [29], ResNet50+ft [34], MEON [48], DIQA [26], TSCN [25], and DB-CNN [38]). Due to the source codes of some methods are not available to the public, we only copy the metrics from the corresponding papers.

4.2. Experimental Settings

Parameters in Sub-Network I were initialized by He’s method [54], and Adam was adopted as optimizer with the default parameters with a mini-batch of 64. The learning rate was initialized as a decay logarithmically from

[10^{- 4}, 10^{- 6}]

in 30 epochs. The construction details of the pre-training dataset are described in Section 3.1. The datasets were randomly divided into two subsets, 80% for training and 20% for testing. All images were firstly scaled to

256 \times 256 \times 3

and then cropped to

224 \times 224 \times 3

as inputs. The top-1 and top-5 errors were 3.842% and 0.026%, respectively.

In the fine-tuning phase, the shared layers were directly initialized with the parameters of Sub-Network I. Adam was used as optimizer with the default parameters for 20 epochs and the learning rate was set to

10^{- 5}

. Except for the LIVE database, images were input without any pre-processing during training with a mini-batch of 8. Since the LIVE database contains images in different size, images were randomly cropped to

320 \times 320

during training in a mini-batch, whose quality annotated were assigned from the corresponding image. All of the images were input without any preprocessing during testing. We implemented all of our models using PyTorch 0.4.1 deep learning framework and the numerical calculations presented in this paper were performed on the supercomputing system at the Supercomputing Center of Wuhan University. We will release the code at https://github.com/Cwp1107/CGFA-CNN.

4.3. Consistency Experiment

We investigated the effectiveness of CGFA-CNN on LIVE, TID2013, CSIQ and LIVE Challenge databases and the results are presented in Table 2. The results of each specific distortion type on LIVE, CSIQ, and TID2013 databases are reported in Table 3, Table 4 and Table 5. The top three SRCC and PLCC results are highlighted in red, green, and blue, respectively.

Based on the results in Table 2, we have the following observations. First, DIQA [26] achieves state-of-the-art accuracies which surpasses CGFA-CNN by about 0.004 in SRCC and PLCC, and most methods take great advantages in indexes on LIVE. However, their results on CSIQ and TID2013 are rather diverse. Second, CGFA-CNN achieves comparable accuracies on LIVE Challenge compared with DB-CNN [38] and ResNet50+ft [34], which are pre-trained on ImageNet [28] databases. This suggests that CNNs pre-trained on ImageNet [28] could extract relevant features for authentically distorted images.

Performance on individual distortion types on LIVE, CSIQ, and TID2013 are shown in Table 3, Table 4 and Table 5. On LIVE, we also find that CGFA-CNN is superior to other methods in most distortions, except Fast-fading error, which is not introduced into the pre-training dataset because there is no open-source or detailed description of it. On CSIQ, CGFA-CNN has obvious advantages compared with other methods, especially in contrast change and pink noise. On TID2013, CGFA-CNN achieves state-of-the-art performance in 10 of the 24 distortions and the whole effect standout accuracies other methods. In addition, we find that CGFA-CNN performs well when the distortion shares similar artifacts with the distortion synthesized in the pre-training dataset. For example, additive Gaussian noise, additive noise in color components, and high-frequency noise are all grainy noise; quantization noise and image color quantization with dither exhibit similar appearances; and Gaussian blur, image denoising, and sparse sampling and reconstruction all introduce blur effects on the image. Therefore, although the pre-training dataset constructed in this paper does not cover all distortion types, CGFA-CNN still achieves impressive gains in performance.

4.4. Cross-Database Experiment

To analyze the generalization ability of the proposed method, we trained CGFA-CNN on one full database and evaluated it on another database. Specifically, a model was trained on CSIQ and evaluated on either LIVE or TID2013. The results are reported in Table 6. It can be concluded that CGFA-CNN can easily be generalized to distortions that have not been seen during training.

4.5. Comparison among Different Experimental Settings

In this section, we first work with the performance of different feature aggregation layers investigated in this paper and number of GMM components K. Experiments were conducted on LIVE and the results are shown in Figure 4. We observe that SRCC gradually increases and eventually keeps stability as K increases. Besides, CGFA-CNN FV, CGFA-CNN VLAD, and CGFA-CNN BOW attain highly competitive prediction accuracy when K is set to 32, 64 and 1024, respectively. By contrast, CGFA-CNN FV is superior to CGFA-CNN VLAD and CGFA-CNN BOW.

Additionally, we report ablation studies to evaluate the design rationality of CGFA-CNN and the following comparative set of experiments were conducted: (1) to evaluate the effectiveness of the proposed FV layer, we used the maximum pooling (denoted as CGFA-CNN (MaxPool)) and average pooling (denoted as CGFA-CNN (AvgPool)) instead; (2) to examine the validity of the CGU described in this work, we predicted the quality score directly by regressing the output feature vector without CGU (denoted as CGFA-CNN (w/o CGU)); (3) to verify the necessity of the hierarchical feature extraction, we extracted features only from high-level (Conv 5-2 of shared layers and Conv 4-3 of VGG-16) convolutional layers as descriptors (denoted as CGFA-CNN (single feature)); (4) to discuss the optimal settings of for feature aggregation layer, we set the BOW with

K = 1024

(denoted as CGFA-CNN (BOW layer (

K = 1024

))), VLAD with

K = 64

(denoted as CGFA-CNN (VLAD layer (

K = 1024

))), and FV with

K = 32

(denoted as CGFA-CNN (proposed)); and (5) to demonstrate the prediction accuracies on authentic distortions by involving VGG-16, we only included Sub-Network I pre-trained on self-built dataset to extract features (denoted as CGFA-CNN (w/o VGG-16)). The results are demonstrated in Table 7. We empirically found that the proposed CGFA-CNN could achieve state-of-the-art prediction accuracies on both synthetic and authentic distortion image quality databases. Besides, CGFA-CNN (w/o VGG-16) can only deliver promising performance on synthetic databases and its results on LIVE Challenge are inferior to CGFA-CNN (proposed), suggesting that authentic distortions cannot be fully fitted by synthetic distortions.

5. Conclusions

In this work, we propose an end-to-end learning framework for BIQA based on classification guidance and feature aggregation, which is named as CGFA-CNN. In the fine-tuning phase, except for the shared convolutional layers, the rest of Sub-Network I only participates in the forward propagation, and the parameters are fixed. The fused feature group is aggregated and encoded by the FV layer to obtain a fisher vector. Then, the fisher vector is corrected by the CGU to obtain a quality-ware feature, which is mapped to a quality score by the regression model. In the test phase, only forward propagation is required to obtain the quality score. The results on the four publicly IQA databases demonstrate that the proposed method indeed benefited image quality assessment. However, CGFA-CNN is not a unified learning framework because it takes two steps to pre-train and fine-tune. The promising future direction is to optimize CGFA-CNN for both distortion identification and quality prediction at the same time. For example, we think that the autoencoder method could be designed to perform k-mean clustering. A VAE framework can be introduced to decode. This approach can replace the two-stage procedure. We also look forward to designing a potential objective function that could in principle reduce the necessity to rely on external procedures.

CGFA-CNN is versatile and extensible. For example, more distortion types and levels can be added to the pre-training dataset, and it could fuse with other approaches to achieve a new backbone network.

Author Contributions

W.C., C.F. and Y.M. conceived the idea; W.C., C.F. and L.Z. performed the experiments, analyzed the data, and wrote the paper; Y.L. and M.W. developed the proofs. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded partly by the National Key R&D Program of China (Project No. 2017YFC0821603).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Z.; Sheikh, H.R.; Bovik, A.C. Objective video quality assessment. In The Handbook of Video Databases: Design and Applications; CRC Press: Boca Raton, FL, USA, 2003; Volume 41, pp. 1041–1078. [Google Scholar]
Panetta, K.; Samani, A.; Agaian, S. A Robust No-Reference, No-Parameter, Transform Domain Image Quality Metric for Evaluating the Quality of Color Images. IEEE Access 2018, 6, 10979–10985. [Google Scholar] [CrossRef]
Jian, M.; Ping, A.; Shen, L.; Kai, L. Reduced-Reference Stereoscopic Image Quality Assessment Using Natural Scene Statistics and Structural Degradation. IEEE Access 2017, 6, 2768–2780. [Google Scholar]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Sheikh, H.R.; Bovik, A.C. Image information and visual quality. In Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, QC, Canada, 17–21 May 2004; Volume 3, p. 709. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Chandler, D.M.; Hemami, S.S. VSNR: A wavelet-based visual signal-to-noise ratio for natural images. IEEE Trans. Image Process. 2007, 16, 2284–2298. [Google Scholar] [CrossRef]
Li, Q.; Wang, Z. Reduced-reference image quality assessment using divisive normalization-based image representation. IEEE J. Sel. Top. Signal Process. 2009, 3, 202–211. [Google Scholar] [CrossRef]
Liu, D.; Li, F.; Song, H. Image Quality Assessment Using Regularity of Color Distribution. IEEE Access 2016, 4, 4478–4483. [Google Scholar] [CrossRef]
Wu, J.; Liu, Y.; Li, L.; Shi, G. Attended Visual Content Degradation Based Reduced Reference Image Quality Assessment. IEEE Access 2018, 6, 2169–3536. [Google Scholar] [CrossRef]
Brand ao, T.; Queluz, M.P. No-reference image quality assessment based on DCT domain statistics. Signal Process. 2008, 88, 822–833. [Google Scholar] [CrossRef]
Saad, M.A.; Bovik, A.C.; Charrier, C. A DCT statistics-based blind image quality index. IEEE Signal Process. Lett. 2010, 17, 583–586. [Google Scholar] [CrossRef] [Green Version]
Moorthy, A.; Bovik, A. A Two-Step Framework for Constructing Blind Image Quality Indices. IEEE Signal Process. Lett. 2010, 17, 513–516. [Google Scholar] [CrossRef]
Li, J.; Yan, J.; Deng, D.; Shi, W.; Deng, S. No-reference image quality assessment based on hybrid model. Signal Image Video Process. 2017, 11, 985–992. [Google Scholar] [CrossRef]
Ferzli, R.; Karam, L.J. A no-reference objective image sharpness metric based on the notion of just noticeable blur (JNB). IEEE Trans. Image Process. 2009, 18, 717–728. [Google Scholar] [CrossRef]
Wang, Z.; Sheikh, H.R.; Bovik, A.C. No-reference perceptual quality assessment of JPEG compressed images. In Proceedings of the International Conference on Image Processing, Rochester, NY, USA, 22–25 September 2002; Volume 1. [Google Scholar]
Marziliano, P.; Dufaux, F.; Winkler, S.; Ebrahimi, T. Perceptual blur and ringing metrics: Application to JPEG2000. Signal Process. Image Commun. 2004, 19, 163–172. [Google Scholar] [CrossRef] [Green Version]
Bovik, A.C. Automatic prediction of perceptual image and video quality. Proc. IEEE 2013, 101, 2008–2024. [Google Scholar]
Mittal, A.; Moorthy, A.K.; Bovik, A.C. No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 2012, 21, 4695–4708. [Google Scholar] [CrossRef]
Xue, W.; Mou, X.; Zhang, L.; Bovik, A.C.; Feng, X. Blind image quality assessment using joint statistics of gradient magnitude and Laplacian features. IEEE Trans. Image Process. 2014, 23, 4850–4862. [Google Scholar] [CrossRef] [PubMed]
Ghadiyaram, D.; Bovik, A.C. Perceptual quality prediction on authentically distorted images using a bag of features approach. J. Vis. 2017, 17, 32. [Google Scholar] [CrossRef]
Ye, P.; Kumar, J.; Kang, L.; Doermann, D. Unsupervised feature learning framework for no-reference image quality assessment. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 1098–1105. [Google Scholar]
Xu, J.; Ye, P.; Li, Q.; Du, H.; Liu, Y.; Doermann, D. Blind Image Quality Assessment Based on High Order Statistics Aggregation. IEEE Trans. Image Process. 2016, 25, 4444–4457. [Google Scholar] [CrossRef]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Yan, Q.; Gong, D.; Zhang, Y. Two-Stream Convolutional Networks for Blind Image Quality Assessment. IEEE Trans. Image Process. 2018, 28, 2200–2211. [Google Scholar] [CrossRef]
Kim, J.; Nguyen, A.D.; Lee, S. Deep CNN-based blind image quality predictor. IEEE Trans. Neural Netw. Learn. Syst. 2018, 30, 11–24. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Bosse, S.; Maniry, D.; Müller, K.R.; Wiegand, T.; Samek, W. Deep neural networks for no-reference and full-reference image quality assessment. IEEE Trans. Image Process. 2018, 27, 206–219. [Google Scholar] [CrossRef] [Green Version]
Kim, J.; Lee, S. Fully deep blind image quality predictor. IEEE J. Slected Top. Signal Process. 2017, 11, 206–220. [Google Scholar] [CrossRef]
Ma, K.; Liu, W.; Liu, T.; Wang, Z.; Tao, D. dipIQ: Blind Image Quality Assessment by Learning-to-Rank Discriminable Image Pairs. IEEE Trans. Image Process. 2017, 26, 3951–3964. [Google Scholar] [CrossRef] [Green Version]
Bianco, S.; Celona, L.; Napoletano, P.; Schettini, R. On the use of deep learning for blind image quality assessment. Signal Image Video Process. 2018, 12, 355–362. [Google Scholar] [CrossRef]
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 1452–1464. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kim, J.; Zeng, H.; Ghadiyaram, D.; Lee, S.; Zhang, L.; Bovik, A.C. Deep convolutional neural models for picture-quality prediction: Challenges and solutions to data-driven image quality assessment. IEEE Signal Process. Mag. 2017, 34, 130–141. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems 25 (NIPS 2012), Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ma, K.; Liu, W.; Zhang, K.; Duanmu, Z.; Wang, Z.; Zuo, W. End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process. 2018, 27, 1202–1213. [Google Scholar] [CrossRef]
Zhang, W.; Ma, K.; Yan, J.; Deng, D.; Wang, Z. Blind Image Quality Assessment Using A Deep Bilinear Convolutional Neural Network. IEEE Trans. Circ. Syst. Video Technol. 2018. [Google Scholar] [CrossRef] [Green Version]
Ma, K.; Duanmu, Z.; Wu, Q.; Wang, Z.; Yong, H.; Li, H.; Zhang, L. Waterloo exploration database: New challenges for image quality assessment models. IEEE Trans. Image Process. 2017, 26, 1004–1016. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Int. J. Comput. Vision 2010, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
Jégou, H.; Douze, M.; Schmid, C.; Pérez, P. Aggregating local descriptors into a compact image representation. In Proceedings of the CVPR 2010-23rd IEEE Conference on Computer Vision & Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3304–3311. [Google Scholar]
Csurka, G.; Dance, C.; Fan, L.; Willamowski, J.; Bray, C. Visual categorization with bags of keypoints. In Proceedings of the Workshop on Statistical Learning in Computer Vision, ECCV, Prague, Czech Republic, 11–14 May 2004; Volume 1, pp. 1–2. [Google Scholar]
Perronnin, F.; Dance, C. Fisher kernels on visual vocabularies for image categorization. In Proceedings of the 2007 IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
Arandjelovic, R.; Gronat, P.; Torii, A.; Pajdla, T.; Sivic, J. NetVLAD: CNN architecture for weakly supervised place recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 40, 5297–5307. [Google Scholar]
Miech, A.; Laptev, I.; Sivic, J. Learnable pooling with context gating for video classification. arXiv 2017, arXiv:1706.06905. [Google Scholar]
Ma, K.; Zeng, K.; Wang, Z. Perceptual Quality Assessment for Multi-Exposure Image Fusion. IEEE Trans. Image Process. 2015, 24, 3345–3356. [Google Scholar] [CrossRef]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Ma, Y.; Zhang, W.; Yan, J.; Fan, C.; Shi, W. Blind image quality assessment in multiple bandpass and redundancy domains. Digit. Signal Process. 2018, 80, 37–47. [Google Scholar] [CrossRef]
Chatfield, K.; Lempitsky, V.S.; Vedaldi, A.; Zisserman, A. The devil is in the details: An evaluation of recent feature encoding methods. BMVC 2011, 2, 8. [Google Scholar]
Sheikh, H.R.; Sabir, M.F.; Bovik, A.C. A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 2006, 15, 3440–3451. [Google Scholar] [CrossRef]
Larson, E.C.; Chandler, D.M. Most apparent distortion: Full-reference image quality assessment and the role of strategy. J. Electron. Imaging 2010, 19, 011006. [Google Scholar]
Ponomarenko, N.; Ieremeiev, O.; Lukin, V.; Egiazarian, K.; Jin, L.; Astola, J.; Vozel, B.; Chehdi, K.; Carli, M.; Battisti, F.; et al. Color image database TID2013: Peculiarities and preliminary results. In Proceedings of the European Workshop on Visual Information Processing (EUVIP), Paris, France, 10–12 June 2013; pp. 106–111. [Google Scholar]
Ghadiyaram, D.; Bovik, A.C. Crowdsourced study of subjective image quality. In Proceedings of the 2014 48th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 2–5 November 2014; pp. 84–88. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]

Figure 1. Illustration of CGFA-CNN configurations for BIQA, highlighting the feature aggregation layer (denoted as FV layer) and classification-guided gating unit (denoted as CGU). Features are extracted from the distorted image by Sub-Network I.

Figure 2. A comparison of several distortion types identified by Sub-Network I.

Figure 3. The configurations of the proposed FV layer. Convolution kernel size is

1 \times 1

.

Figure 3. The configurations of the proposed FV layer. Convolution kernel size is

1 \times 1

.

Figure 4. Relationship between the SRCC on LIVE of different feature aggregation layer and the number of K.

Table 1. Architecture of Sub-Network I.

Layer Name	Type	Patch Size	Stride	Output Size
Conv 1-1	Conv + ReLU + BN	$3 \times 3 \times 48$	1	$H \times W \times 48$
Conv 2-1	Conv + ReLU + BN	$3 \times 3 \times 48$	2	$\frac{H}{2} \times \frac{W}{2} \times 48$
Conv 2-2	Conv + ReLU + BN	$3 \times 3 \times 64$	1	$\frac{H}{2} \times \frac{W}{2} \times 64$
Conv 3-1	Conv + ReLU + BN	$3 \times 3 \times 64$	2	$\frac{H}{4} \times \frac{W}{4} \times 64$
Conv 3-2	Conv + ReLU + BN	$3 \times 3 \times 64$	1	$\frac{H}{4} \times \frac{W}{4} \times 64$
Conv 4-1	Conv + ReLU + BN	$3 \times 3 \times 64$	2	$\frac{H}{8} \times \frac{W}{8} \times 64$
Conv 4-2	Conv + ReLU + BN	$3 \times 3 \times 128$	1	$\frac{H}{8} \times \frac{W}{8} \times 128$
Conv 5-1	Conv + ReLU + BN	$3 \times 3 \times 128$	2	$\frac{H}{16} \times \frac{W}{16} \times 128$
Conv 5-2	Conv + ReLU + BN	$3 \times 3 \times 128$	1	$\frac{H}{16} \times \frac{W}{16} \times 128$
Pool	MaxPool	$1 \times 1 \times 128$	1	$1 \times 1 \times 128$
FC-1	FC + ReLU	$1 \times 1 \times 256$	1	$1 \times 1 \times 256$
FC-2	FC + ReLU	$1 \times 1 \times 256$	1	$1 \times 1 \times 256$
FC-3	FC	$1 \times 1 \times 41$	1	$1 \times 1 \times 41$
Classifier	Soft-max	$1 \times 1 \times 41$	1	$1 \times 1 \times 41$

Table 2. Demographic prediction performance comparison by two evaluation metrics. The top three SRCC and PLCC results are highlighted in red, green, and blue, respectively.

Method	LIVE		CSIQ		TID2013		LIVE Challenge
Method	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC	SRCC	PLCC
BRISQUE [19]	0.940	0.945	0.777	0.817	0.573	0.651	0.603	0.641
M3 [20]	0.950	0.954	0.804	0.835	0.679	0.705	0.595	0.620
FRIQUEE [21]	0.948	0.955	0.844	0.889	0.668	0.705	0.694	0.710
CORNIA [22]	0.943	0.946	0.730	0.800	0.550	0.613	0.618	0.665
BIECON [30]	0.958	0.960	0.815	0.823	0.717	0.762	0.595	0.613
deepIQA [29]	0.960	0.972	—	—	0.803	0.821	0.671	0.680
ResNet50+ft [34]	0.950	0.954	0.876	0.905	0.712	0.756	0.819	0.849
DIQA [26]	0.975	0.977	0.884	0.915	0.825	0.850	0.703	0.704
TSCN [25]	0.969	0.972	—	—	—	—	—	—
DB-CNN [38]	0.968	0.971	0.946	0.959	0.816	0.865	0.851	0.869
CGFA-CNN	0.971	0.973	0.953	0.965	0.841	0.858	0.837	0.846

Table 3. Average SRCC and PLCC results of individual distortion types across ten sessions on LIVE database. The top three SRCC and PLCC results are highlighted in red, green, and blue, respectively.

SRCC	JPEG	JP2K	WN	GB	FF
BRISQUE [19]	0.965	0.929	0.982	0.964	0.828
M3 [20]	0.966	0.930	0.986	0.935	0.902
FRIQUEE [21]	0.947	0.919	0.983	0.937	0.884
CORNIA [22]	0.947	0.924	0.958	0.951	0.921
HOSA [23]	0.954	0.935	0.975	0.954	0.954
dipIQ [31]	0.969	0.956	0.975	0.940	—
DIQA [26]	0.961	0.976	0.988	0.962	0.912
TCSN [25]	0.966	0.950	0.979	0.963	0.911
DB-CNN [38]	0.972	0.955	0.980	0.935	0.930
CGFA-CNN	0.973	0.975	0.986	0.968	0.912
PLCC	JPEG	JP2K	WN	GB	FF
BRISQUE [19]	0.971	0.940	0.989	0.965	0.894
M3 [20]	0.977	0.945	0.992	0.947	0.920
FRIQUEE [21]	0.955	0.935	0.991	0.949	0.943
CORNIA [22]	0.962	0.944	0.974	0.961	0.943
HOSA [23]	0.967	0.949	0.983	0.967	0.967
dipIQ [31]	0.980	0.964	0.983	0.948	—
DIQA [26]	—	—	—	—	—
TCSN [25]	0.966	0.963	0.995	0.950	0.949
DB-CNN [38]	0.986	0.967	0.988	0.956	0.961
CGFA-CNN	0.972	0.976	0.981	0.974	0.947

Table 4. Average SRCC and PLCC results of individual distortion types across ten sessions on CSIQ database. The top three SRCC and PLCC results are highlighted in red, green, and blue, respectively.

SRCC	JPEG	JP2K	WN	GB	PN	CC
BRISQUE [19]	0.806	0.840	0.732	0.820	0.378	0.804
M3 [20]	0.740	0.911	0.741	0.868	0.663	0.770
FRIQUEE [21]	0.869	0.846	0.748	0.870	0.753	0.838
CORNIA [22]	0.513	0.831	0.664	0.836	0.493	0.462
HOSA [23]	0.733	0.818	0.604	0.841	0.500	0.716
dipIQ [31]	0.936	0.944	0.904	0.932	—	—
MEON [48]	0.948	0.898	0.951	0.918	—	—
DIQA [26]	0.835	0.931	0.927	0.893	0.870	0.718
DB-CNN [38]	0.940	0.953	0.948	0.947	0.940	0.870
CGFA-CNN	0.950	0.939	0.956	0.941	0.952	0.897
PLCC	JPEG	JP2K	WN	GB	PN	CC
BRISQUE [19]	0.828	0.887	0.742	0.891	0.496	0.835
M3 [20]	0.768	0.928	0.728	0.917	0.717	0.787
FRIQUEE [21]	0.885	0.883	0.778	0.905	0.769	0.864
CORNIA [22]	0.563	0.883	0.778	0.905	0.632	0.543
HOSA [23]	0.759	0.899	0.656	0.912	0.601	0.744
dipIQ [31]	0.975	0.959	0.927	0.958	—	—
MEON [48]	0.979	0.925	0.958	0.846	—	—
DIQA [26]	—	—	—	—	—	—
DB-CNN [38]	0.982	0.971	0.956	0.969	0.950	0.895
CGFA-CNN	0.972	0.953	0.969	0.955	0.942	0.893

Table 5. Average SRCC results of individual distortion types across ten sessions on TID2013 database. The top three SRCC results are highlighted in red, green, and blue, respectively.

Method	#01	#02	#03	#04	#05	#06	#07	#08	#09	#10	#11	#12
BRISQUE [19]	0.852	0.709	0.491	0.575	0.753	0.630	0.798	0.813	0.586	0.852	0.893	0.315
M3 [20]	0.748	0.591	0.769	0.491	0.875	0.693	0.833	0.878	0.721	0.823	0.872	0.400
FRIQUEE [21]	0.730	0.573	0.866	0.345	0.345	0.847	0.730	0.764	0.881	0.839	0.813	0.498
CORNIA [22]	0.756	0.750	0.7i 27	0.726	0.769	0.767	0.016	0.921	0.832	0.874	0.910	0.686
HOSA [23]	0.833	0.551	0.842	0.468	0.897	0.809	0.815	0.883	0.854	0.891	0.730	0.710
MEON [48]	0.813	0.722	0.926	0.728	0.911	0.901	0.888	0.887	0.797	0.860	0.891	0.746
DIQA [26]	0.915	0.755	0.878	0.734	0.939	0.843	0.858	0.920	0.788	0.892	0.912	0.861
DB-CNN [38]	0.790	0.700	0.826	0.646	0.879	0.708	0.825	0.859	0.865	0.894	0.916	0.772
CGFA-CNN	0.812	0.804	0.851	0.845	0.910	0.794	0.867	0.933	0.866	0.914	0.922	0.763
Method	#13	#14	#15	#16	#17	#18	#19	#20	#21	#22	#23	#24
BRISQUE [19]	0.359	0.145	0.224	0.124	0.040	0.109	0.724	0.008	0.685	0.764	0.616	0.784
M3 [20]	0.731	0.190	0.318	0.119	0.224	−0.121	0.701	0.202	0.664	0.886	0.648	0.915
FRIQUEE [21]	0.660	0.076	0.032	0.254	0.585	0.589	0.704	0.318	0.641	0.768	0.737	0.891
CORNIA [22]	0.805	0.286	0.219	0.065	0.182	0.081	0.644	0.534	0.862	0.272	0.792	0.862
MEON [48]	0.716	0.116	0.500	0.177	0.252	0.684	0.849	0.406	0.772	0.857	0.779	0.855
DIQA [26]	0.812	0.659	0.407	0.299	0.687	−0.151	0.904	0.655	0.930	0.936	0.756	0.909
DB-CNN [38]	0.773	0.270	0.444	0.646	0.548	0.631	0.711	0.752	0.860	0.833	0.732	0.902
CGFA-CNN	0.757	0.335	0.649	0.441	0.573	0.657	0.819	0.785	0.897	0.940	0.711	0.938

Table 6. SRCC comparison on cross-database. The top three SRCC results are highlighted in red, green, and blue, respectively.

Method	CSIQ		TID2013
Method	LIVE	TID2013	LIVE	CSIQ
BRISQUE [19]	0.847	0.454	0.790	0.590
M3 [20]	0.797	0.328	0.873	0.605
FRIQUEE [21]	0.879	0.463	0.755	0.635
CORNIA [22]	0.853	0.312	0.846	0.672
HOSA [23]	0.773	0.329	0.594	0.462
DB-CNN [38]	0.877	0.540	0.891	0.807
CGFA-CNN	0.891	0.533	0.898	0.774

Table 7. SRCC with different settings.

Method	LIVE	CSIQ	TID2013	LIVE Challenge
CGFA-CNN (MaxPool)	0.915	0.893	0.778	0.766
CGFA-CNN (AvgPool)	0.909	0.876	0.755	0.761
CGFA-CNN (w/o CGU)	0.948	0.919	0.783	0.799
CGFA-CNN (single feature)	0.931	0.890	0.757	0.765
CGFA-CNN (BOW layer (K = 1024))	0.955	0.936	0.808	0.791
CGFA-CNN (VLAD layer (K = 64))	0.966	0.945	0.819	0.810
CGFA-CNN (w/o VGG-16)	0.970	0.950	0.836	0.672
CGFA-CNN (proposed)	0.973	0.953	0.841	0.837

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, W.; Fan, C.; Zou, L.; Liu, Y.; Ma, Y.; Wu, M. Blind Image Quality Assessment Based on Classification Guidance and Feature Aggregation. Electronics 2020, 9, 1811. https://doi.org/10.3390/electronics9111811

AMA Style

Cai W, Fan C, Zou L, Liu Y, Ma Y, Wu M. Blind Image Quality Assessment Based on Classification Guidance and Feature Aggregation. Electronics. 2020; 9(11):1811. https://doi.org/10.3390/electronics9111811

Chicago/Turabian Style

Cai, Weipeng, Cien Fan, Lian Zou, Yifeng Liu, Yang Ma, and Minyuan Wu. 2020. "Blind Image Quality Assessment Based on Classification Guidance and Feature Aggregation" Electronics 9, no. 11: 1811. https://doi.org/10.3390/electronics9111811

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Blind Image Quality Assessment Based on Classification Guidance and Feature Aggregation

Abstract

1. Introduction

2. Related Work

3. The Proposed Method

3.1. Distortion Type Identification

3.1.1. Construction of the Pre-Training Dataset

3.1.2. Sub-Network I Architecture

3.2. Feature Extraction and Fusion

3.3. Feature Aggregation Layer and Encoding

3.3.1. GMM Clustering

3.3.2. Fisher Encoding

3.3.3. Beyond the FV Aggregation

3.4. Classification-Guided Gating Unit and Quality Prediction

4. Experimental Results and Discussions

4.1. Database Description and Experimental Protocol

4.2. Experimental Settings

4.3. Consistency Experiment

4.4. Cross-Database Experiment

4.5. Comparison among Different Experimental Settings

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI