Fuzzy Windows with Gaussian Processed Labels for Ordinal Image Scoring Tasks

Kang, Cheng; Yao, Xujing; Novak, Daniel

doi:10.3390/app13064019

Open AccessArticle

Fuzzy Windows with Gaussian Processed Labels for Ordinal Image Scoring Tasks

by

Cheng Kang

^1,*

,

Xujing Yao

²

and

Daniel Novak

¹

Department of Cybernetics, Czech Technical University in Prague, 166 36 Prague, Czech Republic

²

School of Computing and Mathematical Sciences, University of Leicester, Leicester LE1 7RH, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(6), 4019; https://doi.org/10.3390/app13064019

Submission received: 19 February 2023 / Revised: 15 March 2023 / Accepted: 20 March 2023 / Published: 22 March 2023

(This article belongs to the Special Issue Pattern Recognition and Artificial Intelligence in Biomedical Signal Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this paper, we propose a Fuzzy Window with the Gaussian Processed Label (FW-GPL) method to mitigate the overlap problem in the neighboring ordinal category when scoring images. Many published conventional methods treat this challenge as a traditional regression problem and make a strong assumption that each ordinal category owns an adequate intrinsic rank to outline its distribution. Our FW-GPL method aims to refine the ordinal label pattern by using two novel techniques: (1) assembling fuzzy logic to the fully connected layer of convolution neural networks and (2) transforming the ordinal labels with a Gaussian process. Specifically, it incorporates a heuristic fuzzy logic from the ordinal characteristic and simultaneously plugs in ordinal distribution shapes that penalize the difference between the targeted label and its neighbors to ensure a concentrated regional distribution. Accordingly, the function of these proposed windows is leveraged to minimize the influence of majority classes that mislead the prediction of minority samples. Our model is specifically designed to carefully avoid partially missing continuous facial-age segments. It can perform competitively when using the whole continuous facial-age dataset. Extensive experimental results on three facial-aging datasets and one ambiguous medical dataset demonstrate that our FW-GPL can achieve compelling performance results compared to the State-Of-The-Art (SOTA).

Keywords:

scoring images; Gaussian process; fuzzy logic; ordinal image; neighbor ambiguity

1. Introduction

Image scoring, typically known as ordinal classification, is a supervised learning problem aiming to predict a discrete set of ordinal labels. The main difference from the classification task is that the categories are related in a natural or implied order. For example, the apparent age group estimation grades face images based on an ordinal scale: “infants”, “children”, “teenagers”, “youth”, “young adults”, “adults”, “middle-aged”, and “aged”. Ordinal classification can be viewed as a special case of metric regression, where the regression targets are discrete and finite. The differences in features between adjacent labels are not always equal to each other. The difference in facial features between “infants” and “children” being more obvious than facial features between “young adults” and “adults” is one example. However, if the ordinal relationship of labels is ignored, the ordinal regression problem will only become a simple multi-class classification issue. In Figure 1, when learning with ordinal images, a common problem is that the ambiguity between two neighboring categories usually has a negative effect on the training convergence. Therefore, the performance of the learned model tends to degrade in ordinal classes. This challenge has motivated us to develop a robust ordinal image classification approach to analyzing ordinal data.

Ordinal image classification approaches or ordinal models [1] can be roughly divided into two aspects, Single Label Learning with Specific Loss (SLL-Loss) [2,3,4,5] and the Label Distribution Based Learning (LDBL) [2,6,7,8,9,10,11,12,13]. SLL-Loss methods typically rely on independently processing a single facial image. This ignores gradual changes in human faces, and thus, facial appearance is usually ambiguous as regards adjacent age classes. The LDBL methods tend to map ordinal ground-truth learning based on a Gaussian or Gaussian-like label distribution. But under such a long-tailed case, they also ignore the processing of ordinal neighbors or overlapping features.

To address the ambiguous and overlapping features in the ordinal data, we propose a Fuzzy Window with the Gaussian Processed Labels (FW-GPL) approach to the ordinal classification issue. This also aims to stretch the semantic margins (or enlarge their interclass variance) of ordinal classes, which represent the shared features of neighbor categories. As shown in Figure 1, we assume that two neighboring ordinal classes have a closely shared feature region which can increase the difficulty of the ordinal classification task. A fuzzy window with Gaussian processed labels is carefully designed on the top of deep neural networks so as to reduce the effect of the overlapping features but preserve the age distribution information. In Figure 2, our proposed FW-GPL is composed of two crucial branches, including a defuzzifier window and a learning strategy using Gaussian processed labels. The defuzzifier window attempts to reduce the influence the ambiguously overlapping features have on ordinal neighbors, and simultaneously, it tries to retain the internal ordinal features which can represent the real category.

Practically, the Gaussian processed labels allow the incorporation of prior knowledge (e.g., Gaussian-like age distribution) to concentrate on the major class and weaken the influence of remote neighbor classes. To validate the effectiveness of our proposed method, we perform extensive experiments on three widely-used face-aging datasets, including MORPH II [14], FG-NET [15], and CACD [16], as well as one medical ordinal dataset: Curated Breast Imaging Subset of Digital Database of Screening Mammography (CBIS-DDSM). This paper proposes a novel Fuzzy Window with a Gaussian Processed Label (FW-GPL) method for ordinal image scoring, and we achieve a competitive performance compared to State-Of-The-Art (SOTA) methods. The main contributions of this work are summarized as follows:

We directly face the internal challenge of the ordinal image classification task and clarify the theoretical reason why the ordinal image classification task is difficult.
FW-GPL can indirectly reduce the influence of the overlapping features among the ordinal neighbor classes. This process can effectively improve the scoring performance of the ordinal images.
When the ordinal sequence of the images is not consecutive, FW-GPL can achieve an equivalent performance to wholly sequential ordinal data by setting a proper length for the fuzzy window.

2. Related Work

The objective of this learning architecture for the ordinal regression problem is to weaken the influence of the overlapping features

F = {f_{1}, f_{2}, \dots, f_{ϵ}}

extracted from the neighboring ordinal categories:

C = {C_{1}, C_{2}, \dots, C_{i}, \dots, C_{K}}

(

ϵ

is the number of quantized features, and K is the number of categories). Each

C_{i}

is an ordinal category containing overlapping features with its neighbors,

{C_{i - a}, \dots, C_{i - a + 1}, C_{i - 1}}

and

{C_{i + 1}, \dots, C_{i + b - 1}, C_{i + b}}

, where values a and b are related to the relationship between the feature strength of the specific category

C_{i}

and its closeness to neighboring categories (In prior work [17,18], the boundary of the window is

{a, b}

. In this paper, we also set the upper bound of the window as a and the lower bound of the window as b). Moreover, Gaussian processed labels can prevent extracted features from roughly slipping into one category, which means they make neighboring ordinal categories meticulously divided up according to shared overlapping features.

2.1. Ordinal Classification

In the machine learning field, ordinal classification models are reassembled by reformulating the problem to utilize multiple binary classifiers [19]. There are some earlier studies working on constructing Convolution Neural Networks (CNNs) [20,21], which have replaced the last layer of the ordinal classification model with a number of binary classifiers [22]. In this Ordinal Regression CNN (OR-CNN) architecture, the ordinal classification problem has been converted to a number of K binary classification tasks. If the maximum value of the ordinal label is K, we rearrange the labels with a set

k = {0, 1, \dots, K - 1}

and define the binary classifier as whether the output is greater than k or not. All K binary tasks share the same intermediate layers, but they are assigned distinct weight parameters in the output layer [23]. This Ordinal Regression CNN (OR-CNN) architecture deeply relies on the ordinal continuity of the data. If the training dataset has insufficient and intermittent input ordinal labels, and if the dataset has missing data (for example, 150-year-old facial-age data), the fitted OR-CNN cannot recognize the intermittent or missing segment, which inevitably leads to a classification crash.

2.2. Windows for Ordinal Classification

MWR [18] uses five neurons and a local window to estimate facial age. It proposes the notion of relative rank (

ρ

rank), a new order representation scheme for input and reference instances. This relative rank was estimated iteratively by selecting two reference instances to form a search window and then estimating the

ρ

rank within the window. In other words, MWR applies two overlapping windows with reference centers to limit the influence of the relative rank (or “intrinsic rank”), and it also uses a search process to find the most proper position of the centers to reduce the influence of the overlapping “rank”. This has inspired us to develop a fuzzy window that can reduce the overlapping features of neighboring ordinal classes.

2.3. Fuzzy Scoring for Ordinal Classification

Before using fuzzy logic to disjoint the characteristic adhesion between two neighbor categories, an OR-CNN is typically designed to be used for age estimation [24]. There is an expectation layer that takes the predicted distribution and label set as input and emits its expectation:

\tilde{y} = \sum_{k = 0}^{K - 1} P_{k} l_{k},

(1)

where

P_{k}

denotes the prediction probability that the input image belongs to label

l_{k}

. Given an input image, the expectation regression module minimizes the error between the expected value

\tilde{y}

and ground truth

y_{t r u e}

. We use the below loss as the error measurement:

{Loss}_{e r r} = | \tilde{y} - y_{t r u e} |,

(2)

where

| \cdot |

denotes absolute value. Note that this module does not introduce any new parameters. OR-CNN adopts a general image classification framework that maximizes the probability of the ground-truth class during training. However, because each class is naturally influenced by its neighbors (in Figure 3, we can see that the 20–39 age group has a feature overlap with the 40–59 age group), the training would become unstable.

When using fuzzy logic to solve the ordinal regression problem, one strategy achieved outstanding performance by extracting a set of fuzzy rules from an example set and using it as the basic model with a genetic algorithm [25]. Moreover, a method based on monotonicity indexes, an evolutionary fuzzy systems algorithm, was used for ordinal classification and ordinal regression tasks [26]. However, there is no common approach that can cover most ordinal image classification problems because most researchers prefer to develop particular methods or systems to target specific problems.

Inspired by Deep Expectation (DEX), a fuzzy scoring method was used to reduce the influence of the tails and the shared features in each class and thus weaken the feature overlaps during the training steps [17]. Our previous work proposed a fuzzy window that focused on softly pulling the shared features to the optimal position by balancing the position of features with the distance away from the center. Under the conditions in Figure 1a, we set the length of the fuzzy window to 3, and the ascending (or descending) trend of the high score (or low score) was 1. Alternatively, under the condition of Figure 1c, in order to reduce the influence of the overlapping features, we set the length of the fuzzy window to 5. With this setting, the ascending or descending respective high or low score trend was 2. Eventually, the output value modified by the fuzzy window tended to slip forward to the global average position, and they optimized the redistributed probabilities as follows:

\tilde{P} (x_{i} | y_{i} = i) = \frac{| i - {\tilde{V}}_{o} |}{b - a} \times \sum_{j = i - a}^{i + b} \frac{e^{- E (y_{j}, x_{j})}}{\sum_{y_{1}}^{y_{K}} e^{- E (y_{j}, x_{j})}},

(3)

where b is the upper bound of the fuzzy window, and a is the lower bound.

\tilde{P}

is the probability after using fuzzy windows.

E (y_{j}, x_{j})

is the expectation that

x_{j}

is predicted as

y_{j}

.

{\tilde{V}}_{o}

is used to reduce the conglutination between two either neighbors or remote classes. This was calculated with:

{\tilde{V}}_{o} (x_{j} | y_{j} = j) = \frac{j \times P (x_{j} | y_{j})}{\sum_{j = i - a}^{i + b} P (x_{j} | y_{j})} .

(4)

2.4. Soft Labels and Gaussian Processes

Hard labels. This type of label is traditionally used as a one-hot vector. For example, the encoding

h_{i} = [0; 1; 0]

means that

x_{i}

is annotated to be the second class and

y_{i} = 2

. However, the undesirable application in classifying ordinal images is its disadvantage, as many of them are ambiguous or have an unclear “borderline”. Thus, it brings difficulties to researchers as to which class they should belong to [27]. These hard labels tend to create an artificial gap that ponderously defines the borderline, and this instinctual drawback might then reduce the ability of the network to adapt [28].

Soft labels. By contrast, soft labels annotate categories by representing corresponding classes with a probability vector. For example, the encoding

h_{i} = [0.1; 0.7; 0.2]

indicates that

P (Y = 2 | X = x_{i}) = 0.7

. Thus, the value of the true label has been switched from 1 to

0.7

[27]. Instead of using a single bit, soft labels with probabilities can provide extra information to the training models [29]. Meanwhile, they have information inheritance, which can resist disturbance during inference [30,31,32].

Gaussian Processes. Gaussian process approaches for ordinal regression have been studied based on support vector machines [33], deep neural networks [7], and deep learning models with Gaussian distribution labels [34,35,36]. One partial label machine learning study used the Gaussian Processed (GP) approach to address and disambiguate the vague labeling information conveyed by the training data [8]. It assumed that there was already an unobservable latent function depending on the Gaussian process in the feature space of each class label. The essential problem, however, is that manually affected and annotated ambiguous labels have been ignored. The reason is that the Gaussian distribution, at times, could not really represent realistic labels without logic deblurring.

For facial-age detection, many technical articles, for example, RCNN [22,37,38,39], Distribution Learning (DLDL) [9,30], and DLDL2 [2], reported that all of them trend towards the implicit use of the learning label distribution method, and data distribution is a Gaussian-like distribution.

The distribution of facial ages is represented by a Gaussian distribution for which a lookup table is generated beforehand to store multi-part integrals [40]. These integrals can explain the probability of whether an input image should belong to the true chronological age of a given person whose multiple age samples have been provided. In [10], a label distribution learning with a normal distribution variance

σ

was used and proposed

p_{μ} (y, σ)

to represent the k-th

(k \in [0, 99])

element of

p (y, σ)

:

p_{μ} (y, σ) = 1 / \sqrt{2 π σ^{2}} e^{- \frac{{(y - μ)}^{2}}{2 σ^{2}}},

(5)

where

p_{μ}

is the probability that the true age is

μ

years old. It represents the connection between the classes

μ

and y in a normal distribution view. The optimal

σ

in each iteration depends on the optimal model parameter

θ^{*}

:

θ^{*} (σ) = a r g m i n_{θ} L_{K L} (H, y_{t r u e}, θ, σ),

(6)

where

L_{K L} (H, y_{t r u e}, θ^{*}, σ)

denotes the train loss. H is the training input image, while

y_{t r u e}

is its label.

K L

is the Kullback–Leibler divergence.

3. Our Method

For ordinal regression, the most effective and popular method is using multiple binary classifiers to determine the ordinal category for each input (the K-rank approach) [2,4,30]. But the fundamental principle of this strategy is the consistency of the ordinal regression data [22]. In this section, we propose a simple and intuitive method that frames ordinal regression as a traditional classification problem, uses the Gaussian processed labels to stretch the shared features between two ordinal neighbors, and, finally, combines these Gaussian processed labels with a fuzzy window [17] to stabilize the weights on shared features.

3.1. Normalized Gaussian Processed Labels

After we set the equivalent double wings of the fuzzy window, which means

i - a = b - i

, we get the fuzzy

w i n d o w = \{w i n_{1} = x_{i - a}, \dots, w i n_{a + 1} = x_{i}, \dots, w i n_{a + b + 1} = x_{i + b}\}

. The true label is defined as:

L a b e l (x_{i} | y_{i} = i) = \{\begin{matrix} 0 & for & x_{i} \neq i \\ 1 & for & x_{i} = i, \end{matrix}

(7)

and then the Gaussian processed label

L a b e l_{g}

can be:

{L a b e l}_{G} (x_{i} | y_{i} = i) = \{\begin{matrix} 0 & for & x_{i} \notin w i n d o w \\ \frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{{(x_{i} - μ)}^{2}}{2 σ^{2}}} & for & x_{i} \in w i n d o w, \end{matrix}

(8)

where

x_{i}

is predicted to be

y_{i}

,

y_{i}

is the annotated label, i is the ordinal number,

μ_{o}

is the serially ordinal number of the true label, and

σ

is the variance. Here, we set

σ = 1 / \sqrt{2 π} \approx 0.4

so as to ensure that when

x_{i} = μ

,

L a b e l_{g}

, it can be 1.

In Table 1, we illustrate the essential utilization difference between GP labels and traditional original labels. We assume there are seven categories in this example. The output probabilities of these seven categories are generated using an artificial design. In Table 1, the traditional back-propagation error vector (Errors of the output layer = probabilities of the output—original labels) should be

[0.19, 0.1, 0.01, - 0.6, 0.18, 0.09, 0.03]

. After we use the traditional original labels, there is only one negative error resulting from the back-propagating calculation. If, however, we apply the GP labels on the back-propagation processing, the original hard label vector will switch from

[0, 0, 0, 1, 0, 0, 0]

to the soft label vector

[0, 0.07, 0.14, 1, 0.14, 0.07, 0]

. After using GP labels, the back-propagation error vector (

E r r o r s_{G}

of the output layer = probability outputs—Gaussian labels) will turn to

[0.19, 0.03, - 0.13, - 0.6, 0.04, 0.02, 0.03]

. The output probability of

C 3

is lower than a systematic value—here, we assume this value was generated from the Gaussian function—and there would be two negative errors, which, in the next step, are used for back-propagation.

The ordinal vector is

O r d i n a l

=

{1, 2, \dots, n}

, and m is the total number of ordinal categories. Because we used cross-entropy as the loss function, the back-propagation error between the output and the last layer after using the

L a b e l_{g}

was:

▽_{g} (L) = | P \times O r d i n a l - y_{i} | .

(9)

The gradient of the weight from the

α_{t h}

neuron in the layer

L - 1

to the

β_{t h}

neuron in the layer L after using the

L a b e l_{g}

was:

▽_{g} (L - 1) = \{\begin{matrix} P \times \frac{\partial E_{m}}{\partial W_{L - 1}^{t} (α, β))} & for & x_{i} \notin w i n d o w \\ ▽_{g} (L) \times \frac{\partial E_{m}}{\partial W_{L - 1}^{t} (α, β))} & for & x_{i} \in w i n d o w, \end{matrix}

(10)

where

E_{m}

is the expectation output of the

m_{t h}

category, and

W_{L - 1}^{t} (α, β)

is the weight matrix of the

α_{t h}

neuron in the layer

L - 1

. We find the value of

▽_{g} (L)

cannot always stay positive, which means when

x_{i} \in w i n d o w

,

▽_{g} (L - 1)

should be merged using the multiplication product of

▽_{g} (L)

,

s i g n (P - e^{- π {(x_{i} - μ)}^{2}})

and

\frac{\partial E_{m}}{\partial W_{L - 1}^{t} (α, β))}

.

The difference between using GP labels and original labels is presented in Figure 4. We assume that there are two adjacent ordinal categories,

C_{i - 1}

, and

C_{i}

, and, at the same time, shared quantized features, which are represented by the grey area. In the first round of Gradients Decent Directions (GDDs), the original center of the shared quantized features is located at

C_{a} (0)

. If the true label is

C_{i}

, with the updating of the weights of models when using the original labels and back-propagation of errors, the initialized location of

C_{a} (0)

would slide to

C_{a} (1)

(see it in the Figure 4a). In the second GDDs round, if the true label is

C_{i - 1}

, according to the vector direction of the pulling force, the center will slide to

C_{a} (2)

. Finally, after inducting the location of

C_{a} (2)

, the center of the shared features would stay close to either

C_{i}

or

C_{i - 1}

but not near the borderline.

Alternatively, when using the GP labels, the center of the shared features will fluctuate around the borderline of

{C_{i}, C_{i - 1}}

. By using the original labels, the pulling force of back-propagation is unidirectional, which means the center of the shared features will move toward

C_{i}

or

C_{i - 1}

during every updating step. However, when using the GP labels, the pulling force of back-propagation is a resultant force generated from the

C_{i}

side and the

C_{i - 1}

side.

This part is very similar to the Fast Gradient Sign Method (FGSM) in both non-targeted and targeted adversarial attacks [41,42,43]:

H^{a d v} = H + ϵ \cdot s i g n (▽_{H} J (H, y_{n o n - t a r g e t}))

(11)

and

H^{a d v} = H + ϵ \cdot s i g n (▽_{H} J (H, y_{t a r g e t})),

(12)

where x is the input image,

x^{a d v}

is the perturbed adversarial image, J is the classification loss function,

y_{n o n - t a r g e t}

or

y_{t a r g e t}

is the true label for the input H and

ϵ

can control the steps toward to the targeted or non-targeted image. In our method, this step depends on

s i g n (P - e^{- π {(x_{i} - μ)}^{2}})

, and the targeted category is the Gaussian processed neighbor of the true label.

3.2. Fuzzy Windows with Normalized Gaussian Processed Labels

In order to make models more stable, practically, we should weaken the influence of gradients and use a lower learning rate or a smaller updating gradient. Additionally, when considering the optimal global strategy of only using a fully connected layer, the Fuzzy Fully Connected Layer (FFCL) can bring weaker influence into whole neural networks [17]. Thus, regardless of the strength of the pulling force (the gradient matrices in every layer), the center of the shared features can slide relatively smoothly to the optimal position. Then, this combined method is more beneficial to do classification in the output OR-CNN layer.

We use the DEX method as the base, and the true label y is quantized to different label groups, which are treated as a class. To train DEX with fuzzy windows and normalized Gaussian processed labels, we replaced the expectation module—the last output layer—with fuzzy windows of different lengths, used a Gaussian function (

σ = 1 / \sqrt{2 π} \approx 0.4

) to process the ordinal labels, and, finally, modified the loss function with a typical cross-entropy loss. The back-propagation error between the output and the last layer after using the

L a b e l_{g}

was:

{\tilde{▽}}_{x_{i}} l (x_{i}, y_{t r u e}) = \{\begin{matrix} \tilde{P} - 0 & for & x_{i} \notin w i n d o w \\ \tilde{P} - L a b e l_{g} & for & x_{i} \in w i n d o w, \end{matrix}

(13)

where

\tilde{P}

is calculated from Equations (3) and (4).

In Algorithm 1, we provide the pseudo-code of the fuzzy window with a normalized Gaussian processed label algorithm on processing the ordinal regression issue. The first step is to process the labels with Gaussian distribution. After setting the length of the Gaussian window,

L_{W i n}

, the

L a b e l_{G}

can be calculated according to the Gaussian processing template in Table 1. However, if the Gaussian window reaches the head or the tail of the whole age sequence (0 or m), the element which is out-of-range (for example, if the front side of the window

F r t < 0

(

F r t = i - L_{h W i n}

) or if the back side of the window

B k > m

(

B k = i + L_{h W i n}

)) should be removed. The second step is to use fuzzy logic to eliminate the influence of overlapping features in ordinal neighbor classes.

\tilde{P_{i}}

can be calculated out using Equation (3), and

{\tilde{V}}_{o}

is able to be computed from Equation (4). During the inference time of the fuzzy window, an expected value, the sum of the multiplication of two elements—the position of the binary classifier and the prediction probability of this specific classifier—is used for the final estimation.

Algorithm 1 Fuzzy Windows with Normalized Gaussian Processed Labels

4. Experiments

In this section, we introduce one medical image dataset (CBIS-DDSM) and three facial-age datasets (IMDB-WIKI, FG-NET, MORPH-2, CACD). In the following, there are three experimental ablation results. The first shows the performance on the selection of the hyperparameter

L_{W i n}

. The second ablation study presents the performance of FW-GPL in processing a designed fragmentary ordinal dataset. The last one demonstrates comparison results with SOTA methods on three facial-age datasets.

4.1. Datasets

In this study, there are one medical image dataset and four different facial-age estimation datasets (one for pretraining).

4.1.1. Ordinal Medical Dataset

Table 2 shows the size of one ordinal medical dataset and its corresponding splits for training and testing.

CBIS-DDSM. CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is a large collection of digitized film mammography images, which includes 3572 images referring to 2689 patient cases. According to BI-RADS, overall BI-RADS assessment from 0 to 5 has been described in this dataset, including BI-RADS score 0 (Incomplete cases), BI-RADS score 1 (Negative cases), BI-RADS score 2 (Benign cases), BI-RADS score 3 (Probably Benign cases), BI-RADS score 4 (Suspicious Abnormal cases) and BI-RADS score 5 (Highly Suspicious Malignant cases); (This CBIS-DDSM dataset is available at https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM, accessed on 18 February 2023).

4.1.2. Facial-Age Estimation Datasets

Table 3 shows the size of each dataset, and the corresponding splits for training and testing.

IMDB-WIKI. For the IMDB-WIKI dataset (IMDB-WIKI can be downloaded from http://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/, accessed on 18 February 2023), the authors crawled images of celebrities from IMDB (www.imdb.com, accessed on 18 February 2023) and Wikipedia (https://en.wikipedia.org/, accessed on 18 February 2023).

FG-NET. The Face and Gesture Recognition Research Network (FG-NET) [15] aging database consists of 1002 color and grey-scale images, which were taken in a totally uncontrolled environment. On average, there are 12 images for each of the 82 subjects, whose age ranges from 0 to 69 (FG-NET is available at https://yanweifu.github.io/FG_NET_data/, accessed on 18 February 2023).

MORPH-2. The Craniofacial Longitudinal Morphological Face Database (MORPH) [14] is the largest publicly available longitudinal face database containing more than fifty thousand mug shots ( You can find MORPH-2 from https://www.faceaginggroup.com/morph/, accessed on 18 February 2023).

CACD. The Cross-Age Celebrity Dataset (CACD) [16] collected from the Internet contains 163,446 images from 2000 celebrities. This dataset splits into three parts, 1800 celebrities are used for training, 80 for validation, and 120 for testing (The link of CACD is http://bcsiriuschen.github.io/CARC/, accessed on 18 February 2023).

4.2. Evaluation Metrics

For model evaluation and comparison [44], we computed the Mean Absolute Error (MAE) [45] and Root-Mean-Square Error (RMSE) [46], on the test set after the last training epoch:

M A E = \frac{1}{N} \sum_{n = 1}^{N} | \tilde{y} - y |

(14)

R M S E = \sqrt{\frac{1}{N} \sum_{n = 1}^{N} {| \tilde{y} - y |}^{2}}

(15)

where

\tilde{y}

is the output value of OR-CNNs, y is the real facial age label, and N is the total number of test samples.

4.3. Experiment Settings

Following DEX [21], SSR [5], Mean-Variance Loss [3], and C3AE [4], the model can be first pre-trained on the IMDB-WIKI dataset. This method can be embedded into any CNN ordinal classification model. We respectively set the length of the fuzzy window

L_{W i n}

as 10 for facial-age detection and 3 for breast cancer detection. We used the Adam optimizer in all the experiments, and similarly to SSR and C3AE, the initial learning rate, dropout rate, momentum, and weight decay were set to 0.002, 0.2, 0.9, and 0.0001, respectively. The learning rate was

0.001

with a decay every 10 epochs by a factor of

0.9

. Compared with the SOTA methods, each model totally trained two hundred epochs with a batch size of 50. During the training steps, to avoid overfitting the overlapping features, we adjusted the training strategy according to Algorithm 2.

Algorithm 2 Training

M o d e l

4.4. Hardware and Software

All loss functions and neural network models were implemented in MATLAB2019b and PyTorch 1.7 and trained on four Tesla V100 graphics cards (The source code is available at https://github.com/ChengKang520/FW-with-GPL-for-Ordinal-Regression, accessed on 18 February 2023).

5. Results and Analysis

So as to compare with the SOTA results, we respectively summarize the comparison result of CBIS-DDSM in Table 4 and the comparison result of facial-age detection in Table 5.

5.1. Scoring Breast Cancer Images

As we set the hyperparameter

L_{W i n} = 3

when scoring BI-RADS, only when the categories are greater than 3 in number can our FW-GPL work well to reduce the influence of overlapping features among neighboring ordinal classes—this can also be seen in Table 4. We find that only when scoring the BI-RADS of six categories does FW-GPL show a weak but obvious improvement. The distance d between BI-RADS score 2 (benign) and BI-RADS score 3 (probably benign) is probably beyond the “boundary”, as is the distance d between BI-RADS score 4 (suspicious abnormal) and BI-RADS score 5 (highly suspicious malignant); therefore, the classification task for BI-RADS is difficult.

5.2. Scoring Facial-Age Images

With respect to scoring facial-age images, we set the number of neurons to 10 and the length of the window to 5. The results of a comparison between our model and SOTA models on three facial-age datasets are summarized in Table 5.

Compared with label distribution learning methods such as Deep Label Distribution Learning V2 (DLDL-V2), ref. [2] and MV Loss [3], FW-GPL leverages a fixed pattern (Gaussian processed labels) to learn features that have considered the issue of age distribution. Our FW-GPL does not need to know beforehand the age distribution of the image data. Compared with some models that use special loss functions, we find that our FW-GPL, in particular, achieves competitive results compared with most SOTA methods, for example, MV Loss [3], SSR [5], and C3AE [4]. This is because the fuzzy window reduces the influence of the conjugation among neighboring ordinal categories, and it does not like DLDL-V2 [2], MV Loss [3], and SSR [5] considering the whole probability, or as with C3AE [4], focusing only on the two highest output probabilities. The second reason is that the Gaussian processed labels do not need to fit a proper hyperparameter

σ

[10] that should approximate the true age probability distribution. Compared with our FW-GPL, MWR [18] developed global and local relative ordinal regressors (

ρ

regressors) to predict

ρ

ranks within the entire and specific rank ranges. Furthermore, MWR first refined an initial search window, iteratively moved it by selecting two reference instances, and, lastly, estimated the

ρ

rank within the window.

6. Ablation and Discussion

Based on the facial-age image classification, we used the ordinal IMDB-WIKI data to do the ablation analysis. The ablation study was conducted in three parts: (1) to analyze the influence of the number of neurons, (2) to analyze the influence of the length of window

L_{W i n}

, and (3) to figure out how this model could process incomplete ordinal data.

6.1. Ablation Study I (Influence of the Number of Neurons)

We used the classical pre-trained DEX model as the base. In Table 6 and Table 7, we see that when the neuron number N is 10 or 5, the DEX model can get the best performance. This finding echoes prior research showing that when the number of neurons in the output layer is 10 or 5, DEX-family age detection models can achieve better performance [18]. In other words, a smaller N has a better error tolerance.

6.2. Ablation Study II (Influence of the Length of the Window $L_{W i n}$ )

We used two types of output layers (when

N = 100

and

N = 10

) to analyze the performance of FW-GPL under different

L_{W i n}

, with the results summarized in Table 8 and Table 9. As the length of the half window

L_{h W i n}

should be greater than the number of neurons in the output layer, we find that when the number of the neurons in the output layer is 100, a wider window (

L_{W i n} = 50

or

L_{W i n} = 100

) gets better performance. Additionally, when the number of neurons in the output layer is 10, we can find the same trend. That is because the wider window can contain sufficient information to estimate the facial age. But if

N \leq L_{h W i n}

, there is no improvement. These denote that a proper window length can improve the performance of the FW-GPL model.

6.3. Ablation Study III (Incomplete Ordinal Image Data)

We manually removed some age segments of the IMDB-WIKI to train the model and test it in the complete ordinal text data, as shown in Figure 5. In Table 10, we can see that when the number of neurons is 100, the most proper window is 20. In Table 11, when we set the length of the window as 10, the lowest MAE appears when the number of neurons is 5. Consequently, there is no obvious difference between the incomplete (this section) and complete (Ablation Study II) ordinal image data, and the result can only be affected by the number of neurons N and the length of the window

L_{W i n}

.

6.4. Advantage and Limitation

By directly facing the challenge of ordinal image classification, our method attempts to reduce the influence of the overlapping features. The length of the window controls the defuzzification of the ordinal neighbor categories. JREAE [13] used two covariance matrices to capture the underlying correlations from both aspects of input facial features and output age labels, but this family of methods (e.g., DRF [6] and AVDL [10]) should first take the age distribution of the dataset into account. This consideration is necessary because after fitting the distribution of the facial age dataset, there is an inevitable deviation between the real distribution of the age and the fitted one. To avoid such a problem, our method used a Gaussian distribution within the window to approximate the relationship between input facial features and output age labels. From Table 5, our method outperforms other LDBL methods and presents the advantage of using label distribution-based learning methods.

However, the disadvantage is that we only use a naive fuzzy logic window to leverage the challenge of ordinal image classification tasks. By adaptively adjusting the distance between the real age and the center of the moving window, MWR [18] moved the window to fit the

ρ

-ranks within entire and specific age ranks. Our method constrained the center of the window by using naive fuzzy logic to adjust the distribution of the facial age in the window. That would ignore the influence of the remote but highly related feature, which is beyond the window. Even though we tried to use longer windows, our method failed to overcome this problem.

7. Conclusions

In this paper, we have proposed a novel method for ordinal image scoring named fuzzy window with Gaussian processed label learning (FW-GPL). FW-GPL introduces a method to reduce the influence of the overlapping features between two ordinal neighbors. It achieves better performances than others on multiple age estimation datasets and one ambiguously annotated medical dataset. Our experiments also show that FW-GPL can process discontinuous ordinal regression by setting the proper length of the windows.

The idea of using fuzzy logic and a Gaussian process strategy to guide ordinal image classification is inspirational, and we will explore more possibilities for it. There are many directions for future work. (1) There are many other ordinal medical tasks, for example, scoring the severity of depression and grading the injury of spinal cords. We will use this method on such medical tasks in the coming research. (2) Our method cannot achieve the best SOTA result. We will try to overcome this challenge by infusing FW-GPL into other SOTA models. (3) To save the computing cost, we will fine-tune the pre-trained models which have been inserted with FW-GPL.

Author Contributions

Conceptualization, C.K. and X.Y.; Writing—original draft, C.K.; Writing—review & editing, X.Y. and D.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the student grant agency of the Czech Technical University, grant number SGS22/165/OHK3/3T/13, and by the Research Centre for Informatics, grant number CZ.02.1.01/0.0/16_019/0000765.

Acknowledgments

The work of Cheng Kang has been supported by the student grant agency of the Czech Technical University in Prague (grant number SGS22/165/OHK3/3T/13). The work of Daniel Novak and Cheng Kang has been supported by the Research Centre for Informatics, grant number CZ.02.1.01/0.0/16_019/0000765.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

MDD	Major Depressive Disorder
FW-GPL	Fuzzy Window with the Gaussian Processed Labels
GP	Gaussian Processed
SCID-CV	DSM-IV Axis I Disorders, Clinician Version
AI	Artificial Intelligence
ANN	Artificial Neural Network
BI-RADS	Breast Imaging-Reporting and Data System
IMDB	Internet Movie Database
WIKI	Wikipedia
FG-NET	Face and Gesture Recognition Research Network
MORPH-2	Craniofacial Longitudinal Morphological Face Database II
CACD	Cross-Age Celebrity Dataset
FFCL	Fuzzy Fully Connected Layer
CBIS-DDSM	Curated Breast Imaging Subset of Digital Database of Screening Mammography
SOTA	State-Of-The-Art
RMSE	Root-Mean-Square Error
SLL-Loss	Single Label Learning with Specific Loss
OR-CNN	Ordinal Regression CNN
LDBL	Label Distribution Based Learning
DEX	Deep Expectation
DLDL	Deep Label Distribution Learning
DLDL-V2	Deep Label Distribution Learning V2
FGSM	Fast Gradient Sign Method
GDD	Gradients Decent Direction
FG-NET	Face and Gesture Recognition Research Network
MORPH	Craniofacial Longitudinal Morphological Face Database
CACD	Cross-Age Celebrity Dataset
MAE	Mean Absolute Error
ACC	Accuracy
MV	Mean Variance
SSR	Soft Stagewise Regression
C3AE	Compact yet efficient Cascade Context-based Age Estimation

References

Tutz, G. Ordinal regression: A review and a taxonomy of models. Wiley Interdiscip. Rev. Comput. Stat. 2022, 14, e1545. [Google Scholar] [CrossRef]
Gao, B.B.; Liu, X.X.; Zhou, H.Y.; Wu, J.; Geng, X. Learning Expectation of Label Distribution for Facial Age and Attractiveness Estimation. arXiv 2020, arXiv:2007.01771. [Google Scholar]
Pan, H.; Hu, H.; Shan, S.; Chen, X. Mean-Variance Loss for Deep Age Estimation from a Face. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhang, C.; Liu, S.; Xu, X.; Zhu, C. C3AE: Exploring the limits of compact model for age estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–19 June 2019; pp. 12587–12596. [Google Scholar]
Yang, T.Y.; Huang, Y.H.; Lin, Y.Y.; Hsiu, P.C.; Chuang, Y.Y. Ssr-net: A compact soft stagewise regression network for age estimation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18), Stockholm, Sweden, 13–19 July 2018; Volume 5, p. 7. [Google Scholar]
Shen, W.; Guo, Y.; Wang, Y.; Zhao, K.; Wang, B.; Yuille, A.L. Deep Regression Forests for Age Estimation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2304–2313. [Google Scholar]
Liu, Y.; Wang, F.; Kong, A.W.K. Probabilistic deep ordinal regression based on Gaussian processes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5301–5309. [Google Scholar]
Zhou, Y.; He, J.; Gu, H. Partial label learning via Gaussian processes. IEEE Trans. Cybern. 2016, 47, 4443–4450. [Google Scholar] [CrossRef] [PubMed]
Fan, Y.Y.; Liu, S.; Li, B.; Guo, Z.; Samal, A.; Wan, J.; Li, S.Z. Label distribution-based facial attractiveness computation by deep residual learning. IEEE Trans. Multimed. 2017, 20, 2196–2208. [Google Scholar] [CrossRef] [Green Version]
Wen, X.; Li, B.; Guo, H.; Liu, Z.; Hu, G.; Tang, M.; Wang, J. Adaptive Variance Based Label Distribution Learning for Facial Age Estimation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Berg, A.; Oskarsson, M.; O’Connor, M. Deep ordinal regression with label diversity. In Proceedings of the 2021 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 2740–2747. [Google Scholar]
Li, W.; Huang, X.; Lu, J.; Feng, J.; Zhou, J. Learning probabilistic ordinal embeddings for uncertainty-aware regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13896–13905. [Google Scholar]
Chen, G.; Peng, J.; Wang, L.; Yuan, H.; Huang, Y. Feature constraint reinforcement based age estimation. Multimed. Tools Appl. 2022. [Google Scholar] [CrossRef]
Ricanek, K.; Tesafaye, T. MORPH: A longitudinal image database of normal adult age-progression. In Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR06), Southampton, UK, 10–12 April 2006; pp. 341–345. [Google Scholar] [CrossRef]
Panis, G.; Lanitis, A. An Overview of Research Activities in Facial Age Estimation Using the FG-NET Aging Database; Springer International Publishing: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
Chen, B.C.; Chen, C.S.; Hsu, W.H. Face Recognition and Retrieval Using Cross-Age Reference Coding With Cross-Age Celebrity Dataset. IEEE Trans. Multimed. 2015, 17, 804–815. [Google Scholar] [CrossRef]
Kang, C.; Yu, X.; Wang, S.H.; Guttery, D.S.; Pandey, H.M.; Tian, Y.; Zhang, Y.D. A heuristic neural network structure relying on fuzzy logic for images scoring. IEEE Trans. Fuzzy Syst. 2020, 29, 34–45. [Google Scholar] [CrossRef] [Green Version]
Shin, N.H.; Lee, S.H.; Kim, C.S. Moving window regression: A novel approach to ordinal regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 18760–18769. [Google Scholar]
Baccianella, S.; Esuli, A.; Sebastiani, F. Evaluation measures for ordinal regression. In Proceedings of the 2009 Ninth International Conference on Intelligent Systems Design and Applications, Pisa, Italy, 30 November–2 December 2009; pp. 283–287. [Google Scholar]
Levi, G.; Hassner, T. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 34–42. [Google Scholar]
Rothe, R.; Timofte, R.; Van Gool, L. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Santiago, Chile, 7–13 December 2015; pp. 10–15. [Google Scholar]
Niu, Z.; Zhou, M.; Wang, L.; Gao, X.; Hua, G. Ordinal regression with multiple output cnn for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4920–4928. [Google Scholar]
Cao, W.; Mirjalili, V.; Raschka, S. Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognit. Lett. 2020, 140, 325–331. [Google Scholar] [CrossRef]
Rothe, R.; Timofte, R.; Van Gool, L. Deep expectation of real and apparent age from a single image without facial landmarks. Int. J. Comput. Vis. 2018, 126, 144–157. [Google Scholar] [CrossRef] [Green Version]
Gámez, J.C.; Garcia, D.; González, A.; Perez, R. An approximation to solve regression problems with a genetic fuzzy rule ordinal algorithm. Appl. Soft Comput. 2019, 78, 13–28. [Google Scholar] [CrossRef]
Alcalá-Fdez, J.; Alcalá, R.; González, S.; Nojima, Y.; García, S. Evolutionary fuzzy rule-based methods for monotonic classification. IEEE Trans. Fuzzy Syst. 2017, 25, 1376–1390. [Google Scholar] [CrossRef]
Vega, R.; Gorji, P.; Zhang, Z.; Qin, X.; Rakkunedeth, A.; Kapur, J.; Jaremko, J.; Greiner, R. Sample efficient learning of image-based diagnostic classifiers via probabilistic labels. In Proceedings of the International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 13–15 April 2021; pp. 739–747. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Gao, B.B.; Xing, C.; Xie, C.W.; Wu, J.; Geng, X. Deep label distribution learning with label ambiguity. IEEE Trans. Image Process. 2017, 26, 2825–2838. [Google Scholar] [CrossRef] [Green Version]
Geng, X. Label distribution learning. IEEE Trans. Knowl. Data Eng. 2016, 28, 1734–1748. [Google Scholar] [CrossRef] [Green Version]
Imani, E.; White, M. Improving regression performance with distributional losses. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2157–2166. [Google Scholar]
Chu, W.; Ghahramani, Z.; Williams, C.K. Gaussian processes for ordinal regression. J. Mach. Learn. Res. 2005, 6, 1019–1041. [Google Scholar]
Liu, H.; Lu, J.; Feng, J.; Zhou, J. Ordinal deep feature learning for facial age estimation. In Proceedings of the 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), Washington, DC, USA, 30 May–3 June 2017; pp. 157–164. [Google Scholar]
Zhang, Z.; Lai, C.; Liu, H.; Li, Y.F. Infrared facial expression recognition via Gaussian-based label distribution learning in the dark illumination environment for human emotion detection. Neurocomputing 2020, 409, 341–350. [Google Scholar] [CrossRef]
Rajasekhar, G.P.; Granger, E.; Cardinal, P. Deep domain adaptation with ordinal regression for pain assessment using weakly-labeled videos. Image Vis. Comput. 2021, 110, 104167. [Google Scholar] [CrossRef]
Chen, S.; Zhang, C.; Dong, M.; Le, J.; Rao, M. Using ranking-cnn for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5183–5192. [Google Scholar]
Chen, S.; Zhang, C.; Dong, M. Deep age estimation: From classification to ranking. IEEE Trans. Multimed. 2017, 20, 2209–2222. [Google Scholar] [CrossRef]
Li, K.; Xing, J.; Hu, W.; Maybank, S.J. D2C: Deep cumulatively and comparatively learning for human age estimation. Pattern Recognit. 2017, 66, 95–105. [Google Scholar] [CrossRef] [Green Version]
Tan, Z.; Zhou, S.; Wan, J.; Lei, Z.; Li, S.Z. Age estimation based on a single network with soft softmax of aging modeling. In Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan, 30 November–4 December 2016; pp. 203–216. [Google Scholar]
Kurakin, A.; Goodfellow, I.J.; Bengio, S. Adversarial examples in the physical world. arXiv 2017, arXiv:1607.02533. [Google Scholar]
Tramèr, F.; Kurakin, A.; Papernot, N.; Goodfellow, I.; Boneh, D.; McDaniel, P. Ensemble adversarial training: Attacks and defenses. arXiv 2017, arXiv:1705.07204. [Google Scholar]
Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; Li, J. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9185–9193. [Google Scholar]
Hodson, T.O. Root-mean-square error (RMSE) or mean absolute error (MAE): When to use them or not. Geosci. Model Dev. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the mean root square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef] [Green Version]
Geras, K.J.; Wolfson, S.; Shen, Y.; Wu, N.; Kim, S.; Kim, E.; Heacock, L.; Parikh, U.; Moy, L.; Cho, K. High-resolution breast cancer screening with multi-view deep convolutional neural networks. arXiv 2017, arXiv:1703.07047. [Google Scholar]
Akselrod-Ballin, A.; Karlinsky, L.; Alpert, S.; Hasoul, S.; Ben-Ari, R.; Barkan, E. A region based convolutional network for tumor detection and classification in breast mammography. In Deep Learning and Data Labeling for Medical Applications; Springer: Berlin/Heidelberg, Germany, 2016; pp. 197–205. [Google Scholar]
Lin, Y.; Shen, J.; Wang, Y.; Pantic, M. FP-Age: Leveraging Face Parsing Attention for Facial Age Estimation in the Wild. arXiv 2021, arXiv:2106.11145. [Google Scholar] [CrossRef]
Deng, Z.; Liu, H.; Wang, Y.; Wang, C.; Yu, Z.; Sun, X. PML: Progressive Margin Loss for Long-tailed Age Classification. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 10498–10507. [Google Scholar]

Figure 1. The challenge of ordinal image classification (or scoring). The X-axis denotes the intrinsic rank of features, and the Y-axis denotes the weights of models.

C_{a} (0)

is the initial center of the overlapping feature. We assume that the features in ordinal images have an “intrinsic rank”, and the corresponding ordinal category will show a specific concentration in terms of the “intrinsic rank”.

C_{i}

and

C_{i - 1}

are, respectively, the centers of their corresponding neighboring ordinal classes. (a) If the distance d between two centers is remote, the “intrinsic rank” is slack. (b) If the distance d between two centers is approaching the boundary, the “intrinsic rank” is tight. (c) If the distance d between two centers is beyond the boundary, the “intrinsic rank” seems to become a whole part. Under this condition, the classification task would become extremely difficult.

Figure 1. The challenge of ordinal image classification (or scoring). The X-axis denotes the intrinsic rank of features, and the Y-axis denotes the weights of models.

C_{a} (0)

is the initial center of the overlapping feature. We assume that the features in ordinal images have an “intrinsic rank”, and the corresponding ordinal category will show a specific concentration in terms of the “intrinsic rank”.

C_{i}

and

C_{i - 1}

are, respectively, the centers of their corresponding neighboring ordinal classes. (a) If the distance d between two centers is remote, the “intrinsic rank” is slack. (b) If the distance d between two centers is approaching the boundary, the “intrinsic rank” is tight. (c) If the distance d between two centers is beyond the boundary, the “intrinsic rank” seems to become a whole part. Under this condition, the classification task would become extremely difficult.

Figure 2. The proposed fuzzy window method (the length of the fuzzy window is 5) with the use of Gaussian processed labels for image scoring tasks.

Figure 3. The left panel presents an example that shows overlapping features between two neighbor groups. The right panel shows the one-hot labels and the Gaussian processed labels.

Figure 4. When two adjacent categories pull the center of the shared features, the resultant force decides where the center will finally stay. (a) When using one-hot labels, if the initial center of the shared features is

C_{a} (0)

, the resultant vector of the puling forces toward

C_{i}

and

C_{i - 1}

will make the center slip from

C_{a} (0)

to

C_{a} (1)

. Finally, the center of the shared features will move close to either

C_{i}

or

C_{i - 1}

. (b) However, if we use the Gaussian labels, the center of the shared features will finally vibrate in the middle between

C_{i}

and

C_{i - 1}

.

Figure 4. When two adjacent categories pull the center of the shared features, the resultant force decides where the center will finally stay. (a) When using one-hot labels, if the initial center of the shared features is

C_{a} (0)

, the resultant vector of the puling forces toward

C_{i}

and

C_{i - 1}

will make the center slip from

C_{a} (0)

to

C_{a} (1)

. Finally, the center of the shared features will move close to either

C_{i}

or

C_{i - 1}

. (b) However, if we use the Gaussian labels, the center of the shared features will finally vibrate in the middle between

C_{i}

and

C_{i - 1}

.

Figure 5. This figure shows the condition that the BI-RADS or the facial-age dataset is not consecutive. (a) The class distribution of CBIS-DDSM. (b) The age distribution of the IMDB-WIKI. The blue bars are the fragmentary IMDB-WIKI, whereas the red bars are manually removed.

Table 1. An example of using Gaussian labels. There are seven categories (

C_{1}

to

C_{7}

), a probability vector, original labels, original labels’ errors, Gaussian windows (

μ

= 4, and

σ

= 0.5), Gaussian processed labels (

μ

= 4, and

σ

= 0.5), and Gaussian processed labels’ errors.

Table 1. An example of using Gaussian labels. There are seven categories (

C_{1}

to

C_{7}

), a probability vector, original labels, original labels’ errors, Gaussian windows (

μ

= 4, and

σ

= 0.5), Gaussian processed labels (

μ

= 4, and

σ

= 0.5), and Gaussian processed labels’ errors.

Category	$C_{1}$	$C_{2}$	$C_{3}$	$C_{4}$	$C_{5}$	$C_{6}$	$C_{7}$
Probability Outputs	0.19	0.1	0.01	0.4	0.18	0.09	0.03
Original Labels	0	0	0	1	0	0	0
Errors	0.19	0.1	0.01	−0.6	0.18	0.09	0.03
Gaussian Window	0	0.05	0.1	0.7	0.1	0.05	0
Gaussian Labels	0	0.07	0.14	1	0.14	0.07	0
${Errors}_{G}$	0.19	0.03	−0.13	−0.6	0.04	0.02	0.03

Table 2. Sample distribution of CBIS-DDSM dataset based on BI-RADS assessment.

Scores (BI-RADS)	0	1	2	3	4	5
Training Set (Mass + Calcification)	192 (129 + 63)	1	559 (77 + 482)	368 (279 + 89)	1286 (533 + 753)	458 (299 + 159)
Testing Set (Mass + Calcification)	46 (33 + 13)	2	85 (14 + 71)	109 (85 + 24)	347 (169 + 178)	115 (75 + 40)

Table 3. Facial-age datasets used to evaluate the proposed FW-GPL.

Datasets Name	Train	Test	Val	Total	Label Range
IMDB-WIKI	260,282	⊗	⊗	523,051	0–100
FG-NET	990	12	⊗	1002	0–69
MORPH 2	4380	1095	⊗	5475	16–70
CACD	145,275	10571	7600	163,446	⊗

Table 4. Comparison with existing methods on DDSM in terms of ACC.

Method	CNN + FCL	CNN + FFCL	CNN + W-GPL
Geras [47] (BI-RADS: 0/1/2)	68.8%	70.1%	70.3%
Akselrod-Ballin [48] (BI-RADS: 2/(3-4-5))	60.0%	62.3%	62.4%
Kang [17] (BI-RADS: 0/(2-3)/(4-5))	72.0%	74.1%	74.2%
Kang [17] (BI-RADS: 0/1/2/3/4/5)	56.34% ±1.4%	57.40% ±1.7%	58.29% ±1.9%

Table 5. In terms of MAEs, our approach is compared with different SOTA methods. (* indicates the model was pre-trained on the IMDB-WIKI dataset.)

Type	Method	MORPH 2	FG-NET	CACD	Paras
Bulky	DEX [21]	3.25	4.63	-	138 M
	*DEX [21]**	2.68	3.09	6.52	138 M
	MV [3]	2.41	4.10	-	138 M
	*MV [3]**	2.16	2.68	-	138 M
	DLDL-v2 [2]	1.969	-	-	138 M
	FP-Age [49]	2.04	5.60	5.60	138 M
	*FP-Age [49]**	1.90	4.68	4.33	138 M
	DRF [6]	2.80	3.47	5.63	-
	PML [50]	2.31	2.16	-	-
	JREAE [13]	2.71	3.390	4.596	-
	MWR [18]	2.13	-	5.68	-
	FW-GPL [Ours]	2.71	4.27	-	138 M
	*FW-GPL [Ours]**	2.24	2.73	6.10	138 M
Compact	ORCNN [3]	3.27	6.44	-	479.7 K
	MRCNN [3]	3.42	-	-	479.7 K
	SSR [5]	3.16	-	-	40.9 K
	C3AE [4]	2.78	4.09	-	39.7 K
	*C3AE [4]**	2.75	2.95	-	39.7 K
	*AVDL [10]**	2.37	2.32	-	11 M
	MWR [18]	2.00	2.23	-	-
	FW-GPL[Ours]	2.72	3.71	-	40.9 K

Table 6. Test performance of the FW-GPL method, with the

L_{W i n} = 10

(set length of output neurons N as [100, 50, 20, 10, 5]).

Table 6. Test performance of the FW-GPL method, with the

L_{W i n} = 10

(set length of output neurons N as [100, 50, 20, 10, 5]).

Method	DEX	DEX with FW-GPL
$N$	100	50	20	10	5
RMSE	12.46	13.36	12.65	12.60	12.80
MAE	8.94	8.67	8.79	8.62	8.59

Table 7. Test performance of the DEX method (set length of output neurons N as [100, 50, 20, 10, 5]).

Method	DEX	DEX without FW-GPL
$N$	100	50	20	10	5
RMSE	13.57	13.38	12.86	12.67	12.71
MAE	8.96	8.83	8.77	8.64	8.74

Table 8. Test performance of FW-GPL on testing data sets (length of output neurons set as 100).

Method	DEX with FW-GPL
$L_{Win}$	5	10	20	50	100
RMSE	15.17	15.10	14.58	13.65	13.68
MAE	10.18	10.11	9.78	9.69	9.71

Table 9. Test performance of FW-GPL on testing data sets (length of output neurons set as 10).

Method	DEX with FW-GPL
$L_{Win}$	5	10	20
RMSE	12.91	12.60	12.60
MAE	8.78	8.62	8.62

Table 10. Test performance between the DEX and the FW-GPL with fragmentary IMDB-WIKI dataset. (length of output neurons set as 100).

Method	DEX	FW-GPL
$L_{Win}$	0	5	10	20	50
RMSE	12.46	14.73	14.30	13.64	12.72
MAE	8.94	9.43	9.13	8.78	8.81

Table 11. Test performance between the DEX and the FW-GPL with fragmentary IMDB-WIKI dataset. (length of the window set as 10).

Method	DEX	FW-GPL
$N$	101	50	20	10	5
RMSE	12.46	13.36	12.65	12.60	12.80
MAE	8.94	8.67	8.79	9.08	8.59

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kang, C.; Yao, X.; Novak, D. Fuzzy Windows with Gaussian Processed Labels for Ordinal Image Scoring Tasks. Appl. Sci. 2023, 13, 4019. https://doi.org/10.3390/app13064019

AMA Style

Kang C, Yao X, Novak D. Fuzzy Windows with Gaussian Processed Labels for Ordinal Image Scoring Tasks. Applied Sciences. 2023; 13(6):4019. https://doi.org/10.3390/app13064019

Chicago/Turabian Style

Kang, Cheng, Xujing Yao, and Daniel Novak. 2023. "Fuzzy Windows with Gaussian Processed Labels for Ordinal Image Scoring Tasks" Applied Sciences 13, no. 6: 4019. https://doi.org/10.3390/app13064019

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fuzzy Windows with Gaussian Processed Labels for Ordinal Image Scoring Tasks

Abstract

1. Introduction

2. Related Work

2.1. Ordinal Classification

2.2. Windows for Ordinal Classification

2.3. Fuzzy Scoring for Ordinal Classification

2.4. Soft Labels and Gaussian Processes

3. Our Method

3.1. Normalized Gaussian Processed Labels

3.2. Fuzzy Windows with Normalized Gaussian Processed Labels

4. Experiments

4.1. Datasets

4.1.1. Ordinal Medical Dataset

4.1.2. Facial-Age Estimation Datasets

4.2. Evaluation Metrics

4.3. Experiment Settings

4.4. Hardware and Software

5. Results and Analysis

5.1. Scoring Breast Cancer Images

5.2. Scoring Facial-Age Images

6. Ablation and Discussion

6.1. Ablation Study I (Influence of the Number of Neurons)

6.2. Ablation Study II (Influence of the Length of the Window L W i n )

6.3. Ablation Study III (Incomplete Ordinal Image Data)

6.4. Advantage and Limitation

7. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

6.2. Ablation Study II (Influence of the Length of the Window $L_{W i n}$ )