1. Introduction
The performance of deep neural networks (DNNs) still largely depends on the quality of annotations despite the tremendous success they have achieved in a variety of visual tasks. There is a series of works on the learning mechanism of DNNs with noisy labels [
1,
2,
3]. It is observed in [
3] that DNNs are able to perfectly fit a randomly labeled training set owing to their strong memorization ability. When trained on a noisy dataset, the generalization performances of DNNs decrease sharply [
2,
4], due to overfitting of noisy labels. However, the accurate labeling of large scale datasets is almost impractical [
5]. On one hand, it is extremely time consuming and costly to obtain high-quality labels for large scale datasets. Researchers tend to collect data and the labels automatically using social media or online search engines as a cheaper alternative, which inevitably introduces noisy labels. On the other hand, annotators need specific expertise to label some datasets (e.g., medical or agricultural datasets), which will easily produce incorrect labels caused by variability in the labeling by several annotators.
Therefore, noisy label learning has attracted increasing research interest in recent years [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18]. Recent studies have demonstrated that co-training is conducive to noisy label learning. To hinder the memorization of noisy labels, MentorNet [
6] trains a student network by feeding in small-loss samples selected by a teacher model because the prediction loss of true-labeled samples tends to be smaller than that of noisy samples [
1]. Decoupling [
7] trains two DNNs and only uses samples with divergent predictions from the two models to update them. Each network in Co-teaching [
8] provides small-loss samples to the other. Co-teaching+ [
9] gains further improvement by only selecting small-loss samples from those with different prediction labels from two models. By excluding samples with low certainty of being clean from the training, these methods can effectively avoid overfitting of mislabeled samples, yet they are criticized for leaving a large part of the training dataset unused. JoCoR [
10] jointly trains two networks using a joint loss consisting of cross entropy losses from each model and a KL-Divergence loss [
19] measuring the distance of the two prediction logits. Co-learning [
11] trains a shared feature encoder using a noisy label supervised classifier head along with a self-supervised projection head.
The two parallelly trained models in the above methods will gradually become in agreement with each other in the training process. With the constraining of KL-Divergence loss, JoCoR archives agreement on model predictions while Co-learning maximizes the agreement in latent space. In this study, we innovatively propose to alleviate the memorization of noisy labels by utilizing agreements on parameter gradients.
First, we conduct contrast experiments to explore how noisy labels affect the training process of DNNs. We train two identical networks with the same initialization. A gradient agreement coefficient is designed to evaluate whether the updating directions of two corresponding parameters from the two networks agree at every gradient descent step.
Figure 1 presents the proportion of parameters that achieve agreements among all parameters under different levels of label noise on CIFAR-10. In every batch of the warm up period, the two networks are trained on exactly same input data. Therefore, as shown, the gradient direction of all parameter pairs are in total agreement. After that, we feed data with same class distributions but not consists of same samples to each network in every batch. As
Figure 1 illustrates, the agreement ratio gradually decreases and finally reaches convergence. The higher the noisy rate is, the lower final agreement ratio becomes. While the decreases of agreement ratio occurring on original CIFAR-10 are mainly resulted by intra-class variance, the significant drops of final agreement ratios on other settings are obviously caused by noisy labels, indicating that noisy labels may result in divergent gradient directions in the late stage of training, which ultimately degrades the generation abilities of DNNs.
This phenomenon inspires us to hinder the memorization of noisy labels by identifying divergent gradient directions. However, besides noisy labels, there are several factors which will result in divergence of gradient directions, including different initialization and different class distribution of input samples. To distinguish the divergences caused by noisy labels, we need to eliminate the impact of other factors. Naturally, we propose a novel gradient agreement learning framework (GAL for short). GAL synchronously trains two identical networks with same initialization to avoid the gradient divergence caused by different initialization. Moreover, we propose class distribution agreement sampling strategy to ensure the samples input to the two networks in each iteration share same distribution of annotation classes while avoiding them being identical, thus eliminating the gradient divergence caused by different class distribution of input samples. Therefore, by excluding divergent gradient updates after the warm up steps, GAL can effectively prevent networks from memorizing noisy samples. Then, in every epoch, we use the prediction results of the two networks as pseudo-labels to supervise the training of a third net. In this way, we can rectify noisy labels while avoiding the accumulated confirmation biases.
Compared with previous co-training methods, GAL trains models on the entire training set and gains additional information from hard or noisy samples which have a great possibility not to be selected by small-loss methods such as Co-teaching+. Moreover, by seeking agreement on gradient directions, GAL still allows divergence on the final predictions from the two nets, thus obtaining more significant improvements from model ensembling.
In summary, the contributions of this paper is threefold:
Contrast experiments show that noisy labels may cause more divergent gradients in the late stage of training.
We propose a simple yet effective gradient agreement learning framework that effectively hinders the memorization of noisy labels.
Extensive experiments show the effectiveness and robustness of GAL under different ratios and types of noise, outperforming previous methods.
2. Related Work
We briefly introduce existing literature on noisy label learning in this section.
Regularization. Regularization methods are widely used in the literature and are proved to be able to upgrade the generalization abilities of deep learning models. Recent studies [
12,
13,
14,
15,
16,
17,
18] have proposed various regularization methods to prevent the memorization of noisy labels. Menon [
15] design a new approach to clip gradients, which is robustness to noisy labels. Robust early-learning [
16] dynamically divides parameters into critical and non-critical ones based on their importance for the learning of clean labels. Then different update strategies are applied to the two groups of parameters to avoid overfitting. However, noise rate is needed to divide the parameters, which is not available in real world cases. ELR [
17] proposes an noise robust regularization term to steer models towards label probabilities produced based on model outputs. Label Smoothing [
18] is a technique to train models with an estimated label distribution instead of the one-hot label, thereby hindering the memorization of noisy labels.
Co-training Methods. Recently, kinds of co-training methods for noisy label learning have been developed by researchers. Decoupling [
7] proposes the “disagreement” strategy, only updating two simultaneously trained models based on samples with divergent predictions from the two networks. MentorNet [
6] trains a student network using small-loss samples selected by a cooperating teacher model. Co-teaching [
8] parallelly trains a pair of networks and updates them with small-loss samples selected by the peers. Co-teaching+ [
9] then introduce the “disagreement” strategy into the training process of Co-teaching. In contrast, JoCoR [
10] proposes to select confident samples with agreed predictions. Co-learning [
11] tries to introduce more views of training data by training a shared feature encoder using a noisy label supervised classifier head along with a self-supervised projection head. By seeking agreements in latent space, models become tolerant to noisy labels. Different from previous co-teaching kind methods which select samples with small loss yet leaving a large part of training set unused, the proposed GAL trains models on the whole training set by only excluding parameters with disagreed gradients from updating instead of excluding all the large loss samples, thereby benefiting from more supervisory signals. Moreover, while the previous co-training methods either achieve agreement on the output predictions (e.g., Co-teaching+ and JoCoR) or seek to maximize the agreement in latent space (e.g., Co-learning), GAL seeks agreement on gradient directions of each parameter pairs and still allows divergence on the final predictions from the two nets, thus obtaining more benefits in test accuracy from model ensembling.
Semi-supervised and self-supervised learning. With the rapid developments in the area of Semi-supervised [
20] and Self-supervised learning [
21], recent studies [
22,
23,
24,
25,
26] have leveraged these techniques in noisy label learning. DivideMix [
24] categorizes training samples into clean and noisy sets utilizing a two-component Gaussian Mixture Model [
27]. Then semi-supervised learning is applied to train the networks by treating the clean set and noisy set as labeled and unlabeled set respectively. MOIT+ [
25] employs supervised contrastive learning to pretrain the models. With the learned representations, samples are divided into clean or noisy sets, after which semi-supervised learning is applied to train a classifier. Sel-CL+ [
26] utilizes the low-dimensional features pretrained with unsupervised contrastive learning to select confident pairs of samples for the supervised contrastive training of models, which enables the training of models to benefit from not only the pairs with correct annotations, but also the pairs which are mislabeled from the same class. Despite the promising classification accuracy achieved by these methods, their improvements mainly owe to the strong abilities of semi-supervised and self-supervised learning techniques which partly or completely ignore the annotation labels, thereby not providing new approaches in how to hinder the memorization of noisy labels in supervised learning. In contrast, the proposed method tries to alleviate the impact of noisy labels in the context of supervised learning using the annotations as input.
3. Proposed Method
In this section, we introduce GAL, our proposed framework for noisy label learning, in detail. We utilize a pair of identical network to distinguish noisy gradients and prevent the memorization of noisy labels, thus upgrading the performance of the trained models. The overview of GAL is illustrated in
Figure 2. We propose a triplet network structure (i.e.,
,
and
in
Figure 2). To avoid divergence caused by different initialization,
and
are two identical nets, while
can be any classification network. At each mini-batch after warm-up period, training samples with same class distribution are fed into
and
. To hinder the memorization of noisy labels, for every pair of corresponding parameters, we evaluate whether their gradient directions achieve agreement and only update agreed parameters. To further improve the performance and avoid the accumulated confirmation bias, at every epoch, the predictions of
and
on the training set are used as pseudo labels to supervise the training of
. In general, GAL consists of three parts: class distribution agreement sampling, gradient agreement updating and pseudo-labels supervising. The three components will be introduced in order in the following subsections.
3.1. Class Distribution Agreement Sampling
We present class distribution agreement sampling in this section. In order to avoid gradient disagreement caused by different input data, the samples input to and should share same class distribution in every iteration. However, if the samples are identical, the gradient directions for all the parameters of and will be exactly the same, resulting in failure to distinguish gradient disagreement caused by noisy labels. Thus, we propose a class distribution agreement sampling strategy to ensure roughly the same class distribution for the samples input to and in every batch while avoiding them being identical.
Let batches input to
and
in the
nth iteration as
and
with
being the batch size. As
Figure 3 presents, the sampling process is repeated every four iterative steps. In the first iterative step,
is randomly sampled from the whole training set
S with
K annotation classes. Naturally, the annotated class distribution
of
can be obtained:
where
represents the number of samples belonging to class
j in
. Then
is formed by randomly sampling
instances for every class
j in
among training samples except those in
. In the second iteration, we feed
into
and
into
to make sure the training samples input to
and
are the same in one epoch. In the next two steps, to balance the training process of
and
,
is randomly sampled from
S while
follows the class distribution of
.
3.2. Gradient Agreement Updating
Algorithm 1 briefly describes Gradient Agreement Updating. The training procedure of
and
can be divided into two phases. Due to the large random volatility of the gradient directions at the early stage of training process, gradient agreement updating is not applied in the first phase. Meanwhile, previous works [
2,
3] have observed that DNNs tend to fit training samples with clean labels before memorizing noisy labels. Thus, in the first phase, we warm up the two models using the standard cross-entropy loss and gradient descent for a certain number of epochs. At this stage, parameters of the two identical nets
and
are initialized with the same values. The data fed into
and
is also identical in every batch to avoid unnecessary gradient divergence.
In the next phase, the two nets will gradually over-fit to noisy labels if following the training method in the first phase. Therefore, to hinder the memorization of noisy labels, gradient agreement updating is applied.
and
each consists of
N parameters. We can group the corresponding parameters of the two networks to obtain
N parameter pairs
. In every iteration, using cross entropy loss and the annotation labels, a pair of gradients
and
for
is obtained after each backward propagation. In contrast to standard gradient descent methods which directly add gradients to
, an intermediate parameter pair
is first calculated following Equation (
2):
where
is the learning rate. We then reshape
and
to
D-dimension vectors
and
, respectively.
To measure whether the gradients are clean or noisy, we propose a gradient agreement coefficient
which is defined as follows:
After acquiring
, we then apply parameter updating as follows:
where
is the threshold for determining whether the gradients of a pair of parameters reach an agreement.
Algorithm 1: Gradient Agreement Updating. |
|
3.3. Pseudo-Labels Supervising
The training set consists of input data and corresponding labels . Part of is made up of noisy labels. Models will over-fit to noisy labels if directly using as the supervision for training, downgrading the generalization abilities. The gradient agreement updating method proposed above can effectively prevent and from memorizing noisy labels, yet the performance of trained models can be further strengthened leading by more correct supervision. However, if we directly correct the labels used to supervising and , the training of models will suffer from accumulated confirmation biases. Therefore, we propose to train a third net as shown in Algorithm 2.
First, we use the average of the prediction logits outputted by
and
for sample
to represent the joint prediction:
Then the label with maximum score in
is used as pseudo-label
:
With the logit
predicted by
for sample
according to Equation (
8), the training of
is supervised by two losses, named classification loss and prediction logit loss respectively.
Because
and
are hindered from memorizing noisy labels, the precision of pseudo labels generated is apparently much higher than the original annotation labels. Therefore, the classification loss
is defined as:
where only pseudo labels with confidence scores higher than
will be counted in the loss.
Algorithm 2: Pseudo-Labels Supervising. |
|
The prediction logit loss
is defined as the Kullback–Leibler Divergence between the joint logits and the output prediction logits of
:
Then, the overall loss
of
is the sum of
and
:
Leading by , is effectively prevented from over-fitting to noisy labels. Meanwhile the performance of is further enhanced with which brings in more correct supervision.
5. Conclusions and Future Work
In this study, using contrast experiments, we observe that noisy labels may cause more divergent gradients in the late stage of training. Thus, we propose a novel gradient agreement learning framework (GAL) to tackle the problem of learning with noisy labels. By synchronously training two nets and dynamically excluding divergent gradients, detected using a gradient agreement coefficient, from parameter updating, GAL is highly effective in hindering the memorization of noisy labels. Training a third network with the pseudo labels produced by the two nets further enhances the performance. Extensive experiments on CIFAR-10, CIFAR-100, Animal-10N and Clothing1M datasets demonstrate the effectiveness of GAL.
Limitations. Nowadays, the parameter size of deep neural networks is becoming more and more huge with the introducing of big models. Therefore, the training efficiency of GAL might become a drawback in training big models, with the gradient agreement coefficients for billions of parameter pairs needed to be calculated. Further studies should be conducted to reduce the computational complexity on big models.
Future Work. Our further work will focus on two aspects. (a) Reduce the training time of GAL on big models with deeper studying of the impact of noisy labels on different parameters. (b) Current noisy label learning techniques are mostly studied and applied in the task of image classification. Extending these works to other areas will be interesting.