1. Introduction
Most successful deep-learning architectures are based on rich labeled datasets. However, in special practical application scenarios, only small numbers of labeled data may be available, owing to certain limitations. Therefore, there is a need to acquire information on new classes based on limited labeled data. This is known as few-shot learning, in which tasks use a small number of labeled samples to predict unlabeled ones. There are a variety of approaches that have been proposed for FSL to address the deficiency of labeled data.
Meta-learning is one of the main methods used in FSL. Model-Agnostic Meta-Learning (MAML) [
1] uses an initialization parameter that requires only a few samples to perform gradient descent and to achieve good results when encountering new problems. Because the MAML method only adjusts parameters according to different tasks, the trained model is prone to overfitting. Task Agnostic Meta-learning for few-shot Learning (TAML) [
2] is an improvement on the MAML algorithm. The original design explicitly requires that the parameters of the model have no preference for different tasks during regularization. Meta-learning with memory Augmented Neural Networks (MANN) [
3] uses a recurrent neural network (RNN) to memorize the representation of the previous task. Although this method is helpful for learning new tasks, the weight of the RNN updating process is still slow, which makes the training process difficult. Meta-learning with differentiable closed-form solvers [
4] uses simpler differentiable regression methods that have closed-form solutions to replace the original learning algorithms (e.g., the
k-nearest-neighbor (KNN) algorithm and the convolutional neural network) and has inspired our idea of combining a traditional learning algorithm with a neural network. Belonging Network (BeNet) [
5] uses basic statistical information of the target class to find the simple mean and variance information to improve performance of the training image set. Regularization using knowledge distillation when learning small datasets [
6] leverages the knowledge distillation method, which found that increasing the distillation parameter “temperature” can improve model accuracy, especially for small numbers of training data. However, if the testing and training sets show a large difference with regard to their distribution, the output of the model will be highly unsatisfactory. Task-Aware Feature Embeddings for low-shot learning (TAFE-Net) [
7] innovatively uses the meta-learning method to dynamically select weight parameters. According to various tasks, different weight parameters are selected, and the weight decomposition method is used to render this computationally possible. Because the few-shot dataset has no corresponding class description information, the meta-learner’s ability to represent the embedded features of the task is affected. Hence, the experimental effect is only slightly improved compared with other algorithms. Therefore, how to effectively use the limited dataset information has become the focus of our research.
Metric learning maps images into an embedding space where images of the same class are located close to each other, and the images of different classes are farther apart. The Siamese Neural Network [
8] limits the structure of the input images and can automatically discover the generalizable features of the new samples. However, because it is sensitive to the differences between the two images, it may easily result in misclassification. Matching networks [
9] construct an end-to-end nearest-neighbor classifier. Through meta-learning training, the classifier can quickly adapt to new tasks with few samples. However, when the label distribution has obvious deviations (e.g., being fine-grained), the model becomes unusable. Few-shot image classification with Differentiable Earth Mover’s Distance and structured classifiers (DeepEMD) [
10] splits an image into multiple tiles and introduces a new distance measurement method (i.e., Earth Mover’s Distance (EMD)) and calculates the best matching cost between each tile of the query set and the support-set image, which indicates the degree of similarity between the two. Boosting few-shot learning with adaptive margin loss [
11] improves the classification effect of the original algorithm by introducing adaptive margin loss of class-relevant or task-relevant information and uses the semantic similarity between different categories to generate adaptive edges. However, there is no correlation between the two proposed edge generation methods. Improved few-shot visual classification [
12] uses the Mahalanobis distance to calculate the distance between samples; however, it focuses largely on dividing the most accurate inter-class interval for the existing samples and neglects learning the image features. The metric learning above makes us realize the importance of a distance measurement in FSL.
In this study, we propose a new model for FSL called the word embedding distribution propagation graph network. As illustrated in
Figure 1, the word embedding distribution graph is merged into the GNN, and the fine-grained few-shot classification task is solved by the cyclic calculation method. The contributions of this work are summarized below.
First, the WPGN uses the GloVe model to extract the class label information as a word vector. The WordNet model is used to weigh the class distribution similarity. It embeds class semantic information into the GNN. Using the word embedding distribution graph, the WPGN solves the problem of low classification accuracy caused by the similarity of fine-grained image features.
Second, we replace the ReLU activation function of the GNN with the FReLU [
13] function. Compared with ReLU, FReLU is more suitable for processing vision tasks and can further improve classification accuracy. At the same time, according to our experiments, the Mahalanobis distance exhibits a better classification performance than Euclidean distance for FSL. Furthermore, the covariance matrix is used to eliminate the variance and dimensionality between each component. Therefore, we used Mahalanobis distance instead of Euclidean distance to calculate the distance between samples.
Finally, we combine the Efficient Channel Attention (ECA) [
14] and our backbone network (ResNet-12), called ECAResNet-12, which can introduce few extra parameters, almost ignore calculations, and bring performance gain. ECAResNet-12 can better extract image feature information through one-dimensional convolution to efficiently achieve local cross-channel interaction, which extracts the dependencies between channels and avoids dimensionality reduction, and further improves the classification performance of GNN.
The remainder of this paper is organized as follows.
Section 2 describes related work.
Section 3 focuses on the few-shot task and introduces the framework of WPGN in detail.
Section 4 presents a discussion of the results of the WPGN comparison experiments, ablation studies and practical application example, demonstrating the effectiveness of WPGN on FSL.
4. Experiment
4.1. Experimental Environment and Datasets
The experimental environment of this paper is shown in
Table 4.
We selected three types of standard datasets in FSL: MiniImageNet [
9], CUB-200-2011 [
32] and CIFAR-FS [
4]. The details of the images, classes, training/validation/test set divisions and the image resolutions of each dataset are shown in
Table 5.
As illustrated in
Figure 6, the image features of the four different classes of birds in the CUB-200-2011 dataset are similar and more difficult to distinguish.
4.2. Experimental Settings
The WPGN uses cyclic computation to construct the network structure, including the point graph and word embedding distribution graph. Mutual updating between the dual-graph is the biggest feature of the WPGN. Therefore, the total number of layers of the WPGN affects the final classification results. To find the layer number that best fits the network structure, we trained the WPGN on the CUB-200-2011 dataset by changing the layer number to obtain the classification accuracy of each training model. The experimental results are shown in
Figure 7.
Here, we can see that the abscissa represents the number of layers, zero represents no cyclic calculation, and one represents one cycle calculation. When the layer number increases from zero to five, the classification accuracy increases by nearly 17%. However, the growth of classification accuracy tends to be flat and slightly oscillates when the layer number is greater than five. Therefore, we chose five as the final layer number of the WPGN.
To more intuitively show the impact of different layer numbers on the WPGN’s classification accuracy, the labeled class [
1,
2,
3,
4,
5] was selected for the experiment, and a heat map was used to show the change of classification accuracy with the increase in layer numbers.
The brighter parts indicate high confidence.
Figure 8a did not use a cycle for calculation; hence, the classification accuracy was low, resulting in fuzzy predictions and a greater possibility of predicting the wrong label.
Figure 8e has five layers, and with the exception of the ground-truth location, the other parts have darker colors, meaning that the probability of an accurate prediction is much higher than that of the prediction error.
The parameter settings obtained in the WPGN are shown in
Table 6 and listed by experiment.
4.3. Evaluation
In this paper, classification accuracy was used to evaluate the performance of the model. The higher the accuracy, the better the performance of the model. We randomly selected n = 10,000 tasks, and we published the mean accuracy and the 95% confidence interval. The calculation formula of accuracy is as follows:
4.4. Experimental Results
For this study, we used ConvNet, RestNet-12 and ECAResNet-12 as the backbone network of features traction with three tasks: 5-way-1 shot/2 shot/5 shot. The experimental results are shown in
Table 7 on the CUB-200-2011 dataset.
As can be seen from
Table 7, the classification accuracy of the WPGN under three backbone networks and three tasks is higher than the other methods. When the feature extraction backbone network is ECAResNet-12 and the tasks are 5-way-1 shot, 5-way-2 shot, and 5-way-5 shot, the accuracy of the WPGN is improved by nearly 9.0%, 4.5%, and 4.1%, respectively, compared with the DPGN. The accuracy of the WPGN under the 5-way-2 shot task is approximately 2% higher than that of the DPGN under 5-way-5 shot. The experimental results prove and demonstrate that our WPGN is robust in fine-grained classification.
The experimental results are shown in
Figure 9 on the MiniImagenet and CIFAR-FS datasets.
Here, the DPGN Conv represents the feature extraction backbone network as ConvNet on the DPGN, the WPGN ResNet represents the feature extraction backbone network as ResNet-12 on the WPGN, whereas the WPGN ECARes represents the feature extraction backbone network as ECAResNet-12 on the WPGN. From
Figure 9, we can see that on the MiniImagenet dataset and the CIFAR-FS dataset, the classification accuracy of the WPGN is higher than that of the DPGN on the three tasks. Moreover, when the feature extraction backbone network adopted ECAResNet-12, its classification effect was better than that of ConvNet and ResNet-12. The experiment demonstrated that the WPGN performs better in a dataset having fewer obfuscating features. The accuracy of the CIFAR-FS dataset was lower than that of the MiniImagenet dataset because its background had a much smaller impact on the classification accuracy.
Moreover, compared with the DPGN, our model had less computational overhead while improving accuracy. This is because FSL is trained in task units. For each task, the first layer of the DPGN distribution graph requires a large number of calculations to initialize, whereas the first layer of the word distribution graph associated with the WPGN only needs to obtain the word vectors of the corresponding category. Thus, initialization is completed much faster. The time used for the same number of steps of WPGN and DPGN training is shown in
Table 8. For the same number of rounds of training, WPGN requires far fewer calculations than DPGN.
In addition, compared with the number of training rounds, as shown in the left side of
Figure 10, the loss convergence speed of WPGN is significantly faster than that of DPGN, which shows that the WPGN is better in total training time. We found that WPGN converged in 12,000 rounds. Hence, we reduced the learning rate for further optimization. DPGN requires at least 15,000 rounds to converge such that the learning rate can be reduced. We tried to reduce the learning rate for DPGN at 12,000 rounds, but experimental results show that the accuracy of DPGN was reduced by ~2%. The right side of
Figure 10 shows that, compared with DPGN, WPGN converges faster and significantly improves test accuracy. At the same time, after several rounds of training, the performance of WPGN will not obviously decline, which proves that the model is more robust and less prone to overfitting.
Because the WPGN model is better than the DPGN in terms of calculation overhead and accuracy, it demonstrates that our research content has a good prospect for real-world applications.
4.5. Ablation Studies
In order to verify the validity of each component in WPGN, we will add word embedding, Mahalanobis distance, FReLU and ECA to the baseline model one by one. In the baseline model, the distance measurement method is Euclidean distance, and the activation function is the Leakey ReLU function. The ablation experiment can fully prove the effectiveness of the components. The results of the ablation experiment on the CUB-200-2011 and CIFAR-FS datasets under 5-way-1 shot tasks are shown in
Table 9.
As observed from the results in the table, after the word embedding distribution graph was added to the WPGN, the classification accuracy of the two datasets increased by 7.23% and 2.1%. Using Mahalanobis distance in the similarity calculation method, the classification accuracy increased by ~0.4%. The activation FReLU function also improved the classification accuracy of the model, and although it is not as great as the first two innovations, it contributes to the improvement of model accuracy. Finally, by integrating the ECA attention module into ResNet-12, the accuracy of our model has increased by 1.2%. From the experimental results, it can be seen that for these two datasets, the four innovations described in this paper improved the classification accuracy of the model.
4.6. Practical Application Example
In order to prove the great potential of WPGN in practical application, we added an example to apply the trained WPGN to the classification of specific rare birds. In this case, seven species of rare birds in bird habitats were selected, as shown in
Figure 11, of which two belonged to storks as shown in the upper part and five belong to cranes at the bottom. We can see that the similarity between these birds is high, although these birds belong to different categories. It is difficult for ordinary people to distinguish these seven species of birds if you are not a professional ornithologist. Generally speaking, compared with common image classification problems, fine-grained classification faces the images with more similar appearance characteristics. In addition, there are interference factors such as posture, illumination, viewing angle, occlusion and background in the collection, which lead to the characteristics of small differences between classes and large differences with classes. By using category labels, WPGN can first increase the distance between storks and cranes, and the distance between storks and cranes from the word vectors will be greater than the distance between subcategories. Second, in the subcategories of cranes or storks, word vectors can also divide well according to the category labels. Finally, image information is embedded into GNN to classify birds with the aid of semantic information.
The example contains 350 images of seven species of birds, and we use WPGN trained on the CUB-200-2011 dataset to test this example with 7-way-1 shot task. In this example, the accuracy of WPGN on the 7-way-1 shot task is 82.45%, while our baseline model is 72.14%. Importantly, semantic information can be obtained without manual tagging. This example illustrates the huge potential of WPGN in practical application.