Next Article in Journal
DSA-SOLO: Double Split Attention SOLO for Side-Scan Sonar Target Segmentation
Previous Article in Journal
Virtual Surgical Planning in Orthognathic Surgery: Two Software Platforms Compared
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Long-Tail Instance Segmentation Based on Memory Bank and Confidence Calibration

1
Beijing Key Laboratory of Information Service Engineering, Beijing Union University, Beijing 100101, China
2
College of Robotics, Beijing Union University, Beijing 100101, China
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2022, 12(18), 9366; https://doi.org/10.3390/app12189366
Submission received: 12 July 2022 / Revised: 8 August 2022 / Accepted: 15 September 2022 / Published: 19 September 2022

Abstract

:
In the field of computer vision, training a well-performing model on a dataset with a long-tail distribution is a challenging task. To address this challenge, image resampling is usually introduced as a simple and effective solution. However, when performing instance segmentation tasks, there may be multiple classes in one image. Hence, image resampling alone is not enough to obtain a sufficiently balanced distribution at the level of target data volume. In this paper, we propose an improved instance segmentation method for long-tail datasets based on Mask R-CNN. Specifically, an object-centric memory bank is used to establish an object-centric storage strategy that can solve the imbalance problem with respect to categories. In the testing phase, a post-processing calibration is used to adjust each class logit to change the confidence score, which improves the prediction score of tail classes. A discrete cosine transform-based mask is used to obtain high-quality masks, which improves segmentation accuracy. The evaluation of the proposed method on the LVIS dataset demonstrates its effectiveness. The proposed method improves the AP performance of EQL by 2.2%.

1. Introduction

Object detection and instance segmentation are basic tasks in computer vision. Over the past decades, many studies of these tasks have been carried out from different perspectives. With the rapid development of neural networks, detection and segmentation tasks in computer vision have achieved unprecedented breakthroughs for common objects such as pedestrians or signs. Therefore, to obtain a computer vision model with better performance, the usual approach is to collect a large-scale dataset to train the model. There are always objective factors in producing the dataset that prevent the collection of the same number of samples for each category. For example, when creating an animal dataset, only one percent or even one-thousandth of the number of common animals such as cats and dogs can be collected due to the scarcity and harsh living conditions of the Northeast tiger. In this dataset, cats and dogs are the head category, the northeastern tigers are the tail category, and the presence of the tail category is very detrimental to the training of the model [1,2,3].Therefore, how to solve this problem has become a popular research topic in computer vision.
Currently, effective methods include class-balanced resampling [4,5], fine-tuning, meta-learning, causal reasoning [6], and few-shot learning techniques such as feature normalization. In addition, repeat factor sampling (RFS) [7] has become a standard simple resampling method. However, this method ignores the fact that an image often contains multiple objects of different categories, which results in the co-proliferation of regular classes and tail classes in the images. As a result, it cannot balance the number of classes in the training set. This directly causes the model to assign a larger confidence score to the head class in the testing phase, and it makes the classifier assign the head-class label to the segmented images more often.
In this paper, to improve the resampling effect of small samples in long-tailed datasets, especially in instance segmentation, and to enhance the performance of long-tailed instance segmentation models, we propose a target-centric post-processable dynamic memory library. The main contributions are as follows:
  • First, the long-tail dataset is a kind of dataset with unbalanced samples, and a memory bank for rare classes is constructed to solve the sample unbalance problem. We use the forward-propagation region-of-interest (RoI) features to establish a dynamic memory bank that stores them and the target frame coordinates. The characteristic information of the tail classes is stored and used in a targeted manner. This enables efficient tail class resampling without further forward/backward propagation.
  • A post-process calibration is used to adjust each class logit, which changes the confidence scores between different classes for more accurate prediction by adjusting the logit of each class.
The instance segmentation task also puts high demands on the quality of mask generation, so this paper seeks to improve not only the sample imbalance in the long-tailed dataset of instance segmentation, but also the quality of mask generation. To improve the quality of mask and reduce the computation, a mask method based on discrete cosine transform is used to improve the Mask R-CNN, which cannot capture the target edge details when generating a low-resolution mask and increases the training complexity when generating a high-resolution mask, to reduce the training complexity and improve the quality of mask generation at the same time. In summary, this paper proposes a target-centered post-processable dynamic memory bank model and uses the cosine transform method in the mask generation method. It solves the problem of small sample imbalance in the instance segmentation long-tail dataset, and the quality of mask generation for instance segmentation, so that the proposed method can be better applied to the application scenario of instance segmentation long-tail data.
This paper is organized as follows: Section 2 introduces the related work; Section 3 describes the details of the proposed method; the experiment and results are presented in Section 4. The conclusion and future work of this study are presented in Section 5.

2. Related Work

2.1. Long-Tail Data Detection and Segmentation

At present, methods for solving the problems of long-tail datasets can be divided into three categories: resampling, re-weighting, and two-stage or multi-stage methods. Resampling strategies [7] change the number of tail and head classes in a long-tail dataset to balance the number of categories. RFS, which is one example of this approach, resamples images containing tail class labels by searching for images to reduce the imbalance in categories in long-tail datasets and is often used as the baseline standard and in the fine-tuning stage. Re-weighting strategies [8,9,10,11] assign different weights to different classes of training samples to solve the problem that head-class features dominate long-tail datasets. Some researchers use two-stage or multi-stage methods [12,13] to solve the problem of long-tail datasets. Generally, a feature extraction model is trained using the original long-tail dataset, and the number of images in each category is balanced in the dataset using resampling strategies. A model trained in this way can obtain better performance than a basic model, but the model is complex and takes a long time to train. In contrast to the general resampling method, in this paper, a mixed sampling strategy is proposed.

2.2. Few-Shot Learning

Few-shot learning is another way to solve long-tail distribution problems. Snell [14] used similar class features to obtain better tail class features. Zhang [15] used meta-learning to add more training samples to solve the few-shot learning problem. This is similar to resampling and can create unique samples more explicitly. GAN networks [16] are often used to synthesize completely new images to increase the number of samples. Moreover, research [17] uses meta-learning to generate features while bypassing the complexity of generating images. Similarly, the proposed dynamic memory bank in this paper focuses on the object level in the images.

2.3. Confidence Calibration

A few models use confidence calibration methods to solve long-tail problems. Li [18] used classifier normalization as a baseline, but the results were not ideal. Tang [19] developed a calibration pipeline for causal inference with complicated training steps. In this study, the prediction scores are adjusted by categories, and one hyperparameter is estimated for hundreds of classes in the LVIS.
The experiments in [20] show that it is challenging to estimate multiple hyperparameters for tail classes. Therefore, in this paper, we use only a stable hyperparameter to ensure the stability of the training data and improve the accuracy of the results.

2.4. DCT

Widely used in computer vision [21], a DCT transforms RGB images in the spatial domain into components in the frequency domain. With the development of deep learning, many studies have investigated how to integrate this method into the deep-learning framework of computer vision. Ulicny [22] used a convolutional neural network (CNN) to classify images encoded using the DCT, Ehrlich [23] proposed a ResNet in the DCT domain, Lo [24] used DCT representations by feeding the rearranged DCT coefficients into a CNN for semantic segmentation, and Xu [25] studied object detection and instance segmentation in the field of frequency-domain learning, using DCT coefficient parameters as the input of the CNN model. In these studies, DCT was used to extract the features of the model input. In contrast, the proposed method uses DCT to improve mask quality and reduce computational complexity.

3. Proposed Method

Long-tail datasets are a challenge in computer vision. To cope with the problem of a very tiny number of tail classes and huge number of head classes in the long-tailed dataset, resampling is usually used for tail class targets as a way to balance the number of classes in the dataset. In this paper, we propose an improved Mask R-CNN for instance segmentation on long-tail datasets. The contributions are divided into three parts: an object-centric dynamic memory bank is proposed to resample at the object level, a post-processing confidence calibration method is used to reduce the high scores of head-class objects, and a DCT is used to improve the accuracy of the mask. These three parts are shown in the red box in Figure 1.

3.1. Dynamic Memory Bank

Mask R-CNN is used as the architectural framework proposed in this paper. RoI features and bounding boxes are obtained from the fully connected layer before the classification and bounding box regression branches; any RoI-level features can be placed in the memory bank. We set the key class to the tail classes in the memory bank T to repeat tail classes. When necessary, the key class can be set to any of the categories. To ensure that the memory bank T contains only object-level samples, we store the RoI object features with class labels and bounding box coordinates. We denote each class queue in T as q c , c S c . To improve efficiency and obtain more space, each q c can only store a maximum number of samples v . Unlike a traditional queue, the queue used in the proposed method does not remove samples from the queue when they are sampled from it. Moreover, this memory bank is only used during training, and hence it adds no additional computational cost.
The operations in the memory bank include push, dequeue, and sample. In the training iteration P of training batch B , all the objects present in image i are denoted as o i j , j ( 1 , , k ) . Moreover, any target class is marked as r i j . We denote RoI feature f and bounding box information b as { f , b } . When a target to be collected is identified, { f , b } is pushed to the top of class 2 queue q c in memory bank T . We iterate through all images and objects, and finally add the RoI and bounding box information to T . Push processing is as shown in Figure 2.
Dequeue processing is also as shown in Figure 2. At any time, when there is no space left in any queue q c , the feature and candidate pairs must be dequeued from q c . In this case, the earliest sample added to queue q c needs to be removed. For example, when it is necessary to add a pair { f , b } p to a full queue q c , one pair must be dequeued from the bottom of queue q c , and then pair { f , b } p is added to q c .
As shown in Figure 2 (Sample), when queue q c has no samples, the batch cannot sample class c from the queue q c in memory bank T . The first image of tail class c is detected during training and quickly populates queue q c . Once queue q c is filled, samples from any training batch can be increased from q c . For each image I i in batch B , if it is necessary, an additional batch can be added.
Dynamic resampling strategy: As shown in Figure 3, an effective resampling strategy cannot just resample images or objects alone. Because image resampling is affected by the amount of head-class data, and object resampling is limited by how often the object image appears in each batch, neither are suitable when used alone. Therefore, these two methods must work together in a complementary relationship, which we call joint resampling. Image resampling updates the feature repository frequently by detecting the presence of tail class targets in the image in Figure 3 (RFS). Object resampling pulls the tail and common class targets from the memory repository to increase the sample size of these two classes without increasing the number of head classes. This method uses the addition of a batch process to implement this design. The object class in this batch has an additional set of attributes (the default value is 20), which are stored features in the memory bank. Moreover, the memory bank only allows updates when a batch contains an image that itself includes the target object.

3.2. Post-Processing Calibration

The post-processing confidence calibration method is applied to a pre-trained classifier sub-network. We consider Faster R-CNN [26] and Mask R-CNN [27] as examples. These networks are applied to a softmax classifier with C + 1 object proposals, where C is the number of foreground classes and 1 denotes the background class. To prevent the scores from being biased toward regular classes [28], the logit of each class can be rescaled according to the number of classes using the number of training images. The key to the method is to maintain the logit of the background class as unchanged. Because the meanings of the background class and an object class are different, the value of the logit of the background class does not affect the ranking of the object classes.
In the task of instance segmentation, a standard classification loss function is the cross-entropy loss function, expressed as follows:
L C E ( x , y ) = c = 1 C + 1 y | c | × log ( p ( c | x ) )
s c = p ( c | x ) = exp ( ϕ ( x ) ) c = 1 C exp ( ϕ c ( x ) ) + exp ( ϕ C + 1 ( x ) )
where ϕ c is the logit of class c , usually implemented with ω c T f θ ( x ) , ω c is a linear classifier for class c , f θ is the feature network, and C + 1 represents the background classes. Moreover, N c denotes the number of class c , and hence the primary purpose of this paper is to address the imbalance in the later stages of model N c . Equation (1) tends to give a higher score to frequent categories.
To implement a simple post-processing calibration method, the approach proposed in this paper is to equiproportionally deflate each class c according to its logit size [29]. After adjusting the logit, the confidence scores are recalculated and normalized for all classes, including the background class, and the label assignment for each object is determined. After isometric scaling, Equation (2) can be decomposed into Equation (3).
s c = p ( c | x ) = c = 1 C exp ( ϕ c ( x ) ) c = 1 C exp ( ϕ c ( x ) ) + exp ( ϕ C + 1 ( x ) ) × exp ( ϕ c ( x ) ) c = 1 C exp ( ϕ c ( x ) )
In contrast to [20], instead of tuning each class individually by a specific factor, the method proposed in this paper sets the factor as a function of the class size following [29], leaving only a hyperparameter for tuning. To set the hyperparameter, the positive factor a c (Equation (5)) scales the exponent of ϕ c to obtain the following:
s c = p ( c | x ) = exp ( ϕ ( x ) ) / a c c = 1 C exp ( ϕ c ( x ) ) / a c + exp ( ϕ C + 1 ( x ) )
According to Equation (4), if a c increases monotonically with respect to N c , the score of the head class will be effectively suppressed. In addition, when the parameter λ in the following equation is 0, the confidence score obtained from Equation (2) is used.
a c = N c λ , λ 0

3.3. Post-Processing Calibration of the Dynamic Object-Centric Memory Bank

We use an example with three frequent categories in the image (see Figure 4). The column corresponding to AP r is the tail class accuracy, the columns corresponding to AP c and AP f are the common class accuracy and head-class accuracy, respectively, and AP b g is the background class. Suppose two proposals are found in a single image: proposal A (PA) with scores [0.00, 0.30, 0.60, 0.10] and true labels belonging to the head classes, and proposal B (PB) with scores [0.30, 0.10, 0.50, 0.10] and the true label belonging to the tail category. Assume that the occurrence frequencies of the three classes in this image are 1 = 1 , 2 = 3 and 3 = 5 . According to Equation (4), the two proposals can obtain new scores, [0.00, 0.10, 0.12, 0.10] and [0.30, 0.03, 0.10, 0.10], and the final results can be obtained according to the normalization process, as shown by the red font probability values in Figure 4, with adjustment of the confidence scores for the different categories, which demonstrates the effectiveness of this method.

3.4. DCT

In the instance segmentation field, the quality of the mask is a crucial metric, as is the case with the instance segmentation problem for long-tail datasets. To enhance the mask quality, binary grid mask representation is widely used in instance segmentation. When the resolution of the binary mask representation is increased from 28 × 28 to 128 × 128, the mask quality is improved; however, the excessive resolution still leads to a significant increase in computational complexity, which is not conducive to the training of the model. Therefore, the method proposed in this paper uses a DCT mask to improve the mask generation quality and enhance the mask quality while reducing the computational complexity. This process is shown in Figure 5.
The DCT method reconstructs the mask by encoding the binary mask M g t into a compressed vector and then decoding the compressed vector. For the binary truth mask M g t R H × W ( H and W represent the height and width of the mask, respectively), we resize it into a bilinear interpolation of M K × K R K × K , where K × K is the mask size. We set the value of K to 128 when the DCT is applied in Mask R-CNN. As shown in Figure 5, four convolutional layers are used to extract the features of the mask (the convolutional layers are set up in the same way as they are in Mask R-CNN), and three fully connected layers regress the DCT mask vectors. The output dimension of the first two fully connected layers is 1024, and the output dimension of the last layer is the dimension of the DCT mask vector.

4. Experiments

4.1. Dataset and Experiment Details

LVIS is a large vocabulary lexical-level labeling dataset for object detection and instance segmentation. It contains 164,000 images and 2 million high-quality instance segmentations for over 1000 classes of objects. Based on the number of images included in each category, these categories were divided into three broad categories: tail classes (1–10 images per class), common classes (11–100 images per class), and frequent classes (>100 images per class). The dataset was not collected using unknown category labels, but by collecting images and then labeling them according to the natural distribution of the targets in the images. In contrast to automated machine annotations, a large number of manual annotations can enable the effective identification of naturally occurring long-tail distributions in the dataset images.
In this paper, we use the LVIS v0.5 training set for training and the LVIS v0.5 validation set for testing. In this dataset, we target tail classes and common classes with a number of images ≤ 30 samples, and we have a total of 706 classes in our target repository. The images were resized during training so that their shorter and longer edges were 800 and 1333 pixels, respectively. The training platform uses 8 Titan Vs with an initial learning rate of 0.02, using a momentum SGD gradient descent method with a weight decay of 0.0001, and the model is trained for 90,000 iterations, decaying at 60 k and 80 k iterations. Two-hundred and fifty-six regions are used by RPN for training. The memory bank in this paper has a fixed maximum size as a constraint on the amount of memory occupied, which is set to 60 to balance memory and performance, and 20 samples are added to the target memory bank class for each batch.
In this paper, we choose a 128 × 128 mask size and a 300-dimensional DCT mask vector as the default mask representation. In Table 1, we compare the l 1 loss and smooth l 1 loss of λ masks with different weights. The results show that l 1 and smooth l 1 have similar performance and the DCT-Mask is robust to the corresponding weights. In this paper, the best combination of l 1 loss and λ mask = 0.007 is selected. The value of a c is adjusted in training with reference to Equation (5) by controlling the dependence of a c on N c . According to the experiments, when λ = 0.6 , we can achieve better training results and do not consume too many computational resources.
The metric AP bbox was used to comprehensively evaluate each method. It represents the accuracy of the instance segmentation frame, and the subscripts “r”, “c”, and “f” represent the tail, common, and frequent categories, respectively; AP m is the mask accuracy.

4.2. Ablation Experiment

In this study, ablation experiments were conducted for each method, using the indicators described above and shown in Table 2. First, we validated the post-processing confidence method. Using a ResNet50 backbone, the method proposed in this paper achieves an AP bbox and AP r that is higher than those of the Mask R-CNN baseline model by 23.6% and 9.1%, respectively. With a ResNet101 backbone, AP bbox is improved by 2.9% and AP r is improved by 9.8%. These results show that the post-processing confidence method proposed in this paper improves the prediction scores of tail and common classes by changing the confidence levels of the different classes; this effectively solves the instance segmentation problem of long-tail datasets.
Table 2 presents the effect of using a dynamic memory bank. The results show that the accuracy of the model is greatly improved in the tail and common classes. With a ResNet50 backbone, the model achieves an improvement AP r of 14.1% and an improvement AP c of 5.7% compared with the baseline. These results illustrate that the dynamic memory bank effectively resamples the tail classes of objects by substantially increasing the AP r and AP c as well as increasing the AP bbox .
Finally, we analyze the effectiveness of the combined methods. Using a ResNet50 backbone, the proposed method yields AP bbox , AP r , and AP c results that are 5.1%, 12.6%, and 6.7% better, respectively, than those of the baseline. However, compared with the results of the method with the dynamic memory bank, the AP r results are slightly lower. The reason for this is that the dynamic memory bank increases the frequency of the tail classes, which leads to an excessive reduction in the confidence level of some of the rare classes at a later stage. However, the substantial improvement of the other three indicators, especially AP c , also indicates the effectiveness and relevance of the method.
In addition, we evaluated the effectiveness of the DCT by plotting the gradient descent curve. As shown in Figure 6, both models were trained on different backbone networks and both converged. However, the training curve of the model with DCT is smoother and less volatile.

4.3. Main Experiments

This study analyzed experimental results obtained using the LVIS v0.5 dataset. Several baselines, including Mask R-CNN, the most representative in the field of instance segmentation, were first evaluated. In addition, we evaluated RFS, which uses only the resampling method and is based on the Mask R-CNN, and EQL, which only has some changes to the cross-entropy loss function of the Mask R-CNN. Finally, we compared the final results with the results obtained by a combination of the above two methods. The results are presented in Table 3. Experiments show that compared with the baseline Mask R-CNN with ResNet50 as the backbone, the method proposed in this paper has a 5% accuracy gain. It obtains an accuracy gain of 5.5% using ResNet101 as a backbone. The AP m is improved by 0.8% and 1.9%, respectively, when compared with EQL+RFS on both backbone networks. In addition, a good long-tail model must rank true positives higher so that they can be included in the cap, as shown in Figure 7. Our model performs much better than the baseline. In addition, we focus on the comparison of the AP r of the proposed method with that of the Mask R-CNN baseline. The index is significantly improved by 1.9% and 3.3% respectively, which shows the effectiveness of this method when applied to long-tail data, and it effectively solves the problem of long-tail distribution. This suggests that the combination of a memory-bank-centered resampling approach with a confidence post-processing calibration can work well to make the model pay attention to tail class objects and perform well to detect segmentation.
A visualization of the experimental results is shown in Figure 8. EQL+RFS has problems with incorrect recognition and inaccurate mask edges. For example, only our model identifies an instance of “bulletin_board” in the first group. In the second group, our model identifies a “monitor” class object with the clearest mask edge, whereas the other models incorrectly identify it as a “laptop_monitor” with an incorrect mask region. Our model also successfully identifies the “vent” class object in the third group and successfully identifies the tail class “ball_cap”, which was not identified by the first two methods. In the fourth group, it successfully identifies “helmet” and “glove”, which were not identified by the first two methods. In the fifth group, it successfully identifies “shoe” and “sandal”, and in the sixth group, it successfully identifies the “banana” with clear mask edges.
These results show that the proposed method effectively detects and segments the tail class data. However, as shown in Figure 9, since our model cannot sufficiently extract all the targets in the dataset for the tail class when extracting them from the memory, only the targets within a certain batch can be extracted. If all targets of that class in the dataset are extracted, it will lead to an oversized and inefficient model. Therefore, there are still some cases of misidentified or unidentified recognition.
The proposed method was also compared with other advanced models using the same comparison metrics as above. First, it was compared with two-stage decoupled training methods, and the results are shown in Table 4. The experimental results reveal that both methods further improve the overall level of accuracy, mainly by enhancing the tail class segmentation accuracy. Of the comparison methods, the LWS method achieves limited improvements. This is because the model only learns a scaling factor to adjust the decision boundary of the classifier without changing the classifier itself. Furthermore, compared with BAGS, the proposed method achieves the same accuracy using the ResNet50 backbone. With the ResNet101 and X101 backbones, our method achieves higher accuracy than the BAGS and ResNet50 results by 1.4% and 0.7%, respectively. Moreover, the proposed model is trained jointly with the classifier, and no additional fine-tuning stage is required.
In addition, the proposed method was compared with end-to-end training models. When using the most primitive cross-entropy loss function, the proposed method achieves 2.2% and 0.5% improvements in segmentation accuracy on both backbone networks, respectively. Moreover, the 2.9% and 2.9% improvements in mask accuracy of our proposed method are substantially better than the results of the above two models.
Finally, the proposed method was also compared with other more recent models, including one used for incremental learning [30] and hierarchical classification [31]. The experimental results show that our model performs substantially better than LST, which illustrates the absolute advantages of our model. Compared with Forest R-CNN, it is slightly inferior when ResNet50 is used as the backbone, but slightly superior when ResNet101 is used as the backbone. In addition, we compared the proposed method when X101 is used as the backbone network with Forest R-CNN. The proposed model still achieves a 0.4% higher AP bbox and 1.4% higher AP m . The experimental results of the proposed model using different backbones illustrate that it has a significant advantage over general instance segmentation models. We can also solve the problems of long-tail datasets better than some recent state-of-the-art models.

5. Conclusions

In this paper, we first systematically analyzed the category imbalance problem in long-tail datasets and the inappropriateness of conventional deep-learning methods for long-tail datasets, noting that the segmentation task is particularly significant. To address the difficulties of long-tail instance segmentation tasks, this paper proposed a dynamic memory bank for tail classes to perform object-level resampling of tail class objects in images. Furthermore, a post-processing calibration method was proposed to solve the problem of imbalances in the number of targets due to the discrete number of targets in the training process of the dynamic memory bank. Finally, to compensate for the low accuracy of the mask edge, a DCT binary mask was used to further improve the mask quality.
The effectiveness of the proposed model under different settings is demonstrated by ablation experiments. However, the low accuracy improvement cannot match the high time consumption due to the model training efficiency, which is a challenge that needs to be addressed in the future. However, the experimental results show that the proposed method effectively solves the problem of long-tailed instance segmentation and effectively improves the quality of mask generation by obtaining an accuracy 2.2% higher than that of the baseline method.

Author Contributions

Conceptualization, X.F. and W.P.; methodology, X.F. and T.L. (Teng Liu); software, T.L. (Teng Liu) and T.L. (Tianjiao Liang); validation, X.F. and T.L. (Teng Liu); formal analysis, X.F. and H.B.; investigation, X.F. and T.L. (Teng Liu); resources, X.F., T.L. (Tianjiao Liang), W.P. and H.B.; data curation, T.L. (Teng Liu); writing—original draft preparation, X.F.; writing—review and editing, X.F. and W.P.; visualization, T.L. (Teng Liu) and H.L.; supervision, W.P. and H.B.; project administration, W.P.; funding acquisition, H.B. and W.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Nos. 61802019, 61932012, 61871039, 61906017 and 62006020), the Academic Research Projects of Beijing Union University (No. ZK10202202), the Beijing Municipal Education Commission Science and Technology Program (Nos. KM201911417003, KM201911417009 and KM201911417001), the Beijing Union University Research and Innovation Projects for Postgraduates (No.YZ2020K001), and the Premium Funding Project for Academic Human Resources Development in Beijing Union University under Grant BPHR2020DZ02.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tan, J.; Zhang, G.; Deng, H.; Wang, C.; Lu, L.; Li, Q.; Dai, J. 1st place solution of lvis challenge 2020: A good box is not a guarantee of a good mask. arXiv 2020, arXiv:2009.01559. [Google Scholar]
  2. Zhang, S.; Li, Z.; Yan, S.; He, X.; Sun, J. Distribution alignment: A unified framework for long-tail visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2361–2370. [Google Scholar]
  3. Liang, T.; Bao, H.; Pan, W.; Pan, F. Traffic sign detection via improved sparse R-CNN for autonomous vehicles. J. Adv. Transp. 2022, 2022, 3825532. [Google Scholar] [CrossRef]
  4. Kang, B.; Xie, S.; Rohrbach, M.; Yan, Z.; Gordo, A.; Feng, J.; Kalantidis, Y. Decoupling representation and classifier for long-tailed recognition. arXiv 2019, arXiv:1910.09217. [Google Scholar]
  5. Zhou, B.; Cui, Q.; Wei, X.S.; Chen, Z.M. Bbn: Bilateral-branch network with cumulative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9719–9728. [Google Scholar]
  6. Liu, Y.; Wei, Y.; Yan, H.; Li, G.; Lin, L. Causal Reasoning with Spatial-temporal Representation Learning: A Prospective Study. arXiv 2022, arXiv:2204.12037. [Google Scholar]
  7. Gupta, A.; Dollar, P.; Girshick, R. LVIS: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5356–5364. [Google Scholar]
  8. Hsieh, T.I.; Robb, E.; Chen, H.T.; Huang, J.B. Droploss for long-tail instance segmentation. In Proceedings of the Association for the Advancement of Artificial Intelligence(AAAI), Maynooth University, Maynooth, Ireland, 8 February 2021; Volume 3, p. 15. [Google Scholar]
  9. Tan, J.; Wang, C.; Li, B.; Li, Q.; Ouyang, W.; Yin, C.; Yan, J. Equalization loss for long-tailed object recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11662–11671. [Google Scholar]
  10. Wang, J.; Zhang, W.; Zang, Y.; Cao, Y.; Pang, J.; Gong, T.; Lin, D. Seesaw loss for long-tailed instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9695–9704. [Google Scholar]
  11. Cao, K.; Wei, C.; Gaidon, A.; Arechiga, N.; Ma, T. Learning imbalanced datasets with label-distribution-aware margin loss. Adv. Neural Inf. Processing Syst. 2019, 32, 1567–1578. [Google Scholar]
  12. Ren, J.; Yu, C.; Ma, X.; Zhao, H.; Yi, S. Balanced meta-softmax for long-tailed visual recognition. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Vancouver, BC, Canada, 2020; Volume 33, pp. 4175–4186. [Google Scholar]
  13. Wang, T.; Zhu, Y.; Zhao, C.; Zeng, W.; Wang, J.; Tang, M. Adaptive class suppression loss for long-tail object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3103–3112. [Google Scholar]
  14. Snell, J.; Swersky, K.; Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems; 2017; Volume 30, Available online: https://proceedings.neurips.cc/paper/2017/hash/cb8da6767461f2812ae4290eac7cbc42-Abstract.html (accessed on 5 August 2022).
  15. Zhang, R.; Che, T.; Ghahramani, Z.; Bengio, Y.; Song, Y. Metagan: An adversarial approach to few-shot learning. In Advances in Neural Information Processing Systems; 2018; Volume 31, Available online: https://proceedings.neurips.cc/paper/2018/hash/4e4e53aa080247bc31d0eb4e7aeb07a0-Abstract.html (accessed on 5 August 2022).
  16. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
  17. Wang, Y.X.; Girshick, R.; Hebert, M.; Hariharan, B. Low-shot learning from imaginary data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7278–7286. [Google Scholar]
  18. Li, Y.; Wang, T.; Kang, B.; Tang, S.; Wang, C.; Li, J.; Feng, J. Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10991–11000. [Google Scholar]
  19. Tang, K.; Huang, J.; Zhang, H. Long-tailed classification by keeping the good and removing the bad momentum causal effect. Adv. Neural Inf. Processing Syst. 2020, 33, 1513–1524. [Google Scholar]
  20. Dave, A.; Dollár, P.; Ramanan, D.; Kirillov, A.; Girshick, R. Evaluating large-vocabulary object detectors: The devil is in the details. arXiv 2021, arXiv:2102.01066. [Google Scholar]
  21. Ravì, D.; Bober, M.; Farinella, G.M.; Guarnera, M.; Battiato, S. Semantic segmentation of images exploiting DCT based features and random forest. Pattern Recognit. 2018, 52, 260–273. [Google Scholar] [CrossRef]
  22. Ulicny, M.; Dahyot, R. On using cnn with dct based image data. In Proceedings of the 19th Irish Machine Vision and Image Processing conference IMVIP, Maynooth University, Maynooth, Ireland, 30 August–1 September 2017; Volume 2, pp. 1–8. [Google Scholar]
  23. Ehrlich, M.; Davis, L.S. Deep residual learning in the jpeg transform domain. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 3484–3493. [Google Scholar]
  24. Lo, S.Y.; Hang, H.M. Exploring semantic segmentation on the dct representation. In Proceedings of the ACM Multimedia Asia, Beijing, China, 15–18 December 2019; pp. 1–6. [Google Scholar]
  25. Xu, K.; Qin, M.; Sun, F.; Wang, Y.; Chen, Y.K.; Ren, F. Learning in the frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1740–1749. [Google Scholar]
  26. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Processing Syst. 2015, 28, 1137–1149. [Google Scholar]
  27. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  28. Wang, T.; Li, Y.; Kang, B.; Li, J.; Liew, J.; Tang, S.; Feng, J. The devil is in classification: A simple framework for long-tail instance segmentation. In Proceedings of the 16th European conference on computer vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 728–744. [Google Scholar]
  29. Menon, A.K.; Jayasumana, S.; Rawat, A.S.; Jain, H.; Veit, A.; Kumar, S. Long-tail learning via logit adjustment. arXiv 2020, arXiv:2007.0734. [Google Scholar]
  30. Hu, X.; Jiang, Y.; Tang, K.; Chen, J.; Miao, C.; Zhang, H. Learning to segment the tail. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14045–14054. [Google Scholar]
  31. Wu, J.; Song, L.; Wang, T.; Zhang, Q.; Yuan, J. Forest r-cnn: Large-vocabulary long-tailed object detection and instance segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1570–1578. [Google Scholar]
Figure 1. Framework of the proposed method.
Figure 1. Framework of the proposed method.
Applsci 12 09366 g001
Figure 2. Dynamic object-centric memory bank operation.
Figure 2. Dynamic object-centric memory bank operation.
Applsci 12 09366 g002
Figure 3. Specific resampling strategy in memory banks.
Figure 3. Specific resampling strategy in memory banks.
Applsci 12 09366 g003
Figure 4. Post-processing calibration application.
Figure 4. Post-processing calibration application.
Applsci 12 09366 g004
Figure 5. DCT network.
Figure 5. DCT network.
Applsci 12 09366 g005
Figure 6. Effect of DCT on training loss.
Figure 6. Effect of DCT on training loss.
Applsci 12 09366 g006
Figure 7. AP with different number of image detections.
Figure 7. AP with different number of image detections.
Applsci 12 09366 g007
Figure 8. Visualization of experimental results.
Figure 8. Visualization of experimental results.
Applsci 12 09366 g008
Figure 9. Limitations of the model in this paper. The left column is the label of the ground truth, and the right column is the detection result of the proposed method. The area indicated by the red arrow shows the limitations of the proposed method.
Figure 9. Limitations of the model in this paper. The left column is the label of the ground truth, and the right column is the detection result of the proposed method. The area indicated by the red arrow shows the limitations of the proposed method.
Applsci 12 09366 g009
Table 1. The results of loss with different weights.
Table 1. The results of loss with different weights.
LossWeightLVIS AP
l 1 0.00539.3
0.00639.4
0.00739.6
0.00839.2
smooth l 1 0.00439.1
0.00539.2
0.00639.5
0.00739.3
Table 2. Results of ablation experiments.
Table 2. Results of ablation experiments.
ModelRes50Res101Post-Processing CalibrationMemory Bank AP bbox AP r AP c AP f
Mask R-CNN ××20.83.119.429.8
Ours method ×23.612.222.729.3
Ours method ×25.617.225.129.8
Ours method 25.915.726.130.4
Mask R-CNN ××22.33.820.930.8
Ours method ×25.213.624.431
Ours method ×27.319.126.831.2
Ours method 27.518.427.931
“√” select the components that appear in this column. “×” the component of this column in the table is not selected.
Table 3. Comparison of the results on LVIS v0.5.
Table 3. Comparison of the results on LVIS v0.5.
ModelBackbone AP bbox AP r AP c AP f AP s e g m
Mask R-CNNRes5020.83.119.429.821.2
RFS-14.524.328.424.4
EQL23.68.523.929.324.0
EQL+RFS25.416.025.429.126.1
Ours25.817.924.830.126.9
Mask R-CNNRes10122.33.820.930.822.8
EQL25.99.226.931.125.9
EQL+RFS27.115.927.930.627.4
Ours27.819.227.331.929.3
Table 4. Comparison with state-of-the-art methods on the LVIS v0.5 validation set.
Table 4. Comparison with state-of-the-art methods on the LVIS v0.5 validation set.
MethodBackbone AP bbox AP r AP c AP f AP s e g m
Mask R-CNNRes5020.83.119.429.821.2
LST----23.0
LWS-14.424.426.824.1
EQL23.68.523.929.324.0
RFS-14.524.328.424.4
Forest R-CNN25.918.326.427.625.6
BAGS25.818.026.928.726.3
Ours25.817.924.830.126.9
Mask R-CNNRes10122.34.322.730.2-
BAGS26.416.825.830.9-
ACSL27.319.327.630.727.5
Forest R-CNN27.520.127.928.326.9
Ours27.819.227.331.929.3
BAGSX10127.818.827.332.1-
Forest R-CNN28.820.629.231.728.5
Ours29.219.529.033.530.7
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Fan, X.; Liu, T.; Bao, H.; Pan, W.; Liang, T.; Li, H. Long-Tail Instance Segmentation Based on Memory Bank and Confidence Calibration. Appl. Sci. 2022, 12, 9366. https://doi.org/10.3390/app12189366

AMA Style

Fan X, Liu T, Bao H, Pan W, Liang T, Li H. Long-Tail Instance Segmentation Based on Memory Bank and Confidence Calibration. Applied Sciences. 2022; 12(18):9366. https://doi.org/10.3390/app12189366

Chicago/Turabian Style

Fan, Xinyue, Teng Liu, Hong Bao, Weiguo Pan, Tianjiao Liang, and Han Li. 2022. "Long-Tail Instance Segmentation Based on Memory Bank and Confidence Calibration" Applied Sciences 12, no. 18: 9366. https://doi.org/10.3390/app12189366

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop