Next Article in Journal
Polymorphisms of PDCD1 and COL9A1 Genes in Plaque, Palmoplantar and Arthropathic Psoriasis in Romanian Patients
Next Article in Special Issue
Document Retrieval System for Biomedical Question Answering
Previous Article in Journal
Enhancing Industrial Digitalisation through an Adaptable Component for Bridging Semantic Interoperability Gaps
Previous Article in Special Issue
A Study on the Emotional Tendency of Aquatic Product Quality and Safety Texts Based on Emotional Dictionaries and Deep Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Margin and Shared Proxies: Advanced Proxy Anchor Loss for Out-of-Domain Intent Classification

Department of Industrial and Systems Engineering, Dongguk University, Seoul 04620, Republic of Korea
*
Author to whom correspondence should be addressed.
Appl. Sci. 2024, 14(6), 2312; https://doi.org/10.3390/app14062312
Submission received: 25 January 2024 / Revised: 4 March 2024 / Accepted: 7 March 2024 / Published: 9 March 2024
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)

Abstract

:
Out-of-Domain (OOD) intent classification is an important task for a dialog system, as it allows for appropriate responses to be generated. Previous studies aiming to solve the OOD intent classification task have generally adopted metric learning methods to generate decision boundaries in the embedding space. However, these existing methods struggle to capture the high-dimensional semantic features of data, as they learn decision boundary using scalar distances. They also use generated OOD samples for learning. However, such OOD samples are biased, and they cannot include all real-world OOD intents, thus representing a limitation. In the current paper, we attempt to overcome these challenges by using Advanced Proxy-Anchor loss, which introduces a margin proxy and shared proxy. First, to generate a decision boundary that has the high-dimensional semantic features of training data, we use a margin proxy for learnable embedding vectors. Next, the shared proxy, which is shared by all In-Domain (IND) samples, is introduced to make it possible to learn the discriminative feature between IND intents and OOD intent, ultimately leading to the improved classification of OOD samples. We conduct evaluations of the proposed method using three benchmark datasets. The experimental results demonstrate that our method achieved an improved performance compared to the methods described in previous studies.

1. Introduction

Recently, research examining conversation systems has been actively conducted, which has made the accurate classification of user utterance intent increasingly important [1,2,3,4]. In particular, a task-oriented dialog system classifies the utterances that are input by the user into pre-defined intents; this is an essential function for a conversation process [5,6,7]. For example, as shown in Figure 1, a banking system classifies each user utterances as a banking intention such as a deposit, withdrawal, and balance check, and then proceeds with the follow-up task, accordingly. However, even if a user inputs an utterance with an intention that is difficult to classify as one of the pre-defined intents (e.g., asking a banking chatbot to make a hotel reservation) is entered, it is necessarily classified as one of the pre-defined banking words. As mentioned above, intents that are not pre-defined in the dialog system are referred to as an OOD intent. It is essential to use an OOD intent classification technique for a dialog system to prepare for the situation when a user inputs OOD intent utterances [8].
To solve this problem, many researchers regard the OOD intent classification task as a (k + 1)-class classification task. For example, Scheirer et al. [9] propose the concept of open space risk as an evaluation metric for OOD classification. Intuitively, open space risk means that data that have not been experienced during the training process can be located anywhere in the embedding space. Zhang et al. [10] propose an Adaptive Decision Boundary (ADB) method that optimizes the decision boundary. To reduce the open space risk, that method limits the space in which OOD samples can exist inside the decision boundary by making the intra-class variance small and the inter-classes variance large. However, since the OOD intent can be located anywhere in the embedding space, this method is vulnerable in cases where OOD samples are embedded inside the In-Domain (IND) decision boundary. Moreover, the decision boundaries generated through ADB face difficulty when attempting to learning the high-dimensional semantic feature of data embeddings. To solve this problem, Zhou et al. [11] propose the KNN-Contrastive learning method. This method is designed to avoid the risk of OOD samples being located inside the IND decision boundary. However, it experiences instability when sampling random k neighbors. Lang et al. [12] generate OOD samples by utilizing a self-supervised learning method and propose an Adaptive Soft pseudo labeling (ASoul) method that labels each data with a soft labeling technique rather than one hot encoding. However, the OOD samples generated in this way are biased, and there is a limitation in that they cannot include all real-world OOD intents [11,13,14].
In this paper, we propose Advanced Proxy-Anchor loss, which captures the high-dimensional semantic features of data and learns the discriminative features between OOD intents and IND intents without needing to use OOD intent samples in training. Our method performs an OOD intent classification task by introducing a margin proxy and shared proxy to the Proxy-Anchor loss [15], which is an existing distance-based classification approach. The proxy, which is a concept that was introduced by Movshovitz-Attias et al. [16] in the metric learning approach, is an embedding vector that learns the distances from samples and represents each class of the training dataset.
In previous studies [10,17], the decision boundary has been created by directly calculating the scalar distance from the center of the IND intents when learning the decision boundary. However, since these methods only learn whether each sample is located inside the decision boundary, it is difficult for them to capture high-dimensional features. To overcome this limitation, we utilize the margin proxy in the proposed model. The margin proxy is an embedding vector that determines the decision boundary and learns a high-dimensional semantic distance by adaptively adjusting the distance to the centroid proxy as well as the distance to the positive sample. Margin proxies are created in an one-to-one correspondence to the centroid proxies representing each IND intent (like the moon orbiting the earth).
Moreover, to learn the discriminative features between IND intents and the OOD intent, we introduce a shared proxy that represents the class to which all IND samples belong. Existing methods learn using single-label samples, but our method utilizes a simple multi-label method in which each sample belongs to both the positive class and the shared class. IND samples are learned to reduce the distance from the positive IND centroid proxy and shared proxy, and OOD samples can be clearly classified in the embedding space, as only the distance from the shared proxy is reduced. The proposed method does not require the use of any additional classifiers or any changes to the model structure.
In this paper, through experiments using three intent classification benchmark datasets, we confirmed that the proposed method achieves an improved performance compared to previous works. The contributions of our study can be summarized as follows:
  • We propose an Advanced Proxy-Anchor loss to conduct the OOD intent classification task.
  • Our method, which introduces new concepts of the margin proxy and shared proxy, effectively captures high-dimensional features to determine decision boundaries and learns discriminative features between IND intents and the OOD intent.
  • The results of experiments performed on three benchmark datasets demonstrate that our method achieves significant improvements compared to baseline outcomes.

2. Related Work

2.1. OOD Detection

Many recent works have attempted to solve the problem of OOD detection. For example, MSP [18] uses the maximum Softmax probability to detect OOD intents. GOT [19] detects OOD intents by utilizing energy scores aligned with the density of inputs. SCL [20] and LMCL [21] use Supervised Contrastive Learning and margin loss to minimize intra-class variance and maximize inter-classes variance, respectively. ADB [10] detects OOD by learning the adaptive decision boundaries of IND intents and ODD intent. Outlier [22] generates OOD samples using a pre-trained model to learn the difference between IND and OOD. ODIST [23] creates pseudo OOD samples using pre-trained language models that are included in the learning stage. KNN-Contrastive [11] uses KNN-contrastive learning to extract discriminative semantic features that can aid the OOD classification. ASoul [12] performs (k + 1) classification using the semi-supervised learning method that uses the adapted soft pseudo label (ASoul) for the ODD intent classification. That method also uses a feature generator to generate pseudo OOD samples to be used for training.

2.2. Proxy-Anchor Loss

Proxy-Anchor loss is a method proposed by Kim et al. [15], and it is one of the proxy-based metric learning methods [16,24] that introduce proxy, a learnable parameter. It is mainly used for image retrieval, face recognition, and few-shot learning tasks [25]. This method creates as many proxies as needed to match the number of classes in the training dataset. In the embedding space, each proxy is trained to have a short distance from positive samples and a long distance from negative samples, thus allowing the model to classify effectively. Proxy-Anchor loss can overcome the disadvantages of previous proxy-based methods [26,27,28,29] because it can effectively learn the semantic relationship between each data point by learning the relationship between proxies and all data in a batch. Moreover, since it does not involve a pair sampling process, the computational complexity is low, and the problem of the pair-based method [30,31], where the learning process can be hindered when too-easy negative data are sampled (i.e., when the loss easily becomes 0), is solved.

3. Proposed Method

To improve the OOD classification performance, we propose Advanced Proxy-Anchor loss, which captures high-dimensional semantic features to determine decision boundaries and then learns the discriminative features between IND intents and the OOD intent. The proposed method achieves the above two goals by introducing new concepts into Proxy-Anchor loss [15], margin proxy and shared proxy, which are both learnable embedding vectors. The margin proxy learns not only the distance from the center of the IND intents, but also the distance from the samples to generate a decision boundary. The proposed method also has the ability to effectively distinguish between IND intents and OOD intents through a multi-label method using a shared proxy.
This section first describes the method used to create an intent representation. Next, after the reviewing Proxy-Anchor loss [15], the basis of the proposed method—Advanced Proxy-Anchor loss—is explained. Finally, we present the method we use to classify intents in the embedding space based on the generated decision boundaries.

3.1. Intent Representation

To extract the intent representation of each sample, we use BERT [32], a pre-trained language model. While following the method deployed by Lin et al. [33], we obtain the embedding vector e i = [ [ C L S ] , T 1 , , T N ] of the i-th IND sentence through BERT. We then perform mean-pooling to obtain intent representation x i as expressed in the following equation:
x i = Mean - pooling ( [ [ C L S ] , T 1 , , T N ] ) ,
where x i R H , N is the sequence length and H is the hidden size of BERT.

3.2. Classic Proxy-Anchor Loss

Proxy-Anchor loss randomly initializes one proxy for each class and uses the created proxy as an anchor to learn the distance to all samples within a batch. The equation of Proxy-Anchor loss is as follows:
L = 1 | P + | p P + log 1 + x X p + e α ( s ( x , p ) δ ) + 1 | P | p P log 1 + x X p e α ( s ( x , p ) + δ ) ,
where δ > 0 and α > 0 are hyperparameters that respectively refer to the margin and scaling factor. P is a set of all proxies, and P + is a set of the positive proxies of the samples in a batch. Moreover, X is a set of the embedding vectors of all samples in a batch, X p + is a set of the positive samples of each proxy p, and X p = X X p + . The similarity function s denotes the cosine similarity function.
The first term of Proxy-Anchor loss, which consists of two terms in total, intuitively brings the distance of positive samples closer to each proxy in the embedding space, while the second term makes negative samples further away. Here, based on the calculation of s ( x , p ) δ , the cosine similarity between the positive sample and the proxy is made larger than the margin (i.e., the sample is made to be located inside the decision boundary). The margin of Proxy-Anchor loss is set directly by the researcher to be the same for all proxies, which has limitations in terms of reflecting the specific characteristics of each IND intents.

3.3. Advanced Proxy-Anchor Loss

As implied by the name, Advanced Proxy-Anchor loss is designed to overcome the limitations of previous OOD intent classification studies by achieving advancements over Proxy-Anchor loss [15]. Our proposed method utilizes three types of proxies: a centroid proxy, a margin proxy, and a shared proxy.
Centroid proxy p is a proxy representing each class, as is used in the Proxy-Anchor loss [15]. p k R H is an embedding vector that can be learned and which represents the centroid proxy of the k-th IND intent.
Margin proxy m k R H determines the decision boundary by calculating the distance to the corresponding centroid proxy p k . The concept of the margin also exists in Proxy-Anchor loss. However, since the margin of the proxy-anchor loss is a hyperparameter, the researchers set the margin, and there is a limitation in that several experiments must be conducted to find the optimal value [17]. Moreover, there have been various studies aiming to learn the margin that determines the decision boundary in the OOD Detection task [10,34,35], but since the low-dimensional distance itself is set as a learning parameter, these approaches have been limited in terms of their ability to capture the features of the high-dimensional embedding vector. Since each margin proxy of our method is a vector of the same dimension as the data embedding, our method captures the semantic features of the sample more effectively than previous studies.
Only one shared proxy exists in the embedding space, and one margin proxy is assigned the same as the centroid proxy. The shared proxy p s h a r e d R H is treated as a positive proxy for all IND samples in the training process. That is, all samples are in a multi-label state belonging to both the original centroid proxy and the shared proxy. Chun et al. [36] use the method of intentionally overfitting the model when training data are sparse. Inspired by this method, we overfit the model so that all samples are located inside the decision boundary of the shared proxy. However, since the model is trained using multi-label samples, it is possible to overfit with a shared proxy and learn the semantic features of the intent to which each sample originally belongs. In other words, the model is weakly overfitted. With this intentional overfitting method, IND samples are located inside the decision boundaries of both the shared proxy and the original proxy. Further, since OOD samples do not include the semantic features of IND intents, they are only located only inside the decision boundary of the shared proxy.
Advanced Proxy-Anchor loss is formulated as follows:
L = 1 | P + | p P + log 1 + x X p + e α M p o s + 1 | P | p P log 1 + x X p e α M n e g ,
where the notation of each variable is the same as that of the Proxy-Anchor loss, and M p o s and M n e g are, respectively, defined as follows:
M p o s = s ( x , m ) , if d > 0 d , if d 0 ,
M n e g = d , if d > 0 s ( x , m ) , if d 0 ,
where m is a margin proxy corresponding to the centroid proxy p. The d is defined as:
d = s ( x , p ) s ( m , p ) ,
where d is the difference of the distance between the sample and the centroid proxy and the distance between the sample and the margin proxy. This value determines whether the sample is located inside the decision boundary. Advanced Proxy-Anchor loss not only makes it so that each sample is located inside the decision boundary of the positive proxy, but it also limits the space in which OOD samples can be located inside by making it smaller. Here, the decision boundary is made small but still large enough to contain the samples inside. We divide each sample into intra-samples and inter-samples according to whether the sample exists inside each decision boundary in the embedding space. Figure 2 shows the process used to train inter-samples and intra-samples in our proposed method.
First, since the positive intra-sample ( d > 0 in Equation (4)) is already inside the decision boundary, it does not need to be moved further. In this case, the space in which OOD samples can exist is limited, as the distance between the margin proxy and samples has been made small but is still large enough to include the intra-sample. By contrast, the negative intra-sample ( d > 0 in Equation (5)) is in the wrong location and should be moved outside the decision boundary. Thus, Advanced Proxy-Anchor loss makes the distance between the sample and the centroid proxy greater than the distance between the sample and margin proxy.
We now describe the inter-sample case where each sample exists outside the positive or negative decision boundary. Since the positive inter-sample ( d 0 in Equation (4)) is outside the decision boundary, it is desirable for the sample to move inside the positive decision boundary. Therefore, the distance between the sample and the centroid proxy should be closer than the distance between the sample and the margin proxy. In contrast, the negative inter-sample ( d 0 in Equation (5)) already exists outside the decision boundary. In this case, we increase the distance between the sample and margin proxy to ensure that each decision boundary is distinct. If all the mentioned equations exist separately, the decision boundary becomes extremely large or small; thus, the OOD intent classification cannot be performed properly. However, in our method, since all equations are computed together, it is possible to obtain an appropriate decision boundary.
As shown in the gray decision boundary in Figure 2b, the decision boundary of the shared proxy is positive for all samples, so it becomes large enough to include samples of all embedding spaces at the end of the training. As mentioned earlier, since all IND samples are in a multi-label state that is positive for each centroid proxy and shared proxy, the two decision boundaries overlap with each other. However, since the embedding vector of the OOD sample does not have the features of the IND intent, it exists outside the decision boundary of all IND proxies, and it is only located inside the decision boundary of the shared proxy or outside all decision boundaries. Since the embedding space is created as detailed above, our proposed method is effective in learning the discriminative features between IND intents and the OOD intent as well as distinguishing them in the embedding space.

3.4. Classification with Decision Boundaries

As described in Section 3.3, after training the decision boundary through our proposed method, the classification was finally performed. The classification process to obtain the predicted intent y ^ for each sample follows the equation:
y ^ = a r g m a x k Y s ( x , p ) , if d > 0 O O D , otherwise ,
where Y is the entire set of IND intents. If the sample is located inside the decision boundary of intent k, it is classified as intent k. If the sample exists in the space where the decision boundaries of two intents overlap, it is predicted as the intent of the closer centroid proxy. If it is not located inside any of the decision boundaries, it is predicted as an OOD intent.

4. Experiments

4.1. Datasets

To verify the proposed method, we conduct experiments on the following three benchmark datasets: CLINC [37] is a dataset created for intent classification which includes 150 intents. It includes 10 domains in total, including banking, airlines, hotels, credit cards, insurance, loans, and restaurants. We follow the settings used by Lang et al. [12]. StackOverflow [38] is a dataset published on Kaggle.com. It consists of 20 technical question intents. BANKING [39] is a dataset made up of the banking domain. It consists of 77 intents. The detailed statistics are presented in Table 1.

4.2. Baselines

We compare Advanced Proxy-Anchor loss with the following previously reported OOD intent classification methods described in Section 2.1: MSP [18], DOC [40], OpenMax [41], LMCL [21], ADB [10], Outlier [22], SCL [20], GOT [19], ODIST [23] and ASoul [12].

4.3. Evaluation Metrics

Following many prior studies, we employ four metrics [10,11,12]. First, we utilize the accuracy and F1-score of all IND and OOD intent to check the overall performance of the trained model (Acc-ALL, F1-ALL). Next, to separately check the performance of the model for IND intents’ classification and OOD intent classification, we use the average F1-score of all IND intents (F1-IND) and the F1-score of the OOD intent (F1-OOD). All F1-scores are macro average F1-scores.

4.4. Experimental Settings

Following the settings used in the previous study [10,11,12], we randomly sampled only 25%, 50%, and 75% of the IND intents and used them for training. Then, we regarded the remaining unsampled intents as OOD, which are used with sampled intents for validation and testing. We selected the model with the highest F1-ALL performance in the validation set and conducted the test accordingly.
We utilized the BERT-base model (bert-base-uncased) to generate the intent representation. The max sequence length of BERT is 50, the hidden size is 768, and the scaling factor α is 64. Moreover, we adapt the AdamW Optimizer, and we set the learning rate to 2 × 10 5 for each sample, 2 × 10 4 for margin proxies, and 2 × 10 3 for centroid and shared proxies. The results of all experiments are expressed as the average of 10 repeated experiments by applying different random seeds in the intent sampling process.

4.5. Results

Table 2 presents our experimental results. The performance of all baselines was extracted from Zhang et al. [10], Zhou et al. [11], Lang et al. [12]. Based on the results of our experiment, we can make the following conclusions: First, the performance of our method is improved significantly over ADB, which is similar to our method in that it focused on decision boundary learning. This suggests that the decision boundary created by Advanced Proxy-Anchor loss is effective for OOD intent classification; Second, our method achieves a state-of-the-art performance, surpassing the performance of ODIST and FM (Feature Mixup) + ASoul, both of which directly utilize OOD samples for training, in all settings except for one case. This indicates that our method has successfully learned the discriminative features between IND intents and OOD intents without needing to use additional data. FM + ASoul, which shows the best performance at the CLINC dataset’s known intent ratio setting of 75%, forcibly distorts the IND samples through Mixup [42] to generate OOD samples and use them for learning. However, since the above method only generates samples that are different from the existing IND samples, the data are biased, and the method is limited in terms of real-world applicability [11,13,14]; Finally, in OOD intent classification, since the size of IND samples decreases because the known intent ratio is low, it is common for the F1-IND performance to decrease compared to the setting of the high known intent ratio. However, our method is relatively robust to changes in the known intent ratio compared to baseline models, and F1-IND metrics have the largest improvement when compared to baselines. This shows that our method can better classify IND intents and OOD intent, as well as better classify between IND intents, than the baseline models.

4.6. Ablation Studies

We conduct ablation studies to verify the effectiveness of the margin proxy and the shared proxy, which are components of the Advanced Proxy-Anchor loss. Each ablation model is as follows:
  • w/o Shared is a model in which the shared proxy is removed from the proposed method. That is, in this setting, all samples are learned in a single-label state, not a multi-label state;
  • w/o margin is a model without the margin proxy, and it is a variant used to check the difference between setting the margin as a vector and as a scalar. This model uses the margin as a trainable scalar value, not as an embedding vector.
  • w/o ALL is a setting where all the components proposed in this study are removed, and where OOD intent classification is performed with the Classic Proxy-Anchor loss setting [15].
As can be observed in Table 3, our method outperforms in all settings. We can summarize the results of the ablation studies as follows: First, the w/o Margin model shows very poor results in the datasets aside from the StackOverflow dataset. This means that the w/o Margin model face difficulty in determining the decision boundary for OOD intent classification, suggesting that it is effective to initialize the margin with a high-dimensional vector. Compared to the proposed model, the w/o Shared model exhibits an overall performance degradation, and in particular, the F1-OOD has a significant decline. These results demonstrate that the newly introduced multi-label method using a shared proxy is effective in detecting OOD intent. Comparing w/o ALL and baselines through Table 2 and Table 3, we can observe that the w/o ALL model outperforms many of the baseline models. Altogether, these findings suggest that the Proxy-Anchor loss framework itself is effective in the OOD intent classification task.

4.7. Cluster Analysis

The analysis involves embedding the test set through the trained model and examining the distribution of these vectors in the embedding space. As the model was trained with distance-based functions, clusters are expected to form in the embedding space for each intent. To evaluate the quality of these clusters, a Silhouette analysis [43] and the average separation [44] of clusters are utilized. The silhouette coefficient used in the silhouette analysis ranges from 1 to 1, where values closer to 1 indicate more successful clustering. The separation metric, which also ranges between 0 and 1, is considered to be successful when it is close to 0.
Table 4 presents the silhouette coefficients and separation values for the proposed method and variant models. Bolded values mark indicate the highest performance, while underlined values mark represent the second-highest performance.
The evaluation results show that the proposed method outperforms other models in silhouette coefficients and separation metrics across most configurations. This suggests that the proposed method is effective in embedding data samples based on semantic distances. Moreover, models without margin proxies (w/o Margin) and model without all proxies (w/o ALL) exhibit significantly lower separation values compared to models utilizing margin proxies, thus indicating the effectiveness of margin proxies in separating intents. Despite the fact that the w/o Margin model shows a much lower separation than the w/o Shared model, there are cases where it surpasses it in the silhouette coefficient. This implies that shared proxies can help distinguish OOD and enhance cohesion within OOD classes.
Figure 3 visualizes the embedding vectors of the StackOverflow test set with a 25% sampling rate. The dimensionality reduction to 2D using t-SNE is applied, and each color represents an intent class (red for Cocoa, orange for Oracle, green for Hibernate, blue for Scala, purple for OSX, and brown for OOD intent).
By qualitatively examining the visualizations, it can be observed that the proposed method generates the most appropriate vectors. The w/o Shared model faces difficulties in distinguishing between OOD and IND compared to other models. It can also be confirmed that the embedding vectors belonging to one class in w/o ALL and w/o Margin that do not use the margin proxy are separated compared to the model that uses the margin proxy.

5. Conclusions

In this paper, we propose the Advanced Proxy-Anchor loss, which additionally introduces the concepts of margin proxy and shared proxy, which are high-dimensional embedding vectors, to Proxy-Anchor loss for OOD intent classification. In doing so, we overcome the problems encountered by existing studies. The prior method could not sufficiently learn the semantic features of samples because they used a low-dimensional margin. They also had to use biased OOD samples for training to learn the discriminative semantic feature between IND intents and the OOD intent. The proposed method generates a decision boundary that reflects the semantic features of samples, while previous studies did so through a margin proxy, which is a learnable high-dimensional embedding vector. Moreover, through the multi-label method using a shared proxy, it is possible to learn the discriminative features between IND intents and the OOD intent without using OOD samples in training. The results of experiments on three benchmark datasets demonstrate that our proposed method overcomes the aforementioned limitations of prior methods and achieves a SOTA performance in the OOD intent classification task. In the future, we plan to further validate our method using datasets from various domains such as DialoGLUE [45] or CoLA [46] and apply our method to various metric learning frameworks that can utilize concepts of proxy. Furthermore, our method, as a metric learning approach using proxies, will benefit from lower computational costs compared to pair-based methods like contrastive learning. Therefore, we plan to explore the superiority of the proposed method through experiments comparing its computational cost performance with various baselines.

Author Contributions

Conceptualization, J.P. and J.R.; methodology, J.P.; software, J.P.; validation, J.P. and B.K.; formal analysis, J.P.; investigation, J.P.; resources, J.P. and S.H.; data curation, J.P. and S.J.; writing—original draft preparation, J.P.; writing—review and editing, J.P. and B.K.; visualization, J.P.; supervision, J.R.; project administration, J.R.; funding acquisition, J.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant (21163MFDS502) from the Ministry of Food and Drug Safety in 2024.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: CLINC at https://github.com/clinc/oos-eval (accessed on 29 January 2024), StackOverflow at https://github.com/jacoxu/StackOverflow (accessed on 29 January 2024), and BANKING at https://huggingface.co/datasets/banking77, (accessed on 29 January 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Qin, L.; Che, W.; Li, Y.; Ni, M.; Liu, T. Dcr-net: A deep co-interactive relation network for joint dialog act recognition and sentiment classification. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8665–8672. [Google Scholar]
  2. Qin, L.; Che, W.; Li, Y.; Wen, H.; Liu, T. A stack-propagation framework with token-level intent detection for spoken language understanding. arXiv 2019, arXiv:1909.02188. [Google Scholar]
  3. Chen, Q.; Zhuo, Z.; Wang, W. Bert for joint intent classification and slot filling. arXiv 2019, arXiv:1902.10909. [Google Scholar]
  4. Min, Q.; Qin, L.; Teng, Z.; Liu, X.; Zhang, Y. Dialogue state induction using neural latent variable models. arXiv 2020, arXiv:2008.05666. [Google Scholar]
  5. Li, C.H.; Yeh, S.F.; Chang, T.J.; Tsai, M.H.; Chen, K.; Chang, Y.J. A conversation analysis of non-progress and coping strategies with a banking task-oriented chatbot. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–12. [Google Scholar]
  6. Larson, S.; Leach, K. A Survey of Intent Classification and Slot-Filling Datasets for Task-Oriented Dialog. arXiv 2022, arXiv:2207.13211. [Google Scholar]
  7. Hasani, M.F.; Gaol, F.L.; Soewito, B.; Warnars, H.L.H.S. Deep learning and Threshold Probability for Out of Scope Intent Detection in Task Oriented Chatbot. In Proceedings of the 2022 3rd International Conference on Artificial Intelligence and Data Sciences (AiDAS), Ipoh, Malaysia, 7–8 September 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 322–327. [Google Scholar]
  8. Akbari, M.; Mohades, A.; Shirali-Shahreza, M.H. A Hybrid Architecture for Out of Domain Intent Detection and Intent Discovery. arXiv 2023, arXiv:2303.04134. [Google Scholar]
  9. Scheirer, W.J.; de Rezende Rocha, A.; Sapkota, A.; Boult, T.E. Toward Open Set Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1757–1772. [Google Scholar] [CrossRef] [PubMed]
  10. Zhang, H.; Xu, H.; Lin, T.E. Deep open intent classification with adaptive decision boundary. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 2–9 February 2021; Volume 35, pp. 14374–14382. [Google Scholar]
  11. Zhou, Y.; Liu, P.; Qiu, X. KNN-contrastive learning for out-of-domain intent classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 5129–5141. [Google Scholar]
  12. Lang, H.; Zheng, Y.; Sun, J.; Huang, F.; Si, L.; Li, Y. Estimating Soft Labels for Out-of-Domain Intent Detection. arXiv 2022, arXiv:2211.05561. [Google Scholar]
  13. Shafaei, A.; Schmidt, M.; Little, J.J. A less biased evaluation of out-of-distribution sample detectors. arXiv 2018, arXiv:1809.04729. [Google Scholar]
  14. Si, Q.; Liu, Y.; Meng, F.; Lin, Z.; Fu, P.; Cao, Y.; Wang, W.; Zhou, J. Towards Robust Visual Question Answering: Making the Most of Biased Samples via Contrastive Learning. arXiv 2022, arXiv:2210.04563. [Google Scholar]
  15. Kim, S.; Kim, D.; Cho, M.; Kwak, S. Proxy anchor loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3238–3247. [Google Scholar]
  16. Movshovitz-Attias, Y.; Toshev, A.; Leung, T.K.; Ioffe, S.; Singh, S. No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 360–368. [Google Scholar]
  17. Phan, N.; Tran, S.; Huy, T.D.; Duong, S.T.; Nguyen, C.D.T.; Bui, T.; Truong, S.Q. Adaptive Proxy Anchor Loss for Deep Metric Learning. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1781–1785. [Google Scholar]
  18. Hendrycks, D.; Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv 2016, arXiv:1610.02136. [Google Scholar]
  19. Ouyang, Y.; Ye, J.; Chen, Y.; Dai, X.; Huang, S.; Chen, J. Energy-based unknown intent detection with data manipulation. arXiv 2021, arXiv:2107.12542. [Google Scholar]
  20. Zeng, Z.; He, K.; Yan, Y.; Liu, Z.; Wu, Y.; Xu, H.; Jiang, H.; Xu, W. Modeling discriminative representations for out-of-domain detection with supervised contrastive learning. arXiv 2021, arXiv:2105.14289. [Google Scholar]
  21. Lin, T.E.; Xu, H. Deep unknown intent detection with margin loss. arXiv 2019, arXiv:1906.00434. [Google Scholar]
  22. Zhan, L.M.; Liang, H.; Liu, B.; Fan, L.; Wu, X.M.; Lam, A. Out-of-scope intent detection with self-supervision and discriminative training. arXiv 2021, arXiv:2106.08616. [Google Scholar]
  23. Shu, L.; Benajiba, Y.; Mansour, S.; Zhang, Y. Odist: Open world classification via distributionally shifted instances. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3751–3756. [Google Scholar]
  24. Teh, E.W.; DeVries, T.; Taylor, G.W. Proxynca++: Revisiting and revitalizing proxy neighborhood component analysis. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 448–464. [Google Scholar]
  25. Kaya, M.; Bilge, H.Ş. Deep metric learning: A survey. Symmetry 2019, 11, 1066. [Google Scholar] [CrossRef]
  26. Alhuzali, H.; Ananiadou, S. SpanEmo: Casting multi-label emotion classification as span-prediction. arXiv 2021, arXiv:2101.10038. [Google Scholar]
  27. Lee, J.; Lee, W. CoMPM: Context Modeling with Speaker’s Pre-trained Memory Tracking for Emotion Recognition in Conversation. arXiv 2021, arXiv:2108.11626. [Google Scholar]
  28. Zhu, L.; Pergola, G.; Gui, L.; Zhou, D.; He, Y. Topic-driven and knowledge-aware transformer for dialogue emotion detection. arXiv 2021, arXiv:2106.01071. [Google Scholar]
  29. Deng, D.; Chen, Z.; Shi, B.E. Multitask emotion recognition with incomplete labels. In Proceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), Buenos Aires, Argentina, 16–20 November 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 592–599. [Google Scholar]
  30. Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
  31. Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning PMLR 2020, Online, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
  32. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  33. Lin, T.E.; Xu, H.; Zhang, H. Discovering new intents via constrained deep adaptive clustering with cluster refinement. In Proceedings of the AAAI Conference on Artificial Intelligence 2020, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8360–8367. [Google Scholar]
  34. Fei, G.; Liu, B. Breaking the closed world assumption in text classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 2–17 June 2016; pp. 506–514. [Google Scholar]
  35. Liu, X.; Li, J.; Mu, J.; Yang, M.; Xu, R.; Wang, B. Effective Open Intent Classification with K-center Contrastive Learning and Adjustable Decision Boundary. arXiv 2023, arXiv:2304.10220. [Google Scholar] [CrossRef]
  36. Chun, J.; Park, J.C.; Olberg, S.; Zhang, Y.; Nguyen, D.; Wang, J.; Kim, J.S.; Jiang, S. Intentional deep overfit learning (IDOL): A novel deep learning strategy for adaptive radiation therapy. Med. Phys. 2022, 49, 488–496. [Google Scholar] [CrossRef]
  37. Larson, S.; Mahendran, A.; Peper, J.J.; Clarke, C.; Lee, A.; Hill, P.; Kummerfeld, J.K.; Leach, K.; Laurenzano, M.A.; Tang, L.; et al. An evaluation dataset for intent classification and out-of-scope prediction. arXiv 2019, arXiv:1909.02027. [Google Scholar]
  38. Xu, J.; Wang, P.; Tian, G.; Xu, B.; Zhao, J.; Wang, F.; Hao, H. Short text clustering via convolutional neural networks. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing, Denver, CO, USA, 5 June 2015; pp. 62–69. [Google Scholar]
  39. Casanueva, I.; Temčinas, T.; Gerz, D.; Henderson, M.; Vulić, I. Efficient intent detection with dual sentence encoders. arXiv 2020, arXiv:2003.04807. [Google Scholar]
  40. Shu, L.; Xu, H.; Liu, B. Doc: Deep open classification of text documents. arXiv 2017, arXiv:1709.08716. [Google Scholar]
  41. Bendale, A.; Boult, T.E. Towards open set deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1563–1572. [Google Scholar]
  42. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
  43. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
  44. Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 1, 224–227. [Google Scholar] [CrossRef] [PubMed]
  45. Mehri, S.; Eric, M.; Hakkani-Tur, D. Dialoglue: A natural language understanding benchmark for task-oriented dialogue. arXiv 2020, arXiv:2009.13570. [Google Scholar]
  46. Trotta, D.; Guarasci, R.; Leonardelli, E.; Tonelli, S. Monolingual and cross-lingual acceptability judgments with the Italian CoLA corpus. arXiv 2021, arXiv:2109.12053. [Google Scholar]
Figure 1. An example of a banking system. If a user utterance containing a pre-defined intent (blue texts) is entered, the task proceeds normally. However, if the OOD intent (red text) is entered, it outputs that banking cannot proceed.
Figure 1. An example of a banking system. If a user utterance containing a pre-defined intent (blue texts) is entered, the task proceeds normally. However, if the OOD intent (red text) is entered, it outputs that banking cannot proceed.
Applsci 14 02312 g001
Figure 2. Figure showing the embedding space before and after training with Advanced Proxy-Anchor loss. Dashed arrows indicate the direction of each vector. In (a), in the light of the red intent class, the red star outside the decision boundary (positive inter-samples) moves to a location inside the red decision boundary, and the blue star inside the red decision boundary (negative intra-samples) move to the blue decision boundary. Moreover, the blue star outside the decision boundary (negative inter proxy) moves away from the red margin proxy. Finally, the red margin proxy is trained to reduce the distance to red stars (positive intra proxies) inside the red decision boundary. The shared proxy’s decision boundary grows to enclose all samples and serves to include OOD samples.After the previous training process, the vectors ultimately position themselves as shown in (b).
Figure 2. Figure showing the embedding space before and after training with Advanced Proxy-Anchor loss. Dashed arrows indicate the direction of each vector. In (a), in the light of the red intent class, the red star outside the decision boundary (positive inter-samples) moves to a location inside the red decision boundary, and the blue star inside the red decision boundary (negative intra-samples) move to the blue decision boundary. Moreover, the blue star outside the decision boundary (negative inter proxy) moves away from the red margin proxy. Finally, the red margin proxy is trained to reduce the distance to red stars (positive intra proxies) inside the red decision boundary. The shared proxy’s decision boundary grows to enclose all samples and serves to include OOD samples.After the previous training process, the vectors ultimately position themselves as shown in (b).
Applsci 14 02312 g002
Figure 3. Embedding vectors for StackOverflow’s test data at a 25% sampling rate. Blue points represent Scala, orange points represent Oracle, green points represent Hibernate, red points represent Cocoa, purple points represent Osx, and brown points represent OOD samples.
Figure 3. Embedding vectors for StackOverflow’s test data at a 25% sampling rate. Blue points represent Scala, orange points represent Oracle, green points represent Hibernate, red points represent Cocoa, purple points represent Osx, and brown points represent OOD samples.
Applsci 14 02312 g003
Table 1. Statistics of CLINC, StackOverflow(SO), and BANKING datasets.
Table 1. Statistics of CLINC, StackOverflow(SO), and BANKING datasets.
DatasetTrainValidTestIntent
CLINC15,00030005700150
SO12,0002000600020
BANKING90031000308077
Table 2. The performance of the Advanced Proxy-Anchor loss model and baselines models. The best performances are indicated in bold and the second-best performances are underlined.
Table 2. The performance of the Advanced Proxy-Anchor loss model and baselines models. The best performances are indicated in bold and the second-best performances are underlined.
CLINCStackOverflowBANKING
MethodsAcc-ALLF1-ALLF1-OODF1-INDAcc-ALLF1-ALLF1-OODF1-INDAcc-ALLF1-ALLF1-OODF1-IND
25%MSP47.0247.6250.8847.5328.6737.8513.0342.8243.6750.0941.4350.55
DOC74.9766.3781.9865.9642.7447.7341.2549.0256.9958.0361.4257.85
OpenMax68.5061.9975.7661.6240.2845.9836.4147.8949.9454.1451.3254.28
SCL75.0165.4581.9265.0162.0861.0167.9959.6170.8264.8277.2864.17
GOT72.6364.0172.6364.0165.0262.2665.0262.2663.0563.4963.0563.49
LMCL81.4371.1679.4563.6047.8452.0568.5861.0064.2161.3668.6163.22
ADB87.5977.1991.8476.8086.7280.8390.8878.8278.8571.6284.5670.94
Outlier88.4480.7392.3580.4368.7465.6474.8663.8074.1169.9380.1269.39
ODIST89.7980.0493.4279.6991.5385.0594.4183.1881.6973.4487.1172.72
FM + ASoul92.7184.1195.4283.8192.0486.0394.7684.2987.4178.3991.5277.70
Ours92.9085.3895.5185.1293.2687.5495.5885.9388.3680.2792.1679.64
50%MSP62.9670.4157.6270.5852.4263.0123.9966.9159.7371.1841.1971.97
DOC77.1678.2679.0078.2552.5362.8425.4466.5864.8173.1255.1473.59
OpenMax80.1180.5681.8980.5460.3568.1845.0070.4965.3174.2454.3374.76
SCL71.1475.0370.8175.0976.1678.9574.4279.4074.8178.0472.4578.19
GOT67.0673.1563.4873.2865.5672.1955.5373.8669.9776.3763.0376.72
LMCL83.3582.1685.8582.1158.9868.0143.0170.5172.7377.5369.5377.74
ADB86.5485.0588.6585.0086.4085.8387.3485.6878.8680.9078.4480.96
Outlier88.3386.6790.3086.5475.0878.5571.8879.2272.6979.2167.2679.52
ODIST88.6186.5790.6286.5288.5287.3589.5787.1380.9081.7881.3281.79
FM + ASoul89.9688.2091.7288.1588.9288.2489.6988.0981.9883.9681.6584.03
Ours90.3888.6292.0888.5789.1488.4490.0288.2883.7084.9383.8784.96
75%MSP74.0782.3859.0882.5972.1777.9533.9680.8875.8983.6039.2384.36
DOC78.7383.5972.8783.6968.9175.0616.7678.9576.7783.3450.6083.91
OpenMax76.8073.1676.3573.1374.4279.7844.8782.1177.4584.0750.8584.64
SCL76.5082.6566.9082.7979.9184.4163.7985.7878.4584.0956.1984.57
GOT72.6581.4954.1181.7377.7681.8552.8083.7977.1183.3648.3083.97
LMCL83.7186.2381.1586.2772.3378.2837.5981.0078.5284.3158.5484.75
ADB86.3288.5383.9288.5882.7885.9973.8686.8081.0885.9666.4786.29
Outlier88.0889.4386.2889.4681.7185.8565.4487.2281.0786.9860.7187.47
ODIST87.7089.3085.8689.3383.7586.8875.2187.6682.7986.9471.9587.20
FM + ASoul89.8891.3888.2191.4185.0087.9075.7688.7184.4788.3972.6488.66
Ours89.4191.1387.4991.1686.6889.1278.6889.8284.6288.6773.0088.94
Table 3. The performances of the proposed method and ablation models. The best performances are indicated in bold, and the second-highest performances are underlined.
Table 3. The performances of the proposed method and ablation models. The best performances are indicated in bold, and the second-highest performances are underlined.
CLINCStackOverflowBANKING
MethodsAcc-ALLF1-ALLF1-OODF1-INDAcc-ALLF1-ALLF1-OODF1-INDAcc-ALLF1-ALLF1-OODF1-IND
25%w/o Shared87.2478.3191.5377.9689.1981.7892.7279.676.7972.7582.5172.23
w/o Margin83.8529.7190.8228.193.3187.9795.5886.4582.8648.8989.6646.75
w/o ALL91.7483.1994.7682.8974.5770.4680.668.4379.4771.8685.0471.16
Ours92.9085.3895.5185.1293.2687.5495.5885.9388.3680.2792.1679.64
50%w/o Shared80.8882.5282.3682.5276.7880.8574.8281.4677.7481.9475.5382.11
w/o Margin64.2715.1677.1914.3370.9856.377.4054.1857.1221.5770.2320.29
w/o ALL88.4587.4490.2087.4083.9784.4584.6684.4380.8883.4380.1983.52
Ours90.3888.6292.0888.5789.1488.4490.0288.2883.7084.9383.8784.96
75%w/o Shared85.3588.8981.3588.9583.3586.8669.6688.0182.5487.7265.2788.1
w/o Margin44.038.3859.457.9237.5624.2644.4522.9129.7311.0441.2310.52
w/o ALL88.0190.5085.3390.5584.7687.3776.6388.0982.4387.9364.7888.33
Ours89.4191.1387.4991.1686.6889.1278.6889.8284.6288.6773.0088.94
Table 4. Clustering performances of proposed method and ablation models. The best performances are indicated in bold letters and the second-highest performances are underlined.
Table 4. Clustering performances of proposed method and ablation models. The best performances are indicated in bold letters and the second-highest performances are underlined.
CLINCStackOverflowBANKING
MethodsSilhouetteSeparationSilhouetteSeparationSilhouetteSeparation
25%w/o Shared−0.13150.2847−0.12400.4318−0.16820.3024
w/o Margin−0.15260.6618−0.07290.7193−0.13240.6441
w/o ALL−0.13950.62510.05350.4834−0.13580.5926
Ours−0.11370.27360.07700.3132−0.07350.3273
50%w/o Shared0.04190.35090.24940.47490.10280.3306
w/o Margin0.04710.68760.18870.68140.01520.6704
w/o ALL−0.01910.64570.23770.4417−0.02900.5934
Ours0.06890.30960.28130.42670.12460.3277
75%w/o Shared0.20240.43270.45150.42510.29740.4289
w/o Margin0.25630.69930.37830.65560.27980.7062
w/o ALL0.11220.65020.34280.39270.11840.5968
Ours0.27590.37020.47480.41730.36550.3629
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Park, J.; Kim, B.; Han, S.; Ji, S.; Rhee, J. Margin and Shared Proxies: Advanced Proxy Anchor Loss for Out-of-Domain Intent Classification. Appl. Sci. 2024, 14, 2312. https://doi.org/10.3390/app14062312

AMA Style

Park J, Kim B, Han S, Ji S, Rhee J. Margin and Shared Proxies: Advanced Proxy Anchor Loss for Out-of-Domain Intent Classification. Applied Sciences. 2024; 14(6):2312. https://doi.org/10.3390/app14062312

Chicago/Turabian Style

Park, Junhyeong, Byeonghun Kim, Sangkwon Han, Seungbin Ji, and Jongtae Rhee. 2024. "Margin and Shared Proxies: Advanced Proxy Anchor Loss for Out-of-Domain Intent Classification" Applied Sciences 14, no. 6: 2312. https://doi.org/10.3390/app14062312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop