Research on Named Entity Recognition for Spoken Language Understanding Using Adversarial Transfer Learning

Guo, Yao; Li, Meng; Li, Yanling; Ge, Fengpei; Qi, Yaohui; Lin, Min

doi:10.3390/electronics12040884

Open AccessArticle

Research on Named Entity Recognition for Spoken Language Understanding Using Adversarial Transfer Learning

by

Yao Guo

¹

,

Meng Li

²,

Yanling Li

^1,3,*,

Fengpei Ge

^4,*,

Yaohui Qi

^5,* and

Min Lin

¹

College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China

²

Inner Mongolia Big Data Center, Hohhot 010096, China

³

Inner Mongolia Discipline Inspection and Supervision Big Data Laboratory, Hohhot 010015, China

⁴

Library, Beijing University of Posts and Telecommunications, Beijing 100876, China

⁵

College of Physics, Hebei Normal University, Shijiazhuang 050024, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(4), 884; https://doi.org/10.3390/electronics12040884

Submission received: 8 December 2022 / Revised: 1 February 2023 / Accepted: 6 February 2023 / Published: 9 February 2023

(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we propose an adversarial transfer learning method to solve the lack of data resources for named entity recognition (NER) tasks in spoken language understanding. In the framework, we use bi-directional long short-term memory with self-attention and conditional random field (BiLSTM-Attention-CRF) model which combines character and word information as the baseline model to train source domain and target domain corpus jointly. Shared features between domains are extracted by a shared feature extractor. This paper uses two different sharing patterns simultaneously: full sharing mode and private sharing mode. On this basis, an adversarial discriminator is added to the shared feature extractor to simulate generative adversarial networks (GAN) and eliminate domain-dependent features. This paper compares ordinary adversarial discriminator (OAD) and generalized resource-adversarial discriminator (GRAD) through experiments. The experimental results show that the transfer effect of GRAD is better than other methods. The F1 score reaches 92.99% at the highest, with a relative increase of 12.89%. It can effectively improve the performance of NER tasks in resource shortage fields and solve the problem of negative transfer.

Keywords:

spoken language understanding; named entity recognition; transfer learning; adversarial discriminator

1. Introduction

Spoken language understanding, as the leading task of dialogue system, aims to transform users’ spoken language into structured information. It can be used to provide a guarantee for downstream tasks of dialogue system. Spoken language understanding consists of three key subtasks, which are domain determination (DD), intent determination (ID) and NER [1]. Among them, NER identifies specific entities in users’ oral expression, such as name, address, organization, institution, etc., to help the machine understand spoken language. Initially, NER mainly used rule-based and statistical machine learning methods [2,3,4]. Because deep learning has strong modeling ability and can automatically mine regular knowledge contained in data, many researchers began to apply it to the task of NER [5,6,7] and showed excellent performance.

In fact, deep learning methods rely on large-scale and high-quality labeled data, while task-based dialogue systems usually only have access to a small amount of data in the development stage. In addition, the text data obtained is generally short, such as abbreviation, colloquial, and semantically ambiguous. For example, article comments on social media such as Twitter or Weibo write “no idea” as “no eye deer” and “xswl” means “smile to death”. Therefore, it is difficult to understand such short texts through their literal meaning, and hard to collect enough labeled training data. Adopting deep learning methods in this field of NER may lead to poor model generalization and unstable performance. Transfer learning focuses on knowledge transfer across domains, and aims to improve the performance of target learners on the target domain by transferring knowledge contained in different but related source domains, to reduce the dependence of constructing target learners on a large amount of target domain data. It is a promising machine learning method to solve the above problems [8]. This paper applies transfer learning to the NER task of spoken language understanding. The main work and contributions of this paper can be summarized as follows:

(1): For the first time, we applied transfer learning to the task of spoken language understanding. Aiming at the shortage of labeled training data for the task of NER in spoken language understanding, the source domain and target domain data are trained jointly, and shared features of different domains are extracted by sharing feature extractor. Building on the reference [9], we extend to two sharing modes: full sharing mode and private sharing mode.
(2): Referring to the work [10], a NER method using adversarial transfer learning is proposed to solve the problem that unique features of the source domain in shared features have a negative impact on the target task. Therefore, an adversarial discriminator is added to the shared feature extractor. Two kinds of adversarial discriminators, OAD and GRAD, are investigated. We combine the two sharing modes to form a total of four combinations. The experimental results show that the method of adversarial transfer learning based on GRAD and private sharing mode can effectively improve the system performance of NER tasks in spoken language understanding.

2. Related Work

This paper mainly studies the application of transfer learning to the task of NER in spoken language understanding. The related content includes traditional NER methods and NER methods using transfer learning.

Traditional NER methods include the hidden markov model (HMM) [11], maximum entropy model (MEM) [12], support vector machine (SVM) [13], conditional random fields (CRF) [14], etc. In recent years, NER tasks have generally adopted deep learning models based on multi-layer neural networks. Since LSTM can alleviate long-term dependency problems, some researchers have integrated it with CRF and achieved advanced recognition accuracy on NER tasks. For example, Habibi [15] used LSTM-CRF for NER tasks in the biomedical field, and used generated character embeddings and word embeddings concatenation as input of the model. On this basis, WANG [16] proposed to use the BiLSTM-CRF model for entity recognition in the field of traditional Chinese medicine. The BiLSTM-CRF model has also become one of the common methods for NER tasks. Zeng [17] incorporated a self-attention mechanism into the BiLSTM-CRF model to capture global dependencies of the entire sentence and learn the internal structural features of the sentence in NER of electronic medical records, and obtained better results. Although the above methods achieve high recognition accuracy, these methods use supervised training methods and require a large amount of labeled data to train the model. For some specific fields, it is very difficult to collect enough data with manual annotation, such as military, small languages and other fields.

To solve the scarcity of labeled data resources, some scholars have used transfer learning for NER tasks [18]. Traditional methods are mainly divided into two types: data-based method and model-based method. Data-based methods are mainly designed to reduce the distribution difference between samples in source and target domains but are only used for cross-language transferring [18]. Model-based methods are mainly use chosen features or trained model parameters in the source domain to improve the task performance in the target domain. To improve the performance of various tasks including NER, Ando and Zhang [19] proposed a transfer learning model that can share structural parameters among multiple tasks. To solve the classification problem, Tommasi [20] proposed a single-model knowledge transfer (SMKL) method, which is based on least squares SVM to select a pre-acquired binary decision function from the source domain, and then migrate its parameters to the target domain model.

With the development of deep learning, many researchers use it to build transfer learning models. Some methods aim to reuse some pre-trained deep neural networks of the source domain and transfer them to the target domain model. Wang [21] performed a feature to transfer through label-aware maximum mean discrepancy (MMD) and realized a cross-medical professional NER system. Yang [22] proposed a sequence labeling transfer learning framework based on RNN (RNN-TL) to improve system performance by sharing feature representations and model parameters. Although these methods have obtained some achievements, there are differences between different domain resources. Forcing source and target domains to share the same features and model parameters may lead to poor model generalization. Inspired by GAN [23], researchers introduce adversarial techniques, which aim to remove some domain-dependent features to alleviate the negative transfer problem. Cao [9] applied adversarial transfer learning to the NER task for the first time. They made full use of the richer word boundary information of Chinese word segmentation (CWS) and filtered the unique information of CWS through task discriminator and adversarial loss function to improve the performance of Chinese NER. Zhou [10] proposed a dual adversarial transfer network (DATNet). They introduced GRAD and adversarial training based on a general deep transfer unit, which plays a role in solving the difference and imbalance of domain resources and improving the generalization performance of the model. Based on the above methods, this paper applies adversarial transfer learning to spoken language understanding tasks and explores a transfer learning method that can solve the problems of scarcity of labeled data resources and irregular data representation in NER tasks in this field.

3. Method

The main research of this paper is to use adversarial transfer learning to solve the problem of data scarcity for NER tasks in spoken language understanding. We achieve knowledge transfer by extracting word-level shared features between domains. Figure 1 shows two word-level feature sharing modes used in this paper, which are full sharing mode (MTL-F) and private sharing mode (MTL-P). Full sharing mode is shown in Figure 1a, which is mainly composed of five parts, embedding layer, shared word-level feature extractor, self-attention mechanism, adversarial discriminator, and label decoder CRF. The model first inputs text into the embedding layer of the source domain and target domain to obtain a hybrid representation of word embeddings and character-level feature representations. Then the model uses the BiLSTM framework as a word-level feature extractor to extract contextual word features. For full sharing mode, the two domains are jointly trained for learning task-sharing features. Using attention mechanisms to capture semantic and grammatical connections between different words in sentences. Considering the difference between the source domain and target domain data and the unique features of the source domain, we add an adversarial discriminator to the shared feature extractor. Finally, the results are output through the CRF layer of the tag decoder in their respective fields. The private sharing mode is shown in Figure 1b. To consider different resource representations of word features, it adds a private word-level feature extractor for each field. The private word-level feature extractor is trained independently for each field to extract specific characteristics of the task. In the following sections, we will state each part of this model in detail.

3.1. Embedding Layer

To inject a richer semantic representation into the pre-training of the model, this study uses a hybrid representation in the distributed representation, including word embeddings and character-level feature representations. Character-level feature representations are obtained from character embeddings by CNN, which is a character-level feature extractor [7]. The process is shown in Figure 2. Given a Chinese sentence

s = {w_{1}, w_{2}, …, w_{n}}

, where

n

is the number of its words, each word is composed of several characters,

w_{i} = {c_{1}, …, c_{p}}

. We put the word

w_{i}

into character-level CNN to extract its character-level feature representation

x_{i}^{c h a r}

, then concatenate character-level feature representation and word embedding of each word to obtain a hybrid representation

x_{i} = [x_{i}^{c h a r}, e_{i}^{w o r d}]

, and finally input them into word-level feature extractor. Both word embeddings and character embeddings are initialized using word embedding table and character embedding table pre-trained by the Word2Vec model.

3.2. Word-Level Feature Extractor

In this paper, we use BiLSTM as a word-level feature extractor to make full use of past and future information. The hidden layer vector of BiLSTM is denoted as Equations (1)–(3), where x_i is the output in Section 3.1,

\vec{h_{i}}

and

\overset{\leftarrow}{h_{i}}

are hidden vectors of forward and backward LSTM at the position

i

, and then

\vec{h_{i}}

and

\overset{\leftarrow}{h_{i}}

are concatenated to obtain hidden layer vector

h_{i}

of BiLSTM.

\vec{h_{i}} = \vec{L S T M} ({\vec{h}}_{i - 1}, x_{i})

(1)

\overset{\leftarrow}{h_{i}} = \overset{\leftarrow}{L S T M} ({\overset{\leftarrow}{h}}_{i + 1}, x_{i})

(2)

h_{i} = [\vec{h_{i}}, \overset{\leftarrow}{h_{i}}]

(3)

For the two shared modes used in this paper, we introduce two different transferable word-level feature extractors, namely shared word-level feature extractor and private word-level feature extractor. The shared feature extractor is jointly trained with source and target domains data to learn shared features of the task, while the private word-level feature extractor is trained independently for each domain to extract task-specific features. In full sharing mode, there is only one shared word-level feature extractor, while in private sharing mode, one shared word-level feature extractor and one private word-level feature extractor are assigned to source and target tasks. For any sentence in the dataset of task

k \in

{source domain, target domain}, hidden states

f_{i}^{k}

and

p_{i}^{k}

of shared word-level feature extractor and private word-level feature extractor are denoted as Equations (4) and (5):

f_{i}^{k} = B i L S T M (x_{i}^{k}, f_{i - 1}^{k}; θ_{f})

(4)

p_{i}^{k} = BiLSTM (x_{i}^{k}, p_{i - 1}^{k}; θ_{p})

(5)

where

θ_{f}

and

θ_{p}

denote shared BiLSTM parameters and private BiLSTM parameters, respectively.

3.3. Self-Attention

Using an attention mechanism can fully consider semantic and grammatical connections between different words in sentences. This paper adopts a multi-head self-attention mechanism, and the calculation is shown in Equation (6):

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) V

(6)

where

Q

,

K

, and

V

are query matrix, key matrix and value matrix. We set

Q = K = V = H

. Multi-head self-attention mechanism first projects a query matrix, key matrix, and value matrix for

h

times by using different linear projections. Then, the

h

projections are executed in parallel. Finally, the

h

self-attentions are concatenated and re-projected to obtain the final result. Multi-head attention can be denoted as Equations (7) and (8):

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(7)

H^{'} = ({head}_{i} \oplus … \oplus {head}_{h}) W_{o}

(8)

where

W_{i}^{Q}

,

W_{i}^{K}

, and

W_{i}^{V}

are projection parameters and

W_{o}

is a trainable parameter. For the sentences in the dataset of task

k

, the shared self-attention output vector

{F^{'}}^{k}

and private self-attention output vector

{P^{'}}^{k}

can be obtained by Equations (6)–(8). For full sharing mode, the shared self-attention output vector

{F^{'}}^{k}

is input to the CRF layer. The private sharing mode is to connect two self-attention output vectors

{F^{'}}^{k}

and

{P^{'}}^{k}

to obtain the input of the CRF layer in their respective fields. The calculation can be denoted as Equation (9).

{H^{″}}^{k} = ({F^{'}}^{k} \oplus {P^{'}}^{k})

(9)

3.4. Adversarial Discriminator

Inspired by Cao [9] and Zhou [10], we further add an adversarial discriminator on shared feature extractor by considering the difference between data in source and target domains and the unique features of the source domain. We use a shared feature extractor as a generator and an adversarial discriminator as a discriminator to simulate GAN [23] to ensure that obtained shared features are pure and there is no source domain-specific information. The core part is the adversarial discriminator, which is used to identify the domain of shared features. The features are marked with domain identification before entering the shared feature extractor, and then back-propagate and train the shared feature extractor according to the identification of the source of shared features. This produces shared features that cannot be identified by the adversarial discriminator. Thus, an adversarial network composed of an adversarial discriminator and a shared feature extractor is formed. After multiple rounds of training, a balance is reached. That is, the adversarial discriminator cannot identify which domain shared features come from, and finally obtains pure domain-independent features. This method can alleviate the negative transfer problem. This paper uses two adversarial discriminators, which are OAD and GRAD. The following two subsections will introduce in detail.

3.4.1. Ordinary Adversarial Discriminator

OAD judges the source of a sentence according to the idea of generative adversarial network discriminator. The identification process is shown in Figure 3. OAD identifies the domain of feature through the max-pooling layer and softmax layer, which can be denoted as Equations (10) and (11):

S = Maxpooling ({F^{'}}^{k})

(10)

D (S; θ_{d}) = softmax (W_{d} S + b_{d})

(11)

where

{F^{'}}^{k}

represents the output of shared multi-head self-attention in Section 3.3,

θ_{d}

represents parameters of the adversarial discriminator, and

W_{d}

and

b_{d}

are trainable parameters. During training, the function of the gradient descent algorithm is to make the discriminative ability of the adversarial discriminator stronger. Next, we use the gradient reversal layer to make it reverse, so that the adversarial discrimination ability of the discriminator becomes worse. Eventually the adversarial discriminator will not be able to identify the domain of shared features.

We introduce an adversarial loss function

l_{A D}

to prevent the unique information in the source domain from entering shared space, as shown in Equation (12):

l_{A D}

train a shared feature extractor to generate shared features such that the adversarial discriminator cannot judge the domain of features reliably.

l_{A D} = \min_{θ_{f}} (\max_{θ_{d}} \sum_{k = 1}^{K} \sum_{i = 1}^{T_{k}} \log D (E_{s} (x_{k}^{(i)})))

(12)

where

θ_{f}

is the trainable parameter of the shared feature extractor,

E_{s}

represents shared feature extractor,

T_{k}

is the number of training instances for task

k

, and

x_{k}^{(i)}

is the ith instance of the task

k

. We complete optimization by adding a gradient inversion layer before the softmax layer, and train model parameters to make the discriminative ability of the adversarial discriminator weaker. When the shared feature extractor and adversarial discriminator reach a balance point, the adversarial discriminator cannot discriminate the domain of the shared feature.

3.4.2. Generalized Resource Adversarial Discriminator

Although OAD can remove the impurities in shared features, it does not consider the problem of data scale imbalance between the source domain and target domain. If this imbalance is not considered during training, stochastic gradient descent optimization will make the model more biased toward source domain [24]. This paper uses GRAD [10] to solve this problem. GRAD is shown in Figure 4.

GRAD also uses a gradient inversion layer to reduce its judgment ability. Shared feature representations extracted in source and target domains are made more compatible, GRAD cannot distinguish the source of features, and the output of the shared feature extractor is domain-independent. To compute the loss function of GRAD, the output sequence of the shared feature extractor is first encoded into a single vector via a self-attention mechanism [25], and then projected onto a scalar

r

through a linear transformation, as shown in Equation (13).

l_{GARD} = - \sum_{i} {I_{i \in D_{S}} α {(1 - r_{i})}^{γ} \log r_{i} + I_{i \in D_{T}} (1 - α) r_{i}^{γ} \log (1 - r_{i})}

(13)

The weight

α

is used to balance the influence of the large difference in training scale between high and low resources, and provide adaptive weights for each sample, so that the focus of model training is on hard samples.

I_{i \in D_{S}}

indicates that feature comes from the source domain, and

I_{i \in D_{T}}

indicates that the feature comes from the target domain, both of which are identification functions. The parameter

γ

measures the comparison of the loss contribution of hard and easy samples, and

{(1 - r_{i})}^{γ}

(or

r_{i}^{γ}

) controls the loss contribution of each sample by measuring the difference between the predicted value and true label. Weights

α

and

{(1 - r_{i})}^{γ}

(or

r_{i}^{γ}

) reduce the loss contribution of high-resource samples and simple samples, respectively.

3.5. Label Decoder

This paper uses a linear chain CRF [13] as the label decoder. Given a Chinese sentence

s = {w_{1}, w_{2}, …, w_{n}}

and its corresponding sequence of predicted labels

y = {y_{1}, y_{2}, …, y_{n}}

for each word, linear chain CRF can be written as Equation (14):

p (\tilde{y} | {H^{'}}_{1 : n}) = \frac{1}{Z (h_{1 : n})} \exp {\sum_{t = 2}^{n} θ_{y_{t - 1}, y_{t}} + \sum_{t = 1}^{n} W_{y_{t}} {H^{'}}_{t}}

(14)

where

H^{'}

represents the output of the attention mechanism and

\tilde{y}

is the sequence of predicted labels,

\tilde{y} = y_{1 : n}

.

θ

represents transition distribution between

t - 1

and

t

.

Z (h_{1 : n})

is the normalization term. The calculation process is as Equation (15);

t_{k}

and

s_{l}

represent the transition feature function and state feature function, respectively,

t_{k}

is determined by the previously hidden node

y_{i - 1}

and currently hidden node

y_{i}

, and

s_{l}

is determined by currently hidden node

y_{i}

;

λ_{k}

and

μ_{l}

represent corresponding weights of two feature functions respectively.

Z (x) = \sum_{y} \exp (\sum_{i, k} λ_{k} t_{k} (y_{i - 1}, y_{i}, x, i) + \sum_{i, l} μ_{l} s_{l} (y_{i}, x, i))

(15)

The model parameters are optimized using the negative log-likelihood function [25] as the objective function of the model. The loss function

l

is defined as shown in Equation (16):

l = - \sum \log p (\tilde{y} | {H^{'}}_{1 : n})

(16)

Since different domains have different types of labels, the two shared modes add a specific CRF layer for each domain. Define source domain loss function

l_{S}

and target domain loss function

l_{T}

, as shown in Equations (17) and (18).

l_{S} = - \sum_{i} \log p (\tilde{y} | {H^{″}}_{1 : n}^{S})

(17)

l_{T} = - \sum_{i} \log p (\tilde{y} | {H^{″}}_{1 : n}^{T})

(18)

3.6. Training

Our model can be trained by end-to-end mode, which uses the backpropagation algorithm by minimizing loss. For OAD, the loss function of the model is shown in Equation (19):

l = l_{S} \cdot I (i) + l_{T} \cdot (1 - I (i)) + λ l_{A D}

(19)

where

λ

is a hyperparameter and

I (i)

is a domain discriminant function, as shown in Equation (20), used to identify whether a sentence comes from the source domain or target domain. In the formula,

D_{S}

and

D_{T}

are source domain and target domain datasets, respectively.

I (i) = {\begin{matrix} 1, & i f & i \in D_{T} \\ 0, & i f & i \in D_{S} \end{matrix}

(20)

For GRAD, the loss function of the model is shown in Equation (21).

l = l_{GARD} + l_{S} + l_{T}

(21)

4. Experiments

In this section, we first introduce the dataset and preprocessing scheme used in the experiment, then give the parameter settings for model training and performance evaluation indicators, and finally compare and analyze the experimental results.

4.1. Data Set

This experiment uses CLUENER2020 Chinese Fine Grained NER Dataset [26] and flight information domain dataset [27] as source domain datasets. CLUENER is a dataset for NER tasks in the news domain with fine-grained entity classification. The flight information domain dataset is the data inquiring about flight information in the dialogue system, which is part of short text data. The dataset includes 19 entity types, which have more fine-grained entity types than the CLUENER dataset. Both datasets are annotated with BIO. BIO labeling is a kind of joint labeling. Specifically, B, I, and O represent Begin, Inner, and Other, respectively. BIO annotation is to label each element as “B-X”, “I-X” or “O”. Among them, “B-X” indicates that the segment where this element is located belongs to type X and this element is at the beginning of this segment, “I-X” indicates that the segment where this element is located belongs to type X and this element is in the middle of this segment, “O” indicates that it is not of any type. Details are shown in Table 1.

Our work uses SMP2020-ECDT Chinese Human-Machine Dialogue Technology Evaluation Dataset [28] as the target domain dataset, which is also part of the short text dataset. The dataset has a total of 5024 sentences, involving 45 domains, with 81 named entity types and 8787 entities. In addition, we select four sub-domains with appropriate data volume from the SMP2020-ECDT dataset as target domain datasets, which are recipes, music, news, and trains, to verify the impact of different domains on the transfer effect. The amount of data in other fields is too small, and overfitting is prone to occur during the experiment. The information on the selected four domain datasets is shown in Table 2.

We use the hold-out method to divide the dataset into training and testing sets with a ratio of 7:3. We first clean the dataset and then use Jieba tools for word segmentation. To prevent specific nouns from being incorrectly segmented, all entity words are stored in the user dictionary in advance. The data is labeled according to the BIO labeling method. Finally, we input data into the model.

4.2. Parameter Settings

In this experiment, the dimension of character embeddings and word embeddings is set to 300, the batch size is 16, the learning rate is 0.001, the learning decay rate is 0.7, the dropout is 0.5, and adam optimizer is used as the gradient descent algorithm. The dimensions of LSTM hidden layers of baseline model and full sharing mode are set to 200, and the dimensions of LSTM hidden layers of source, shared and target domains are set to 100 in private sharing mode, respectively. Hyperparameter

λ

of OAD model is 0.06, and hyperparameters

α

and

γ

of GRAD model are 0.25 and 2.

4.3. Evaluation Metrics

This experiment uses F1 score as the evaluation metric, which considers precision and recall, and the calculation is shown in Equations (22)–(24):

Precision = \frac{T P}{T P + F P}

(22)

Recall = \frac{T P}{T P + F N}

(23)

F 1 - score = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(24)

where TP (true positive) means the number of correct entities identified. FP (false positive) means the number of wrong entities identified. FN (false negative) means the number of unidentified entities. For NER tasks, the precision rate represents the ratio of the number of correctly labeled entities in a given dataset to the total number of labeled entities. The recall rate represents the ratio of the number of correctly labeled entities in a given dataset to the total number of entities in the dataset.

4.4. Results and Analysis

In this paper, the BiLSTM-Attention-CRF baseline model fused with character and word information is first tested on five target domains, and the results are used as the baseline for comparison experiments to detect the transfer effect of the model. Then, private sharing mode (MTL-P) and full sharing mode (MTL-F) are used for transfer learning. Knowledge on the two source domains is transferred to five target domains. The results are later compared with the ones of the model used to detect the impact of adversarial discriminators on transfer learning. Finally, we add an OAD and a GRAD to the shared feature extractor, and the final comparison results of the experimental F1 score are shown in Table 3.

Due to a large number of named entities in the SMP2020 dataset, there are situations in which the same entity type is used in different fields, the same word has different entity types in different fields, and the number of sentences in the other four subfields is too small. However, the baseline model adopts a supervised training method, which cannot be well converged due to the influence of the data, which will result in a low F1 score. After adding a shared feature extractor, the F1 score obtained by the model transfer from the flight information domain dataset, as the source domain is generally higher than that of CLUENER. The model transfer effect of using MTL-P mode is generally better than that of MTL-F mode, probably because private sharing mode has more private features than full sharing mode, which provides a richer feature representation for the target domain. It can be seen that the MTL-P mode is more suitable for transfer learning across domains. Although the performance of the model is improved after adding a shared feature extractor, the improvement effect is limited, and even the problem of negative transfer occurs. For example, when the flight information domain is used as the source domain, the F1 score transferred to the music target domain is not as high as the baseline. It may be that the source domain and target domain data are quite different, and there are a large number of features unique to the source domain in the shared feature extractor which lead to negative effects.

Inspired by GAN, this experiment adds an adversarial discriminator to the shared feature extractor to ensure that shared features do not have source-specific features. The two modes of private sharing and full sharing continue to be used, and OAD and GRAD are added. They are represented by OAD-P, OAD-F, GRAD-P, and GRAD-F, respectively. It can be seen from Table 3 that, consistent with the previous analysis, the private sharing mode still performs better. In addition, the F1 score of OAD in all target domains exceeds the baseline, which proves that adversarial discriminator can alleviate negative transfer problems. The transfer effect of using GRAD is the best, and the F1 score exceeds that of the baseline model, NER model using model transfer, and OAD. It can alleviate the data scale imbalance between the source domain and the target domain.

From the perspective of transfer effect, the music domain has the best effect, and the F1 score is increased by 12.89%, but the F1 score of the baseline of the music domain is the lowest at 74.29%. The smallest improvement was in the news domain, where the F1 score improved by 4.4%, while the F1 score of baseline was the highest at 88.46%. The highest F1 score is obtained from the SMP2020 dataset. To sum up, it can be seen that the domain with a lower baseline has a better transfer effect, that is, a domain with fewer data and less labeled data has a better transfer effect. This proves our method can solve the problem that resource shortage of NER tasks in spoken language understanding.

5. Conclusions

This paper proposes to use an adversarial transfer learning method to solve the lack of resources for NER tasks in spoken language understanding. It adds a shared feature extractor and adversarial discriminator to the BiLSTM-Attention-CRF model that combines character and word information. We use the shared feature extractor as a generator and an adversarial discriminator as a discriminator to simulate GAN. This can eliminate the feature of the source domain in shared features and improve the purity of shared features. The experimental results show that the method in this paper has a better transfer effect in a domain where the baseline F1 score is lower. The transfer effect is better in fields with fewer data and fewer labeled data. This method can effectively improve the performance of NER tasks in resource shortage fields.

Although this research has made some progress, the negative transfer is still the biggest obstacle to transfer learning. We can alleviate this problem by looking for correlation metrics between source and target domains or introducing adversarial transfer learning, but these methods have their limitations. Regarding this latter point, more research is still in process. Concurrently, the NER task only considers one subtask in spoken language understanding. This field also includes two tasks: intent determination and domain determination. These tasks are related; considering a single task may provide limited improvement in model performance. In the future, we will consider combining multiple tasks for transfer learning to improve the performance of NER tasks in resource shortage fields.

Author Contributions

Conceptualization, methodology, software and data curation, Y.G. and M.L. (Meng Li); validation, formal analysis and investigation, Y.G., M.L. (Meng Li) and Y.L.; writing—original draft preparation, Y.G.; resources and funding acquisition, Y.L.; writing—review and editing, Y.G., M.L. (Meng Li), Y.L., F.G., Y.Q. and M.L. (Min Lin); visualization, Y.G., M.L. (Meng Li), Y.L., F.G. and Y.Q.; supervision and project administration, Y.L., F.G., Y.Q. and M.L. (Min Lin). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Natural Science Foundation of China (12204062, 61806103, 61562068), Youth Innovation and Entrepreneurship Talents of Inner Mongolia “Grassland Talents” (Q2017027), Open Research Topic of Big Data Laboratory of Inner Mongolia Discipline Supervision and Investigation (IMDBD2020013), Science and Technology Research Program for Colleges and Universities in Inner Mongolia of China (NJZY21578, NJZY21551), Basic Scientific Research Business Project of Inner Mongolia Normal University (2022JBQN106, 2022JBQN111), Natural Science Foundation of Inner Mongolia, China (2022LHMS06001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tur, G.; De Mori, R. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech; John Wiley & Sons Publishing: New York, NY, USA, 2011. [Google Scholar]
Satoshi, S.; Nobata, C. Definition, Dictionaries and Tagger for Extended Named Entity Hierarchy. In Proceedings of the 4th International Conference on Language Resources and Evaluation, Lisbon, Portugal, 26–28 May 2004. [Google Scholar]
Finkel, J.R.; Manning, C.D. Nested named entity recognition. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–7 August 2009. [Google Scholar]
Singh, S.; Hillard, D.; Leggetter, C. Minimally-supervised extraction of entities from text advertisements. In Proceedings of the Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, 2–4 June 2010. [Google Scholar]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Chiu, J.P.; Nichols, E. Named entity recognition with bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 2016, 4, 357–370. [Google Scholar]
Ma, X.; Hovy, E. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016. [Google Scholar]
Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A comprehensive survey on transfer learning. Proc. IEEE 2020, 109, 43–76. [Google Scholar]
Cao, P.; Chen, Y.; Liu, K.; Zhao, J.; Liu, S. Adversarial transfer learning for Chinese named entity recognition with self-attention mechanism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
Zhou, J.T.; Zhang, H.; Jin, D.; Zhu, H.; Fang, M.; Goh, R.S.M.; Kwok, K. Dual adversarial neural transfer for low-resource named entity recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019. [Google Scholar]
Eddy, S.R. Hidden markov models. Curr. Opin. Struct. Biol. 1996, 6, 361–365. [Google Scholar] [CrossRef]
Berger, A.; Della Pietra, S.A.; Della Pietra, V.J. A maximum entropy approach to natural language processing. Comput. Linguist. 1996, 22, 39–71. [Google Scholar]
Chen, P.H.; Lin, C.J.; Schölkopf, B. A tutorial on v-support vector machines. Appl. Stoch. Model. Bus. Ind. 2005, 21, 111–136. [Google Scholar] [CrossRef]
Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning, Williams College, Williamstown, MA, USA, 28 June–1 July 2001. [Google Scholar]
Habibi, M.; Weber, L.; Neves, M.; Wiegandt, D.L.; Leser, U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017, 33, i37–i48. [Google Scholar]
Wang, Q.; Zeng, L. Chinese symptom component recognition via bidirectional LSTM-CRF. In Proceedings of the 2018 Tenth International Conference on Advanced Computational Intelligence (ICACI), Xiamen, China, 29–31 March 2018. [Google Scholar]
Zeng, Q.; Xiong, W.; Du, J.; Nie, B.; Guo, R. Electronic medical record named entity recognition combined with self-attention BILSTM-CRF. Comput. Appl. Softw. 2021, 38, 159–162. [Google Scholar]
Ni, J.; Dinu, G.; Florian, R. Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017. [Google Scholar]
Ando, R.K.; Zhang, T. A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 2005, 6, 1817–1853. [Google Scholar]
Tommasi, T.; Caputo, B. The more you know, the less you learn: From knowledge transfer to one-shot learning of object categories. In Proceedings of the British Machine Vision Conference, London, UK, 7–10 September 2009. [Google Scholar]
Wang, Z.; Qu, Y.; Chen, L.; Shen, J.; Zhang, W.; Zhang, S.; Gao, Y.; Gu, G.; Chen, K.; Yu, Y. Label-aware double transfer learning for cross specialty medical named entity recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 1–6 June 2018. [Google Scholar]
Yang, Z.; Salakhutdinov, R.; Cohen, W.W. Transfer learning for sequence tagging with hierarchical recurrent networks. In Proceedings of the 5th International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Assoc. Comput. Mach. 2020, 63, 139–144. [Google Scholar] [CrossRef]
Lin, B.Y.; Xu, F.; Luo, Z. Multi-channel bilstm-crf model for emerging named entity recognition in social media. In Proceedings of the 3rd Workshop on Noisy User-Generated Text, Copenhagen, Denmark, 7 September 2017. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Xu, L.; Tong, Y.; Dong, Q.; Liao, Y.; Yu, C.; Tian, Y.; Liu, W.; Li, L.; Liu, C.; Zhang, X. CLUENER2020: Fine-grained Name Entity Recognition for Chinese. arXiv 2020, arXiv:2001.04351. [Google Scholar]
Hou, L.; Li, Y.; Lin, M.; Li, C. Joint Recognition of Intent and Semantic Slot Filling Combining Multiple Constraints. J. Front. Comput. Sci. Technol. 2020, 14, 1545–1553. [Google Scholar]
Hou, Y.; Che, W.; Lai, Y.; Zhou, Z.; Liu, Y.; Liu, H.; Liu, T. Few-shot slot tagging with collapsed dependency transfer and label-enhanced task-adaptive projection network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020. [Google Scholar]

Figure 1. Adversarial transfer learning model diagram (a) full sharing mode (b) private sharing mode.

Figure 2. CNN extracts character-level features.

Figure 3. Ordinary adversarial discriminator.

Figure 4. Generalized resource adversarial discriminator.

Table 1. Source domain dataset.

Dataset	Sentences	Entities	Classes	Entity Types
CLUENER	12,091	25,244	10	Address, Book, Company, Game, Name, Government, Movie, Scene, Organization, Position
Flight Information	5871	16,575	19	Departure, Destination, Time, Airport Name, Fare Range, Airline, Flight Number, Seat Classes, etc.

Table 2. SMP2020 subdomain dataset.

Domain	Sentences	Entities	Classes	Entity Types
Cookbook	438	450	4	dishName, utensil, ingredient, keyword
Music	189	195	3	song, artist, category
News	197	254	8	datetime_time, datetime_date, category, country, province, city, area, keyword
Train	171	366	8	startLoc_province, startLoc_city, startLoc_area, endLoc_province, endLoc_city, endLoc_area, startDate_date, category

Table 3. Experimental results of NER for spoken language understanding using transfer learning (F1-score/%).

Source Domain	Model	SMP2020	Cookbook	Music	News	Train
	baseline	86.75	83.87	74.29	88.46	87.50
CLUENER	MTL-P	87.58	83.52	78.95	88.89	87.94
	MTL-F	88.18	84.44	75.68	87.72	87.32
	OAD-P	87.96	87.64	81.08	90.57	89.04
	OAD-F	88.39	86.36	78.05	91.55	88.73
	GRAD-P	92.59	88.17	86.42	92.57	90.78
	GRAD-F	91.20	86.96	83.33	91.76	89.38
Flight Information	MTL-P	88.72	85.39	73.17	89.29	88.73
	MTL-F	87.76	84.09	73.68	90.91	88.59
	OAD-P	88.12	89.13	82.67	92.59	89.21
	OAD-F	88.27	89.89	82.93	90.20	89.51
	GRAD-P	92.99	91.40	87.18	92.86	90.14
	GRAD-F	90.22	88.89	84.21	92.31	92.31

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Y.; Li, M.; Li, Y.; Ge, F.; Qi, Y.; Lin, M. Research on Named Entity Recognition for Spoken Language Understanding Using Adversarial Transfer Learning. Electronics 2023, 12, 884. https://doi.org/10.3390/electronics12040884

AMA Style

Guo Y, Li M, Li Y, Ge F, Qi Y, Lin M. Research on Named Entity Recognition for Spoken Language Understanding Using Adversarial Transfer Learning. Electronics. 2023; 12(4):884. https://doi.org/10.3390/electronics12040884

Chicago/Turabian Style

Guo, Yao, Meng Li, Yanling Li, Fengpei Ge, Yaohui Qi, and Min Lin. 2023. "Research on Named Entity Recognition for Spoken Language Understanding Using Adversarial Transfer Learning" Electronics 12, no. 4: 884. https://doi.org/10.3390/electronics12040884

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Named Entity Recognition for Spoken Language Understanding Using Adversarial Transfer Learning

Abstract

1. Introduction

2. Related Work

3. Method

3.1. Embedding Layer

3.2. Word-Level Feature Extractor

3.3. Self-Attention

3.4. Adversarial Discriminator

3.4.1. Ordinary Adversarial Discriminator

3.4.2. Generalized Resource Adversarial Discriminator

3.5. Label Decoder

3.6. Training

4. Experiments

4.1. Data Set

4.2. Parameter Settings

4.3. Evaluation Metrics

4.4. Results and Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI