Chinese Named Entity Recognition Model Based on Multi-Task Learning

Fang, Qin; Li, Yane; Feng, Hailin; Ruan, Yaoping

doi:10.3390/app13084770

Open AccessArticle

Chinese Named Entity Recognition Model Based on Multi-Task Learning

by

Qin Fang

^1,2,3

,

Yane Li

^1,2,3,

Hailin Feng

^1,2,3,* and

Yaoping Ruan

^1,2,3

¹

College of Mathematics and Computer Science, Zhejiang A&F University, Hangzhou 311300, China

²

Key Laboratory of Forestry Intelligent Monitoring and Information Technology of Zhejiang Province, Hangzhou 311300, China

³

China Key Laboratory of State Forestry and Grassland Administration on Forestry Sensing Technology and Intelligent Equipment, Hangzhou 311300, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(8), 4770; https://doi.org/10.3390/app13084770

Submission received: 13 March 2023 / Revised: 4 April 2023 / Accepted: 7 April 2023 / Published: 10 April 2023

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Compared to English, Chinese named entity recognition has lower performance due to the greater ambiguity in entity boundaries in Chinese text, making boundary prediction more difficult. While traditional models have attempted to enhance the definition of Chinese entity boundaries by incorporating external features such as lexicons or glyphs, they have rarely disentangled the entity boundary prediction problem for separate study. In order to leverage entity boundary information, the named entity recognition task has been decomposed into two subtasks: boundary annotation and type annotation, and a multi-task learning network (MTL-BERT) has been proposed that combines a bidirectional encoder (BERT) model. This network performs joint encoding and specific decoding of the subtasks, enhancing the model’s feature extraction abilities by reinforcing the feature associations between subtasks. Multiple sets of experiments conducted on Weibo NER, MSRA, and OntoNote4.0 public datasets show that the F1 values of MTL-BERT reach 73.8%, 96.5%, and 86.7%, respectively, effectively improving the performance and efficiency of Chinese named entity recognition tasks.

Keywords:

multi-task learning; Chinese named entity recognition; joint learning; feature interaction; bi-directional encoder (BERT)

1. Introduction

Named Entity Recognition (NER) is a crucial task in natural language processing that involves detecting entity boundaries in a sequence and classifying them into predefined entity classes based on contextual semantic understanding. It has many applications in information extraction, automatic knowledge base construction, and question-answering systems, among others [1]. In recent years, deep learning models have shown great success in mining contextual semantic features, greatly advancing the field of named entity recognition.

Named entity recognition (NER) is a task that involves two subtasks: predicting entity boundaries and identifying entity types. The evaluation of NER performance requires accurate prediction of both entity boundaries and types. In Chinese, unit vocabulary is more semantically complex and entity boundaries are more ambiguous compared to English, making entity recognition more challenging. Previous research has proposed various solutions to this problem, including incorporating lexicon information into NER models. One approach is to perform word segmentation using a tool and then annotate the resulting sequence of words. However, the effectiveness of this approach depends on the accuracy of word segmentation, particularly in specialized fields where errors can propagate and become more problematic. To address this limitation, researchers have proposed various methods for fusing lexical information with models. For example, Zhang et al. [2] proposed a raster-like long- and short-term memory network structure that combines character-level and potential word-level information in the text. Gui et al. [3] used convolutional neural networks and attention mechanisms to fuse word vectors at different layers, which improved computational efficiency and accelerated model training.

Despite the improvement that NER models incorporating lexical information bring to the recognition of Chinese named entities, these models suffer from some drawbacks due to their reliance on a single-objective approach, in which model parameters are shared across boundary prediction and type recognition. One issue with this approach is that type recognition typically requires deeper and more extensive contextual semantic information than boundary prediction. However, the single-objective task encodes both entity boundary and type information uniformly, leading to mixed semantic results for both tasks and making it difficult to learn based on inter-task variability. Another drawback of the single-objective task is its tendency to spend more time on redundant label prediction. For example, in the BIESO sequence labeling approach, the decoding labels for each word in the single-target task network include five options, namely {‘O’, ‘B-entity’, ‘I-entity’, ‘E-entity’, ‘S-entity’}. If there are x entity types, this leads to 4x + 1 combinations of decoding labels for each word. By contrast, splitting the single-target task into two subtasks, namely entity boundary labeling and type labeling, can reduce the number of decoded label combinations per word to x + 5 cases, which is 3x − 4 fewer options than the single-target task. However, because the NER task requires one annotation per word, learning becomes particularly challenging when there are more entity types or a larger dataset, and the single-target task becomes more disadvantageous.

In this paper, we propose a solution to the issues raised in previous studies by decomposing the entity boundary prediction task from the single-objective task and designing a multi-task learning network framework together with the entity type recognition task. Specifically, we decompose the entity recognition task into two subtasks, boundary labeling and type labeling, and construct a backbone network for learning, consisting of a coding layer with shared parameters and a decoding layer with exclusive parameters. Joint learning of the coding layer enhances the model’s generalization performance, while the decoding layer mines unique data features for different tasks. To enhance the feature interaction between subtasks, we design loss functions that link the two tasks closely. The final network outputs entity labels by combining the two types of labels after decoding the boundary and type labels of entities.

The main contributions of this paper include:

A general multi-task algorithmic framework is proposed for the Chinese named entity recognition task, based on which a traditional single-target task network can be designed as a multi-task network for joint learning.
The above framework is instantiated as a sequence annotation model for joint multi-task learning based on BERT [4], where specific decoding allows the feature representation of subtasks not to be restricted to the same expression space, and the interaction pattern is designed to strengthen the feature connections among subtasks.
Experiments on several Chinese datasets show that the proposed model significantly outperforms the single-target task model in terms of performance and efficiency.

2. Related Studies

In recent years, various neural network architectures have been proposed and successfully applied to named entity recognition (NER) tasks, which typically rely on the contextual representation of words in texts. In 2015, Huang et al. [5] introduced a bidirectional long short-term memory (BiLSTM) to replace the hand-designed features in previous works, and reported that the conditional random field (CRF [6]) decoding layer effectively captures dependencies between labels. Later, Lample G et al. [7] modeled character-level and word-level information using BiLSTM and demonstrated the effectiveness of character-level information in solving the out-of-vocabulary word (OOV) problem. In 2016, Xuezhe Ma et al. [8] effectively extracted character-level feature information using convolutional structures, and proposed a BiLSTM-CNN-CRF network that combines a CRF decoding approach. In 2018, Devlin et al. [4] introduced a bidirectional encoder representation from Transformers (BERT) based on the Transformer [9] network. BERT is a model that can decode words in texts by pre-training on a large corpus by masking words in the text to generate a deep bidirectional language representation. It has achieved optimal performance on several natural language processing tasks including NER tasks. Since then, researchers have sought to improve the BERT model in order to obtain better results. In 2019, Yiming Cui et al. [10] proposed the BERT-WWM model, which improved some Chinese natural language processing tasks compared to the original BERT model. However, it has some limitations, such as using only Wikipedia as a pre-training dataset, which may work better for formal texts but not for more casual texts. Additionally, it does not consider some features specific to Chinese, such as entities, phrases, domains, etc. In 2021, Zijun Sun et al. [11] proposed ChineseBERT, which incorporates both glyph and pinyin information about Chinese characters into the language model pre-training. This model significantly improves performance with fewer training steps compared to the BASE model. In 2022, Jingye Li et al. [12] effectively modeled the adjacency relations between entity words with NNW and THW-* relations, solved the kernel bottleneck problem of unified NER and proposed W2NER, and achieved SOTA results on multiple datasets.

Multitask learning can be seen as an approach to the specification of model induction through shared representation [13]. Multitask learning frameworks are very attractive in the field of natural language processing because shared learning reduces the amount of training data required while reducing the risk of overfitting a task [14]. Alex Kendall et al. [15] found that the performance of the system depends heavily on the relative weights between the losses of each task. Adjusting these weights manually is a difficult and time-consuming process, so it was proposed to weigh multiple loss functions by considering the homomorphic uncertainty of each task, allowing the model to learn multi-task weighting and outperforming independent models trained on each task individually. Collobert R. et al. [16] applied the multi-task learning framework to the field of sequence annotation and observed that multi-task learning can generate a single unified network, but brought only minor improvements on the NER task. In addition, Goldberg Y [17] proposed to model different task features using different layers of the network, but the authors also stated that multitask learning is difficult to improve on the NER task and concluded that the combined task should be similar enough to the target task to observe significant gains. Furthermore, some excellent work on sequence labeling based on multitask learning was recently proposed [18,19], which also showed that the unequal relationship between tasks may limit the generality and applicability of multitask models.

Although the potential improvements brought about by multi-task learning in the NER task are uncertain, joint learning provides valuable information to different subtasks and allows them to learn from each other. Motivated by these findings, this paper proposes a multi-task learning network based on the BERT model for Chinese named entity recognition. This model does not rely on lexicon information and instead only uses character encoding, with the goal of enhancing the ability of two subtasks, boundary prediction and entity annotation, to capture text features and ultimately improve the performance of Chinese named entity recognition.

3. Chinese Named Entity Recognition Model Based on Multi-Task Learning

3.1. Multi-Task Learning Network Model

In this study, the multitask is defined as entity boundary labeling and entity type labeling. Based on these two tasks, a multitask learning network framework called MTL-BERT is constructed using the BERT model, as illustrated in Figure 1. The network is divided into three parts: an input representation layer, a feature encoding layer for shared parameter learning, and a label decoding layer for specific parameter learning. The two subtasks share a common input embedding and encoder, allowing the encoding parameters to be shared across multiple tasks. Each subtask then connects to a specific decoding layer, where it learns task-specific parameters and performs label decoding. Finally, the network jointly outputs the predicted labels for both subtasks.

Assuming the input sentence’s embedding vector is represented by X = [x₁,x₂,…,x_n], the process illustrated in Figure 1 can be mathematically represented by Equation (1). Following the shared contextual semantic encoding, the embedding representation vector X is transformed into [h₁,h₂,…,h_n], which is then fed to the specific decoding layer. The decoding layer decodes each word into two corresponding subtask labels, [p₁,p₂,…p_n] and [q₁,q₂,…,q_n]. The final labeling result is obtained by combining the output of the two subtask labels.

\begin{array}{l} [h_{1}, h_{2}, \cdot \cdot \cdot, h_{n}] = E n c o d e r ([x_{1}, x_{2}, \cdot \cdot \cdot, x_{n}]) \\ [p_{1}, p_{2}, \cdot \cdot \cdot, p_{n}] = D e c o d e r_{T_{1}} ([h_{1}, h_{2}, \cdot \cdot \cdot, h_{n}]) \\ [q_{1}, q_{2}, \cdot \cdot \cdot, q_{n}] = D e c o d e r_{T_{2}} ([h_{1}, h_{2}, \cdot \cdot \cdot, h_{n}]) \\ [\begin{matrix} y_{1} \\ y_{2} \\ \dots \\ y_{n} \end{matrix}] = [\begin{matrix} p_{1} \\ p_{2} \\ \dots \\ p_{n} \end{matrix}] \oplus [\begin{matrix} q_{1} \\ q_{2} \\ \dots \\ q_{n} \end{matrix}] \end{array}

(1)

3.2. Input Representation Layer

The work of R. Collober [16] highlights the importance of word vectors in enhancing sequence annotation performance. Numerous research studies have presented pre-trained word vector matrices based on large text datasets, such as Word2Vec [20], Glove [21], ELMO [22], BERT [4], and others. Notably, the BERT pre-training model generates distinct vector representations for each word based on the context, which differs from traditional static word vector embedding. This dynamic approach enables BERT to better consider the context of words, especially in the Chinese language, where a single word can have multiple meanings. Therefore, the embedding vector output from the BERT pre-training model can significantly enhance its performance.

To prevent error propagation caused by the decline in the performance of Chinese word segmentation tools due to linguistic ambiguity, for example, “Peking University” may be split into “Beijing” and “University”, so that “Peking University” cannot be recognized as an entity. This model adopts a separate decoding approach for the boundary prediction task, which replaces the role of word segmentation in Chinese named entity recognition. As such, the network inputs the BERT Chinese pre-training model with word granularity only, as shown in Equation (2). For a given sequence T = [t₁, t₂, …, t_n], the special categorical embedding ([CLS]) serves as the first token, and the special token ([SEP]) is added as the final token for word embedding, text embedding (which carries global semantic information), location embedding, and positional embedding to obtain the total character encoding.

E = T o k e n E e b e d d i n g s (T) + S e g E m b e d d i n g s (T) + P o s E m e b e d d i n g s (T)

(2)

3.3. Feature Coding Layer

The MTL-BERT network is a standard encoder-decoder model in which the encoder learns shared parameters while the decoder learns subtask-specific parameters and generates annotated sequences.

The coding sequence matrix, containing word, location, and text information, is first input into the Transformer coding layer in the BERT model for feature extraction. The Transformer is a powerful feature extractor composed entirely of attention mechanisms, which is conducive to handling longer statements and enables fast parallelism, thereby significantly improving training speed.

The attention mechanism enables the model to better encode the current word by attending to all words of the input sequence. The principle of attention can be represented by Equation (3), where Q, K, and V are the query vector, key vector, and value vector, respectively, created by multiplying the word embedding with three initialized weight matrices that can be learned during network training. When encoding a word, the representation of each word (value vector) is weighted and summed, and this weight is obtained through the dot product operation and Softmax operation of the key vector of all input words with the representation of the encoded word (query vector). The parameter d_k stabilizes the gradient, and Softmax normalizes the weight scores of all words.

A t t (Q, K, V) = S o f t \max (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(3)

3.4. Label Decoding Layer

Label decoding is typically the final stage of the NER task. After transforming the word embedding representation into a contextually relevant representation, there is often a significant semantic gap between the hidden layer of the model and the outputs defined by the two tasks. To address this, the network includes a fully connected layer in the decoding layer of each of the two subtasks, enabling them to express their unique learning content more effectively.

Ultimately, label decoding utilizes the probability maximum of the topmost output of the network for prediction. This process can be expressed as:

p = f (W H) \in ℝ^{N \times Y}

(4)

where N is the sequence sentence length, Y is the number of tags, H is the output of the fully connected layer, W is the learnable parameter, and the f function represents the decoding method.

3.5. Loss Function Design

In addition to the multi-task learning design that represents the knowledge of all tasks in the same parameter space, it is equally important to consider how to design the interaction pattern between subtasks. As such, the loss function design takes into account the feature interactions between subtasks, and the auxiliary information of the boundary labeling task is incorporated into the entity type labeling learning.

Given the significant number of non-entity words in the sequence (for example, in the CoNLL-2003 dataset, ‘O’ labels account for 83% of the total number of labels), and the fact that the type labeling task does not require feature learning of non-entity words, the network can greatly reduce computational costs by eliminating non-entity words in the boundary labeling task after learning that they are not entities. This allows the type labeling task to focus on feature learning of only a small number of entity words. The loss function optimization is performed according to the above idea. The loss function is designed as follows:

[y_{1}, y_{2}, \cdot \cdot \cdot, y_{n}] = S o f t \max (L i n e a r (h_{1}, h_{2}, \cdot \cdot \cdot, h_{n}))

(5)

M = ([\begin{matrix} y_{1} \\ y_{2} \\ \dots \\ y_{n} \end{matrix}]! = [\begin{matrix} ‘ O ’ \\ ‘ O ’ \\ \dots \\ ‘ O ’ \end{matrix}])

(6)

L_{t a s k 1} = \sum_{j = 1}^{n} - \log (p_{j} (y_{j})^{T})

(7)

L_{t a s k 2} = \sum_{j = 1}^{n} - \log (p_{j} (y_{j})^{T}) * M_{j}

(8)

In Equation (5), y_n denotes the boundary annotation result obtained using Softmax as the decoding method, where ‘O’ represents the non-entity annotation type. Equation (6) represents the mask matrix M generated based on the decoding result of boundary annotation. In Equations (7) and (8), L_task1 and L_task2 represent the cross-entropy loss functions of entity boundary annotation and type annotation, respectively. The predicted probability and label of the network for each word are represented by p_j and y_j, respectively, and n is the sentence length. It is observed that the mask matrix M eliminates the loss values of task 2, thereby preventing redundant learning of the entity-type labeling task on non-entity words. Regarding the calculation of the overall loss value of the network, the simplest way is to directly add up the loss values of two tasks. However, the unreasonableness of this calculation is obvious. Since different tasks have different learning difficulties, the direct summation is likely to lead to the learning of multiple tasks being dominated by one task, which will affect the final learning effect.

In this paper, we propose a method to express the loss value of the total task as a weighted sum of the loss values of the two subtasks. Specifically, we set fixed weight parameters α and β for the loss value of each task, as shown in Equation (9). This approach enables manual adjustment of the importance ratio of the two tasks.

L = α L_{t a s k 1} + β L_{t a s k 2}

(9)

However, this fixed hyperparameter will accompany the whole training cycle, which may bring disadvantages to multi-task learning. Inspired by previous work, this paper also incorporates dynamic weighted average [23] and uncertainty weighting [15]. Two dynamic weighting approaches are added to the experimental comparison as a way to explore the best solution to the multi-task learning balance problem.

The main idea of the dynamic weighted average (DWA) approach is to expect each task to converge at a similar learning rate, and when the loss function converges faster, it indicates that the task is easy to learn and the corresponding weight will be reduced, and vice versa the weight will be increased. Equations (10) and (11) briefly illustrate the idea of DWA, where wi (k) represents the weight of each task i, Ln (t−1) and θn (t−1) represent the loss value and training speed of task n at step t−1, respectively, and when θn (t−1) is smaller, it means that the task is trained faster at step t−1. N represents the number of tasks, and R is a constant that represents the scale to control the weight assignment.

θ_{i} (k - 1) = \frac{L_{i} (k - 1)}{L_{i} (k - 2)}

(10)

w_{i} (k) = \frac{N \exp (θ_{i} (k - 1) / R)}{\sum_{n} \exp (θ_{n} (k - 1) / R)}

(11)

In contrast, the uncertainty weighted (UW) approach is modeled based on the data itself (e.g., mislabeling in the data) or the cognitive bias brought about by the nature of the task itself, and tasks that are noisy and difficult to learn are weighted less. Equation (12) illustrates the idea of this approach, where σi represents the uncertainty present in each task. The larger σi is, the greater the uncertainty of the task, the smaller the weight, and logσi is similar to the presence of a regular term to prevent σi from being trained too large.

L (k, σ_{1}, σ_{2}, \dots, σ_{i}) = \sum_{i} \frac{1}{2 σ_{i}^{2}} L_{i} (k) + \log σ_{i}^{2}

(12)

4. Experiment and Analysis

4.1. Dataset

The experiments were conducted on several publicly available datasets, including Weibo NER [24], MSRA [25], CCKS2019 medical, self-built agricultural pest, OntoNote4.0 Chinese, and AG001. Weibo NER is a Chinese named entity recognition dataset in the social media domain, consisting of geographic (GPE), person (PER), location (LOC), and organization (ORG) entity categories, further divided into specific entity (named entity, NE) and referential entity (nominal mention, NM) categories at a finer granularity, posing a challenge to the entity identification task. The MSRA dataset, published by Microsoft Asia Research Institute, contains three entity types: Person (PER), Organization (ORG), and Location (LOC). OntoNote4.0 Chinese data is a mixed English and Chinese language corpus containing four entities: Geography (Location), Organization (ORG), Geography (GPE), and Person (PER). CCKS2019 is a dataset designed for named entity recognition of Chinese electronic medical records, consisting of six types of entities, including diseases and diagnoses (DISEASES), examinations (EXAMINATIONS), tests (TEST), surgeries (TREATMENT), drugs (DRUG), and anatomical sites (BODY).

The experiments also included comparisons of multitask learning in English and Chinese, with experiments conducted on the English corpus CoNLL-2003 [26] and the MIT-Movie dataset. Table 1 presents the basic information of the above datasets.

Self-Built Chinese Agricultural Dataset AG001

This dataset is designed for the named entity identification task in agriculture to improve the accuracy and recall of named entity identification in agriculture, which is critical for subsequent tasks such as pest and disease warning and control recommendations. The diversity, specialization, and locality of entities related to agriculture make named entity identification in agriculture an important and challenging task. Existing datasets often have limited coverage and cannot meet the requirements of different domains and scenarios.

To solve this problem, we extracted agriculture-related entities, such as crop names, pest names, and treatments, from pest intelligence in Changshan and Lin’an, and built a Chinese named entity recognition dataset. The dataset contains 13,000 texts, each of which is no more than 256 characters long.

To ensure that each entity in the dataset contains the semantic meaning of the context, the text was pre-processed with word granularity only to reduce the input dimension of the model, while retaining the semantic information of the text. After that, data cleaning is performed to remove the noise and irrelevant information in the text, which improves the generalization ability of the model and reduces the risk of overfitting. Finally, tagging with BIO is used on the dataset, where each entity is tagged with a B-entity category at the beginning, an I-entity category at the middle and end, and O for non-entities, to facilitate learning the boundaries and categories of named entities. The dataset includes three entity categories: crop name (FIS), pest name (DIS), and treatment (TES).

We divided the dataset into training, validation, and test sets in the ratio of 7:1.5:1.5. Our dataset is sufficiently difficult and representative to effectively test the model for domain-specific named entity recognition and to demonstrate and evaluate the robustness and generality of the proposed model. Examples of the datasets are provided below.

Example Text: 临安市农业技术推广中心植保站站长王建军介绍, 目前, 临安市茶园主要发生的病虫害有茶白星病, 茶红蜘蛛, 茶尺蠖等, 建议茶农及时使用25%的咪鲜胺乳油等进行防治.

Label: O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O B-FIS I-FIS O B-DIS I-DIS O B-DIS I-DIS O B-DIS I-DIS O O O O O O O O O O O O O O O O O O O O O O O O O O O O O B-TES I-TES I-TES I-TES I-TES O O O O O.

4.2. Experimental Setup

The experiment begins by dividing the labels of the original dataset into two label sets to enable the network to backpropagate on each of the two subtasks. For instance, in the case of the Weibo NER dataset, the boundary-labeled label set T1 includes {‘B’, ’I’, ’O’}, and the type-labeled label set T2 includes {‘GPE.NE’,’GPE.NM’, ’PER.NE’,’PER.NM‘,’LOC.NE’,’LOC.NM’,’ORG.NE’,’ORG.NM’, ’‘}. Each of the original character tags can be represented by a combination of sub-tags within T1 and T2.

The experiments were conducted using the PyTorch deep learning platform and accelerated using a GeForce RTX 3080 GPU. For the Chinese dataset, the model inputs are represented as word vector embeddings after pre-training in the Bert-base-Chinese model, which consists of 12 coding layers, 768 hidden nodes, and 12 heads. For the English dataset, the inputs are word vector embeddings after pre-training in the Bert-base model. The initial learning rate is set to 0.00005. Due to the large number of BERT parameters, the network completes convergence after only three rounds of training. Table 2 is the experimental environment configuration.

4.3. Experimental Results and Analysis

4.3.1. Introduction to Quantitative Model Performance Metrics

In this experiment, we evaluate the model using the metrics of accuracy (Precision, P), recall (Recall, R), and the comprehensive evaluation index (F1). The F1 value provides a comprehensive evaluation of the model’s performance.

Precision (P) represents the accuracy rate and is calculated by dividing the number of correctly identified entities by the total number of entities identified by the model, as shown in Equation (13).

P = T P / (T P + F P)

(13)

Recall (R) represents the recall rate and is calculated by dividing the number of correctly identified entities by the total number of real entities, as shown in Equation (14).

R = T P / (T P + F N)

(14)

The F1 value represents the composite evaluation index and is the harmonic mean of P and R, as shown below in Equation (15).

F1 = 2 × P × R/(P + R)

(15)

Here, TP (True Positive) represents the number of correctly identified entities, FP (False Positive) represents the number of incorrectly identified non-entities, and FN (False Negative) represents the number of unidentified true entities.

4.3.2. Model Comparison

We conducted several experiments on multiple datasets to fine-tune the model and evaluate the model with respect to the characteristics of each dataset. Meanwhile, we compared MTLBERT with several state-of-the-art models on several public datasets, including LR-CNN [5] (Zhang et al., 2015), BiLSTM-CRF [27] (Ma and Hovy, 2016), Lattice-LSTM [28] (Liu et al., 2018), BERT (Devlin et al., 2018), BERT-WWM (Cui et al., 2020), ChineseBERT (Sun et al., 2021), and W2NER (Li et al., I, 2022). Table 2 shows a comparison of these models. For comparison, see Table 3.

4.3.3. Model Performance Comparison Experiments

Table 4 and Table 5 present the comparative experimental results of the MTL-BERT model for the Chinese named entity recognition task on diverse datasets. Table 4 indicates that the MTL-BERT model achieves the highest F1 values compared to all models, with 73.8%, 96.5%, and 86.7% on the Weibo NER, MSRA, and OntoNote4.0 datasets, respectively. These results indicate that the MTL-BERT model is more effective than other models in recognizing Chinese named entities, and even without the incorporation of lexical features, it outperforms the ChineseBERT and W2NER networks, mainly due to the powerful contextual feature extraction capability of the BERT model. Moreover, the MTL-BERT model outperforms the state-of-the-art N2NER by 3%, 0.5%, and 4.5% on the Weibo NER and MSRA datasets, respectively, demonstrating that the multi-task learning model can effectively improve entity recognition, particularly on the Weibo NER dataset, which has a smaller data size and higher noise.

Table 5 shows the comparative experimental results of the MTL-BERT model for named entity recognition in specific domains. On the AG001 and CCKS2019 datasets, the F1 values of the MTL-BERT model reach 90.79% and 89.03%, respectively, which represent significant improvements over the single BERT model, and the number of training rounds is significantly reduced. These results demonstrate that the MTL-BERT model also performs well in domain-specific named entity recognition tasks.

Combining the experimental results presented in Table 4 and Table 5, we can conclude that the MTL-BERT model effectively improves entity recognition and performs well on regular and domain-specific datasets.

4.3.4. Individual Performance Comparison of the Two Subtasks

In this study, we experimented with the MTL-BERT model to verify the performance improvement effect of multi-task learning on Chinese and English datasets. The results in Table 6 show that MTL-BERT has a significant improvement in F1 values on the Chinese dataset compared to the single-target task, while having a limited effect on the English dataset. This indicates that multitask learning is more effective in helping Chinese languages that lack explicit word boundaries, whereas it is less effective on languages with explicit word boundaries such as English. Therefore, when dealing with different types of language tasks, such as Chinese and English, appropriate learning methods and models need to be selected in conjunction with specific situations to achieve better performance improvement.

4.3.5. Training Efficiency Performance of the MTL-BERT Model

Finally, we conducted experiments to compare the training efficiency of the MTL-BERT and BERT models on two datasets. We randomly selected three training rounds as samples and trained the models under the same hardware environment and parameter settings. The results, shown in Table 7, indicate that the MTL-BERT model has a significantly shorter convergence time on both datasets, with a 1/3 reduction in training time on the larger MSRA dataset. This demonstrates the superiority of multi-task learning in terms of learning efficiency.

There are two main reasons for this. Firstly, as discussed in the introduction, the label categories of the MTL-BERT model are greatly reduced during multi-task learning. Secondly, the feature interactions between subtasks allow the boundary-labeled non-entity information to propagate to the type-labeled task, preventing ineffective learning on non-entity words. These two key factors help the network save computational costs during decoding and back-propagation, thus effectively improving the learning efficiency of the model.

5. Ablation Studies

In this section, we conduct ablation studies to understand the behaviors of MTLBERT. We use the Chinese named entity recognition dataset Weibo-NER for analysis and all models are based on the base version.

5.1. Individual Performance Comparison of the Two Subtasks

The purpose of this experiment is to investigate the performance improvement of multi-task learning by evaluating the two subtasks separately. In order to eliminate the influence of the two tasks on each other’s evaluations, the original true labels are split and the F1 values are recalculated based on the predicted labels of the two subtasks. The experimental results are presented in Table 8, and it is observed that the MTL-BERT model with the DWA task balancing approach does not significantly improve the performance of the two subtasks compared to the BERT model. This suggests that the key to multi-task learning lies not in the individual optimization of the two subtasks, but in the overall performance improvement achieved by information fusion and feature interaction during joint learning. Thus, multi-task learning achieves better performance by optimizing multiple tasks simultaneously, resulting in more comprehensive information interaction between different tasks.

Experiments on the Performance Impact of the Multi-Task Balancing Approach

This study presents two multi-task balancing approaches, namely fixed weighting and dynamic weighting, in Section 3.5, to address the balancing problem in multi-task learning. The effectiveness of these two approaches is evaluated on the Weibo NER dataset.

Firstly, Figure 2 illustrates the trend of F1 values of the MTL-BERT model under different subtask weighting ratios. The results indicate that the multi-task balancing approach with fixed weighting coefficients does not perform well. The F1 value fluctuates insignificantly after setting the weighting ratio coefficients appropriately, and the accuracy is close to the best result achieved by using the DWA balancing strategy when the weights of boundary labeling and type labeling tasks are equal (0.5 Ltask1 + 0.5 Ltask2).

Furthermore, Table 9 shows that the MTL-BERT model outperforms the fixed-weighting strategy when two dynamically weighted task balancing strategies are employed. Specifically, the DWA task balancing strategy achieved the best accuracy of 73.8%, followed by the UW task balancing strategy with an accuracy of 72.2%. These results suggest that adjusting task weights based on the real-time learning effect of tasks is a more effective solution for solving the multi-task balancing problem.

In conclusion, this study experimentally investigates the balancing problem in multi-task learning and proposes two multi-task balancing approaches, fixed weighting and dynamic weighting. The results show that the method of dynamically adjusting task weights according to the real-time learning effects is more effective, but the multi-task balancing approach with fixed weighting coefficients performs well under appropriate weighting ratio coefficients.

6. Conclusions

In this paper, we propose MTL-BERT, a multi-task learning network for Chinese named entity recognition. The network decomposes the named entity recognition task into two subtasks, boundary annotation and type annotation, based on character encoding input only, without introducing additional features. Each task jointly learns semantic knowledge at the shared encoding layer before decoding independently. The experiments show that the method is effective and achieves state-of-the-art performance on three general Chinese datasets and two domain-specific Chinese datasets. Through ablation experiments, we find that our multi-tasking model performs well. Notably, MTL-BERT is able to significantly reduce redundant computations through multi-task label disambiguation and information interaction, resulting in a significant improvement in learning efficiency. Our framework and model are easy to use, facilitating the development of NER research. In future work, we will explore more suitable multi-task information fusion methods to further improve the effectiveness of Chinese named entity recognition.

Author Contributions

Conceptualization, Q.F.; Formal analysis, Q.F.; Investigation, Y.L.; Data curation, Y.R.; Funding acquisition, H.F.; Methodology, Q.F.; Resources, H.F.; Writing—original draft, Q.F. and Y.L.; Writing—Review, Y.L.; Visualization, Q.F. All authors have read and agreed to the published version of the manuscript.

Funding

Key R&D Projects in Zhejiang Province (2022C02009, 2022C02044, 2022C02020); Basic Public Welfare Project of Zhejiang Province (GN21F020001); Three agricultural nine-party science and technology collaboration projects of Zhejiang Province (2022SNJF036).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Li, M.Y.; Kong, F. Named Entity Recognition in Social Media Incorporating Self-Attention Mechanism. J. Tsinghua Univ. 2019, 59, 48–54. [Google Scholar]
Zhang, Y.; Yang, J. Chinese NER using lattice LSTM. arXiv 2018, arXiv:1805.02023. [Google Scholar]
Gui, T.; Ma, R.; Zhang, Q.; Zhao, L.; Jiang, Y.G.; Huang, X. CNN-Based Chinese NER with Lexicon Rethinking. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), Macao, China, 10–16 August 2019; pp. 4982–4988. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF models for sequence tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
Lafferty, J.; McCallum, A.; Pereira, F.C.N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data; Penn Libraries: Philadelphia, PA, USA, 2001. [Google Scholar]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural architectures for named entity recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar]
Ma, X.; Hovy, E. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv 2016, arXiv:1603.01354. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Cui, Y.; Che, W.; Liu, T.; Qin, B.; Yang, Z.; Wang, S.; Hu, G. Pre-training with whole word masking for chinese bert. arXiv 2019, arXiv:1906.08101. [Google Scholar] [CrossRef]
Cui, Y.; Che, W.; Liu, T.; Qin, Z.; Wang, S.; Hu, G. ChineseBERT: Chinese pretraining enhanced by glyph and pinyin information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, 5–10 July 2020; pp. 335–342. [Google Scholar] [CrossRef]
Wang, P.; Li, X.; Zhang, Y.; Li, L.; Liu, T. Unified Named Entity Recognition as Word-Word Relation Classification. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; pp. 1234–1241. [Google Scholar]
Caruana, R. Multitask learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Doersch, C.; Zisserman, A. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2051–2060. [Google Scholar]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2800–2809. [Google Scholar]
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
Søgaard, A.; Goldberg, Y. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Short Papers. Volume 2, pp. 231–235. [Google Scholar]
Lin, Y.; Yang, S.; Stoyanov, V.; Ji, H. A multi-lingual multi-task architecture for low-resource sequence labeling. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Long Papers. Volume 1, pp. 799–809. [Google Scholar]
Changpinyo, S.; Hu, H.; Sha, F. Multi-task learning for sequence tagging: An empirical study. arXiv 2018, arXiv:1808.04151. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep Contextualized Word representations. arXiv 2018, arXiv:1802.05365v2. [Google Scholar]
Liu, S.; Johns, E.; Davison, A.J. End-to-end multi-task learning with attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 1871–1880. [Google Scholar]
Peng, N.; Dredze, M. Named entity recognition for chinese social media with jointly trained embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 548–554. [Google Scholar]
Levow, G.A. The third international Chinese language processing bakeoff: Word segmentation and named entity recognition. In Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, Sydney, Australia, 22–23 July 2006; pp. 108–117. [Google Scholar]
Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv 2003, arXiv:cs/0306050. [Google Scholar]
Huang, Z.; Xu, W.; Yu, K. Lattice LSTM for Chinese Word Segmentation. arXiv 2017, arXiv:1709.04185. [Google Scholar]
Zhang, X. A Novel Neural Network Model Based on Convolutional Neural Network and Logistic Regression for Chinese Named Entity Recognition. J. Comput. Sci. 2019, 30, 61–68. [Google Scholar]

Figure 1. Multi-task learning network model.

Figure 2. Variation of F1 values in relation to subtask weighting ratio.

Table 1. Dataset information.

Data Set	Number of Entities	Training Set	Validation Set	Test Set
Weibo NER ¹	8	1.4 k	0.3 k	0.3 k
MSRA ¹	3	46.4 k	-	4.4 k
OntoNote4.0 ¹	4	11.0 k	2.3 k	2.3 k
CCKS2019 ¹	6	6.5 k	1.4 k	1.4 k
AG001 ¹	3	2.1 k	0.4 k	0.4 k
CoNLL-2003 ²	4	15.0 k	3.5 k	3.7 k
MIT-Movie ²	12	7.8 k	-	2.0 k

¹ Chinese dataset, ² English dataset.

Table 2. Experimental environment configuration.

	Configuration
GPU	NVIDIA RTX3080
CPU	AMD Ryzen 5 5600X
CUDA	11.1
MEMORY	16G
Operating System	Windows 11

Table 3. Model Comparison.

	BERT	BERT-WWM	ChineseBERT	W2NER	MTLBERT (Ours)
Number of coding layers	12	12	12	24	30 k
Hidden Nodes	768	768	768	1024	char
Optimizer	AdamW	LAMB	AdamW	AdamW	AdamW
Masking	MLM	WWM	CCM	16	MLM
multlenrning	NO	NO	NO	NO	YES

Table 4. Comparative experimental results of Chinese named entity recognition task (%).

Model		Weibo NER			MSRA			OntoNote4.0
Model		P	R	F1	P	R	F1	P	R	F1
Zhang et al., 2015	LR-CNN	57.1	66.7	58.9	94.5	92.9	93.7	80.12	84.32	82.11
Ma and Hovy, 2016	BiLSTM-CRF	60.8	52.9	56.6	92.3	92.4	92.4	79.27	76.34	78.22
Liu et al., 2018	Lattice-LSTM	53.0	62.3	59.8	93.6	92.8	93.2	80.56	83.22	81.93
Devlin et al., 2018	BERT	66.4	67.2	66.8	94.0	94.1	94.0	80.94	85.95	83.28
Cui et al., 2020	BERT-WWM	-	-	68.5	-	-	95.3	81.21	84.39	83.23
Sun et al., 2021	ChineseBERT	68.27	69.78	69.02	94.6	95.4	94.6	80.77	83.65	82.18
Li et al., I, 2022	W2NER	70.84	70.84	70.84	96.08	96.10	96.12	82.31	83.36	83.08
ours	MTL-BERT	74.9	72.7	73.8	96.5	96.6	96.5	85.9	87.7	86.7

Table 5. Domain-specific Named Entity Recognition Comparison Experiment Results (%).

Model		AG001				CCKS2019
Model		Epoch	P	R	F1	Epoch	P	R	F1
Devlin et al., 2018	BERT	10	89.64	80.23	84.5	10	78.34	83.22	81.56
ours	MTL-BERT	3	92.19	88.90	90.79	3	89.33	88.92	89.03

Table 6. Comparison of the performance of the BERT and MTL-BERT models regarding the separate F1 values of the two subtasks (%).

Models	CoNLL-2003			MIT-Movie
Models	P	R	F1	P	R	F1
BERT	90.71	92.10	91.42	67.5	72.5	69.9
W2NER	92.71	93.44	93.07	69.1	74.2	71.2
MTL-BERT	91.91	93.73	92.82	68.1	73.8	70.8

Table 7. Comparison of the experimental effects of the BERT and MTL-BERT models on an English dataset (%).

Models	Weibo NER	MSRA
BERT	51.5 s/iter	1742.1 s/iter
MTL-BERT	37.2 s/iter	1164.8 s/iter

Table 8. Comparison of the performance of the BERT and MTL-BERT models regarding the separate F1 values of the two subtasks (%).

Models	Boundary Labeling	Type Labeling Sw	Overall Entity Labeling
BERT	84.4	75.2	66.8
MTL-BERT	83.3(−0.9)	75.2(-)	73.8(+7)

Table 9. Comparison of experimental results of fixed weighting and two kinds of dynamic weighting (%).

Methods	P	R	F1
0.5 Ltask1 + 0.5 Ltask2	86.8	70.9	72.1
DWA	74.9	72.7	73.8
UW	73.4	70.9	72.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fang, Q.; Li, Y.; Feng, H.; Ruan, Y. Chinese Named Entity Recognition Model Based on Multi-Task Learning. Appl. Sci. 2023, 13, 4770. https://doi.org/10.3390/app13084770

AMA Style

Fang Q, Li Y, Feng H, Ruan Y. Chinese Named Entity Recognition Model Based on Multi-Task Learning. Applied Sciences. 2023; 13(8):4770. https://doi.org/10.3390/app13084770

Chicago/Turabian Style

Fang, Qin, Yane Li, Hailin Feng, and Yaoping Ruan. 2023. "Chinese Named Entity Recognition Model Based on Multi-Task Learning" Applied Sciences 13, no. 8: 4770. https://doi.org/10.3390/app13084770

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Chinese Named Entity Recognition Model Based on Multi-Task Learning

Abstract

1. Introduction

2. Related Studies

3. Chinese Named Entity Recognition Model Based on Multi-Task Learning

3.1. Multi-Task Learning Network Model

3.2. Input Representation Layer

3.3. Feature Coding Layer

3.4. Label Decoding Layer

3.5. Loss Function Design

4. Experiment and Analysis

4.1. Dataset

Self-Built Chinese Agricultural Dataset AG001

4.2. Experimental Setup

4.3. Experimental Results and Analysis

4.3.1. Introduction to Quantitative Model Performance Metrics

4.3.2. Model Comparison

4.3.3. Model Performance Comparison Experiments

4.3.4. Individual Performance Comparison of the Two Subtasks

4.3.5. Training Efficiency Performance of the MTL-BERT Model

5. Ablation Studies

5.1. Individual Performance Comparison of the Two Subtasks

Experiments on the Performance Impact of the Multi-Task Balancing Approach

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI