3.1. Sampling Algorithm
In this paper, the uncertainty-based AL framework was used to train the NER model. The classical sampling algorithms include Token Entropy (TE) [
29] and Least Confidence (LC) [
30]. However, HAZOP reports are normative and professional, which means these algorithms are not suitable for the field of chemical safety, and the improvement of model performance is not obvious. To solve this problem, three novel sampling algorithms based on TE and LC are proposed in this study. The following is an analysis of these algorithms.
In machine learning (ML), entropy [
31] is a measure of the disordered state, which is used to measure the amount of information of an event with multiple states and the expected value of the amount of information about the event probability distribution. For the discrete random variable
Y, the information entropy can be calculated by Formula (1).
According to the meaning of entropy, the amount of information is inversely proportional to the probability of an event. Based on the posterior entropy of the model, a TE algorithm was proposed by Burr et al. According to Formula (1), the TE algorithm relies on the predicted value
P(
yi) of the sample label, but the recognition accuracy of the initial model based on HAZOP text is not high enough, and the algorithm does not show good effect. Therefore, a variation of the token entropy (VTE) algorithm is proposed based on TE in this study.
Formula (2) is the calculation formula of VTE, where
T is the length of the sequence and
l is the label type.
P(
yt = l) is the abbreviation of marginal probability, which means the probability of label
l at position
t,
Q(
yt = l) means the probability that label
l is not at position
t. The variable
Q(
yt = l) is introduced by the VTE algorithm, and the variable
Q(
yt =
l) and the variable
P(
yt = l) are considered by the model at the same time. The false prediction value of samples is used to constrain the real prediction value so as to get rid of the limitation that the algorithm relies too much on the prediction ability of the initial model. Compare the value of
φVTE with the given threshold τ and select the samples whose value of
φVTE is greater than the value of τ (that is, the samples with rich information). The image of the VTE algorithm is shown in
Figure 3, where the shadow layer represents the set threshold τ. The first derivative function of VTE is shown in
Figure 4. It can be seen from
Figure 3 and
Figure 4 that the value of
φVTE is inversely proportional to the prediction value of the label. With the decrease of the prediction value of the label, the richer the corresponding sample information is, the higher the value of
φVTE is, and the greater the gap with the threshold value is, so the sampling error can be reduced. The prediction value of label is in the range of 0 to 0.2, the value of
φVTE changes dramatically, and the samples with different quality are separated effectively, which is helpful for the algorithm to select high-quality samples. The prediction value of the label is in the range of 0.8 to 1, and the low-quality samples can be eliminated well. Therefore, through the VTE algorithm, the iterative efficiency of model training can be improved, the negative impact of the accuracy of the initial model on the sampling process is weakened, and the sampling error is reduced.
In machine learning, cross-entropy can be used to measure the difference between two probability distributions and eliminate the uncertainty of the system. For sample
x, the cross-entropy based on the current model can be calculated by Formula (3).
where
T is the length of the sequence and
l is the label type. Considering the posterior entropy and cross-entropy of the current model, the cross-entropy term is used to reduce the sampling uncertainty and improve the sampling efficiency. In order to adapt the two different entropies and make the sampling algorithm fit the HAZOP text, the
α and
β were introduced in this study, as shown in Formula (4).
α and
β are the coefficients with the interval of [0, 1], which control the posterior entropy term and the cross-entropy term, respectively, so that the two entropies are in a relatively balanced state.
where
P(
yt = l) is the probability that the label at position
t is
l. A large number of experiments show that when
α = 1/3,
β = 2/3, the effect of the algorithm is the most obvious, the cross-entropy term is given higher weight, and the system uncertainty is reduced by more measuring of the positive and negative prediction values of the sample label, as shown in Formula (5).
Formula (5) is simplified to Formula (6).
The correlation between entities in a HAZOP report is relatively small, and the length of the sequence will have a negative impact on the sampling results. In this paper, Formula (6) is normalized and its coefficients are removed. The final result is shown in Formula (7).
Formula (7) is the calculation formula of the HAZOP Confusion Entropy (HCE) algorithm proposed in this paper. The characteristics of HAZOP text are considered by HCE, the posterior entropy is given a low weight to reduce the dependence of active learning on the recognition performance of the initial model, and the cross-entropy term is introduced to improve the stability of sampling.
Culotta et al. [
32] used least confidence (LC) to measure the informativeness of text and applied LC to NER. According to the most uncertain probability predicted by the current model, the samples were sorted in descending order.
Formula (8) is the calculation formula of LC. However, for some sequences, there are some cases where the difference between samples is small. For example, in the text of “light oil level is too low, light oil tank is in trouble”, “light oil” and “light oil tank” should be classified into different entity types. The current model may have errors in the process of predicting sample labels, which would lead to the failure of the LC algorithm. In this paper, the algorithm of amplification of least confidence (ALC) is proposed. Firstly, ALC makes the difference between samples more obvious through exponential operation. The samples are then sorted in ascending order according to the value of ALC output from the model. The calculation formula of ALC is shown in Formula (9). The brief proof and description are as follows.
For the constructed function (10), it is obvious that the function is monotonically increasing in the interval (0,1). For any parameters
φi and
φj in interval (0,1), inequality (11) is obtained.
The predicted value
P(
y|{
xij}) of the sample label transformed by the SoftMax module is also in the interval (0,1), so the relationship (12) can be obtained.
It can be seen from inequality (12) that through the ALC algorithm the probability values calculated by the current model for the sequence are amplified, the differences between the sequences are adjusted, and the sequence samples are sampled more effectively.