Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Open AccessArticle

Peer-Review Record

Geographic Named Entity Recognition by Employing Natural Language Processing and an Improved BERT Model

ISPRS Int. J. Geo-Inf. 2022, 11(12), 598; https://doi.org/10.3390/ijgi11120598

by Liufeng Tao^1,2, Zhong Xie^1,2, Dexin Xu³, Kai Ma^4,5, Qinjun Qiu^1,2,6,*, Shengyong Pan⁷ and Bo Huang⁷

Reviewer 1:

Michał Marcińczuk

Reviewer 2:

Pouyan Nahed

Reviewer 3:

Miguel Ángel Rodríguez García

Reviewer 4: Anonymous

ISPRS Int. J. Geo-Inf. 2022, 11(12), 598; https://doi.org/10.3390/ijgi11120598

Submission received: 15 September 2022 / Revised: 16 November 2022 / Accepted: 24 November 2022 / Published: 28 November 2022

Round 1

Reviewer 1 Report

The main contributions pointed out by the authors are:

a) TPCNER — a corpus of texts annotated with toponyms,

b) "A novel Chinese NER (CNER) model for the geographic domain via the improved ALBERT pretraining model and BiLSTM-CRF"

c) Evaluation of the ALBERT-BiLSTM-CRF on other corpora.

I have doubts related to the novelty of the model. I am not sure if the novelty is the fact that the authors published a new model for toponym detection or if they used a novel approach to train the model.

In the first case, the article misses an extensive comparison with other models train on similar corpora. What is the improvement compared to models trained on Boson, MSRA, or RenMinRiBao?

In the second case, the article misses a comparison with other similar works but not only. Below I list all the questions which also raise:

1. "via the improved ALBERT pretraining" — what is the improvement?

2. Did you use an existing ALBERT model for Chinese or you trained yourself? If the latter, what was the procedure, and what data did you use?

3. In Table 6 you should include results for the vanilla BERT/ALBERT model as they provide good results for many tasks.

3. In Table 6 the differences between some models are below 1pp. Are the results calculated for single runs or are averages of more than 1 run? What was the divination between runs? Is the difference statistically significant?

4. In Tables 6 and 7 the results for BERT and ALBERT models are counterintuitive. ALBERT is a lightweight variant of BERT. It supposes to be faster for the cost of accuracy. In your results, the ALBERT models have a higher f-score than the BERT models. Why is it so? Could you discuss the difference?

"Supplementary Materials: All original data can be found in the Zenodo (https://zenodo.org/rec-ord/6482711#.YmZxWMjAiAc)." — this link does not work.

Author Response

Response:

We firstly refer to some standards, including the rules for classifying and coding geographical names (GB/T18521-2001) and classifying and coding basic geographic information elements (GB/T13923-2006).

Therefore, we also include water systems and pipelines as a category, mainly considering that there may be some entities such as rivers, lakes and seas in social media, and these entities have an important indicator role in natural disasters, emergency response and rescue.

The above is our analysis, which is not included in the manuscript, and we look forward to receiving your understanding.

Author Response File: Author Response.docx

Reviewer 2 Report

1. the Second contribution in the first section (Introduction), mentions the "improved" ALBERT pre-training model. Although, there is no explanation anywhere in the paper about how this improvement is done. "Fine-tuning" might be a better term in this context.

2. In table 1, the abbreviation stands for the shortened form of words. Although they represent the initial letter of the words listed in the column, it prevents confusion by making them consistent with Figure 2. (Waterways -> WAT) and using the actual abbreviation in the column.

3. Figure 5 (left figure) needs reconsideration. The forget gate does not use c_t-1 as input but in the figure, it is shown as input to forget gate along with x_t and h_t-1.

4. Section 4.4 (CRF Layer), in the second paragraph formula is out of LATEX format.

5. In figure 7 despite showing the results the legend is blocking the last data point. It is a good idea to move the legend to the left top corner.

6. Section 5.4 discusses different results, although there is no mention that which dataset is used to report the results. Are these results based on the gathered dataset (TPCNER)?

7. Section 5.4.2 in the second line of the first paragraph, mentions Figure 5 which seems to be incorrect. Did you mean Figure 8?

8. As you mentioned in the manuscript, in Figure 8 red characters denote errors. But almost all (except Case 1 / ALBERT-BiLSTM-CRF) entities even gold labels are highlighted as red. Marking all of the entities as red leads to confusion.

9. In section 5.4.3 (Annotated quality analysis), it is a good idea to support explanations with a confusion matrix.

Author Response

Dear Editors and Reviewers:

Thank you for your letter and for the reviewers’ comments concerning our manuscript entitled “Geographic named entity recognition by employing natural language processing and an improved BERT model (No. ijgi-1945643)”. Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researchers. We have studied the comments carefully and have made correction which we hope to meet with approval. The revised part of the paper has been marked in red. Please let me know if any more changes are required. We hope that the revised version of the manuscript is now acceptable for publication in your journal. The main corrections in the paper and the responds to the reviewer’s comments are as following:

Comments and Suggestions for Authors

the Second contribution in the first section (Introduction), mentions the "improved" ALBERT pre-training model. Although, there is no explanation anywhere in the paper about how this improvement is done. "Fine-tuning" might be a better term in this context.

Response: We thank the refeviewer for the very valuable comment.

The ALBERT used in this paper has several design features that enhance its performance on the task of toponym recognition from social media messages. First, our presented ALBERT uses the pre-trained word embeddings that are specifically derived from social media messages. We use the GloVe word embeddings that were trained on 2 billion texts with 11 billion tokens and 1.8 million vocabulary items collected from Baidu Encyclopedia, Weibo, WeChat, etc. These word embeddings, specifically trained on a large social media messages corpus, include many vernacular words and unregistered words used by people in social media messages. Previous geoparsing and NER models typically use word embeddings trained on well-formatted text, such as news articles, and many vernacular words are not be covered by those embeddings. When that happens, an embedding for a generic unknown token is usually used to represent this vernacular word and, as a result, the actual semantics of the word are lost. Second, compared with the basic BiLSTM–CRF model, our presented model adds ALBERT layer to capture the dynamic and contextualized semantics of words.

The revised details can be found in Lines 343-356.

In table 1, the abbreviation stands for the shortened form of words. Although they represent the initial letter of the words listed in the column, it prevents confusion by making them consistent with Figure 2. (Waterways -> WAT) and using the actual abbreviation in the column.

Response: We thank the refeviewer for the very valuable comment.

As the reviewer said, the previous abbreviation was not standardized enough and we have reworked it. The detailed descriptions are as follows:

Table 1. Details of entity categories in TPCNER.

Id	Entity Tags	Abbreviation	Description	Example
1	Water System	WAT	A man-made building or natural structure associated with water in nature.	Tongji Canal, Huaihe River Basin
2	Residential land and facilities	RLF	A place where human beings live or engage in productive life.	Shaanxi Kiln
3	Transportation	TRA	Human-built buildings related to transportation.	Longxia Railway
4	Pipelines	PIP	Pipelines laid by humans.	Natural gas pipeline
5	Boundaries, Regions, and Other Areas	BRO	The corresponding boundaries that humans have drawn on the land to facilitate management.	Hubei Province
6	Landforms	LAN	Includes natural and artificial landforms.	Himalayas
7	Organization	ORG	Includes the names of relevant organizations.	Wuhan Zhongdi Digital Technology Co.

The revised details can be found in Lines 208-209.

Figure 5 (left figure) needs reconsideration. The forget gate does not use ct-1 as input but in the figure, it is shown as input to forget gate along with xt and ht-1.

Response: We thank the refeviewer for the very valuable comment.

We have modified the Figure 5, and the detailed description are as follows:

Figure 5. Neuron Structure of LSTM.

The revised details can be found in Lines 379-380.

Section 4.4 (CRF Layer), in the second paragraph formula is out of LATEX format.

Response: We thank the refeviewer for the very valuable comment.

We are very ashamed to have caused such a minor error, and we have corrected it.

where P_i,yi is the probability of the y_i label of the character and A is the transfer probability matrix. The CRF score vector is normalized and trained using the log-likelihood function as the loss function, as shown in Equation (8):

The revised details can be found in Lines 393-395.

In figure 7 despite showing the results the legend is blocking the last data point. It is a good idea to move the legend to the left top corner.

Response: We thank the refeviewer for the very valuable comment.

We are very ashamed to have caused such a minor error, and we have corrected it.

Figure 7. Average F1-score of the presented and baseline NER algorithms with different sizes of labeled data.

The revised details can be found in Lines 536-538.

Section 5.4 discusses different results, although there is no mention that which dataset is used to report the results. Are these results based on the gathered dataset (TPCNER)?

Response: We thank the refeviewer for the very valuable comment.

As the reviewer stated, our section is focused on our own constructed dataset.

We focus on analyzing the constructed dataset (TPCNER). We examined an example from the TPCNER corpus to see if the provided model could better detect items in the geographic domain. In this example, the entity “Gulou Hospital of Harbin Engineering University” appeared just twice in the training set. The entity “Gulou Hospital of Harbin Engineering University” is recognized by the BERT-BiLSTM-CRF model as two entities, “Harbin Engineering University” and “Gulou Hospital”, as shown in Table 10. Because these two items are more abundant in the training set, recognition without augmentation information will be deceptive. Because of inaccurate boundary information, the BERT-BiLSTM-CRF model wrongly classifies “Gulou Hospital of Harbin Engineering University” as an entity. Because more extensive augmentation information is incorporated into our suggested model, it provides accurate predictions. Furthermore, the terms “Harbin Engineering University” and “Gulou Hospital” in the sample are similar, implying a tighter relationship between the entity's characteristics.

The revised details can be found in Lines 561-562.

Section 5.4.2 in the second line of the first paragraph, mentions Figure 5 which seems to be incorrect. Did you mean Figure 8?

Response: We thank the refeviewer for the very valuable comment.

We are very ashamed to have caused this small mistake and we have checked and revised it.

As you mentioned in the manuscript, in Figure 8 red characters denote errors. But almost all (except Case 1 / ALBERT-BiLSTM-CRF) entities even gold labels are highlighted as red. Marking all of the entities as red leads to confusion.

Response: We thank the refeviewer for the very valuable comment.

We have reworked Figure 8 as follows:

Figure 8. Error analysis of some typical cases. Blue represents standard place name labeling, and Red represents model identification place names.

In section 5.4.3 (Annotated quality analysis), it is a good idea to support explanations with a confusion matrix.

Response: We thank the refeviewer for the very valuable comment.

We have added the confusion matrix to support the explanation as follows:

The confusion matrix in Figures 9 and 10 shows the number of toponyms that were extracted from the dataset using the proposed algorithm, as well as the number of gold standard annotations, for each toponym class. Figure 10 shows that the proposed algorithm has a relatively lower precision for the TRA toponym classes. This could be attributed to data imbalance. The imbalance in entity number causes the algorithm to focus on minimizing classification errors for the entities with a larger number, while insufficiently considering the errors for the entities with a smaller number.

Figure 9. Confusion matrix for all extracted and gold standard toponym from the constructed dataset based on ALBERT-BiLSTM-CRF. WAT= Water System; RLF= Residential land and facilities; TRA= Transportation; PIP= Pipelines; BRO= Boundaries, Regions, and Other Areas; LAN= Landforms; ORG= Organization.

Figure 10. Confusion matrix for all extracted and gold standard toponym from the constructed dataset based on ALBERT-BiLSTM-CRF. WAT= Water System; RLF= Residential land and facilities; TRA= Transportation; PIP= Pipelines; BRO= Boundaries, Regions, and Other Areas; LAN= Landforms; ORG= Organization.

The revised details can be found in Lines 618-635.

Author Response File: Author Response.docx

Reviewer 3 Report

The authors describe a neural network architecture for geographic entity recognition. The architecture is composed of three main layers: ALBERT, BiLSTM, and CRF. Its evaluation has been addressed with different datasets and its accuracy has been compared to other neural architectures. Although the design and the study are quite interesting, there are some major comments that need to be addressed.

* Major comments

Corpus preparation and annotation. The experiments have selected several datasets, but only TPCNER is analysed in detail, why?. For example, in this dataset tags, and distribution are described in detail, what about others?

The hybrid deep learning model. Only the BiLSTM layer is described formally, what about others? Moreover, the component of the equations should be defined too.

Results and discussion. As mentioned above, there is not enough information about the dataset utilised, only the TPCNER is analysed in detail.

In the experiments, the authors proposed 10 models to test the datasets selected, but the discussion presented is not clear. For instance, why do BiLSTM-CRF and IDCNN-CRF, two different models, obtain exactly the same precision? Is it because of the BiLSTM or CRF? How do the authors select the way for combining methods? What is the difference between IDCNN-CRF, and IDCNN-CRF2?

If we compare Table 7 to Table 6, why do BiLSTM-CRF and IDCNN-CRF behave differently? In table 6 they got the best results in precision, however, other datasets are far away from the best results, why?. Is it due to the samples in the datasets?

To sum up, a more detailed analysis is required to understand the results obtained.

In line 447, the authors analyse the reason for the low accuracy of the model, but in which experiment? In both experiments, the precision and recall values obtained are over 0.96. Therefore it is not clear what the purpose of this comment is.

Looking at the results in Tables 6 and 7, the selected combination is very closer to what it refers to as accuracy yielded. Therefore, it would be interesting to conduct an ablation study for analysing how each component of the architecture designed influences the final results.

* Spelling/grammar

Line 352. There must be a issue with the subindex.

Author Response

Dear Editors and Reviewers:

Comments and Suggestions for Authors

* Major comments

Response: We thank the refeviewer for the very valuable comment.

We have described the situation for the other data sets with the following additions:

Table 2. Details of entity categories in Boson.

Id	Entity Tags	Abbreviation	Description	Example
1	Location	LOC	A spatial distribution, location, or place occupied.	China
2	Org_name	ORG	Includes the names of relevant organizations.	Wuhan Zhongdi Digital Technology Co.

Table 3. Details of entity categories in MSRA.

Id	Entity Tags	Abbreviation	Description	Example
1	NS	NS	A spatial distribution, location, or place occupied.	Yufeng Mountain
2	NT	NT	Includes the names of relevant organizations.	China University of Geosciences

Table 4. Details of entity categories in RenMinRiBao.

Id	Entity Tags	Abbreviation	Description	Example
1	NS	NS	A spatial distribution, location, or place occupied.	Hubei
2	NT	NT	Includes the names of relevant organizations.	China University of Geosciences

The revised details can be found in Lines 210-215.

The hybrid deep learning model. Only the BiLSTM layer is described formally, what about others? Moreover, the component of the equations should be defined too.

Response: We thank the refeviewer for the very valuable comment. We have added related description formally as follows:

The transformer structure of the BERT model is composed of encoder and decoder, for the Encoder part, it mainly consists of 6 identical layers, each layer consists of two sub-layers, the multi-head self-attention mechanism and the fully connected feed-forward network, respectively. Since each sub-layer is added with residual connection and normalisation, the output of the sub-layer can be represented as shown in the following equation.

(1)

The multi-head self-attention mechanism projects the three matrices Q, V, and K by h different linear transformations, and finally splices the different attention results. The main calculation equation is shown below.

		(2)
		(3)
		(4)

For the decoder part, the basic structure is similar to the encoder part, but with the addition of a sub-layer of attention.

The revised details can be found in Lines 312-322.

Results and discussion. As mentioned above, there is not enough information about the dataset utilised, only the TPCNER is analysed in detail.

Response: We thank the refeviewer for the very valuable comment.

We have added the related descriptions about datasets as follows:

Table 2. Details of entity categories in Boson.

Id	Entity Tags	Abbreviation	Description	Example
1	Location	LOC	A spatial distribution, location, or place occupied.	China
2	Org_name	ORG	Includes the names of relevant organizations.	Wuhan Zhongdi Digital Technology Co.

Table 3. Details of entity categories in MSRA.

Id	Entity Tags	Abbreviation	Description	Example
1	NS	NS	A spatial distribution, location, or place occupied.	Yufeng Mountain
2	NT	NT	Includes the names of relevant organizations.	China University of Geosciences

Table 4. Details of entity categories in RenMinRiBao.

Id	Entity Tags	Abbreviation	Description	Example
1	NS	NS	A spatial distribution, location, or place occupied.	Hubei
2	NT	NT	Includes the names of relevant organizations.	China University of Geosciences

The revised details can be found in Lines 210-215.

To sum up, a more detailed analysis is required to understand the results obtained.

Response: We thank the refeviewer for the very valuable comment.

We have redesigned the experiments and added a large number of comparison experiments (comparing published work), and have analyzed the results in detail.

5.2. Baselines

To evaluate the effect of our presented model, we empirically compare our method (ALBERT-BiLSTM-CRF) with six strong baselines (DBN, DM_NLP, NeuroTPR, CheseBERTTP, ChineseTR, GazPNE2). In order to guarantee a relatively fair comparison, for these baselines, we employ their publicly released source codes and follow the parameter settings reported in their papers.

DBN is an adapted toponym recognition approach based on deep belief network (DBN) by exploring two key issues: word representation and model interpretation proposed by [38].
DM_NLP is a general model based on BiLSTM-CRF proposed by [39].
NeuroTPR is a Neuro-net ToPonym Recognition model designed specifically with these linguistic irregularities in mind proposed by [40].
ChineseBERTTP is a deep neural network named BERT-BiLSTM-CRF, which extends a basic bidirectional recurrent neural network model (BiLSTM) with the pretraining bidirectional encoder representation from transformers (BERT) representation to handle the toponym recognition task in Chinese text [41].
ChineseTR is a weakly supervised Chinese toponym recognition architecture that leverages a training dataset creator that generates training datasets automatically based on word collections and associated word frequencies from various texts and an extension recognizer that employs a basic bidirectional recurrent neural network based on particular features designed for toponym recognition proposed by [42].
GazPNE2 is general approach for extracting place names from tweets, named GazPNE2. It combines global gazetteers (i.e., OpenStreetMap and GeoNames), deep learning, and pretrained transformer models (i.e., BERT and BERTweet), which requires no manually annotated data [43].

5.3. Experiments on TPCNER

In this paper, the HMM, CRF, BiLSTM-CRF, IDCNN-CRF, IDCNN-CRF2, BiLSTM-Attention-CRF, BERT-BiLSTM-CRF, BERT-BiGRU-CRF, ALBERT-BiLSTM, ALBERT_old-BiLSTM-CRF (original ALBERT) and ALBERT_ours-BiLSTM-CRF (our presented ALBERT) models are used to test on the TPCNER dataset, and the performance of named entity recognition is evaluated by four indices: accuracy, precision, recall, and F1-score. The experimental results are shown in Table 6. The following results can be observed:

(1) Compared with the non-neural-network models (i.e., HMM, CRF), neural network models improve the performance significantly as the performance of the former deteriorates quickly, while the latter can maintain a reasonable performance. This is due to the fact that most of the features used in non-neural-network models come from human-designed features, which suffer from accumulated errors that may lead to performance degradation.

(2) We can see that these eleven models achieve good performance on the TPCNER dataset, and their accuracy, precision, recall, and F1-scores frequently exceed 80%. Among them, the ALBERT_ours-BiLSTM-CRF model has the best test effect, and its accuracy, precision, recall and F1-score are 97.8%, 96.1%, 96.2%, and 96.1%, respectively. Compared with the other nine models, this model has a better named entity recognition effect on the TPCNER dataset. In particular, our re-trained ALBERT model improves by 7.8% compared to the original ALBERT model.

(3) In addition, IDCNN-CRF2 achieves better performance than IDCNN-CRF, IDCNN-CRF and BiLSTM-CRF obtain almost the same performance, which both indicate that IDCNN utilizes dilated convolution to speed up training and does not enhance sequence features to improve performance.

Table 6. Results of different models on TPCNER.

Model	Accuracy	Precision	Recall	F1-score
HMM	0.809	0.804	0.815	0.807
CRF	0.838	0.838	0.841	0.84
BiLSTM-CRF	0.861	0.979	0.766	0.860
IDCNN-CRF	0.865	0.979	0.771	0.862
IDCNN-CRF2	0.882	0.980	0.795	0.878
BiLSTM-Attention-CRF	0.891	0.975	0.728	0.834
BERT-BiLSTM-CRF	0.911	0.928	0.915	0.921
BERT-BiGRU-CRF	0.934	0.939	0.949	0.944
ALBERT_old-BiLSTM	0.881	0.912	0.907	0.909
ALBERT_ours-BiLSTM	0.927	0.941	0.945	0.943
ALBERT_old-BiLSTM-CRF	0.905	0.925	0.844	0.883
ALBERT_ours-BiLSTM-CRF	0.978	0.961	0.962	0.961

We continue our experiments by comparing ALBERT-BiLSTM-CRF with six deep learning-based models. The performance of these models on the TPCNER data set is reported in Table 7. As can be seen, we have the following observations:

(1) ALBERT-BiLSTM-CRF yields the highest precision with the same recall. Moreover, ALBERT-BiLSTM-CRF obtains a constant and substantial improvement over ChineseBERTTP, which currently has the best results reported on this dataset, with higher precision for the same recall. We believe that the combination of specially designed ALBERT features constitutes more significant features and promotes the extractor to make accurate predictions.

(2) Compared with the basic BiLSTM–CRF model, ALBERT-BiLSTM-CRF performs better in all three metrics which demonstrate the value of our improved designs, including the specially designed ALBERT layers. Compared with DM_NLP and NeuroTPR, ALBERT-BiLSTM-CRF shows higher precision and F1-score and similar recall.

Table 7. Comparison with previous works on TPCNER.

Model	Precision	Recall	F1-score
DBN	0.781	0.774	0.78
DM_NLP	0.838	0.841	0.84
NeuroTPR	0.871	0.872	0.87
ChineseBERTTP	0.89	0.894	0.89
ChineseTR	0.85	0.86	0.85
GazPNE2	0.835	0.849	0.84
ALBERT_ours-BiLSTM-CRF	0.961	0.962	0.961

As expected from Tables 6 and 7, for all datasets, ALBERT_ours-BiLSTM-CRF achieves the best F1 score of 96.1%. Compared with two weakly supervised deep-learning models (NeuroTPR and GazPNE2), Our presented model performs better in all three metrics, which demonstrates the value of our improved design, containing fine-tuned ALBERT. The reason is that Chinese texts often include a considerable number of location names, which may not be covered by the basic BERT, including many vernacular words (e.g., “Mengliang Mountains” and “Plateau”) and abbreviations (e.g., “Dida” and “CUG”) applied by people. When this happens, generic unknown token embedding is usually used to represent the vernacular word, and the actual semantics of the word is lost.

The revised details can be found in Lines 432-507.

Response: We thank the refeviewer for the very valuable comment.

We are very ashamed to have caused such a problem, and we have fixed it.

By analyzing the recognition results, we found that (1) the reason for affecting the accuracy of the model is that some of the names in the data contain toponymic words, resulting in incorrect recall.

The revised details can be found in Lines 589-591.

Response: We thank the refeviewer for the very valuable comment.

We have added ablation experiments with the following experimental results:

5.5. Ablation analysis

To verify the effectiveness of pre-training on our approach, we design the following variant models and conduct experiments on the constructed dataset (See Table 9).

Table 9. Experimental performance of variant models on the TPCNER dataset.

Table 9 show the experimental results of BiLSTM-CRF as baseline method. In Table 9, the performance of BiLSTM-CRF in all evaluation metrics is poor, compared with other models. Besides, From the overall model F1-score in Table 9, we found that the use of BERT layer or the use of ALBERT layer is higher than the baseline method. This phenomenon shows the effectiveness of the combination of pre-training model.

Compared with BiLSTM-CRF, the F1 value of the model can be improved by using pre-training model (BERT) in Table 9. The reason may be that pre-training enables better characterization of text sequence features. This phenomenon shows the effectiveness of using pre-training model.

Compared with Bi-LSTM-CRF, the model using spatial attention has improved in P,R and F1-score in Table 9. The reason may be that domain pre-trained models can better characterize geographic text features and then improves the extraction ability of BiLSTM encoding features. This phenomenon shows the effectiveness of using geographic pre-training model.

* Spelling/grammar

Line 352. There must be a issue with the subindex.

Response: We thank the refeviewer for the very valuable comment.

We have modified this issue as follows:

The revised details can be found in Lines 393-395.

Author Response File: Author Response.docx

Reviewer 4 Report

The authors provide a new method by combining three neural networks ALBERT, BiLSTM, and CRF for solving the named entity recognition (NER) problem in the corpus, especially for finding Chinese landmarks and geographical places. Interesting research has been done and satisfactory results have been obtained by testing on different data sets. My comments are in two parts, minor and major, as follows:

- Lines 41 and 42- Two sequential sentences begin with ‘As a result’

- line 47- ‘the location of the mentioned locations’, reword it

- line 97- ‘The experimental results show…’; The experimental results should not be provided in the Introduction.

- Line 151- n-gram itself is not a machine learning method but a feature extraction strategy

- Line 160-164: ‘Different from traditional machine…’; very long and vague sentence.

- Line 205 -206: ‘Furthermore, this article exclusively considers...’ I really did not understand what you mean. You need to explain it more clearly.

- Line 231: “Geographic entity names…”; Long sentence, broke down it to shorter ones.

- Line 420: “We looked examined an example” -- > We examined an example

· At the beginning of the introduction, it is better to use sentences that are related to the topic of your research. Voice assistants and navigation systems are very irrelevant.

· I did not understand why groups like water systems and pipelines were considered for NER. Are such things really talked about in social media? I find such a thing unlikely or rare.

· Page 8. In my opinion, it is necessary to give a convincing example that the designed network actually works properly. For example, what can be the input and output of each of the three networks? How are they coding?

· You should describe the implementation in more detail. In what environment has it been implemented, and what platform, programming language, and libraries have been used? Maybe you can share some source code. You need to explain in such a way that the audience finds solutions to implement your approach or even find practical ideas for developing your work.

· The results of Table 8 are somewhat ambiguous and should be explained more clearly. Why are there items in the labels section (for example, Zhong, bian, bing, etc.) that do not exist in the original sentence? Used signs like I-L, B-L, E-L, O, etc. must be described.

Author Response

Dear Editors and Reviewers:

Comments and Suggestions for Authors

- Lines 41 and 42- Two sequential sentences begin with ‘As a result’

Response: We thank the refeviewer for the very valuable comment.

We have deleted the excess content, and the detailed descriptions as follows:

As a result, the geocoder needs merely look up the supplied address’s coordinates in a gazetteer. Geocoding is difficult due to the fact that it deals with raw natural language data.

The revised details can be found in Lines 42-44.

- line 47- ‘the location of the mentioned locations’, reword it

Response: We thank the refeviewer for the very valuable comment.

We have modified this sentece, and the detailed descriptions as follows:

To achieve this goal, the first subprocess of our approach is to identify the location of the mentioned contents, which is called named entity recognition (NER) in NLP [10-13].

The revised details can be found in Lines 45-47.

- line 97- ‘The experimental results show…’; The experimental results should not be provided in the Introduction.

Response: We thank the refeviewer for the very valuable comment.

We have deleted this sentence.

- Line 151- n-gram itself is not a machine learning method but a feature extraction strategy

Response: We thank the refeviewer for the very valuable comment.

We have modified this issue and the detailed descriptions as follows:

Common machine learning algorithms include the hidden Markov model (HMM), maximum entropy Markov model (MEMM) and CRF model.

The revised details can be found in Lines 148-150.

- Line 160-164: ‘Different from traditional machine…’; very long and vague sentence.

Response: We thank the refeviewer for the very valuable comment.

We have modified this issue and the detailed descriptions as follows:

Different from traditional machine learning algorithms, the model trained by a deep neural network has the characteristics of end-to-end data input and output. It makes the model training process more capable of reducing artificial interference to directly complete specific tasks according to the original data input, and there is no need to manually set data characteristics [22].

The revised details can be found in Lines 158-162.

- Line 205 -206: ‘Furthermore, this article exclusively considers...’ I really did not understand what you mean. You need to explain it more clearly.

Response: We thank the refeviewer for the very valuable comment.

We have modified this issue and the detailed descriptions as follows:

Furthermore, this article exclusively considers entity types that are relevant to the ge-ographic domain (e.g., toponym, organization), rather than generic entities such as in-dividuals (e.g., personal name).

The revised details can be found in Lines 203-205.

- Line 231: “Geographic entity names…”; Long sentence, broke down it to shorter ones.

Response: We thank the refeviewer for the very valuable comment.

We have modified this issue and the detailed descriptions as follows:

Geographic entity names are the most important distinction between geographic entities, and in the field of geographic information. It can be seen as entity names with location information, mainly composed of basic geographic information elements, with ambiguity and diversity.

The revised details can be found in Lines 237-239.

- Line 420: “We looked examined an example” -- > We examined an example

Response: We thank the refeviewer for the very valuable comment.

We have modified this sentence.

At the beginning of the introduction, it is better to use sentences that are related to the topic of your research. Voice assistants and navigation systems are very irrelevant.

Response: We thank the refeviewer for the very valuable comment.

We have modified the beginning of the introduction as follows:

Online social media platforms, especially microblog platforms such as Wechat and Weibo, are responsive to real-world events and are useful for gathering situational information in real-time [1-4]. Geographic locations are often described in these messages. For example, Weibo is widely used in disaster response and rescue, such as earthquakes, floods, fire, and terrorist attacks.

The revised details can be found in Lines 32-36.

I did not understand why groups like water systems and pipelines were considered for NER. Are such things really talked about in social media? I find such a thing unlikely or rare.

Response: We thank the refeviewer for the very valuable comment.

Water systems and pipelines are less frequently seen in social media, but we included them to provide a more comprehensive description of geographically related entities from the perspective of the overall completeness of geographically named entities.
We firstly refer to some standards, including the rules for classifying and coding geographical names (GB/T18521- 2001) and classifying and coding basic geographic information elements (GB/T13923-2006). Therefore, we also include water systems and pipelines as a category, mainly considering that there may be some entities such as rivers, lakes and seas in social media, and these entities have an important indicator role in natural disasters, emergency response and rescue. The above is our analysis, which is not included in the manuscript, and we look forward to receiving your understanding.

Page 8. In my opinion, it is necessary to give a convincing example that the designed network actually works properly. For example, what can be the input and output of each of the three networks? How are they coding?

Response: We thank the refeviewer for the very valuable comment.

We have described the network architecture in detail, including the inputs and outputs between the layers therein. The modifications are as follows:

We present our model from bottom to top, characterizing the layers of the neural network. The input layer contains the individual words of a message, which are used as the input to the model.

The next layer represents each word as vectors using pre-training approach. It uses pre-trained word embeddings to represent the words in the input sequence. In particular, we use ALBERT, which captures the different semantics of a word under varied contexts. Note that the pre-trained word embeddings capture the semantics of words based on their typical usage contexts and therefore provide static representations of words; by contrast, ALBERT provides a dynamic representation for a word by modeling the particular sentence within which the word is used. This layer captures four different aspects of a word, and their representation vectors are concatenated together into a large vector to represent each input word. These vectors are then used as the input to next layer, which is a BiLSTM layer consisting of two layers of LSTM cells: one forward layer capturing information before the target word and one backward layer capturing information after the target word.

BiLSTM layer combines the outputs of the two LSTM layers and feeds the combined output into a fully connected layer. Then, the next layer is a CRF layer which takes the output from the fully connected layer and performs sequence labeling. The CRF layer uses the standard BIEO model from NER research to label each word but focuses on locations. Thus, a word is annotated as either “B-L” (the beginning of a location phrase), “I-L” (inside a location phrase), “E-L” (end a location phrase), or “O” (outside a location phrase).

The revised details can be found in Lines 275-294.

You should describe the implementation in more detail. In what environment has it been implemented, and what platform, programming language, and libraries have been used? Maybe you can share some source code. You need to explain in such a way that the audience finds solutions to implement your approach or even find practical ideas for developing your work.

Response: We thank the refeviewer for the very valuable comment.

We describe the implementation environment of the algorithmic model proposed in this paper in our manuscript, and we also share our code and dataset.

Many parts of the experimental data were evaluated and debated. TensorFlow was used to implement the models on a single NVIDIA GeForce RTX 3090 GPU.

The revised details can be found in Lines 420-422.

The results of Table 8 are somewhat ambiguous and should be explained more clearly. Why are there items in the labels section (for example, Zhong, bian, bing, etc.) that do not exist in the original sentence? Used signs like I-L, B-L, E-L, O, etc. must be described.

Response: We thank the refeviewer for the very valuable comment.

We have added the missing information and given the content in Chinese, translation and Pinyin modes:

Table 10. Results of an instance being predicted by different models. B represents begin, I represents inside, E represents end, and O represents other.

Original sentence	黑龙江中部出现强降雨，其中哈尔滨工程大学古楼医院周边伴有冰雹。
Sentence translation	Heavy rainfall in central Heilongjiang, including hail in Harbin Yilan County.
Sentence pinyin (Chinese romanization)	Hei Long Jiang Zhong Bu Chu Xian Qiang Jiang Yu, Qi Zhong Ha Er Bin Gong Cheng Da Xue Gu Lou Yi Yuan Zhou Bian Ban You Bing Bao.
Correct Label	Hei／Ｂ－Ｌlong／Ｉ－Ｌjiang／Ｅ－Ｌzhong／Ｏ bu／Ｏ di／Ｏ qu／Ｏ chu／Ｏ xian／Ｏ qiang／Ｏ jiang／Ｏ yu／Ｏ，／Ｏ qi／Ｏ zhong／ha／Ｂ－Ｌer／Ｉ－Ｌbin／Ｉ－Ｌgong／Ｉ－Ｌcheng／Ｉ－Ｌda／Ｉ－Ｌxue／Ｉ－Ｌgu／Ｉ－Ｌlou／Ｉ－Ｌyi／Ｉ－Ｌyuan／E－Ｌzhou／Ｏbian／Ｏban／Ｏyou／Ｏbing／Ｏbao／Ｏ。／Ｏ
BERT-BiLSTM-CRF predict	Hei／Ｂ－Ｌlong／Ｉ－Ｌjiang／Ｅ－Ｌzhong／Ｏ bu／Ｏ di／Ｏ qu／Ｏ chu／Ｏ xian／Ｏ qiang／Ｏ jiang／Ｏ yu／Ｏ，／Ｏ qi／Ｏ zhong／ha／Ｂ－Ｌer／Ｉ－Ｌbin／Ｉ－Ｌgong／Ｉ－Ｌcheng／Ｉ－Ｌda／Ｉ－Ｌxue／E－Ｌgu／B－Ｌlou／Ｉ－Ｌyi／Ｉ－Ｌyuan／E－Ｌzhou／Ｏbian／Ｏban／Ｏyou／Ｏbing／Ｏbao／Ｏ。／Ｏ
BERT-BiGRU-CRF predict	Hei／Ｂ－Ｌlong／Ｉ－Ｌjiang／Ｅ－Ｌzhong／Ｏ bu／Ｏ di／Ｏ qu／Ｏ chu／Ｏ xian／Ｏ qiang／Ｏ jiang／Ｏ yu／Ｏ，／Ｏ qi／Ｏ zhong／ha／Ｂ－Ｌer／Ｉ－Ｌbin／Ｉ－Ｌgong／Ｉ－Ｌcheng／Ｉ－Ｌda／Ｉ－Ｌxue／E－Ｌgu／B－Ｌlou／Ｉ－Ｌyi／Ｉ－Ｌyuan／E－Ｌzhou／Ｏbian／Ｏban／Ｏyou／Ｏbing／Ｏbao／Ｏ。／Ｏ
ALBERT_ours-BiLSTM predict	Hei／Ｂ－Ｌlong／Ｉ－Ｌjiang／Ｅ－Ｌzhong／Ｏ bu／Ｏ di／Ｏ qu／Ｏ chu／Ｏ xian／Ｏ qiang／Ｏ jiang／Ｏ yu／Ｏ，／Ｏ qi／Ｏ zhong／ha／Ｂ－Ｌer／Ｉ－Ｌbin／Ｉ－Ｌgong／Ｉ－Ｌcheng／Ｉ－Ｌda／Ｉ－Ｌxue／E－Ｌgu／B－Ｌlou／Ｉ－Ｌyi／Ｉ－Ｌyuan／E－Ｌzhou／Ｏbian／Ｏban／Ｏyou／Ｏbing／Ｏbao／Ｏ。／Ｏ
ALBERT_ours-BiLSTM-CRF predict	Hei／Ｂ－Ｌlong／Ｉ－Ｌjiang／Ｅ－Ｌzhong／Ｏ bu／Ｏ di／Ｏ qu／Ｏ chu／Ｏ xian／Ｏ qiang／Ｏ jiang／Ｏ yu／Ｏ，／Ｏ qi／Ｏ zhong／ha／Ｂ－Ｌer／Ｉ－Ｌbin／Ｉ－Ｌgong／Ｉ－Ｌcheng／Ｉ－Ｌda／Ｉ－Ｌxue／Ｉ－Ｌgu／Ｉ－Ｌlou／Ｉ－Ｌyi／Ｉ－Ｌyuan／E－Ｌzhou／Ｏbian／Ｏban／Ｏyou／Ｏbing／Ｏbao／Ｏ。／Ｏ

The revised details can be found in Lines 594-596.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Thank you for the explanations. Some of the doubts were clarified.

"In particular, our re-trained ALBERT model improves by 7.8% compared to the original ALBERT model."

It looks like the main improvement was achieved by training a custom ALBERT model. However, in section 4.2 you mentioned only training the static embeddings using the GloVe method. What was the size of the dictionary of static embeddings?

The reader will benefit if Section 4.2 is supplemented with more information about re-training the ALBRT model. What collection of texts did you use? For how many steps/epochs the model was trained? What was the learning rate? How long does it take to train the model?

"Our experiments were cross-validated and, from this perspective, met statistical significance."

Could you provide the standard deviation for each configuration?

"Training process: in this paper, we train word embeddings on Wikipedia corpus by using word2vec tool in advance, and we concatenate consecutive words to represent an entity when the entity has multiple words"

The statement regarding concatenating consecutive words is confusing. One of the challenges in NER is founding the correct boundaries of multi-word entities. If you preprocessed the texts by merging multi-word entities into single entities then you simplified the task. Please, clarify this statement.

Regarding Table 6. How do you explain, that adding a CRF layer to ALBERTold-BiLSTM decreased the performance? At the same time, CRF added to ALBERTours-BiLSTM improved the performance.

Author Response

Dear Editors and Reviewers:

Thank you for the explanations. Some of the doubts were clarified.

"In particular, our re-trained ALBERT model improves by 7.8% compared to the original ALBERT model."

Response: We thank the refeviewer for the very valuable comment.

We use the GloVe word embeddings (the number of Token is 54238 and the dictionary size is 399KB) that were trained on 2 billion texts with 11 billion tokens and 1.8 million vocabulary items collected from Baidu Encyclopedia, Weibo, WeChat, etc.

The revised details can be found in Lines 372-374.

Response: We thank the refeviewer for the very valuable comment.

We performed the following steps on the basis of the collected text data: (1) cleaning the data. We removed the messy codes and incomplete sentences to ensure that the sentences were smooth; (2) cut the sentences. We added [CLS], [SEP], [MASK], etc. to each text item to obtain 25.6GB of training data; (4) training corpus. We trained on 3090GPU for 4 days with epoch set to 100,000 and learning rate set to 5e-5.

The revised details can be found in Lines 367-371.

"Our experiments were cross-validated and, from this perspective, met statistical significance."

Could you provide the standard deviation for each configuration?

Response: We thank the refeviewer for the very valuable comment.

We have provided the standard deviation for each configuration as follows:

Table 6. Results of different models on TPCNER.

Model	Accuracy	Precision	Recall	F1-score
HMM	80.9%-0.16%	80.4%	81.5%	80.7%
CRF	83.8%+0.03%	83.8%	84.1%	84 %
BiLSTM-CRF	86.1%-0.02%	97.9%	76.6%	86.0%
IDCNN-CRF	86.5%+0.11%	97.9%	77.1%	86.2%
IDCNN-CRF2	88.2%+0.25%	98.0%	79.5%	87.8%
BiLSTM-Attention-CRF	89.1%-0.09%	97.5%	72.8%	83.4%
BERT-BiLSTM-CRF	91.1%-0.08%	92.8%	91.5%	92.1%
BERT-BiGRU-CRF	93.4%-0.28%	93.9%	94.9%	94.4%
ALBERT_old-BiLSTM	88.1%+0.12%	91.2%	90.7%	90.9%
ALBERT_ours-BiLSTM	92.7%+0.17%	94.1%	94.5%	94.3%
ALBERT_old-BiLSTM-CRF	90.5%+0.03%	92.5%	94.4%	93.4%
ALBERT_ours-BiLSTM-CRF	97.8%+0.07%	96.1%	96.2%	96.1%

The revised details can be found in Lines 508-509.

Response: We thank the refeviewer for the very valuable comment.

On the one hand, we train the corresponding word embeddings for the words in the corpus, and process them for the geographic entities on the other hand. Specifically, we would like to express here that when performing word vector construction, if a geographic entity consists of multiple words, then we merge the word embeddings of multiple words to form entity-level word vectors. The main starting point is the expectation to solve some nested entity recognition problems, since nested entities are challenging problems in entity recognition tasks.

The above is our analysis and we are looking forward to get your understanding.

Regarding Table 6. How do you explain, that adding a CRF layer to ALBERTold-BiLSTM decreased the performance? At the same time, CRF added to ALBERTours-BiLSTM improved the performance.

Response: We thank the refeviewer for the very valuable comment.

We are very ashamed to have caused the misspelling. We have retrained and tested this algorithm, and the results of the test are that the model’s precision, recall and F1-score are 92.5, 94.4 and 93.4 respectively.

Table 6. Results of different models on TPCNER.

Model	Accuracy	Precision	Recall	F1-score
HMM	80.9%-0.16%	80.4%	81.5%	80.7%
CRF	83.8%+0.03%	83.8%	84.1%	84 %
BiLSTM-CRF	86.1%-0.02%	97.9%	76.6%	86.0%
IDCNN-CRF	86.5%+0.11%	97.9%	77.1%	86.2%
IDCNN-CRF2	88.2%+0.25%	98.0%	79.5%	87.8%
BiLSTM-Attention-CRF	89.1%-0.09%	97.5%	72.8%	83.4%
BERT-BiLSTM-CRF	91.1%-0.08%	92.8%	91.5%	92.1%
BERT-BiGRU-CRF	93.4%-0.28%	93.9%	94.9%	94.4%
ALBERT_old-BiLSTM	88.1%+0.12%	91.2%	90.7%	90.9%
ALBERT_ours-BiLSTM	92.7%+0.17%	94.1%	94.5%	94.3%
ALBERT_old-BiLSTM-CRF	90.5%+0.03%	92.5%	94.4%	93.4%
ALBERT_ours-BiLSTM-CRF	97.8%+0.07%	96.1%	96.2%	96.1%

Article Menu

Geographic Named Entity Recognition by Employing Natural Language Processing and an Improved BERT Model

Further Information

Guidelines

MDPI Initiatives

Follow MDPI