Next Article in Journal
Enhanced Arbovirus Surveillance with High-Throughput Metatranscriptomic Processing of Field-Collected Mosquitoes
Next Article in Special Issue
#Stayathome If You Have a Cold: High SARS-CoV-2 Salivary Viral Loads in Pediatric Patients with Nasopharyngeal Symptoms
Previous Article in Journal
In Vitro and Pre-Clinical Evaluation of Locally Isolated Phages, vB_Pae_SMP1 and vB_Pae_SMP5, Formulated as Hydrogels against Carbapenem-Resistant Pseudomonas aeruginosa
Previous Article in Special Issue
CGRP Plasma Levels Correlate with the Clinical Evolution and Prognosis of Hospitalized Acute COVID-19 Patients
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach

1
Department of Computer Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada
2
Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
3
Data Science, John Snow Labs Inc., Lewes, DE 19958, USA
4
Institute of Business & Information Technology, University of the Punjab, Lahore 54590, Pakistan
*
Author to whom correspondence should be addressed.
Viruses 2022, 14(12), 2761; https://doi.org/10.3390/v14122761
Submission received: 28 November 2022 / Accepted: 8 December 2022 / Published: 11 December 2022
(This article belongs to the Special Issue COVID-19 Diagnostics in Clinical Applications and Pandemic Control)

Abstract

:
The clinical application of detecting COVID-19 factors is a challenging task. The existing named entity recognition models are usually trained on a limited set of named entities. Besides clinical, the non-clinical factors, such as social determinant of health (SDoH), are also important to study the infectious disease. In this paper, we propose a generalizable machine learning approach that improves on previous efforts by recognizing a large number of clinical risk factors and SDoH. The novelty of the proposed method lies in the subtle combination of a number of deep neural networks, including the BiLSTM-CNN-CRF method and a transformer-based embedding layer. Experimental results on a cohort of COVID-19 data prepared from PubMed articles show the superiority of the proposed approach. When compared to other methods, the proposed approach achieves a performance gain of about 1–5% in terms of macro- and micro-average F1 scores. Clinical practitioners and researchers can use this approach to obtain accurate information regarding clinical risks and SDoH factors, and use this pipeline as a tool to end the pandemic or to prepare for future pandemics.

1. Background

The COVID-19 pandemic (coronavirus disease 2019) has had a significant impact on society, due to the severity of the disease and the slow implementation of public health measures [1]. Many of these challenges stem from the information overload problem, which is exacerbated by the growing understanding of the disease and a plethora of literature on the subject [2]. COVID-19 Open Research Dataset (CORD19) [3] and LitCOVID [4] are among the pioneering data sources made available by the research community to aid collaboration between the computing community and the many stakeholders in the COVID-19 pandemic. These data sources contain hundreds of thousands of articles, and new articles are added regularly [1,5]. In its current state, it is difficult for researchers, clinical experts, and practitioners to obtain up-to-date information on the most recent findings.
To study the risk factors associated with COVID-19, government organizations and health sectors can always arrange for human resources to convert the pools of information from the literature into a structured format. However, by the time this data is made accessible to the research community, much of the earlier information is outdated. Natural Language Processing (NLP), a branch of artificial intelligence (AI), allows automated processing and analysis of unstructured texts, such as extracting key information and representing it in a structured format appropriate for computational analysis [6].
The goal of this research is to study the clinical factors, such as disease, drugs, treatments, procedures, and non-clinical factors, such as social determinants of health (SDoH) from the biomedical texts. In terms of methodology, we employ the named entity recognition (NER) [7] task of NLP to extract the biomedical factors from the free texts.
Despite being highly useful, the state-of-the-art work [8,9,10,11] in biomedical NER is primarily focused on a small number of entities (disease, chemicals, genes, etc.). There are numerous other clinical factors to consider, such as diagnosis, therapies, medical concepts, risks, and vital signs, as well as non-clinical factors such as SDoH. Extracting these biomedical entities (clinical and non-clinical) is important to study the predictors of COVID-19, which is a motivation for this research. Usually, scientific texts, such as clinical reports, medical notes, and Electronic Health Records (EHR) consist of sensitive patient information that must be de-identified. In this work, we also preserve patients’ private information through the data obfuscation process.
We have extended our previous work [12] in this paper and performed a more detailed analysis. We also fine-tune a transformer module to create the task-specific embeddings in this work. Our contributions are listed below:
  • We develop a biomedical NER pipeline to identify clinical as well as non-clinical named entities from the COVID-19 texts. We attempt to consolidate and explain data science best practices through this pipeline, with numerous convenient features that can be used as it is or as a starting point for further customization and improvement.
  • We develop a new dataset by curating a large number of scientific publications and case reports on COVID-19, and we scientifically parse the text from these scientific articles and prepare a dataset from it. We annotate a part of this dataset on biomedical-named entities to prepare a gold-standard dataset to train the NER pipeline. A portion of the gold-standard dataset is also reserved as a test set.
  • We de-identify the patients’ personal information after identifying the named entities, thus adhering to the Health Insurance Portability and Accountability Act (HIPAA) [13].
  • We demonstrate the efficacy and utility of this pipeline by comparing it with the state-of-the-art methods on public benchmark datasets. We also show the key findings related to COVID-19 in the analysis.

2. Previous Work

Named Entity Recognition (NER) is the task of identifying a named entity (a real-world object or concept) in unstructured text and then classifying the entity into a standard category [7]. In the field of biomedicine, NER is the task of identifying entities such as genes, diseases, chemicals, and proteins [11]. Several datasets are proposed for the NER task. These datasets are prepared usually in the CONLL-2003 format [14], a prototypical format for NER datasets. Many machine learning and deep learning based NER models have also been released in the past few years. Below, we summarize the benchmark datasets and methods used for NER in Table 1:
According to the Healthy People 2030 initiative, SDoHs related to population health [31] have a major impact on people’s health, well-being, and quality of life, and are related to health outcomes; this is a rather underexplored area of research in biomedicine and clinical research. In this work, we mention some SDOH in our dataset.
In a 2016 survey, nearly 95% of eligible hospitals in the United States use EHRs [32], with that figure expected to rise in these years. The standard EHRs contain 18 categories of critical private information about patients (e.g., name, age, and address), which must be de-identified before they are made public, as required by HIPPA [33]. For the de-identification purpose, researchers used a variety of methods, including rule-based, machine learning-based, and hybrid [34]. The CRF models [35], and Structured Support Vector (SVM) [36] are some of the commonly used models for NER and de-identification tasks. Deep learning models based on recurrent neural networks (RNN) and CNN models are also used for the de-identification of clinical notes [37]. The BioBERT [9], SciBERT [29], and recent Transformer-based models are also used to identify the named entities from biomedical texts.
In this work, we also use deep learning-based methods to build a pipeline for the biomedical NER and de-identification tasks. We identify many biomedical named entities including SDOH from COVID-19 texts.

3. Materials and Methods

3.1. Data Cohort

We have collected the scientific articles and clinical case reports from different journals (Lancet, BMJ, AMJ, Clinical Medicine and related) through LitCOVID [4] API, a resource of scholarly articles. The inclusion and exclusion criteria for data collection are given below:
  • We specify the timeline between November 2021 and March 2022 for data collection.
  • We specify English as the language to get the publications.
  • We exclude many early-pandemic scientific articles, the intuition being that the disease symptoms and diagnosis, drugs and vaccination information were not clear during that time.
  • We specify the population groups in adults: 19–44 years, middle-aged: 45–64 years, aged: 65+ years, during data collection.
After obtaining the scientific articles from these sources, we use the Spark OCR [38] library to automatically extract content from the PDF files and convert them into dataframes, where each row corresponds to one document (publication). After all these steps and filtration criteria, we acquired around 15 k scientific articles. Because we specify limited age groups in the population setting, English as the only language, and a time period of 5 months, the number of articles obtained here is lower than those obtained in the actual repository (LitCOVID) during that time period.
Gold-standard dataset: We annotated around 200 scientific articles from our collected dataset using the JohnSnowLabs annotation lab [39], and prepare a gold-standard dataset. A gold-standard dataset [40] means a corpus of text or a set of documents, annotated or tagged with the desired labels by expert annotators. We use the application of active learning [41] to re-annotate a larger portion of the data, where we specified the gold-standard data as the seed. By the end of this step, we acquired around 500 articles that were annotated. According to research [42], this amount of data is sufficient to begin training an NLP model. We used the following named entities, shown in Table 2, as the gold labels. We saved this data in CONLL [14] format.

3.2. Biomedical Named Entity Recognition Pipeline Structure

In this study, we propose a trainable ML pipeline that includes a pre-processor, tokenizer, embedding component, a deep neural network based on BiLSTM, CNN and CRF models, and a de-identifier. The novelty of this approach lies in the subtle integration of different components that are stacked together to train the pipeline. We build this pipeline following the Spark ML pipeline [43], which provides a default scalable solution without requiring much computation power [44]. The workflow of this pipeline is shown in Figure 1.
Next, we explain each component of this pipeline.
Data Collection: The input to the pipeline can be any raw textual data. We provided the data from our data cohort for this purpose.
Pre-processor: The preprocessor takes the text data as an input that comes from the data collection phase, pre-processes it, and detects the sentence boundaries in each record (document). Then, it transforms the data into a format that is readable by the next stage in the pipeline. The output from the pre-processor is the set of records that are pre-processed.
Tokenizer: The tokenizer takes the pre-processed data from the pre-processor as input. Tokenization is the process of breaking the input text into smaller chunks (words, or sentences) called tokens [45]. These tokens aid in comprehending the context and in developing the NLP model. The output from the tokenizer is transformed data, containing the tokens (words) corresponding to each document (scientific article, case report and so on).
Embedding: The tokenized data from the tokenizer goes into the embedding component, which maps tokens to vectors. We have fine-tuned the pre-trained BlueBERT model [46] that is trained on PubMed abstracts and MIMIC-III [47] on our gold-data to provide task-specific embeddings.
Named Entity Recognizer: This component identifies biomedical entities in the text. This is an algorithm based on the BiLSTM-CNN-CRF [48] model. We modify the vanilla BiLSTM-CNN-CRF for the task-specific embeddings and make our modifications. We introduce our NER model in Figure 2 and explain its working below.
As shown in Figure 2, the algorithm takes as input the sequence of words or a sentence. This sequence is represented as s = [ w 1 ,   w 2 ,   ,   w N ] , where N is the sentence length and w i     R V is the i t h token in the sequence. This input goes to the embedding layer.
The embedding layer is the first layer in the model that converts a sentence from a sequence of tokens into a sequence of dense vectors. In this work, we use our fine-tuned transformer model for the embeddings. The output of this layer is a sequence of vectors x = [ x 1 ,   x 2 ,   ,   x N ] , where x i = E w i   R D , E is for embedding and x i is the dense vector representation of word w i .
The second layer in this model is a CNN layer that is used to capture local information within given words in a biomedical context. The CNN is just for char embeddings to represent letters. The main feature is word embeddings coming from the embedding layer (BERT-based embeddings). The output of the CNN layer is c = [ c 1 ,   c 2 ,   ,   c N ] , where c i R M , M is the number of filters. The contextual representation c i of the i t h character is the concatenation of the outputs of all filters at this position
The third layer in the model is the Bi-LSTM network, which is used to learn hidden representations of characters or tokens in a sequence using all of the previous contexts (in both directions). The output of the Bi-LSTM layer is h = [ h 1 ,   h 2 ,   ,   h N ] , where h i = R 2 S and S is the dimension of hidden states in LSTM.
The fourth layer on the top of the Bi-LSTM network is the CRF layer [49]. The input to the CRF layer is the hidden representations of characters h = [ h 1 ,   h 2 ,   ,   h N ] generated by the Bi-LSTM layer. To ensure that the predicted labels are valid, the CRF layer captures the dependency relationship between the named tags and constrains them to the final predicted labels [22]. The output of the CRF layer is y = [ y 1 ,   y 2 ,   ,   y N ] , which is a label sequence of sentence s, where y i R L is the one-hot representation of the i t h character’s label and L is the number of labels. In this work, the biomedical entities are the labels. A tanh layer on top of the BiLSTM layer is added to predict the confidence scores (CS) for the word with each of the possible labels as the output score of the network.
De-identifier: We use the data obfuscation technique, which is a process that obscures (masks) the meaning of data [50]. For example, to replace identified names with different fake names or to mask some data, value <02-02-2022> with <DATE> is used. This component provides HIPAA [13] compliance when dealing with text documents containing any protected health information. We use the pre-trained de-identification model from Johnsnowlabs [51] and embed it inside the pipeline to de-identify the personal records of the patients.
Biomedical Named Entities: The output of the pipeline is the biomedical entities, shown in Table 2.

3.3. Evaluation

We adopted a two-fold evaluation technique: (1) to evaluated the accuracy of the proposed approach, and (2) to analyze the results of our approach for pandemic surveillance. To evaluate the accuracy of the proposed approach, we considered a number of baseline methods and benchmark datasets including our test set. To evaluate the pandemic surveillance, we analyzed the results of our model and summarized the key findings.
Benchmark datasets: We used the JNLPBA [19] for chemical entities, NCBI-Disease [15] for disease entities, BC5CDR [16] dataset for chemical and disease mentions, BC2GM [18] for genes, and i2b2-Clinical [20] for clinical entities. From here, we obtained datasets that were already available in CoNLL-2003 format [52]. We performed further processing to convert them into IOB (Inside-Outside-Before) [53] scheme. All the datasets were divided into training, validation, and test sets, with a 70:15:15 ratio for all experiments. The Stratified 5-Folds cross-validation (CV) strategy was used for train/test split if original datasets did not have an official train/test split. We also set aside 30% of our gold dataset as a test set.
Baseline Methods: We compared the performance of our approach against the following state-of-the-art baseline methods: BiLSTM-CRF [54], BiLSTM-CRF-MTL [24], CT-BERT [55], SciBERT [29], and BioBERT [9] (v1.0, v1.1, v1.2). All of the baselines were trained on the aforementioned datasets. Each baseline was tuned to its optimal hyperparameter setting and the best results were reported for each method.
Training environment: All the experiments were run on Google Colab Pro (NVIDIA P100 or T4, 24 GB RAM, 2 x vCPU). The grid search was used to get optimal values for the hyperparameters and early stopping was performed to overcome overfitting. We specified the following hyperparameters as shown in Table 3.
Evaluation metrics: Following the standard practice [41,56] to evaluate NER tasks, we used the following metrics:
-
Micro-average F1 to measures the F1-score of aggregated contributions of all classes.
-
Macro-average F1that adds all the measures (Precision, Recall, or F-Measure) and divides with the number of labels, which is more like an average.

4. Results

The results and analysis are given below.

4.1. Comparison with Baseline Methods

We show the performance of our approach for accuracy in Table 4.
Overall, these results show that our approach achieved state-of-the-art performance on five public biomedical benchmarks, as well as on our dataset designed specifically for biomedical named entities. This demonstrates the generalizability of our methodology across different domains.
Our approach achieved the best micro F1 score of 94.78 on our dataset (52 entities), 90.58 on NCBI Disease (disease entity), 89.90 on BC5CDR (chemicals), 89.15 on BC2GM (gene/proteins), 79.92 on JNLBPA (chemical) and 89.10 on the i2b2 (clinical) dataset. We see similar patterns and higher performances in our pipeline for macro F1 scores.
The BioBERT model shows competitive performance in these results. Among the variants of the BioBERT, we see overall better performance of BioBERT v1.2 than its other variants, except for a few places, where BioBERT v1.1 marginally outperforms BioBERT v1.2. The better performance of BioBERT v1.2 is attributed to its training method, which is the same method as BioBERT v1.1 but includes an LM head [57]. Among the BERT-based models (BioBERT, SciBERT, CT-BERT), BioBERT performs best. The BioBERT is quite generalizable compared to other BERT-based methods, the SciBERT is initially trained on scientific data (not clinical) [29], and, CT-BERT is pre-trained on social media data, so they perform differently with different entity types.
Among the BiLSTM-based models (BiLSTM-CRF, BiLSTM-CRF-MTL), we observe the good performance of the BiLSTM-CRF model in identifying many diseases, chemicals, and gene/protein entities in these experiments. Our algorithm (BiLSTM-CNN-CRF) performs better than the BiLSTM-CRF baseline, probably because we are using biomedical embeddings on top of char-level embeddings. The fine-tuned transformer model’s embeddings enhance the performance of our model.
Although we fine-tuned each baseline method to its optimal hyperparameter settings, we anticipate that the relatively low scores of these baselines on our dataset can be attributed to the following: (i) the absence of an annotated dataset for training new biomedical entities, and (ii) different training/test set splits used in previous works that were unavailable.
Ablation Study: We performed an ablation experiment in which we evaluated the component of our pipeline. This component was based on our modified BiLSTM-CNN-CRF model. We replaced the standard BiLSTM-CNN-CRF in the sequence labeling architecture (Figure 2) with a direct feedforward map with and without a CRF decoder. We used a simple linear map over the embeddings to determine their direct information content. The results of this ablation study on our test set, based on macro average F1-score, are shown in Table 5.
The results, in Table 5, show that the effect of removing the BiLSTM layer is far more than removing the CRF layer from BiLSTM-CNN-CRF. This is shown with a dropped macro F1 of more than 15% when we remove the BiLSTM layer, compared to removing only the CRF layer. The most impacted performance is seen with Map-CNN where we removed these two layers (BiLSTM and CRF). With all these results, we find that our default settings are best in this setup.

4.2. Pandemic Surveillance

In this section, we demonstrate the effectiveness of our approach in demonstrating the key findings on pandemic surveillance. First, we show the most common entity types predicted by our approach after parsing 500 case reports, and show the performance of the model in terms of precision, recall, F1-score (F1), micro-average and macro-average in Table 6. The formulae for these performance metrics are based on true positives (TP), false positives (FP) and false negatives (FN).
As seen in Table 6, we can accurately predict a large number of entities with quite a high score. We also show the prevalence of the most common symptoms observed in our data in Figure 3.
The results in Figure 3 show that fever, nasal congestion, pains, a running nose, and sore throat are among the most common COVID-19 symptoms. Next, we show the most occurring named entities (occurrence > 70%) under the prominent entity types (drugs, vaccines, treatments) and show the results in Table 7.
We also gave a snippet from a COVID-19 related case report to our pipeline and show the confidence score for the predicted entities. The results are shown in Table 8.
The result in Table 8 shows that our model can predict many named entities with a high level of confidence score.
We take the nominal race groups [58] and report the results where the race group accounts for more than 5% of the population. This finding shown in Figure 4 is based on a subset of available data from a specific time period, so it may not be an accurate representation of racial groups as a whole during the COVID-19 outbreak.
We show a sample prediction of our model on a case report [59] in Figure 5, where we can see that many clinical and SDOH are being detected.

5. Discussion

5.1. Implications in Healthcare

There are many different ways that this pipeline can be used in healthcare settings. These biomedical entity types can assist physicians, nurses, and other healthcare professionals in matching symptoms to a diagnosis, a course of treatment, and follow-up. Health disparities can be decreased by tracking social determinants [60]. The clinical data can be converted into knowledge, evidence, and clinical impact using this research as well. This pipeline emphasizes best practices, openness, reproducibility, automation, and the capacity to recognize complex named entities from biomedical texts. With little to no code modification, this pipeline can also be applied to any other domain.

5.2. Transfer Learning

The advantages of transfer learning in detecting COVID-19-named entities become clearer because of this work. The proposed approach (combining BiLSTM-CRF-CNN with Transformer-based embeddings) achieves a performance comparable to pure Transformer-based models (BioBERT), and performs at least 1 to 5% better compared to conventional BiLSTM models. In the future, it would be beneficial to have our own pre-trained embeddings that can be used to study a large number of clinical and non-clinical entities.

5.3. Limitations

Although the BiLSTM-CNN-CRF model that we used for this approach showed good results and outperformed the current state-of-the-art solutions, there is still room for improvement, and the following points are what we would consider implementing in the future: first, we plan to increase the number of layers in this deep neural network. We intend to pre-train a transformer-based model. In this regard, one approach would be to first prepare more data for annotation and then pre-train the model on the annotated data.
So far, we have annotated a portion of the dataset, which suffices for the purpose of model training. In the future, we strongly encourage the inclusion of medical professionals in the annotation guideline. We also plan to annotate a large number of documents for this type of study.
We also plan to test the model on additional benchmark datasets. Furthermore, we intend to curate more clinical data; in particular, getting real-time access to EHRs would be helpful. Since we are already providing a de-identifier to de-identify patients’ personal information through this pipeline, we hope to gain access to such a dataset soon while adhering to HIPAA guidelines. Lastly, due to the black-box nature of most deep neural networks, we also plan to handle bias or systematic error in research methods, which may influence disease associations and predictions.

6. Conclusions

In conclusion, this paper presents a pipeline that consists of a number of ML components stacked together. We used an approach to train models for the biomedical named entities using the BiLSTM-CNN-CRF model plus BERT-based embeddings. This paper shows that using contextualized word embedding, pre-trained on biomedical corpora, significantly improves the results of biomedical NER tasks. We evaluated the performance of this approach on benchmark datasets and our own test set, and our approach achieved the state-of-the-art results compared to the baselines. This pipeline can be used in different health science settings, provided that the annotated data to train the model and the pipeline is available.

Author Contributions

Conceptualization, S.R.B., S.R. and V.K.; methodology, S.R.B., S.R. and V.K.; software, V.K., Johnsnowlabs; validation, S.R.K, S.R. and V.K.; formal analysis, S.R.B.; investigation, S.R.; resources, S.R.B. and S.R.; data curation, S.R.; writing—original draft preparation, S.R.B. and S.R.; writing—review and editing, S.R.B. and S.R.; visualization, S.R. and V.K.; supervision, S.R.; project administration, S.R.B.; funding acquisition, S.R.B. and U.Q. performed additional experiments, proof-read the document and validated the analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data can be made available upon request from corresponding author.

Acknowledgments

The earlier version of this work is published in Proceedings of Machine Learning Research [12].

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Chen, Q.; Leaman, R.; Allot, A.; Luo, L.; Wei, C.-H.; Yan, S.; Lu, Z. Artificial Intelligence in Action: Addressing the COVID-19 Pandemic with Natural Language Processing. Annu. Rev. Biomed. Data Sci. 2021, 4, 313–339. [Google Scholar] [CrossRef] [PubMed]
  2. Raza, S.; Schwartz, B.; Rosella, L.C. CoQUAD: A COVID-19 Question Answering Dataset System, Facilitating Research, Benchmarking, and Practice. BMC Bioinform. 2022, 23, 210. [Google Scholar] [CrossRef] [PubMed]
  3. Allen Institute. COVID-19 Open Research Dataset Challenge (CORD-19). 2020. Available online: https://www.kaggle.com/datasets/allen-institute-for-ai/CORD-19-research-challenge (accessed on 27 November 2022).
  4. Chen, Q.; Allot, A.; Lu, Z. LitCovid: An Open Database of COVID-19 Literature. Nucleic Acids Res. 2021, 49, D1534–D1540. [Google Scholar] [CrossRef] [PubMed]
  5. Wang, L.L.; Lo, K. Text Mining Approaches for Dealing with the Rapidly Expanding Literature on COVID-19. Brief. Bioinform. 2021, 22, 781–799. [Google Scholar] [CrossRef]
  6. Reeves, R.M.; Christensen, L.; Brown, J.R.; Conway, M.; Levis, M.; Gobbel, G.T.; Shah, R.U.; Goodrich, C.; Rickets, I.; Minter, F.; et al. Adaptation of an NLP System to a New Healthcare Environment to Identify Social Determinants of Health. J. Biomed. Inform. 2021, 120, 103851. [Google Scholar] [CrossRef]
  7. Nadeau, D.; Sekine, S. A Survey of Named Entity Recognition and Classification. Lingvisticae Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
  8. Boudjellal, N.; Zhang, H.; Khan, A.; Ahmad, A.; Naseem, R.; Shang, J.; Dai, L. ABioNER: A BERT-Based Model for Arabic Biomedical Named-Entity Recognition. Complexity 2021, 2021, 6633213. [Google Scholar] [CrossRef]
  9. Dmis-Lab. DMIS-Lab/Biobert: Bioinformatics’2020: BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. GitHub. 2020. Available online: https://github.com/dmis-lab/biobert (accessed on 27 November 2022).
  10. Perera, N.; Dehmer, M.; Emmert-Streib, F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front. Cell Dev. Biol. 2020, 8, 673. [Google Scholar] [CrossRef]
  11. Cho, H.; Lee, H. Biomedical Named Entity Recognition Using Deep Neural Networks with Contextual Information. BMC Bioinform. 2019, 20, 735. [Google Scholar] [CrossRef]
  12. Raza, S.; Schwartz, B. Detecting Biomedical Named Entities in COVID-19 Texts. In Proceedings of the 1st Workshop on Healthcare AI and COVID-19, ICML 2022, Baltimore, MA, USA, 17–23 July 2022. [Google Scholar]
  13. Nosowsky, R.; Giordano, T.J. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) Privacy Rule: Implications for Clinical Research. Annu. Rev. Med. 2006, 57, 575–590. [Google Scholar] [CrossRef]
  14. Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. arXiv 2003, arXiv:cs/0306050. [Google Scholar]
  15. Doğan, R.I.; Leaman, R.; Lu, Z. NCBI Disease Corpus: A Resource for Disease Name Recognition and Concept Normalization. J. Biomed. Inform. 2014, 47, 1–10. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Li, J.; Sun, Y.; Johnson, R.J.; Sciaky, D.; Wei, C.-H.; Leaman, R.; Davis, A.P.; Mattingly, C.J.; Wiegers, T.C.; Lu, Z. BioCreative V CDR Task Corpus: A Resource for Chemical Disease Relation Extraction. Database 2016, 2016, baw068. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Krallinger, M.; Rabal, O.; Leitner, F.; Vazquez, M.; Salgado, D.; Lu, Z.; Leaman, R.; Lu, Y.; Ji, D.; Lowe, D.M.; et al. The CHEMDNER Corpus of Chemicals and Drugs and Its Annotation Principles. J. Cheminform. 2015, 7, S2. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Smith, L.; Tanabe, L.K.; Kuo, C.-J.; Chung, I.; Hsu, C.-N.; Lin, Y.-S.; Klinger, R.; Friedrich, C.M.; Ganchev, K.; Torii, M.; et al. Overview of BioCreative II Gene Mention Recognition. Genome Biol. 2008, 9, S2. [Google Scholar] [CrossRef]
  19. Collier, N.; Kim, J.-D. Introduction to the Bio-Entity Recognition Task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP), Geneva, Switzerland, 28–29 August 2004; pp. 73–78. [Google Scholar]
  20. Uzuner, Ö.; South, B.R.; Shen, S.; DuVall, S.L. 2010 I2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text. J. Am. Med. Inform. Assoc. 2011, 18, 552–556. [Google Scholar] [CrossRef] [Green Version]
  21. Sun, W.; Rumshisky, A.; Uzuner, O. Evaluating Temporal Relations in Clinical Text: 2012 I2b2 Challenge. J. Am. Med. Inform. Assoc. 2013, 20, 806–813. [Google Scholar] [CrossRef] [Green Version]
  22. Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar]
  23. Chiu, J.P.C.; Nichols, E. Named Entity Recognition with Bidirectional LSTM-CNNs. Trans. Assoc. Comput. Linguist. 2016, 4, 357–370. [Google Scholar] [CrossRef]
  24. Wang, X.; Zhang, Y.; Ren, X.; Zhang, Y.; Zitnik, M.; Shang, J.; Langlotz, C.; Han, J. Cross-Type Biomedical Named Entity Recognition with Deep Multi-Task Learning. Bioinformatics 2019, 35, 1745–1752. [Google Scholar] [CrossRef] [Green Version]
  25. Luo, L.; Yang, Z.; Yang, P.; Zhang, Y.; Wang, L.; Lin, H.; Wang, J. An Attention-Based BiLSTM-CRF Approach to Document-Level Chemical Named Entity Recognition. Bioinformatics 2018, 34, 1381–1388. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Xu, K.; Yang, Z.; Kang, P.; Wang, Q.; Liu, W. Document-Level Attention-Based BiLSTM-CRF Incorporating Disease Dictionary for Disease Named Entity Recognition. Comput. Biol. Med. 2019, 108, 122–132. [Google Scholar] [CrossRef] [PubMed]
  27. Zhao, Z.; Yang, Z.; Luo, L.; Wang, L.; Zhang, Y.; Lin, H.; Wang, J. Disease Named Entity Recognition from Biomedical Literature Using a Novel Convolutional Neural Network. BMC Med. Genom. 2017, 10, 75–83. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  28. Yoon, W.; So, C.H.; Lee, J.; Kang, J. Collabonet: Collaboration of Deep Neural Networks for Biomedical Named Entity Recognition. BMC Bioinform. 2019, 20, 55–65. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  29. Beltagy, I.; Lo, K.; Cohan, A. SCIBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2020; pp. 3615–3620. [Google Scholar]
  30. Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef] [Green Version]
  31. Office of Disease Prevention and Health Promotion Social Determinants of Health. Healthy People 2030. Available online: https://health.gov/healthypeople/objectives-and-data/social-determinants-health (accessed on 7 October 2021).
  32. Toscano, F.; O’Donnell, E.; Unruh, M.A.; Golinelli, D.; Carullo, G.; Messina, G.; Casalino, L.P. Electronic Health Records Implementation: Can the European Union Learn from the United States? Eur. J. Public Health 2018, 28, cky213.401. [Google Scholar] [CrossRef]
  33. Fernández-Calienes, R. Health Insurance Portability and Accountability Act of 1996. In Encyclopedia of the Fourth Amendment; CQ Press: Washington, DC, USA, 2013. [Google Scholar]
  34. Meystre, S.M.; Friedlin, F.J.; South, B.R.; Shen, S.; Samore, M.H. Automatic De-Identification of Textual Documents in the Electronic Health Record: A Review of Recent Research. BMC Med. Res. Methodol. 2010, 10, 70. [Google Scholar] [CrossRef] [Green Version]
  35. Lafferty, J.; Mccallum, A.; Pereira, F. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data Abstract. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001), Williamstown, MA, USA, June 28–1 July 2001; pp. 282–289. [Google Scholar]
  36. Tsochantaridis, I.; Joachims, T.; Hofmann, T.; Altun, Y.; Singer, Y. Large Margin Methods for Structured and Interdependent Output Variables. J. Mach. Learn. Res. 2005, 6, 1453–1484. [Google Scholar]
  37. Yang, X.; Lyu, T.; Li, Q.; Lee, C.Y.; Bian, J.; Hogan, W.R.; Wu, Y. A Study of Deep Learning Methods for De-Identification of Clinical Notes in Cross-Institute Settings. BMC Med. Inform. Decis. Mak. 2019, 19, 232. [Google Scholar] [CrossRef] [Green Version]
  38. John Snow Labs. Spark OCR. Available online: https://nlp.johnsnowlabs.com/docs/en/ocr (accessed on 27 November 2022).
  39. Annotation Lab; John Snow Labs: Lewes, DE, USA, 2022.
  40. Ogren, P.V.; Savova, G.K.; Chute, C.G. Constructing Evaluation Corpora for Automated Clinical Named Entity Recognition. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco, 26 May–1 June 2008; pp. 3143–3150. [Google Scholar]
  41. Chen, Y.; Lasko, T.A.; Mei, Q.; Denny, J.C.; Xu, H. A Study of Active Learning Methods for Named Entity Recognition in Clinical Text. J. Biomed. Inform. 2015, 58, 11–18. [Google Scholar] [CrossRef] [Green Version]
  42. Snow, R.; O’Connor, B.; Jurafsky, D.; Ng, A.Y. Cheap and Fast—But Is It Good? Evaluating Non-Expert Annotations for Natural Language Tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, HI, USA, 25–27 October 2008; pp. 254–263. [Google Scholar]
  43. ML Pipelines—Documentation. Available online: https://spark.apache.org/docs/latest/ml-pipeline.html (accessed on 27 November 2022).
  44. Kocaman, V.; Talby, D. Spark NLP: Natural Language Understanding at Scale. Softw. Impacts 2021, 8, 100058. [Google Scholar] [CrossRef]
  45. Webster, J.J.; Kit, C. Tokenization as the Initial Phase in NLP. In Proceedings of the COLING 1992. The 14th International Conference on Computational Linguistics, Nantes, France, 23–28 August 1992; Volume 4. [Google Scholar]
  46. Peng, Y.; Yan, S.; Lu, Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. arXiv 2019, arXiv:1906.05474. [Google Scholar]
  47. Johnson, A.E.W.; Pollard, T.J.; Shen, L.; Lehman, L.W.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Anthony Celi, L.; Mark, R.G. MIMIC-III, a Freely Accessible Critical Care Database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  48. Huang, Z.; Xu, W.; Yu, K. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv 2015, arXiv:1508.01991. [Google Scholar]
  49. Ma, X.; Hovy, E. End-to-End Sequence Labeling via Bi-Directional Lstm-Cnns-Crf. arXiv 2016, arXiv:1603.01354. [Google Scholar]
  50. Bakken, D.E.; Rarameswaran, R.; Blough, D.M.; Franz, A.A.; Palmer, T.J. Data Obfuscation: Anonymity and Desensitization of Usable Data Sets. IEEE Secur. Priv. 2004, 2, 34–41. [Google Scholar] [CrossRef]
  51. Medical Data De-Identification—John Snow Labs; John Snow Labs: Lewes, DE, USA, 2022.
  52. GitHub. ay94 NER-Datasets. 2022. Available online: https://github.com/ay94/NER-datasets (accessed on 27 November 2022).
  53. Sexton, T. IOB Format Intro; Nestor: Gaithersburg, MA, USA, 2022. [Google Scholar]
  54. Akbik, A.; Blythe, D.; Vollgraf, R. Contextual String Embeddings for Sequence Labeling. In Proceedings of the 27th international Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 1638–1649. [Google Scholar]
  55. Müller, M.; Salathé, M.; Kummervold, P.E. Covid-Twitter-Bert: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter. arXiv 2020, arXiv:2005.07503. [Google Scholar]
  56. Tsai, R.T.-H.; Wu, S.-H.; Chou, W.-C.; Lin, Y.-C.; He, D.; Hsiang, J.; Sung, T.-Y.; Hsu, W.-L. Various Criteria in the Evaluation of Biomedical Named Entity Recognition. BMC Bioinform. 2006, 7, 92. [Google Scholar] [CrossRef] [Green Version]
  57. Perone, C.S.; Silveira, R.; Paula, T.S. Evaluation of Sentence Embeddings in Downstream and Linguistic Probing Tasks. arXiv 2018, arXiv:1806.06259. [Google Scholar]
  58. Abdi, S.; Bennett-AbuAyyash, C.; MacDonald, L.; Hohenadel, K.; Johnson, K.O.; Leece, P. Provincial Implementation Supports for Socio-Demographic Data Collection during COVID-19 in Ontario’s Public Health System. Can. J. Public Health 2021, 112, 853–861. [Google Scholar] [CrossRef]
  59. Ortiz-Egea, J.M.; Sánchez, C.G.; López-Jiménez, A.; Navarro, O.D. Herpetic Anterior Uveitis Following Pfizer–BioNTech Coronavirus Disease 2019 Vaccine: Two Case Reports. J. Med. Case Rep. 2022, 16, 127. [Google Scholar] [CrossRef] [PubMed]
  60. Raza, S. A Machine Learning Model for Predicting, Diagnosing, and Mitigating Health Disparities in Hospital Readmission. Healthc. Anal. 2022, 2, 100100. [Google Scholar] [CrossRef]
Figure 1. Biomedical Pipeline.
Figure 1. Biomedical Pipeline.
Viruses 14 02761 g001
Figure 2. Named Entity Recognition algorithm.
Figure 2. Named Entity Recognition algorithm.
Viruses 14 02761 g002
Figure 3. Most common symptoms of COVID-19 patients. Number at the top of each bar represents the number of times the symptoms were mentioned in test set.
Figure 3. Most common symptoms of COVID-19 patients. Number at the top of each bar represents the number of times the symptoms were mentioned in test set.
Viruses 14 02761 g003
Figure 4. COVID-19 hospitalization by race and ethnicity.
Figure 4. COVID-19 hospitalization by race and ethnicity.
Viruses 14 02761 g004
Figure 5. Biomedical entities recognized by proposed pipeline.
Figure 5. Biomedical entities recognized by proposed pipeline.
Viruses 14 02761 g005
Table 1. Biomedical NER datasets and methods.
Table 1. Biomedical NER datasets and methods.
Benchmark Datasets
CorpusEntity TypesData Size
NCBI-Disease [15]Diseases793 PubMed abstracts
BC5CDR [16]Diseases 1500 PubMed articles
BC5CDR [16]Chemicals1500 PubMed articles
BC4CHEMD [17]Chemicals10,000 PubMed abstracts
BC2GM [18]Gene/Proteins20,000 sentences
JNLPBA [19]Genes, proteins2404 abstracts
i2b2-Clinical [20]Problem, Treatment, and Test.426 discharge summaries
I2b2 2012 [21]Clinical (problems, tests, treatments, clinical departments, occurrences (admission, discharge) and evidence).310 discharge summaries
Benchmark methods
MethodDescription
BiLSTM-CRF [22]Bidirectional Long short-term memory (LSTM) and Conditional random field (CRF) architecture for NER.
BiLSTM-CNN-Char [23] A hybrid LSTM and Convolutional Neural Network (CNN) architecture that learns both character-level and word-level features for the NER task.
BiLSTM-CRF-MTL [24]A multi-task learning (MTL) framework with a BiLSTM-CRF model to collectively use the training data of different types of entities.
Att-BiLSTM-CRF [25],Attention (Att) based BiLSTM model with a CRF layer for chemical NER task.
Doc-Att-BiLSTM-CRF [26]Document (Doc)-level Attention (Att)-based BiLSTM-CRF network for disease NER task.
MCNN [27]A multiple (M) label CNN-based network for disease NER from biomedical literature.
CollaboNet [28]A collaboration of deep neural networks, i.e., BiLSTM-CRF with a single task model trained for each specific entity type.
SciBERT [29]A pre-trained language model based on Bidirectional Encoder Representations from Transformers (BERT) pretrained on a large multi-domain corpus of scientific publications to improve performance on downstream scientific tasks including NER.
BioBERT [30]A pre-trained biomedical language representation model based on BERT for biomedical text mining
Table 2. Biomedical entities used in this study.
Table 2. Biomedical entities used in this study.
Entity TypeEntities
Clinical name entitiesAdmission (patient admission status), oncology (tumor/cancer), blood pressure, respiration (e.g., shortness of breath), dosage (amount of medicine/drug taken), vital signs, symptoms, kidney disease, temperature (body), diabetes, vaccine, time (days, weeks or so), obesity (status), BMI, height (of patient), heart disease, pulse, hypertension, drug name, cerebrovascular disease, disease, treatment, clinical department, weight (of patient), admission/discharge (from hospital), modifier (modifies the current state), external body part, test, strength, route, test result.
Non-clinical entitiesName (of patient), location, date, relative date, duration, relationship status, social status, family history (family members, alone, with family, homeless), employment status, race/ethnicity, gender, social history, sexual orientation, diet (food type, nutrients, minerals), alcohol, smoking.
Table 3. Hyperparameters used—optimal parameter (range of values).
Table 3. Hyperparameters used—optimal parameter (range of values).
HyperparameterOptimal Value (Values Used)
Learning rate1 × 10−3 (1 × 10−2, 1 × 10−3, 1 × 10−5, 2 × 10−5, 5 × 10−5, 3 × 10−4)
Batch size64 (8, 16, 32, 64, 128)
Epochs 30 ({2, 3, …, 30})
LSTM state size200 (200, 250)
Dropout rate0.5 ({0.3, 0.35, …, 0.7})
OptimizerAdam
CNN filters2 (2, 3, 4, 5)
Hidden Size 768
Embedding Size 128
Max Seq Length512
Warmup Steps3000
Table 4. Test results using macro-average F1 (macro) and micro-average F1 (micro) scores on all datasets using different methods. The best scores are in bold and the second-best in italic.
Table 4. Test results using macro-average F1 (macro) and micro-average F1 (micro) scores on all datasets using different methods. The best scores are in bold and the second-best in italic.
Methods/ DatasetMetricNCBIBC5CDRBC2GMJNLPBAi2b2-ClinicalOur Dataset
BiLSTM-CRFmicro85.8084.2278.4674.2983.6687.10
macro86.1285.0980.0175.1084.0188.01
BiLSTM-CRF-MTLmicro86.4684.9480.3477.0382.3888.39
macro88.0185.0081.1277.1483.9688.97
CT-BERTmicro77.5076.8574.1068.0077.0778.10
macro78.5077.9675.3768.9878.0178.98
SciBERTmicro82.8882.9484.0875.7778.1980.95
macro83.3283.1385.8477.0179.1081.14
BioBERT-Base v1.0micro84.0186.5678.6886.2885.8784.01
macro79.1078.9079.0078.1372.1879.10
BioBERT-Base v1.1micro88.5287.1579.3976.1686.2788.52
macro85.8987.1087.1875.4587.7885.89
BioBERT-Base v1.2micro89.1287.8183.3476.4586.8889.12
macro86.7887.8986.0775.1586.9886.78
Our approachmicro90.5889.9089.1579.9289.1094.78
macro91.8390.3490.3880.9490.4895.37
Table 5. Ablation study of the model. Bold shows best macro-average F1 score.
Table 5. Ablation study of the model. Bold shows best macro-average F1 score.
ModelMacro
BiLSTM-CNN-CRF94.18 ± 0.12
BiLSTM-CNN87.37 ± 0.02
Map-CNN-CRF80.55 ± 0.03
Map-CNN69.25 ± 0.04
Table 6. Performance of most used entity from random 500 case reports.
Table 6. Performance of most used entity from random 500 case reports.
EntityTPFPFNPrecRecallF1
Disease818981120.890.880.89
Gender390781010.830.790.81
Employment234291320.890.640.74
Race_Ethnicity33465960.840.780.81
Smoking30924970.930.760.84
Psychological_Condition21829580.880.790.83
Death_Entity387341030.920.790.85
BMI14612290.920.830.88
Diabetes15710280.940.850.89
Macro-average29933797560.890.790.84
Micro-average29933797560.890.800.84
Table 7. Most prevalent named entities under entity types (drugs, vaccine, treatments).
Table 7. Most prevalent named entities under entity types (drugs, vaccine, treatments).
DrugsVaccineNon-Medical Treatments
Hydroxychloroquine Pfizer-BioNTechIsolation
PaxlovidModernaWear masks
ActemraAstraZeneca Vaccination
ImmunomodulatorsCoronaVac Oxygen support
SteroidBBIBP-CorVMedication
AmoxicillinJanssenHand sanitization
Table 8. Test Results on all Datasets using different Methods.
Table 8. Test Results on all Datasets using different Methods.
SentenceBeginEndChunksBiomedical EntityConfidence
021273-year-oldAge1.00
01418womanGender1.00
03243Fever ClinicClinical Department0.98
05265First HospitalClinical Department0.51
0109134Fever, temperatureSymptom0.80
0156160CoughSymptom0.99
0163175ExpectorationSymptom1.00
0178196Shortness of breathSymptom0.39
0203218General weaknessSymptom0.77
0233244Prior 5 daysRelative Date0.42
1247249SheGender1.00
1261264MildModifier0.90
1266273DiarrheaSymptom1.00
1280289Stools/daySymptom0.85
12923032 days priorRelative Date0.68
1322329HospitalClinical Department1.00
1386402COVID-19 positiveDisease Syndrome0.90
1436454Healthcare providerEmployment0.94
2486494CirrhosisDisease Syndrome0.96
2500514Type 2 diabetesDiabetes0.95
2535541SmokingSmoking1.00
2546553DrinkingAlcohol0.93
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bashir, S.R.; Raza, S.; Kocaman, V.; Qamar, U. Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach. Viruses 2022, 14, 2761. https://doi.org/10.3390/v14122761

AMA Style

Bashir SR, Raza S, Kocaman V, Qamar U. Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach. Viruses. 2022; 14(12):2761. https://doi.org/10.3390/v14122761

Chicago/Turabian Style

Bashir, Syed Raza, Shaina Raza, Veysel Kocaman, and Urooj Qamar. 2022. "Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach" Viruses 14, no. 12: 2761. https://doi.org/10.3390/v14122761

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop