Next Article in Journal
Effective Preventive Strategies to Prevent Secondary Transmission of COVID-19 in Hemodialysis Unit: The First Month of Community Outbreak in Taiwan
Next Article in Special Issue
The Application of Projection Word Embeddings on Medical Records Scoring System
Previous Article in Journal
The Impact of Sleep Quality and Education Level on the Relationship between Depression and Suicidal Ideation in Parents of Adolescents
Previous Article in Special Issue
Overview of Multi-Modal Brain Tumor MR Image Segmentation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Forecast of the COVID-19 Epidemic Based on RF-BOA-LightGBM

School of Life Sciences, Central South University, Changsha 410083, China
*
Author to whom correspondence should be addressed.
Healthcare 2021, 9(9), 1172; https://doi.org/10.3390/healthcare9091172
Submission received: 21 July 2021 / Revised: 26 August 2021 / Accepted: 30 August 2021 / Published: 6 September 2021

Abstract

:
In this paper, we utilize the Internet big data tool, namely Baidu Index, to predict the development trend of the new coronavirus pneumonia epidemic to obtain further data. By selecting appropriate keywords, we can collect the data of COVID-19 cases in China between 1 January 2020 and 1 April 2020. After preprocessing the data set, the optimal sub-data set can be obtained by using random forest feature selection method. The optimization results of the seven hyperparameters of the LightGBM model by grid search, random search and Bayesian optimization algorithms are compared. The experimental results show that applying the data set obtained from the Baidu Index to the Bayesian-optimized LightGBM model can better predict the growth of the number of patients with new coronary pneumonias, and also help people to make accurate judgments to the development trend of the new coronary pneumonia.

1. Introduction

During the outbreak of infectious diseases, social media is usually the most active platform for the exchange of information on infectious disease, and the information released is often of good real-time. Using Internet information to predict the epidemic situation of infectious diseases is one of the current research hotspots. L. Lu et al. used Baidu index and micro-index to conduct a comparative study on influenza surveillance in China [1]. J. H. Lu, School of Public Health, Sun Yat-sen University, and others studied the use of Internet search queries or social media data to monitor the temporal and spatial trends of the Avian Influenza (H7N9) in China, and the results show that the number of H7N9 cases is positively correlated with Baidu Index and Weibo Index search results in space and time [2]. J. X. Feng of the University of South Georgia and others studied the impact of Chinese social networks on the Middle East Respiratory Syndrome Coronavirus and Avian Influenza [3]. Mutual relations prove the effectiveness of using social media to predict infectious diseases. H. G. Gu et al. collected data on cases of H7N9 avian influenza in the Chinese urban population through the Internet, as well as geographic and meteorological data during the same period, and established a disease risk early warning model for human infection with H7N9 avian influenza, which can identify the high risk areas of avian influenza outbreaks and issue an early warning [4]. However, in these studies, most of the search process of network data adopts manual empirical methods to select keywords for search, and the choice of keywords often has a greater impact on search results.
At present, the focus of the world’s attention is mainly on the changes in the epidemic situation of the new type of coronary pneumonia. During the four months after the outbreak of the new type of coronavirus in Wuhan, Hubei in December 2019, the epidemic information was widely disseminated on social media such as Baidu, Sina, 360, Sogou, WeChat and QQ. Google, Weibo, Zhihu, Dingxiangyuan, Twitter, Facebook, etc. also released a lot of information about the new coronavirus epidemic, especially through the Google platform to spread to the world. On 31 March 2019, Google launched a project called “COVID-19 Public Datasets” to provide a public database related to the epidemic and open it to the public for free, which means that people can freely access and analyze relevant data and information [5]. How to use this information to predict the spread of the new type of coronary pneumonia in time is an urgent research topic. Currently, X. M. Zhao and others have proposed to use big data retrospective technology to study the spreading trend and epidemic control of the new coronary pneumonia [6]. B. McCall et al. used artificial intelligence methods to predict the new type of coronary pneumonia, thereby protecting medical staff and controlling the spread of the epidemic [7]. These studies are still in the preliminary stage, and the use of network data and prediction of the new coronary pneumonia are not yet ideal.
In this article, we consider that the amount of data indexed by Baidu is large enough for us to use. Based on this, we use the first feature in the search index, namely Baidu index [8], to study the prediction of the epidemic of new coronary pneumonia. We collected data on COVID-19 cases in China from 1 January 2020 to 1 April 2020, and used the random forest feature selection method to select the optimal sub-data set, and used grid search, random search and the Bayesian optimization algorithm optimizes the 7 hyperparameters of the LightGBM (light gradient boosting machine) model. The results show that the application of the data set obtained from the Baidu index to the Bayesian-optimized LightGBM model can better predict the growth of the number of patients with new coronary pneumonia.
This paper is organized as follows. In Section 2, we introduce the data set and analysis method used in detail. Baidu index search and actual case results are compared in time and space, and the impact of keywords and selected index in Baidu index search on the results is analyzed. Model structure, data set preprocessing methods, tuning algorithm, etc. are also introduced in detail. In Section 3, the experimental results are showed and related discussions are presented. Finally, the conclusion is drawn in Section 4.

2. Materials and Methods

2.1. COVID-19 Dataset

In order to standardize prevention and treatment, on 11 February 2020, the World Health Organization named the pneumonia caused by the new coronavirus as “COVID-19” (Corona Virus Disease 2019). In this study, we first obtain the data of COVID-19 cases that occurred in China from 1 January 2020 to 1 April 2020 by searching the COVID-19 Public Datasets on the Google platform, mainly including diagnosis number and death toll, and use them as actual data. These data are released by the Centers for Disease Control (CDC), so we identify these data as CDC data , namely the CDC-Diagnosis and CDC-Death toll mentioned in this paper. Then, we can collect keywords related to COVID-19 through commonly used social networking sites, such as Baidu, Sina, 360, Sogou, WeChat, QQ, Google, Weibo, Zhihu, Dingxiangyuan, Twitter, Facebook, etc., And form a keyword library. Then use the Baidu index platform (http://index.baidu.com, (accessed on 1 April 2020)) to retrieve relevant keywords, and use the statistics of the average daily search volume of relevant Chinese keywords as social network mining data for prediction. In this article, this part of the data is identified as Baidu index data.
By searching for the name and clinical symptoms of new coronavirus pneumonia on social networking sites, we can get the following keywords: new coronavirus, fever, dry cough, fatigue, dyspnea and cough. Using the Baidu index platform to retrieve the above keywords, we can get the average daily search volume of each keyword from 1 January 2020 to 1 April 2020, that is, Baidu index data. Table 1 shows part of the data of the CDC data set and the Baidu index data set. See Appendix A for all the data.

2.1.1. Time and Space Comparative Analysis of Baidu Index Search and Actual Cases

Based on the data obtained during the data collection phase, we have drawn the trend graph of CDC data and Baidu Index data over time, as shown in Figure 1. From Figure 1a–g, it can be seen that the keyword “dry cough” is the most commonly used keyword when Chinese netizens search for symptoms of new coronavirus pneumonia, followed by fever, dyspnea, and fatigue. We can see that in the Baidu index method, the keywords “new coronavirus” and “dry cough” are the best choices. The extracted data has the best spatio-temporal positive correlation with the actual number of cases. Through website search, we can find that these two keywords mainly appear in the columns of Baidu Baike and Baidu Health Pharmacopoeia. Therefore, it is recommended to search these two columns first when choosing keywords in the future. On the other hand, it can also be seen that the Baidu index method is used to predict the change trend of the new coronavirus pneumonia. If the keywords are not selected properly, not only will the accuracy of the prediction be low, but sometimes it may even make it impossible to predict in advance.
In addition, we can see that the CDC diagnosis number and Baidu index data have peak times, so we can compare the correlation between the Baidu index data and the CDC-Diagnosis number from the perspective of the first peak generation time and the time difference, which are shown in Table 2. From the comparative analysis of Figure 1 and Table 2, we can draw the following conclusions. The actual number of new coronavirus pneumonia cases in China reached its highest value on 12 February 2020, which was 15,152, while the Baidu Index data all reach their peak before this date, and the average value of the first peak time difference between the Baidu Index data based on the six keywords and the newly diagnosed CDC is 18 days. This is mainly because during the outbreak of the COVID-19, people like to discuss the it on social media networks. The information released on the new crown epidemic is often of good real-time. The CDC data collection comes from the national infectious disease surveillance system, where the pneumonia often requires a longer diagnosis process from onset to diagnosis, usually 7–14 days.

2.1.2. The Influence of the Selected Index on the Result

In order to explore the impact of the selected index on the results, we first need to check the distribution of the number of new coronavirus confirmed in the data, as shown in Figure 2. It can be seen from the figure that the overall distribution of the target variable deviates from the normal distribution and needs to be adjusted later. The skewness and kurtosis are calculated again, and the calculation results are 10.72 and 140.84, respectively. It can basically be determined that the skewness of the data in this paper is relatively large and needs to be adjusted.
Figure 3 shows the Q-Q graph of the COVID-19 data set. Judge whether the data conforms to the normal distribution by comparing whether the quantiles of the data and the normal distribution are equal. The red line represents the normal distribution, and the blue line represents the sample data. The closer the blue and red reference lines are, the more in line with the expected distribution. From the distribution of data, the data presents a normal state. It is further verified that the data distribution has a large skewness, and further data conversion is needed to make it conform to the normal distribution.
Figure 4 shows the relationship between Diagnosis Numbers and other attributes. It can be seen from the figure that the attributes in the data set are basically positively correlated with the attributes of Diagnosis Numbers. Figure 5 shows the relationship between all attributes, which can be represented by a heat map. The heat map uses different colors to intuitively show the relationship between different attributes, which is a very simple way of data interpretation. The values in the figure are calculated using Pearson’s correlation coefficient. The calculation formula of Pearson’s correlation coefficient is
r ( X , Y ) = Cov ( X , Y ) Var [ X ] Var [ Y ] .
It can be seen from the heat map that the attribute of month is negatively correlated with Diagnosis Numbers. It can be seen from the above analysis that the collected data set has a certain influence on Diagnosis Numbers and can be used for the numerical prediction of Diagnosis Numbers.

2.2. RF-BOA-LightGBM

As a new cutting-edge technology, predictive models based on machine learning have been widely used in various fields of medicine. For example, Y. D. Zhang et al. proposed a new attention network model, namely ANC (attention network for COVID-19) model, which can diagnose COVID-19 more effectively and accurately [9]. X. Zhang et al. enhanced the deep learning network AlexNet to achieve a more effective classification of new coronary pneumonia [10]. Here, we consider using the RF-BOA-LightGBM (random forest-Bayesian optimization algorithm-light gradient boosting machine) model to predict the development trend of the COVID-19.

2.2.1. Model Structure

Figure 6 shows the model structure used in this article. After collecting the data, you need to perform a simple processing on the data, so that this model can “learn” the data. Then build the LightGBM model for training, but due to the many parameters of LightGBM, the effect of using the default parameters to train the data set in this article is not necessarily good, so three hyperparameter tuning algorithms are introduced here to adjust the model parameters of LightGBM Perform tuning. After finding a combination of model parameters suitable for the data set in this article, the training prediction is carried out.

2.2.2. Dataset Preprocessing

In order to enable the model to fully learn the data obtained from the Baidu Index COVID-19 vaccine, this article first made great efforts to preprocess the data. It can be seen from the foregoing that the distribution of the data in this paper presents a similar normal distribution. Therefore, this article first performs logarithmic transformation on the data to make the data satisfy the normal distribution. The data conversion formula is
y = log c 1 + λ x .
Then, deal with the missing data in the data set and delete the samples with missing values (there are not many samples with missing values, which has little effect on the results). Subsequently, the date is divided into three attributes: year, month, and day, and the year attribute is deleted (the year attribute is a fixed value and has little effect on the result), which avoids the problem that the model cannot directly process the date. Finally, the maximum and minimum normalization method is used to integrate the data into (0, 1) range data, which eliminates the influence between samples of different orders of magnitude. The maximum and minimum normalization formula is as follows
X norm = X X min X max X min .
The distribution graph and Q-Q graph of the processed data are shown in Figure 7 and Figure 8 respectively. As can be seen from the figure, the data has basically satisfied the normal distribution.
This data set contains feature data related to the number of new crowns, irrelevant feature data and related but redundant feature data. In the face of complex faults, it is no longer possible to accurately obtain the number of new crowns by relying only on expert experience and simple correlation analysis to perform feature selection work. Important features, so this article uses random forest (RF) out-of-bag estimation to rank the importance of new crown-related features. The random forest is used to select the features of the data set, and the features that have little influence on the prediction results are eliminated.
RF is a combined classifier based on decision trees, which can be used for feature selection [11]. RF uses the Bagging method to randomly and repeatably extract samples from the original sample set for classifier training. About 1/3 of the sample data will not be selected [12]. This data is called Out of Bag (OOB). When calculating the importance of a certain feature, use the OOB data as the base learner after the test set to test the training, and the test error rate is recorded as the out-of-bag error (errOOB). Add noise to the important features to be calculated in the OOB sample, and recalculate errOOB again. The average test error of all base learners is calculated by using the average accuracy decrease rate (MDA) as an indicator for feature importance calculation, namely
M D A = 1 n t = 1 n err O O B t errOOB t ,
where n is the number of base learners, errOOB is the out-of-bag error after adding noise.
The more the MDA index decreases, the more the corresponding feature has a greater impact on the prediction result, and the higher its importance. This feature importance calculation method is called random forest out-of-bag estimation. According to this method, the importance of fault-related features is ranked and feature selection is performed.

2.2.3. Tuning Algorithm

For the LightGBM model, there are many internal hyperparameters that affect the prediction results. However, if the value of the hyperparameter used is the default value, this hyperparameter combination may not be the optimal hyperparameter combination for the new coronavirus number prediction data set [13]. Therefore, this paper introduces three tuning algorithms, namely grid search, random search, and Bayesian optimization, to optimize some important hyperparameters of LightGBM [14]. Before adjusting the parameters of LightGBM, the optimization range of hyperparameters is generally set first. These three algorithms are briefly described below.
Grid search divides the search range into grid shapes, and adjusts the parameters according to the set step to train the model until all possible combination parameters are verified, and finally the parameter combination that gives the best result is output [15]. Because the different prediction results of the data in each group of hyperparameter combinations are also different, when the hyperparameter combination is relatively large and the search range is relatively large, the optimization speed of the grid search is very slow.
Random search is similar to grid search, but it does not verify all possible parameter combinations like grid search, but randomly combines the random value of each parameter, so the speed of random search is faster than that of Grid search [16]. However, random search may also miss the parameter combination that maximizes the prediction result.
Bayesian optimization algorithm(BOA) can quickly find the optimal parameters for the problem to be solved based on historical experience [17]. The main problem scenarios for Bayesian optimization are
X * = argmax f ( x ) ( x S ) ,
where x is the parameter to be optimized, S is the candidate set of x variable, that is, the set of possible values of parameter x. The target selects an x from the set S such that the value of f ( x ) is the largest or smallest. Here, the specific formula of f ( x ) may not be known, that is, the black box function. But you can choose an x, and get the value of f ( x ) through experiment or observation [18].
BOA has two core processes, a priori function (PF) and acquisition function (AC). The acquisition function is also called the efficiency function. Under the framework of Bayesian decision theory, many collection functions can be interpreted as evaluating the expected loss associated with f at point x, and then usually selecting the point with the lowest expected loss [19]. PF mainly uses Gaussian process regression, AC mainly uses these methods including EI (expected improvement), PI (probability of improvement) and UCB (upper confidence bound), and this article uses the EI function. The EI function can find out the global optimum without falling into the local optimum. The collection function is as follows
u ( x ) = max 0 , f f ( x ) ,
where f is the collection function, and f ( x ) is the optimized performance indicator.
The final collection function for variable x is
a EI ( x ) = E [ u ( x ) x , D ] = f f f N ( f ; u ( x ) , K ( x , x ) ) d f = f u ( x ) Φ f l ; u ( x ) , K ( x , x ) + K ( x , x ) N f ; u ( x ) , K ( x , x ) .
The calculation shows that the point corresponding to the maximum value of a EI is the best point. There are two components in Formula (7). To maximize the value of it, you need to optimize the left and right parts at the same time, that is, the left side needs to reduce the μ ( x ) as much as possible, and the right side needs to increase the variance (or covariance) K(x, x) as small as possible. It is a typical theory on issues such as exploration and exploitation.
Upper confidence bound (UCB) can be simply understood as the upper confidence boundary. It is usually described by maximizing f instead of minimizing f. But in the case of minimization, the collection function will take the following form
a U C B ( x ) = u ( x ) β σ ( x ) ,
where β > 0 is a strategy parameter, and σ ( x ) = K ( x , x ) is the boundary standard deviation of f ( x ) . Similarly, UCB also includes exploitation ( u ( x ) ) and exploration ( ( x ) modes. It can converge to the global optimal value under certain conditions.
Table 3 shows the hyperparameter combinations selected in this article and the corresponding descriptions.

2.2.4. LightGBM

LightGBM is an open source decision tree-based gradient boosting framework proposed by Microsoft. As an improved version of Gradient Boosting, it has the characteristics of high accuracy, high training efficiency, support for parallelism and GPU, small memory required, and ability to handle large-scale data [20].
According to the different generation methods of the base learner, integrated learning can be divided into parallel learning and serial learning. As the most typical representative of serial learning, Boosting algorithm can be divided into Adaboost and Gradient Boosting. The main difference between them is that the former improves the model by increasing the weight of misclassified data points, while the latter improves the model by calculating negative gradients. The core idea of Gradient Boosting is to use the negative gradient of the loss function to approximate the value of the current model f ( x ) = f j 1 ( x ) to replace the residual. Suppose the training sample is i ( i = 1 , 2 , , n ), the number of iterations is j ( j = 1 , 2 , , m ), and the loss function is L ( y i , f ( x i ) ) , then the negative gradient r i j can be expressed as
r i j = L y i , f x i f x i f ( x ) = f j 1 ( x ) .
Use the base learner h j ( x ) to fit the negative gradient r of the loss function, and find the best fit value r j that minimizes the loss function
r j = arg min L y i , f j 1 x i + r h j x i .
Model update:
f j ( x ) = f j 1 ( x ) + r j h j ( x ) .
Gradient Boosting generates a base learner in each round of iteration. Through multiple rounds of iteration, the final strong learner F ( x ) is the base learner generated in each round and obtained by linear addition:
F ( x ) = f m ( x )
As an improved lightweight Gradient Boosting algorithm, the core ideas of LightGBM are: histogram algorithm, leaf growth strategy with depth limitation, direct support for category features, histogram feature optimization, multithreading optimization, and cache hit rate optimization. The first two features effectively control the complexity of the model and realize the lightweight of the algorithm, so this article is particularly concerned.
The histogram algorithm discretizes continuous floating-point features into L integers to construct a histogram with a width of L. When traversing the data, use the discretized value as an index to accumulate statistics in the histogram. After traversing the data once, the histogram accumulates the necessary statistics, and then find the optimal split point from the discrete values of the histogram .
The traditional leaf growth strategy can split the leaves of the same layer at the same time. In fact, the splitting gain of many leaves is low and there is no need to split, which brings a lot of unnecessary expenses. For this, LightGBM uses a more efficient leaf growth strategy: each time it searches for the leaf with the largest split gain from all the current leaves to split, and sets a maximum depth limit. While ensuring high efficiency, it also prevents the model from overfitting.

3. Results and Discussion

3.1. Performance Predictor

All models are cross-validated and the coefficient of determination (R2), mean absolute error (MAE), relative absolute error (RAE), relative square root error (RRSE), root mean square error (RMSE) are calculated, as shown below
R 2 ( y , y ^ ) = 1 i = 1 n y i y ^ i 2 i = 1 n y i y ¯ 2 ,
RMSE ( y , y ^ ) = 1 n i = 1 n y i y ^ i 2 ,
MAE ( y , y ^ ) = 1 n i = 1 n y i y ^ i ,
RAE ( y , y ^ ) = i = 1 n y i y ^ i i = 1 n y i y ¯ ,
RRSE ( y , y ^ ) = i = 1 n y i y ^ i 2 i = 1 n y i y ¯ 2 ,
where y represents the true value, y ^ represents the predicted value, y ¯ represents the average value of the true value and n is the number of test sets.

3.2. Experiment Results

Figure 9 shows the result of feature selection using random deep forest, and the features are output in descending order of importance. It can be seen from the figure that Death Toll has the greatest impact on Diagnosis Numbers, while the attribute of Month has the least impact. Finally, we selected the 7 most influential attributes for the prediction of Diagnosis Numbers.
According to the optimal parameter set of the model, the Diagnosis Numbers prediction model of COVID-19 is constructed. In this paper, LightGBM, GridSearch-LightGBM, RandomSearch-LightGBM, and BOA-LightGBM models are used for Diagnosis Numbers prediction. Table 4 shows the specific values of the optimal parameter combinations found by the three tuning algorithms.
Table 5 shows the evaluation indicators of the prediction results of the four models. The prediction results of the model are evaluated by R2, RMSE, MAE, RAE, RRSE evaluation indicators. It can be seen from the values of the five evaluation indicators that the results of BOA-LightGBM are better than the former. RandomSearch-LightGBM and GridSearch-LightGBM have their own advantages and disadvantages. It can also be seen that the default hyperparameters of LightGBM are not suitable for the prediction of Diagnosis Numbers of COVID-19 in this article. From the approximate prediction effect, BOA-LightGBM can better analyze the relationship between historical data and can effectively predict the value of Diagnosis Numbers of COVID-19, which proves the superiority of the model.
Figure 10 is a line chart of the four algorithms to predict Diagnosis Numbers, and only part of the data is taken on the abscissa. The prediction effect of the model can be seen more intuitively from the line graph. It can be seen from the figure that in most cases, the BOA-LightGBM model can better fit the fluctuation trend of Diagnosis Numbers at some points, and the predicted value is very close to the actual value. In the figure, the points predicted by GridSearch-LightGBM are basically covered, so they are not shown in the figure, which just shows that the prediction results are not very prominent. Sometimes the prediction value of LightGBM is better than other models, but most of them are inferior to other models. So comprehensively, the BOA-LightGBM model is more in line with the changing trend of real values.

4. Conclusions

This study uses the Internet big data tool-Baidu Index to predict the development trend of the new coronavirus pneumonia epidemic to obtain data. By selecting appropriate keywords, data on COVID-19 cases in China from 1 January 2020 to 1 April 2020 are collected. After preprocessing the data set, the random forest feature selection method is used to obtain the optimal sub-data set. After comparing and analyzing the optimization results of the seven hyperparameters of the LightGBM model with the three optimization algorithms of grid search, random search, and Bayesian optimization. It is concluded that applying the data set obtained from the Baidu Index to the Bayesian-optimized LightGBM model can better predict the increase in the number of new coronary pneumonias, and it is a good aid to predict the new number of new coronary pneumonia in the future medical structure effect.

Author Contributions

Conceptualization, D.H.H; methodology, Z.L. and D.H.H; formal analysis, Z.L.; data curation, Z.L.; writing—original draft preparation, Z.L.; writing—review and editing, D.H.H; validation, D.H.H All authors have read and agree to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available at School of Life Sciences, Central South University, China.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Datasets from CDC and Baidu Index Search

Table A1. Datasets from CDC and Baidu Index search.
Table A1. Datasets from CDC and Baidu Index search.
SourceCDC-
Diagnosis
Baidu-
New Coronavirus
Baidu-
Fever
Baidu-
Dry Cough
Baidu-
Fatigue
Baidu-
Dyspnea
Baidu-
Cough
CDC-
Death Toll
Data
Date
1 January 2020004001110025648158850
2 January 2020004323120627860264480
3 January 2020104212117326265463920
4 January 2020004309110927062165700
5 January 2020504327111827159165640
6 January 2020004324122631069364040
7 January 2020003920117528863358750
8 January 2020003803112427262253540
9 January 2020088123693113127057951820
10 January 2020020323700109526353550220
11 January 2020028793478108323749850331
12 January 2020014453364106725247450111
13 January 2020015153573111827849444181
14 January 2020048463479113326652843591
15 January 2020041913241109724551243552
16 January 2020051743230110026754642202
17 January 2020477133247111425452140082
18 January 20201777543271106022849242182
19 January 20203629,0033418118225354843232
20 January 2020151266,89240643684609109053242
21 January 202077659,926547410,1621106207372602
22 January 2020149852,363678221,9671711312587513
23 January 20201311,374,253915126,3933141484010,22911
24 January 20202591,469,947810821,71831624511905941
25 January 20206882,330,8511002924,100325359221279856
26 January 20207692,150,0211055220,635311757791267780
27 January 202017711,816,430940615,3232152457211,547106
28 January 202014592,227,942909115,1152296408711,185132
29 January 202017371,503,255935013,7832088394011,351170
30 January 202019821,372,206928712,5741943354110,786213
31 January 202021021,390,560885512,9741876370210,584259
1 February 202025901,334,127810811,425162029529741304
2 February 202028291,374,154768210,981149131629750361
3 February 202032351,277,132725810,683136529498517425
4 February 202038871,244,04866029504129326267258490
5 February 202036941,209,80862138763134923807434563
6 February 202031431,943,19757368305129521798043636
7 February 202033991,643,94157899236129224887261722
8 February 202026561,185,97851267287118321316718811
9 February 202030621,142,89252208719118720048173908
10 February 202024781,158,302545085851212194689481016
11 February 202020151,061,433481474211239190186411113
12 February 202015,1521,050,392459059711163192279081367
13 February 202050901,277,024474564361125204980761380
14 February 202026411,069,203414053391126183071971523
15 February 20202009948,165329545371018145664521596
16 February 20202048904,43129943953942120554611770
17 February 20201886920,373345440251046140665421868
18 February 20201749840,490327436521056127868892004
19 February 2020394784,784332735301038131568482118
20 February 2020889800,960303530711012134565522236
21 February 2020397776,56330033244935126964672345
22 February 2020648636,59426633003949117956062442
23 February 2020409622,09527772771978117252182592
24 February 2020508634,391323426951025128659402663
25 February 2020406550,48430662550964126054622715
26 February 2020433482,72628502468896120254512744
27 February 2020327478,82228352403819116553542788
28 February 2020427486,39426602285845119554252835
29 February 2020573496,28924202213750113346552870
1 March 2020202482,28022442070686115144582912
2 March 2020125441,91424682123785117653262943
3 March 2020119393,11822231955755114347412981
4 March 2020139441,92122641970765116349673012
5 March 2020143414,14221221789680114051573042
6 March 202099376,10621111658694111251863070
7 March 202044369,78018771539723107241963097
8 March 202040368,91617591480646105239933119
9 March 202019359,42620171414687113355473136
10 March 202024335,71117921288635108541643158
11 March 202015337,49119111413633104943313169
12 March 20208353,16718911575686108839673176
13 March 202011353,85719061756641111932693189
14 March 202020332,21517451358601104227883199
15 March 202016364,03317211486657103727323213
16 March 202021324,56619851555759108738453226
17 March 202013300,18518851546673106830223237
18 March 202034295,53619201491696105231983245
19 March 202039282,99017241355663105727423248
20 March 202041300,18317791227621103627053255
21 March 202046299,29117341308641100625773261
22 March 202039285,19117361102672102728293270
23 March 202078280,84118551391704110230183277
24 March 202047278,22118301457704105232153281
25 March 202067259,09118101308656104534463287
26 March 202055261,95718391094655109131143292
27 March 202054279,08216451129592106127803295
28 March 202045264,6641525106547699824803300
29 March 202031265,7611562109649096123643304
30 March 202048264,44217251094601103130213305
31 March 202036239,27217721071535100726763312
1 April 202035243,58215691080565101326763318
Note: CDC = Centers of Disease Control.

References

  1. Lu, L.; Zou, Y.Q.; Peng, Y.S.; Li, K.L.; Jiang, T.J. Comparison of Baidu index and Weibo index in surveillance of influenza virus in China. Appl. Res. Comput. 2016, 33, 392–395. [Google Scholar]
  2. Chen, Y.; Zhang, Y.Z.; Xu, Z.W.; Wang, X.Z.; Lu, J.H.; Hu, W.B. Avian Influenza A (H7N9) and related Internet search query data in China. Sci. Rep. 2019, 9, 10434. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Fung, I.C.H.; Fu, K.W.; Ying, Y.C.; Schaible, B.; Hao, Y.; Chan, C.H.; Tse, Z.T.H. Chinese social media reaction to the MERS-CoV and avian influenza A(H7N9) outbreaks. Infect. Dis. Poverty 2013, 2, 31. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Gu, H.G.; Zhang, W.J.; Xu, H.; Li, P.Y.; Wu, L.L.; Guo, P.; Hao, Y.T.; Lu, J.H.; Zhang, D.M. Predicating risk area of human infection with avian influenza A (H7N9) virus by using early warning model in China. Chin. J. Epidemiol. 2015, 36, 470–475. [Google Scholar]
  5. COVID-19 Coronavirus Data. Available online: https://data.europa.eu/euodp/en/data/dataset/covid-19-coronavirus-data (accessed on 14 December 2020).
  6. Zhao, X.M.; Li, X.H.; Nie, C.H. Retrospecting the spread of new coronary pneumonia based on big data and China’s control of the epidemic. Bull. Chin. Acad. Sci. 2020, 35, 248–255. [Google Scholar]
  7. McCall, B. COVID-19 and artificial intelligence: Protecting health-care workers and curbing the spread. Lancet Digit. Health 2020, 2, 166–167. [Google Scholar] [CrossRef]
  8. Baidu Index. Available online: http://index.baidu.com/ (accessed on 1 April 2020).
  9. Zhang, Y.D.; Zhang, X.; Zhu, W.G. ANC: Attention network for COVID-19 explainable diagnosis based on convolutional block attention module. Cmes-Comp. Model. Eng. 2021, 127, 1037–1058. [Google Scholar]
  10. Zhang, X.; Lu, S.Y.; Wang, S.H.; Yu, X.; Wang, S.J.; Yao, L.; Pan, Y.; Zhang, Y.D. Diagnosis of COVID-19 pneumonia via a novel deep learning architecture. J. Comput. Sci. Tech. 2021, 1. [Google Scholar] [CrossRef]
  11. Sylvester, E.V.A.; Bentzen, P.; Bradbury, I.R.; Clement, M.; Pearce, J.; Horne, J.; Beiko, R.G. Applications of random forest feature selection for fine-scale genetic population assignment. Evol. Appl. 2018, 11, 153–165. [Google Scholar] [CrossRef] [PubMed]
  12. Li, X.K.; Chen, W.; Zhang, Q.R.; Wu, L.F. Building auto-encoder intrusion detection system based on random forest feature selection. Comput. Secur. 2020, 95, 101851. [Google Scholar] [CrossRef]
  13. Al Daoud, E. Comparison between XGBoost, LightGBM and CatBoost using a home credit dataset. Int. J. Comput. Inf. Eng. 2019, 13, 6–10. [Google Scholar]
  14. Frazier, P.I. A tutorial on Bayesian optimization. arXiv 2018, arXiv:1807.02811. [Google Scholar]
  15. Liashchynskyi, P.; Liashchynskyi, P. Grid search, random search, genetic algorithm: A big comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar]
  16. Wang, Y.; Wang, T. Application of improved LightGBM model in blood glucose prediction. Appl. Sci. 2020, 10, 3227. [Google Scholar] [CrossRef]
  17. Liang, X. Image-based post-disaster inspection of reinforced concrete bridge systems using deep learning with Bayesian optimization. Comput.-Aided Civ. Inf. 2019, 34, 415–430. [Google Scholar] [CrossRef]
  18. Jones, D.R.; Schonlau, M.; Welch, W.J. Efficient global optimization of expensive black-box functions. J. Global Optim. 1998, 13, 455–492. [Google Scholar] [CrossRef]
  19. Sameen, M.I.; Pradhan, B.; Lee, S. Application of convolutional neural networks featuring Bayesian optimization for landslide susceptibility assessment. Catena 2020, 186, 104249. [Google Scholar] [CrossRef]
  20. Liang, W.Z.; Luo, S.Z.; Zhao, G.Y.; Wu, H. Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms. Mathematics 2020, 8, 765. [Google Scholar] [CrossRef]
Figure 1. The number released by the CDC and Baidu index data based on keyword searches. (a) represents the number of newly diagnosis released by the CDC. (bg) represent the Baidu index data based on keywords “New coronavirus”, “Fever”, “Dry cough”, “Fatigue”, “Dyspnea”, “Cough” respectively. (h) represents the death toll released by the CDC. CDC = Centers of Disease Control.
Figure 1. The number released by the CDC and Baidu index data based on keyword searches. (a) represents the number of newly diagnosis released by the CDC. (bg) represent the Baidu index data based on keywords “New coronavirus”, “Fever”, “Dry cough”, “Fatigue”, “Dyspnea”, “Cough” respectively. (h) represents the death toll released by the CDC. CDC = Centers of Disease Control.
Healthcare 09 01172 g001
Figure 2. Original diagnosis numbers distribution diagram.
Figure 2. Original diagnosis numbers distribution diagram.
Healthcare 09 01172 g002
Figure 3. Original diagnosis numbers Q-Q diagram.
Figure 3. Original diagnosis numbers Q-Q diagram.
Healthcare 09 01172 g003
Figure 4. The impact of all attributes on diagnosis numbers. (a,d) show the trend of newly diagnosis number by month and day, respectively. (b) represents the relationship between the diagnosis numbers and the death toll. (e) represents the relationship between the diagnosis numbers and new diagnosis released by the CDC. (c,f,g,h) and (i) respectively represent the relationship between the diagnosis numbers and Baidu Index data based on keyword search, and they correspond to keywords “Fatigue” “Fever” “Dry cough” “Dyspnea” and “Cough” respectively.
Figure 4. The impact of all attributes on diagnosis numbers. (a,d) show the trend of newly diagnosis number by month and day, respectively. (b) represents the relationship between the diagnosis numbers and the death toll. (e) represents the relationship between the diagnosis numbers and new diagnosis released by the CDC. (c,f,g,h) and (i) respectively represent the relationship between the diagnosis numbers and Baidu Index data based on keyword search, and they correspond to keywords “Fatigue” “Fever” “Dry cough” “Dyspnea” and “Cough” respectively.
Healthcare 09 01172 g004
Figure 5. Heat map between variables.
Figure 5. Heat map between variables.
Healthcare 09 01172 g005
Figure 6. RF-BOA-LightGBM structure. BOA = Bayesian optimization algorithm.
Figure 6. RF-BOA-LightGBM structure. BOA = Bayesian optimization algorithm.
Healthcare 09 01172 g006
Figure 7. Distribution of diagnosis numbers after data conversion.
Figure 7. Distribution of diagnosis numbers after data conversion.
Healthcare 09 01172 g007
Figure 8. Diagnosis numbers Q-Q diagram after data conversion.
Figure 8. Diagnosis numbers Q-Q diagram after data conversion.
Healthcare 09 01172 g008
Figure 9. Feature selection results.
Figure 9. Feature selection results.
Healthcare 09 01172 g009
Figure 10. Comparison of predicted and true values of the four models. BOA, Bayesian optimization algorithm; GBM, gradient boosting machine.
Figure 10. Comparison of predicted and true values of the four models. BOA, Bayesian optimization algorithm; GBM, gradient boosting machine.
Healthcare 09 01172 g010
Table 1. Partial data from CDC and Baidu Index search.
Table 1. Partial data from CDC and Baidu Index search.
SourceCDC-
Diagnosis
Baidu-
New Coronavirus
Baidu-
Fever
Baidu-
Dry Cough
Baidu-
Fatigue
Baidu-
Dyspnea
Baidu-
Cough
CDC-
Death Toll
Data
Date
1 January 2020004001110025648158850
2 January 2020004323120627860264480
3 January 2020104212117326265463920
4 January 2020004309110927062165700
5 January 2020504327111827159165640
6 January 2020004324122631069364040
7 January 2020003920117528863358750
8 January 2020003803112427262253540
9 January 2020088123693113127057951820
10 January 2020020323700109526353550220
11 January 2020028793478108323749850331
12 January 2020014453364106725247450111
13 January 2020015153573111827849444181
14 January 2020048463479113326652843591
15 January 2020041913241109724551243552
Note: CDC = Centers of Disease Control.
Table 2. Comparison of peak time between Baidu Index data based on different keywords and CDC new diagnostic data.
Table 2. Comparison of peak time between Baidu Index data based on different keywords and CDC new diagnostic data.
CategoryFirst Peak TimeTime Difference (Days)
CDC-Diagnostic12 February 2020-
Baidu-New coronavirus25 January 2020+18
Baidu-Fever26 January 2020+17
Baidu-Dry cough23 January 2020+20
Baidu-Fatigue25 January 2020+18
Baidu-Dyspnea25 January 2020+18
Baidu-Cough25 January 2020+18
Arithmetic mean-+18
Note: + indicates the number of days in advance, - indicates the number of days later. CDC = Centers of Disease Control.
Table 3. The LightGBM hyperparameters selected in this article and their functions.
Table 3. The LightGBM hyperparameters selected in this article and their functions.
ParameterStyleSearch ScopeEffect
learn_ratefloat(0.001, 0.3)improve accuracy
max_depthint(3, 10)prevent overfitting
num_leavesint(3, 1024)improve accuracy
min_data_in_leafint(0, 80)prevent overfitting
feature_fractionfloat(0.2, 0.9)accelerate
bagging_fractionfloat(0.2, 0.9)accelerate
lambda_l1float(0, 10)prevent overfitting
Table 4. Specific parameter values found by three tuning algorithms.
Table 4. Specific parameter values found by three tuning algorithms.
ParameterGridSearchRandomSearchBOA
learn_rate0.6320.8280.355
max_depth785
num_leaves225237249
min_data_in_leaf332730
feature_fraction0.70.70.8
bagging_fraction0.70.70.8
lambda_l12.343.451.80
Note: BOA = Bayesian optimization algorithm.
Table 5. Model evaluation index.
Table 5. Model evaluation index.
ModelsR2RMSEMAERAERRSE
LightGBM0.820354.945138.9390.5350.424
GridSearch-LightGBM0.865311.918145.2660.5480.368
RandomSearch-LightGBM0.861316.217137.6210.5330.373
BOA-LightGBM0.879295.686124.9110.5080.348
Note: GBM, gradient boosting machine; BOA, Bayesian optimization algorithm; R2, coefficient of determination; RMSE, root mean square error; MAE, mean absolute error; RAE, relative absolute error; RRSE, relative square root error.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, Z.; Hu, D. Forecast of the COVID-19 Epidemic Based on RF-BOA-LightGBM. Healthcare 2021, 9, 1172. https://doi.org/10.3390/healthcare9091172

AMA Style

Li Z, Hu D. Forecast of the COVID-19 Epidemic Based on RF-BOA-LightGBM. Healthcare. 2021; 9(9):1172. https://doi.org/10.3390/healthcare9091172

Chicago/Turabian Style

Li, Zhe, and Dehua Hu. 2021. "Forecast of the COVID-19 Epidemic Based on RF-BOA-LightGBM" Healthcare 9, no. 9: 1172. https://doi.org/10.3390/healthcare9091172

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop