Multiclassification Prediction of Clay Sensitivity Using Extreme Gradient Boosting Based on Imbalanced Dataset

Ma, Tao; Wu, Lizhou; Zhu, Shuairun; Zhu, Hongzhou

doi:10.3390/app12031143

Open AccessArticle

Multiclassification Prediction of Clay Sensitivity Using Extreme Gradient Boosting Based on Imbalanced Dataset

¹

College of Environment and Civil Engineering, Chengdu University of Technology, Chengdu 610059, China

²

State Key Laboratory of Mountain Bridge and Tunnel Engineering, Chongqing Jiaotong University, Chongqing 400074, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(3), 1143; https://doi.org/10.3390/app12031143

Submission received: 4 December 2021 / Revised: 30 December 2021 / Accepted: 19 January 2022 / Published: 21 January 2022

(This article belongs to the Topic Artificial Intelligence (AI) Applied in Civil Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Predicting clay sensitivity is important to geotechnical engineering design related to clay. Classification charts and field tests have been used to predict clay sensitivity. However, the imbalanced distribution of clay sensitivity is often neglected, and the predictive performance could be more accurate. The purpose of this study was to investigate the performance that extreme gradient boosting (XGboost) method had in predicting multiclass of clay sensitivity, and the ability that synthetic minority over-sampling technique (SMOTE) had in addressing imbalanced categories of clay sensitivity. Six clay parameters were used as the input parameters of XGBoost, and SMOTE was used to deal with imbalanced classes. Then, the dataset was divided using the cross-validation (CV) method. Finally, XGBoost, artificial neural network (ANN), and Naive Bayes (NB) were used to classify clay sensitivity. The F1 score, receiver operating characteristic (ROC), and area under the ROC curve (AUC) were considered as the performance indicators. The results revealed that XGBoost showed the best performance in the multiclassification prediction of clay sensitivity. The F1 score and mean AUC of XGBoost were 0.72 and 0.89, respectively. SMOTE was useful in addressing imbalanced issues, and XGBoost was an effective and reliable method of classifying clay sensitivity.

Keywords:

clay sensitivity; imbalanced categories; SMOTE; XGBoost

1. Introduction

Soft clays are widely distributed near lakes, rivers, and coastal areas in countries such as Sweden, Norway, Canada, Thailand, and China [1,2,3]. For grain size, clay is a fine-grained mineral (<2 µm in size), which is the main component of soil [4]. Clay minerals belong to the family of phyllosilicates and provide information on formation conditions and diagenesis [4]. Additionally, clay can be used as an additive for green processing technology and sustainable development, such as medical materials and treatment, agriculture, building materials, adsorbents of organic pollutants in soil, water, and air, etc. [5,6,7,8,9,10]. For engineers, clays are characterized by high compressibility, low shear strength, and high sensitivity. The sensitivity is defined as the ratio of the unconfined compressive strength of the undisturbed samples to the strength of the remolded samples [11,12,13].

Nowadays, in situ and laboratory testing and classification charts are often used to predict clay sensitivity. Cone Penetration Tests (CPTu) and Field Vane Tests (FVT) are commonly carried out to obtain the shear strength and classify clay sensitivity [14,15,16,17,18]. Yafrate et al. [19] employed full-flow penetrometers to evaluate the remolded soil strength and clay sensitivity. Abbaszadeh Shahri et al. [20] proposed a Unified Soil Classification System (USCS) to assess soils classification and used high-resolution files to detect potential sensitive clays. Different soil classification charts are widely used to determine clay sensitivity or types [13,21]. For example, Robertson [22] proposed a few updated charts to predict soil type based on CPTu data. Gylland et al. [23] used pore pressure ratio and modified cone resistance to build a set of diagrams identifying sensitive and quick clays.

However, in situ tests are costly and time consuming [14], and the construction conditions are complicated. It is difficult to accurately measure the sensitivity of highly sensitive and quick clays. It is a great challenge if the site clays are not textbook soils in classification charts, which makes clays sensitivity difficult to determine [13]. Therefore, advanced methods are required to resolve this issue.

Artificial Intelligence (AI) approaches, such as machine learning (ML) and deep learning, are being rapidly developed. Machine learning methods are data-driven tools that learn from the relationships of existing data [24]. Hence, ML does not assume a statistical model [25,26,27]. Additionally, these techniques have been widely applied in engineering [28,29,30,31,32,33]. For example, Zhang et al. [34] used XGBoost and Bayesian optimization to predict the shear strength of soft clays. Machine learning methods outperform traditional methods [27,28]. Moreover, XGBoost is an excellent ensemble method, better than conventional ML, and shows great potential in geotechnical engineering [34].

The high sensitivity of clays is one of the main properties in soft clay engineering, and considerably influences the safety of such structures [35]. For example, the shear strength decreases due to the excavation disturbance of soft clays, which is related to the sensitivity [3]. Landslides are also related to the sensitivity parameter of geomaterials [36,37]. Therefore, clay sensitivity must be predicted to ensure the safety of geotechnical engineering. There are a few different methods for clay sensitivity classification [11,38,39], as shown in Figure 1. The Canadian Foundation Engineering Manual [38] is used to simplify the classification issue.

Few studies have focused on sensitivity classification for soft clay using machine learning methods. Shahri et al. [37] used an optimized ANN to predict clay sensitivity. It is the first time that machine learning methods have been used to predict clay sensitivity. However, it would be more convenient and direct if the result were a category value because clay sensitivity is a category value in Figure 1. Godoy et al. [13] used different machine learning methods to identify quick and highly sensitive clays, and they found that classification accuracy was improved significantly despite limited training data. However, these approaches have a few shortcomings. First, there are more than two sensitivity categories in different classification charts (Figure 1), but few studies have investigated the multiclassification prediction of sensitivity. One previous study [13] only considered two different sensitivity categories, including quick and highly sensitive clays. Therefore, the multiclassification of clay sensitivity should be further investigated. Second, the distribution of different sensitivity categories influencing machine learning methods is imbalanced, which reduces the prediction accuracy [40]. Therefore, SMOTE has been used to address the imbalanced classes of the dataset and improve the accuracy of the results [40]. Furthermore, SMOTE has been successfully applied in geotechnical engineering [41].

The objective of this study was to investigate the potential of XGBoost and SMOTE with regard to predicting the categories of clay sensitivity based on imbalanced datasets. First, the considered dataset, XGBoost, and SMOTE are introduced. Next, SMOTE is used to address the imbalanced categories, and the input parameters include the vertical pre-consolidation pressure (VPP), liquid limit (LL), plastic limit (PL), effective vertical pressure (EVP), depth (Dep), and moisture content (W). Then, the F1 score and the area under the receiver operating characteristic (ROC) curve (AUC) are considered as performance indicators to evaluate the proposed model. Finally, the predictive accuracy of XGBoost is compared with that of other methods.

2. Materials and Methods

2.1. XGBoost

The XGBoost method provides an advanced boosted tree model [42] and is a common machine learning method with high accuracy. This method implements a new regularized learning objective, which is simpler than the regularized greedy forest model. The predicted function of XGBoost is defined as follows:

\hat{y} = \sum_{k = 1}^{K} f_{k} (x_{i}), f_{k} \in Γ

(1)

where

{\hat{y}}_{i}

is the predicted output value of the ith instance; K is the number of regression trees;

f_{k}

is the tree structure;

x_{i}

is the feature vector of the ith sample; and

Γ

is the space of regression trees.

To overcome overfitting problems, a penalty function is used to smooth the learning weights as follows:

O b j = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{k = 1}^{K} Ω (f_{k})

(2)

Ω (f_{k}) = γ T + \frac{1}{2} λ \sum_{j = 1}^{T} w_{j}^{2}

(3)

where Obj is the regularized objective function;

\sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i})

is the loss function, which measures the model accuracy;

Ω (f_{k})

is a penalty function handling overfitting;

y_{i}

is the real value of the ith sample; γ is the complexity cost of introducing additional leaves; T is the number of leaves; λ is a regular item parameter; and

w_{j}^{2}

is the weight of the jth leaf node.

The additive method is used to train the model as follows [42]:

O b j^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t})

(4)

where

{\hat{y}}_{i}^{(t)}

is the prediction of the ith sample at the ith iteration, and

f_{t}

is applied to minimize the objective. To rapidly optimize the loss function of the first term in Equation (4), a second-order Taylor expansion can be written as

O b j^{(t)} = [l (y_{i} - {\hat{y}}_{i}^{(t - 1)}) + g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t})

(5)

where

g_{i} = \partial_{\hat{y}} (t - 1) l (y_{i}, {\hat{y}}^{(t - 1)})

and

h_{i} = \partial_{{\hat{y}}^{(t - 1)}}^{2} l (y_{i} - {\hat{y}}^{(k - 1)})

are the first and second order gradient statistics of the loss function, respectively. The constant term is removed to obtain the objective function of the ith step as follows:

O b j^{(t)} = \sum_{i = 1}^{n} [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t})

(6)

The parameters in Equation (6) can be continuously updated until the conditions are satisfied. More details on XGBoost can be found in ref. [42]. XGBoost has been used to predict shear strength of clay, and the results demonstrate that XGBoost is a promising tool for predicting geotechnical parameters [34]. The potential of XGBoost for predicting multicategory of clay sensitivity will be investigated.

2.2. SMOTE

The Synthetic Minority Over-sampling Technique (SMOTE) was first proposed by Chawla et al. [40] and is used to solve the classification problems of imbalanced datasets. A dataset is imbalanced if the distribution of categories is unequal, which results in low classification accuracy. A few methods have been proposed to address the imbalance issue, such as re-sampling the original dataset, over-sampling the minority categories, and under-sampling the majority class [43]. However, these methods do not considerably enhance the accuracy of minority class recognition [20]. The SMOTE creates synthetic examples instead of performing typical over-sampling. First, for each sample x_l in the minority class, the Euclidean Distance between x_l and other samples of the minority class is calculated, and the k neighbors of x_l are obtained. Then, a random sample x_m from the k neighbors is selected. Finally, the new sample x_n can be expressed as follows:

x_{n} = x_{l} + λ (x_{l} - x_{m})

(7)

where λ is a random number in the range of 0–1.

SMOTE has been utilized to solve imbalanced rock mass classification in tunneling engineering [41]. The categories of clay sensitivity comprise more than two classes (Figure 1), which may cause imbalanced problems. Therefore, SMOTE could be used to address imbalanced classes.

3. Preprocessing Data

3.1. Description of Data

The clay dataset was obtained from F-CLAY/7/216 and S-CLAY/7/168 [14,44]. F-CLAY/7/216 consists of 216 samples and was compiled through field vane tests (FVT) in Finland; S-CLAY/7/168 was compiled through 168 FVT tests in Sweden and Norway. Therefore, there are 384 samples in total. Each sample contains six attributes, namely, LL, PL, W, EVS, Dep, and VPP. The sensitivity (St) is the predicted value. The distribution of the six attributes is shown in Figure 2, and the statistical information of the input attributes is listed in Table 1. Python 3.6 and the scikit-learn 0.23.2 library [45] were used to prepare the data and train the model.

3.2. Data Preparation and Performance

3.2.1. Analysis of Clay Dataset

This study referred to the Canadian Foundation Engineering Manual [38] for the classification of clay sensitivity. The category distribution is shown in Figure 3, where it can be seen that the proportion of low sensitivity is close to the proportion of medium sensitivity. However, the proportion of high sensitivity is particularly smaller compared with other categories and only accounts for 9.54%. Hence, it is difficult to classify the minority class [41]. To deal with the imbalanced classes, SMOTE is used to over-sample the high sensitivity category.

A heat map is used to show the correlation coefficient among the attributes [34]. The correlation coefficient is calculated using the Pearson coefficient [24]. The heat map of clay attributes is shown in Figure 4, where there is no strong linear relationship between the clay attributes and the sensitivity. However, machine learning methods can effectively solve the above-mentioned nonlinear issues [26].

3.2.2. Cross-Validation

The processed data used in machine learning are commonly separated into training and validation datasets [46]. The machine learning method is trained on the training dataset, and then the accuracy is tested on the test dataset. However, small datasets may potentially cause bias. Therefore, k-fold cross-validation (CV) is used to divide the datasets [47,48]. CV does not only increase the training times but also improves the robustness of the model. The CV method divides the datasets into k mutually exclusive subsets, that is,

D = D_{1} \cup D_{2} \cup \cdot \cdot \cdot D_{k}, D_{i} \cap D_{j} = \emptyset (i \neq j)

. Next, the set of k − 1 subsets is used as the training set, the remaining subset is acted as the validation dataset, and training and validation can be conducted for k times. Then, k results are obtained, and the validation results are the mean value of the k results. Figure 5 shows the 5-fold CV.

3.3. Performances

The confusion matrix, F1 score, ROC, and AUC are considered as the evaluation indicators.

3.3.1. Confusion Matrix

The confusion matrix consists of the True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) [49], as shown in Figure 6. In the confusion matrix, TP and TN are a correct classification, while FP and FN are an erroneous one. A value closer to 1 indicates higher accuracy.

3.3.2. F1 Score

The F1 score is the harmonic mean of Precision (P) and Recall (R). Precision and Recall are shown in Figure 7. Precision, Recall, and F1 score are defined as follows [50]:

F_{s c o r e} = \frac{2 \times P \times R}{P + R}

(8)

P = \frac{T P}{T P + F P}

(9)

R = \frac{T P}{T P + F N}

(10)

3.3.3. AUC and ROC

The ROC graph helps visualize the classification algorithms [51]. The y-axis of the ROC graphs is the true positive rate (TPR), while the x-axis is the false positive rate (FPR) (Figure 8). The TPR and FPR are defined as follows:

T P R = \frac{T P}{T P + F N}

(11)

F P R = \frac{F P}{T N + F P}

(12)

Because the ROC curve is two-dimensional, a single scalar value, such as AUC, can evaluate the algorithms [51,52]. The AUC value is in the range of 0–1. An AUC value closer to 1 indicates a better fit for the model.

3.3.4. Evaluation Methods

In this study, the XGBoost method was compared with ANN and NB. To further investigate SMOTE, the data without SMOTE were also used as the training data of XGBoost, which is referred to as XGBoost_NoSmote. The dataset is divided into 260 training–validation sets (70%) and 112 test sets (30%). Gridsearch is a custom method used to optimize parameters [45]. To ensure that models could achieve their own best performance, CV and gridsearch are used to optimize the hyperparameters on a training–validation set, and the final performance of a given model is evaluated on a test set. These methods are different from previous studies [13,37] that did not incorporate parameter optimization. The flow chart of the method is shown in Figure 9. Table 2 lists the optimal parameters of models.

4. Results

4.1. Confusion Matrix and F1 Score Results

Figure 10 shows the confusion matrix of different machine learning methods, and labels 0–2 represent high sensitivity, low sensitivity, and medium sensitivity, respectively. When the number in the matrix is closer to 1, the model fits better. The XGBoost matrix is larger than that of the other models, which indicates that XGBoost outperforms ANN and NB. Furthermore, the matrix value of XGBoost without SMOTE is smaller than that of XGBoost with SMOTE.

Figure 11 presents the F1 score of different machine learning methods. An F1 score closer to 1 indicates that the model has better performance. Figure 11 shows that XGBoost achieved the best F1 score, Recall, and Precision (0.72, 0.72, and 0.73, respectively), followed by NB (0.68, 0.69, and 0.70, respectively), and XGBoost_NoSmote (0.68, 0.66, and 0.71, respectively). ANN had the worst performance in terms of the F1 score, Recall, and Precision (0.61, 0.62, and 0.63, respectively). Using SMOTE, the performance of XGBoost on the F1 score, Recall, and Precision improved by 6.9%, 9.1%, and 2.8%, respectively.

4.2. ROC and AUC Results

Figure 12 shows ROC and AUC of different machine learning methods. The XGBoost method achieved the best AUC score for all classes, but it achieved the worst AUC score on the data without SMOTE. All indices of the models are listed in Table 3. The XGBoost method achieved the highest F1 score, followed by NB, ANN, and XGBoost_NoSmote. For all AUC evaluations, XGBoost achieved the best effect. Compared with XGBoost_NoSmote, the AUC score of XGBoost improved by 5.43%, 14.47%, and 10.81%, respectively. For medium and low sensitivity classification, XGBoost performed the best, but the performance of ANN and NB was slightly inferior compared with XGBoost. Finally, the performance of XGBoost_NoSmote was poor.

4.3. Compared with Previous Studies

The results of Section 4.1 and Section 4.2 indicate that XGBoost performs best on all evaluation indices, and the performance is identical to refs. [33,53], which proves that XGBoost is a powerful tool to classify the properties of engineering materials. This study also produces some valid and new results that are different from other studies. The results in this study are category values and are consistent with the clay sensitivity standards [37], as shown in Figure 1. Second, the results of XGBoost are better than those of XGBoost_NoSmote, which indicates that SMOTE is good at solving imbalanced problems, and it is the first time that SMTOE has been adopted to address imbalanced clay sensitivity problems. Moreover, the results of Godoy et al. [13] are a binary classification, i.e., the clays are divided into quick and highly sensitive groups. In this study, multiclass of clay sensitivity is predicted using XGBoost, and the results are more satisfactory.

5. Discussion

Clay sensitivity is not only important to the safety of geotechnical engineering, but it also influences the feasibility of such projects. However, it is difficult to classify clay sensitivity, which is greatly influenced by disturbance. There are more than two sensitivity categories under different classification charts. Moreover, the distribution of clay sensitivity categories is often imbalanced. Therefore, new methods are needed. However, XGBoost is rarely used for multiclassification problems, such as predicting the sensitivity categories of soft clays. Additionally, it is necessary to deal with imbalanced problems.

In this study, NB, ANN, and XGBoost were used to predict the multiple classes of clay sensitivity. SMOTE was applied to address the imbalanced classes of data. Additionally, a set of performance indices were developed to evaluate accuracy.

The evaluation indices of XGBoost incorporating SMOTE were better compared with those of other machine learning methods, based on the classification results. Compared with XGBoost_NoSmote, the mean AUC of classification for XGBoost improved by 9.9%, which indicates that SMOTE improves the model performance of imbalanced datasets. The best classification performance was achieved for high sensitivity, followed by low sensitivity and medium sensitivity. This study did not only investigate multiclassification with regard to clay sensitivity, but it also employed SMOTE to handle imbalanced issues. The results prove that the combination of XGBoost and SMOTE is a simple and quick way to classify imbalanced clay datasets. Furthermore, more accurate indices for evaluating the model performance, such as the AUC and F1 score, were applied to assess the models.

However, this study has a few limitations. For the AUC of medium sensitivity (Table 3), all models performed slightly worse compared with other categories. The possible reason is that the medium sensitivity is between the low sensitivity and high sensitivity categories, which results in a particularly unclear boundary and affects the model performance. Other studies have also proven that the overlap between different classes can influence classification performance [54,55]. Therefore, new methods should be developed to solve these issues.

6. Conclusions

In this study, NB, ANN, and XGBoost were employed to predict the multiple classes of clay sensitivity and evaluate their performance, respectively. SMOTE was first applied to address the imbalanced classes of clay sensitivity. The conclusions demonstrated that the XGBoost achieved the best performance, according to all performance indices. The results obtained by XGBoost were better than those of XGBoost_NoSmote, which means that SMOTE can improve model performance. Therefore, the proposed XGBoost is an effective and low-cost method for the multiclassification prediction of clay sensitivity, and the proposed SMOTE is useful for addressing the imbalanced classes of clay datasets. However, models may perform better on the AUC of medium sensitivity. It is recommended that SMOTE could be improved according to the distribution of clay sensitivity. Additionally, XGBoost predicts more than three clay sensitivity categories, which makes the classification results more delicate.

Author Contributions

Conceptualization, T.M. and L.W.; methodology, T.M.; software, T.M. and S.Z.; validation, T.M.; formal analysis, T.M.; writing—original draft preparation, T.M. and L.W.; writing—review and editing, L.W. and H.Z.; visualization, T.M. and S.Z.; supervision, L.W. and H.Z.; funding acquisition, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 41790432 and 51908093), and the National Key Research and Development Program of China (Grant No. 2021YFB2600026).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

Likitlersuang, S.; Surarak, C.; Wanatowski, D.; Oh, E.; Balasubramaniam, A. Finite element analysis of a deep excavation: A case study from the Bangkok MRT. Soils Found. 2013, 53, 756–773. [Google Scholar] [CrossRef] [Green Version]
Arasan, S.; Akbulut, R.K.; Isik, F.; Bagherinia, M.; Zaimoglu, A.S. Behavior of polymer columns in soft clayey soil: A preliminary study. Geomech. Eng. 2016, 10, 95–107. [Google Scholar] [CrossRef]
Hu, J.; Ma, F. Failure Investigation at a Collapsed Deep Open Cut Slope Excavation in Soft Clay. Geotech. Geol. Eng. 2018, 35, 665–683. [Google Scholar] [CrossRef]
Jlassi, K.; Krupa, I.; Chehimi, M.M. Overview: Clay Preparation, Properties, Modification. Clay-Polym. Nanocomposites 2017, 1–28. [Google Scholar] [CrossRef]
Paiva, L.B.; Morales, A.R.; Valenzuela Díaz, F.R. Organoclays: Properties, preparation and applications. Appl. Clay Sci. 2008, 42, 8–24. [Google Scholar] [CrossRef]
Zhou, C.H.; Zhao, L.Z.; Wang, A.Q.; Chen, T.H.; He, H.P. Current fundamental and applied research into clay minerals in China. Appl. Clay Sci. 2016, 119, 3–7. [Google Scholar] [CrossRef]
Zahid, I.; Ayoub, M.; Abdullah, B.B.; Nazir, M.H.; Zulqarnain; Kaimkhani, M.A.; Sher, F. Activation of nano kaolin clay for bio-glycerol conversion to a valuable fuel additive. Sustainability 2021, 13, 2631. [Google Scholar] [CrossRef]
Doğan-Sağlamtimur, N.; Bilgil, A.; Szechyńska-Hebda, M.; Parzych, S.; Hebda, M. Eco-friendly fired brick produced from industrial ash and natural clay: A study of waste reuse. Materials 2021, 14, 877. [Google Scholar] [CrossRef] [PubMed]
Otunola, B.O.; Ololade, O.O. A review on the application of clay minerals as heavy metal adsorbents for remediation purposes. Environ. Technol. Innov. 2020, 18, 100692. [Google Scholar] [CrossRef]
Abdallah, Y.K.; Estévez, A.T. 3d-printed biodigital clay bricks. Biomimetics 2021, 6, 59. [Google Scholar] [CrossRef]
Skempton, A.W.; Northey, R.D. The sensitivity of clays. Geotechnique 1952, 3, 30–53. [Google Scholar] [CrossRef]
Terzaghi, K.; Peck, R.B.; Mesri, G. Soil Mechanics in Engineering Practice, 3rd ed.; John Wiley & Sons: New York, NY, USA, 2016; pp. 20–25. [Google Scholar]
Godoy, C.; Depina, I.; Thakur, V. Application of machine learning to the identification of quick and highly sensitive clays from cone penetration tests. J. Zhejiang Univ. Sci. A 2020, 21, 445–461. [Google Scholar] [CrossRef]
D’Ignazio, M.; Phoon, K.K.; Tan, S.A.; Länsivaara, T.T. Correlations for undrained shear strength of finish soft clays. Can. Geotech. J. 2016, 53, 1628–1645. [Google Scholar] [CrossRef] [Green Version]
Eslami, A.; Fellenius, B.H. Pile capacity by direct CPT and CPTu methods applied to 102 case histories. Can. Geotech. J. 1997, 34, 886–904. [Google Scholar] [CrossRef]
Gao, Y.B.; Ge, X.N. On the sensitivity of soft clay obtained by the field vane test. Geotech. Test. J. 2016, 39, 282–290. [Google Scholar] [CrossRef]
Meijer, G.; Dijkstra, J. A novel methodology to regain sensitivity of quick clay in a geotechnical centrifuge. Can. Geotech. J. 2013, 50, 995–1000. [Google Scholar] [CrossRef]
Schneider, J.A.; Randolph, M.F.; Mayne, P.W.; Ramsey, N.R. Analysis of Factors Influencing Soil Classification Using Normalized Piezocone Tip Resistance and Pore Pressure Parameters. J. Geotech. Geoenviron. Eng. 2008, 134, 1569–1586. [Google Scholar] [CrossRef]
Yafrate, N.; DeJong, J.; DeGroot, D.; Randolph, M. Evaluation of Remolded Shear Strength and Sensitivity of Soft Clay Using Full-Flow Penetrometers. J. Geotech. Geoenviron. Eng. 2009, 135, 1179–1189. [Google Scholar] [CrossRef]
Abbaszadeh Shahri, A.; Malehmir, A.; Juhlin, C. Soil classification analysis based on piezocone penetration test data-A case study from a quick-clay landslide site in southwestern Sweden. Eng. Geol. 2015, 189, 32–47. [Google Scholar] [CrossRef]
Moreno-Maroto, J.M.; Alonso-Azcárate, J.; O’Kelly, B.C. Review and critical examination of fine-grained soil classification systems based on plasticity. Appl. Clay Sci. 2021, 200, 105955. [Google Scholar] [CrossRef]
Robertson, P.K. Cone penetration test (CPT)-based soil behaviour type (SBT) classification system—An update. Can. Geotech. J. 2016, 53, 1910–1927. [Google Scholar] [CrossRef]
Gylland, A.S.; Sandven, R.; Montafia, A.; Pfaffhuber, A.A.; Kåsin, K.; Long, M. Cptu classification diagrams for identification of sensitive clays. In Advances in Natural and Technological Hazards Research; Springer: Cham, Switzerland, 2017. [Google Scholar]
Zhang, W.; Wu, C.; Li, Y.; Wang, L.; Samui, P. Assessment of pile drivability using random forest regression and multivariate adaptive regression splines. Georisk 2021, 15, 27–40. [Google Scholar] [CrossRef]
Dickson, M.E.; Perry, G.L.W. Identifying the controls on coastal cliff landslides using machine-learning approaches. Environ. Model. Softw. 2016, 76, 117–127. [Google Scholar] [CrossRef]
Pourghasemi, H.R.; Rahmati, O. Prediction of the landslide susceptibility: Which algorithm, which precision? Catena 2018, 162, 177–192. [Google Scholar] [CrossRef]
Li, S.H.; Wu, L.Z.; Luo, X.H. A novel method for locating the critical slip surface of a soil slope. Eng. Appl. Artif. Intell. 2020, 94, 103733. [Google Scholar] [CrossRef]
Zhu, S.; Wu, L.; Huang, J. Application of an improved P(m)-SOR iteration method for flow in partially saturated soils. Comput. Geosci. 2021, 1–15. [Google Scholar] [CrossRef]
Pham, B.T.; Son, L.H.; Hoang, T.A.; Nguyen, D.M.; Tien Bui, D. Prediction of shear strength of soft soil using machine learning methods. Catena 2018, 166, 181–191. [Google Scholar] [CrossRef]
Mishra, P.; Samui, P.; Mahmoudi, E. Probabilistic design of retaining wall using machine learning methods. Appl. Sci. 2021, 11, 5411. [Google Scholar] [CrossRef]
Huang, Z.; Zhang, D.; Zhang, D. Application of ANN in Predicting the Cantilever Wall Deflection in Undrained Clay. Appl. Sci. 2021, 11, 9760. [Google Scholar] [CrossRef]
Li, S.H.; Luo, X.H.; Wu, L.Z. A new method for calculating failure probability of landslide based on ANN and a convex set model. Landslides 2021, 18, 2855–2867. [Google Scholar] [CrossRef]
Wu, L.Z.; Li, S.H.; Huang, R.Q.; Xu, Q. A new grey prediction model and its application to predicting landslide displacement. Appl. Soft Comput. J. 2020, 95, 106543. [Google Scholar] [CrossRef]
Zhang, W.; Wu, C.; Zhong, H.; Li, Y.; Wang, L. Prediction of undrained shear strength using extreme gradient boosting and random forest based on Bayesian optimization. Geosci. Front. 2021, 12, 469–477. [Google Scholar] [CrossRef]
Chen, R.P.; Li, Z.C.; Chen, Y.M.; Ou, C.Y.; Hu, Q.; Rao, M. Failure Investigation at a Collapsed Deep Excavation in Very Sen-sitive Organic Soft Clay. J. Perform. Constr. Facil. 2015, 29, 04014078. [Google Scholar] [CrossRef]
Gylland, A.; Long, M.; Emdal, A.; Sandven, R. Characterisation and engineering properties of Tiller clay. Eng. Geol. 2013, 164, 86–100. [Google Scholar] [CrossRef] [Green Version]
Abbaszadeh Shahri, A. An Optimized Artificial Neural Network Structure to Predict Clay Sensitivity in a High Landslide Prone Area Using Piezocone Penetration Test (CPTu) Data: A Case Study in Southwest of Sweden. Geotech. Geol. Eng. 2016, 34, 86–100. [Google Scholar] [CrossRef]
Canadian Foundation Engineering Manual, 4th ed.; Canadian Geotechnical Society: Prince George, BC, Canada, 2006; pp. 17–20.
Das, B.; Sobhan, K. Principles of Geotechnical Engineering, 8th ed.; CENGAGE Learning: Stanford, CA, USA, 2014; pp. 450–466. [Google Scholar]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Liu, Q.; Wang, X.; Huang, X.; Yin, X. Prediction model of rock mass class using classification and regression tree integrated AdaBoost algorithm based on TBM driving data. Tunn. Undergr. Sp. Technol. 2020, 106, 103595. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016. [Google Scholar] [CrossRef] [Green Version]
Japkowicz, N. The Class Imbalance Problem: Significance and Strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence, Acapulco, Mexico, 11–14 April 2000. [Google Scholar]
Ching, J.; Phoon, K.K. Transformations and correlations among some clay parameters-The global database. Can. Geotech. J. 2014, 51, 663–685. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Du-bourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Alkroosh, I.; Nikraz, H. Predicting axial capacity of driven piles in cohesive soils using intelligent computing. Eng. Appl. Artif. Intell. 2012, 25, 618–627. [Google Scholar] [CrossRef]
Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Int. Jt. Conf. Artif. Intell. 1995, 14, 1137–1145. [Google Scholar]
Wong, T.T. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognit. 2015, 48, 2839–2846. [Google Scholar] [CrossRef]
Trajdos, P.; Kurzynski, M. Weighting scheme for a pairwise multi-label classifier based on the fuzzy confusion matrix. Pattern Recognit. Lett. 2018, 103, 60–67. [Google Scholar] [CrossRef] [Green Version]
Xue, Y.; Bai, C.; Qiu, D.; Kong, F.; Li, Z. Predicting rockburst with database using particle swarm optimization and extreme learning machine. Tunn. Undergr. Sp. Technol. 2020, 98, 103287. [Google Scholar] [CrossRef]
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Huang, J.; Ling, C.X. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 2005, 17, 299–310. [Google Scholar] [CrossRef] [Green Version]
Pei, L.; Sun, Z.; Yu, T.; Li, W.; Hao, X.; Hu, Y.; Yang, C. Pavement aggregate shape classification based on extreme gradient boosting. Constr. Build. Mater. 2020, 256, 119356. [Google Scholar] [CrossRef]
García, V.; Mollineda, R.A.; Sánchez, J.S. On the k-NN performance in a challenging scenario of imbalance and overlapping. Pattern Anal. Appl. 2008, 11, 269–280. [Google Scholar] [CrossRef]
López, V.; Fernández, A.; García, S.; Palade, V.; Herrera, F. An insight into classification with imbalanced data: Empirical re-sults and current trends on using data intrinsic characteristics. Inf. Sci. 2013, 250, 113–141. [Google Scholar] [CrossRef]

Figure 1. Different methods for clay sensitivity classification.

Figure 2. Distribution of input parameters.

Figure 3. Distribution of sensitivity categories.

Figure 4. Heat map of clay parameters.

Figure 5. Schematic diagram of 5-fold CV.

Figure 6. Confusion matrix.

Figure 7. Precision and recall (adapted from [50]).

Figure 8. AUC and ROC diagram.

Figure 9. Flow chart of proposed method.

Figure 10. Confusion matrix of different machine learning methods: (a) ANN; (b) NB; (c) XGBoost; (d) XGBoost_NoSmote.

Figure 11. F1 score of different machine learning methods.

Figure 12. ROC and AUC of different machine learning methods: (a) ANN; (b) NB; (c) XGBoost; (d) XGBoost_NoSmote.

Table 1. Statistical information of clay parameters.

	Depth (m)	LL (%)	PL (%)	W (%)	EVS (kPa)	VPP (kPa)	S_t
mean	6.97	68.37	28.49	76.47	48.72	79.82	16.27
std	3.95	23.86	7.97	23.32	27.33	48.54	13.26
min	0.50	22.00	2.73	17.27	6.86	15.2	2.00
50%	6.00	68.72	27.02	75.00	43.08	64.88	11.00
max	24.00	201.81	73.92	180.11	212.87	315.64	64.00

Table 2. Optimal parameters of models.

Model	Parameters	Value
XGBoost	n_estimators	360
	learning_rate	0.002
	max_depth	6
	min_child_weight	1
	gamma	0.2
	colsample_bytree	0.5
	subsample	0.8
ANN	learning_rate_init	0.0001
	activation	tanh
	hidden_layer_sizes	(100, 100, 100)
	max_iter	260
NB	priors	3
	var_smoothing	10⁻⁹
XGBoost_NoSmote	n_estimators	360
	learning_rate	0.005
	max_depth	5
	min_child_weight	1
	gamma	0.3
	colsample_bytree	0.7
	subsample	0.8

Table 3. Evaluation measures of different models.

Evaluation Measures	Models
Evaluation Measures	XGBoost	ANN	NB	XGBoost_NoSmote
Precision	0.73	0.63	0.70	0.71
Recall	0.72	0.62	0.69	0.66
F1 score	0.72	0.61	0.68	0.68
AUC of high sensitivity	0.97	0.95	0.93	0.92
AUC of medium sensitivity	0.82	0.72	0.72	0.74
AUC of low sensitivity	0.87	0.78	0.84	0.76
Mean AUC of classification	0.89	0.82	0.83	0.81

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, T.; Wu, L.; Zhu, S.; Zhu, H. Multiclassification Prediction of Clay Sensitivity Using Extreme Gradient Boosting Based on Imbalanced Dataset. Appl. Sci. 2022, 12, 1143. https://doi.org/10.3390/app12031143

AMA Style

Ma T, Wu L, Zhu S, Zhu H. Multiclassification Prediction of Clay Sensitivity Using Extreme Gradient Boosting Based on Imbalanced Dataset. Applied Sciences. 2022; 12(3):1143. https://doi.org/10.3390/app12031143

Chicago/Turabian Style

Ma, Tao, Lizhou Wu, Shuairun Zhu, and Hongzhou Zhu. 2022. "Multiclassification Prediction of Clay Sensitivity Using Extreme Gradient Boosting Based on Imbalanced Dataset" Applied Sciences 12, no. 3: 1143. https://doi.org/10.3390/app12031143

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multiclassification Prediction of Clay Sensitivity Using Extreme Gradient Boosting Based on Imbalanced Dataset

Abstract

1. Introduction

2. Materials and Methods

2.1. XGBoost

2.2. SMOTE

3. Preprocessing Data

3.1. Description of Data

3.2. Data Preparation and Performance

3.2.1. Analysis of Clay Dataset

3.2.2. Cross-Validation

3.3. Performances

3.3.1. Confusion Matrix

3.3.2. F1 Score

3.3.3. AUC and ROC

3.3.4. Evaluation Methods

4. Results

4.1. Confusion Matrix and F1 Score Results

4.2. ROC and AUC Results

4.3. Compared with Previous Studies

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI