Prediction and Factor Analysis of Liquefaction Ground Subsidence Based on Machine-Learning Techniques

Karimai, Kazuki; Liu, Wen; Maruyama, Yoshihisa

doi:10.3390/app14072713

Open AccessArticle

Prediction and Factor Analysis of Liquefaction Ground Subsidence Based on Machine-Learning Techniques

by

Kazuki Karimai

^*

,

Wen Liu

and

Yoshihisa Maruyama

Graduate School of Engineering, Chiba University, 1-33 Yayoi-cho, Inage-ku, Chiba 263-8522, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(7), 2713; https://doi.org/10.3390/app14072713

Submission received: 28 February 2024 / Revised: 14 March 2024 / Accepted: 15 March 2024 / Published: 23 March 2024

(This article belongs to the Special Issue Artificial Intelligence and Data Mining in Geotechnical Engineering: Innovative Approaches and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Liquefaction is a significant challenge in the fields of earthquake risk assessment and soil dynamics, as it has the potential to cause extensive damage to buildings and infrastructure through ground failure. During the 2011 Great East Japan Earthquake, Urayasu City in the Chiba Prefecture experienced severe soil liquefaction, leading to evacuation losses due to the effect of the liquefaction on roads. Therefore, developing quantitative predictions of ground subsidence caused by liquefaction and understanding its contributing factors are imperative in preparing for potential future mega-earthquakes. This research is novel because previous research primarily focused on developing predictive models for determining the presence or absence of liquefaction, and there are few examples available of quantitative liquefaction magnitude after liquefaction has occurred. This research study extracts features from existing datasets and builds a predictive model, supplemented by factor analysis. Using the Cabinet Office of Japan’s Nankai Trough Megathrust Earthquake model, liquefaction-induced ground subsidence was designated as the dependent variable. A gradient-boosted decision-tree (GDBT) prediction model was then developed. Additionally, the Shapley additive explanations (SHAP) method was employed to analyze the contribution of each feature to the prediction results. The study found that the XGBoost model outperformed the LightGBM model in terms of predictive accuracy, with the predicted values closely aligned with the actual measurements, thereby proving its effectiveness in predicting ground subsidence due to liquefaction. Furthermore, it was demonstrated that liquefaction assessments, which were previously challenging, can now be interpreted using SHAP factors. This enables accountable wide-area prediction of liquefaction-induced ground subsidence.

Keywords:

machine learning; factor analysis; liquefaction; prediction model; SHAP method

1. Introduction

Liquefaction, a phenomenon in which the bearing capacity of soil is suddenly reduced, causes catastrophic damage to buildings and infrastructure, and accounts for a large proportion of earthquake damage [1]. Diverse research on liquefaction is underway in various fields, including civil engineering, urban disaster prevention, and disaster risk management. Seed and Idriss [2] identified the importance of soil compaction, soil particle size, and groundwater level as conditions for liquefaction and proposed a simple liquefaction assessment. Research on the seismic data has been conducted in Japan since 2000. For example, the 2004 Niigata Chuetsu earthquake [3], the 2011 Tohoku earthquake [4], the 2016 Kumamoto earthquake [5], and the 2018 Hokkaido Eastern Iburi earthquake [6] all caused extensive liquefaction damage. Many studies have been conducted based on these cases, contributing to improving the accuracy of liquefaction prediction models and developing effective countermeasures.

One type of damage caused by liquefaction is ground subsidence. The 2011 earthquake off the Pacific coast of Tohoku caused large-scale liquefaction in many areas in Japan, resulting in road subsidence and road-surface uplift caused by sand boil and lateral spreading [7,8,9], which hindered the passage of emergency vehicles and reduced evacuation speed and effectiveness. In Urayasu City, Chiba Prefecture, where the damage caused by liquefaction was the most severe, unequal road subsidence and sand boils delayed restoration work, forcing overall ground improvement and reinforcement [10]. In the coastal areas of Tokyo Bay, extensive liquefaction damage occurred in many houses and facilities, with serious social consequences [11,12]. Liquefaction was also observed during the Canterbury earthquake in New Zealand in February 2011, when subsidence and lateral flow caused damage to 60,000 homes and elements of urban infrastructure [13] causing extensive damage to key parts of the infrastructure in particular [13]. New Zealand is characterized by both seismic activity and loose, young ground deposits over a wide area, which makes it prone to concentrated damage from liquefaction [14]. However, liquefaction is a phenomenon caused by the breakdown of the physical soil particle structure based on the principle of effective stress [15,16] and volumetric shrinkage caused by negative dilatancy, which causes soil particles in the ground and pore water to leak out of the ground surface, resulting in muddy-soil-like behaviors. Gabel et al. [17] proposed that liquefaction of the evacuation route or lateral ground flow, resulting in horizontal displacement of the soil, can render evacuation vehicles and even pedestrian evacuees unable to traverse the affected areas. In anticipation of major Nankai Trough earthquakes and resultant tsunamis expected to impact Kochi Prefecture in Japan before anywhere else [18], an experiment was conducted to model the liquefaction conditions that occur immediately after the earthquake to verify the effects of liquefaction of the ground surface and determine steps for evacuation behavior in the aftermath. Experiments demonstrated that walking speed was reduced on liquefied ground compared to asphalt under normal conditions. Additionally, wheelchair evacuation was virtually impossible because the wheels became stuck in muddy water, rendering them immobile. These results show that the damage caused by liquefaction is not limited to building disasters; physical and mental fatigue can easily accumulate during evacuations on liquified terrain.

The Ministry of Land, Infrastructure, Transport, and Tourism [19] published a portal site that compiles liquefaction risk maps for each municipality in Japan in preparation for such a situation. These maps categorize liquefaction susceptibility into distinct classes and use color-coding to depict liquefaction risk across the target area. This enables intuitive examination of the liquefaction risk from any vantage point, with excellent legibility and visibility. However, using the liquefaction risk as a categorical variable has the disadvantage that it cannot be treated as a quantitative indicator. From a practical perspective, this is considered inadequate for quantitative liquefaction analysis. In particular, when designing appropriate evacuation routes in the event of a major earthquake, specific values are required to estimate the points where evacuation routes are likely to be disrupted, as well as the delay in evacuation time. Usually, the amount of subsidence is determined by the standard penetration test N-values (SPT-N values), which indicate the ground strength, or soil physical parameters based on the factor of safety against liquefaction (FL) method [20]. Although this method can achieve high accuracy over a limited area, it relies on point data and is thus not suitable for wide-area estimation. This is because of the enormous amount of experimental data that is required to determine liquefaction over an extensive area, such as a city, town, or village. Studies have been conducted to determine liquefaction over extensive areas using geographic information systems (GIS); the method proposed by Matsuoka et al. [21] for estimating liquefaction risk can predict liquefaction over extensive areas based on geomorphological classifications. Additionally, Zhu et al. [22] developed an assessment model to predict the probability of liquefaction on a global scale using a global geospatial liquefaction model (GGLM).

It is important to predict the areas in which liquefaction will occur in preparation for future large earthquakes, and it is desirable to be able to calculate liquefaction in a simple and highly accurate manner over an extensive area. Therefore, there is an increasing need for liquefaction risk assessment and countermeasures. The dominant factors that govern liquefaction include earthquake seismic intensity, duration of shaking, ground type, and groundwater level [23,24]. However, it is difficult to express these as regression equations and adequately capture the interactions. In this context, advances in machine-learning technology have created new possibilities. Machine learning is a technique used to learn complex patterns from large amounts of data and to construct predictive models. Liquefaction has been assessed using machine learning, with studies evaluating liquefaction based on the physical parameters of the ground [25]. Recently, a novel semi-active control algorithm that combines a fuzzy control strategy and long short-term memory (LSTM) was proposed with the aim of mitigating structural responses under earthquake excitations [26]. Galupino and Dungca [27] applied the machine learning algorithms to estimate liquefaction susceptibility in the Metro Manila, Philippines. Kuwabara and Matsuoka [28] used random forest (RF), a machine learning algorithm, as an estimation model, and input earthquake scenario ground motion data to develop a Japanese liquefaction risk map. Jas and Dodagoudar [29] successfully applied explainable AI (XAI) to perform factor analysis based on seismic parameters, ground parameters, and site conditions. This suggests that factor analysis of liquefaction scenarios can elucidate the interactions and nonlinear relationships between multiple factors that have not been captured by previous statistical methods. However, these previous studies only developed predictive models for the presence or absence of liquefaction, and there are few examples of studies that have revealed the quantitative magnitude of liquefaction and the factors that influence the amount of ground subsidence. Quantitative predictive data are more important than qualitative data for estimating the degree of damage after liquefaction has occurred.

In this study, a model for quantitatively predicting ground subsidence due to liquefaction was developed by integrating existing seismic data, topographical data, and geotechnical information, with the aim of improving resilience against liquefaction damage. Similarly, factor analysis of the explanatory variables was performed to identify the factors that cause ground subsidence. In this paper, we first provide an overview of the algorithms utilized for machine learning models and factor analysis. Subsequently, we outline the steps involved in constructing the predictive model, including the description of the dataset used, and finally present the results of model accuracy evaluation and factor analysis, followed by a discussion.

2. Machine-Learning Algorithm and Factor Analysis

2.1. Overview of Gradient-Boosting Decision Tree

The gradient-boosting decision-tree algorithm [30] combines weak learners sequentially to build a single predictive model. During the training process, weak learners are connected to minimize the overall loss of the model. This is achieved by repeatedly adjusting the weights at each stage to reduce the error between the predicted and actual values, thereby explicitly fitting the relationship between the explanatory and target variables. Through this process, the model performs feature extraction of explanatory variables and learns data-trend patterns, enabling it to make predictions and provide explanations when applied to new datasets. Notably, it actively generates interaction features from the product of explanatory variables, allowing for consideration of causality with multiple variables. This enables effective feature extraction, because the combination of multiple features enables the representation of interactions in higher dimensions.

Gradient-boosting decision-tree algorithms are often used to solve regression problems due to their dual advantages of low computational cost during training and high prediction accuracy. These make them extremely effective when handling large amounts of data, and thus suitable for building robust regression-prediction models.

2.2. Factor Analysis of Exploratory Data Analysis

2.2.1. Overview of Exploratory Data Analysis

Exploratory data analysis (EDA) [31] is a method used to understand relationships in data. This statistical method is widely used in various fields including medicine, finance, social media, and fighting crime, to reduce the complexity of data and reveal underlying patterns [32,33,34,35]. EDA is an essential tool for understanding the structure of underlying data and enhancing model interpretability. Recently, the emergence of XAI in machine-learning models enabled the automatic identification of correlations between target variables and features in existing datasets [36,37]. This is particularly anticipated in addressing the “black-box problem” often faced by machine-learning models in clarifying the causal relationships between explanatory and target variables. The “black-box problem” refers to a situation in which the internal structure of a machine learning model is too complex to explain the model’s predicted results. Therefore, it is difficult for users to understand the specific basis of the prediction results, and visualization techniques are required to increase the interpretability of machine learning models.

The black box problem refers to a situation in which it is unclear how a machine-learning model makes predictions or classifications. This is challenging in advanced models, particularly those based on deep learning, owing to their numerous internal parameters and complex structures. However, factor analysis makes it possible to extract the main factors from a group of variables with many correlations and interpret how these factors contribute to the overall output. This helps to clarify the internal workings of the model, revealing how individual explanatory variables affect the target variable. Consequently, this aids in understanding the causal mechanisms behind the model rather than merely capturing correlations.

2.2.2. Description of SHapley Additive exPlanations

SHapley Additive exPlanations (SHAP) [38] is a visualization method used for interpreting and understanding the outputs of predictive models. It is based on the cooperative game theory [39] concept of Shapley values. Originally formulated to distribute fairly payouts among players in a coalition, Shapley values are used in machine learning to determine the contribution of each feature to the predictions made by a model. The SHAP method is versatile as it is independent of the type of machine learning model, and it also guarantees consistency when using SHAP values to verify feature importance. This method enables interpretability in practical applications of machine-learning models, making explicit the hidden the reasoning behind why a model produces certain predictions. To date, the visualization of variables using the SHAP method has been applied to address the black box problem in machine-learning models, and most of these efforts have achieved excellent results [40,41,42,43,44]. The SHAP method fundamentally decomposes the predictions to indicate the influence of each explanatory variable. In particular, it breaks down the difference between the prediction of a certain instance and the average prediction of the contributions of each feature. Whether a feature increases or decreases the model output, and to what extent, becomes clear by converting the values of the explanatory variables into SHAP values. The strength of the SHAP method is that the features do not change even in different models [38], making it applicable to any machine-learning model. Therefore, the SHAP method was selected for normalization in this study, to analyze the influence of various variables.

2.2.3. Components of EDA

When conducting EDA, it is necessary to employ analytical algorithms to extract the features included in the dataset and the tools to visualize these features. In this study, we implemented EDA using the gradient-boosting decision-tree (GBDT) approach with SHAP. This combination of techniques enabled deep insight into the data, particularly in understanding the contribution and importance of various features in the trained model. The integration of GBDT and SHAP in EDA allows for the visualization of interactions between explanatory variables and interprets how much influence they have on model predictions.

3. Development of Prediction Model of Ground Subsidence

3.1. Description of Database

3.1.1. Data about Ground Subsidence Because of Liquefaction

In this study, a prediction model was constructed based on the amount of liquefied ground subsidence assumed by the Nankai Trough Megathrust Earthquake Model Study Group [45]. The dataset was compiled on a MESH5 mesh unit (in which the width/height of each cell is approximately 250 m) released by the Ministry of Land, Infrastructure, Transport, and Tourism in Japan. The training set comprised 1,326,588 meshes. The Nankai Trough Megathrust Earthquake Model Study Group [45] assumed different scenarios with different hypocenter locations, including the Tokai, Tonankai, and Nankai Earthquakes, and calculated the amount of ground subsidence caused by liquefaction for each earthquake scenario. Figure 1 shows maps of the calculated liquefaction ground subsidence in the Nankai Trough Megathrust Earthquake. These results were used as the objective variables in this study.

3.1.2. Explanatory Valuables

For the ground motion index, the Japan Meteorological Agency (JMA) seismic intensity [46] at the engineering bedrock (Ia_min_dI) was selected; this can be calculated from the JMA seismic intensity at the surface (Ia) and the amplification of the JMA seismic intensity between the engineering bedrock and ground surface (dI). Furthermore, the maximum elevation (Max_Elev), minimum elevation (Min_Elev), relative relief (Relief), maximum slope angle (Max_Slop_Deg), and minimum slope angle (Min_Slop_Deg) were calculated from the Digital National Land Information [47], and the minimum distances from the center point of the MESH5 to the coastline (Coast_Dist) and the river (River_Dist) were calculated to consider the degree of ground saturation. The average S-wave velocity (AVS30) of the ground to a depth of 30 m and the geomorphologic classification (JCODE) were derived from the dataset developed by Wakamatsu and Matsuoka [48]. The height above nearest drainage (HAND) was employed as a parameter to indicate the height from the channel [49,50].

To prevent feature competition due to multicollinearity, a correlation analysis was performed, and explanatory variables with high correlations were excluded from the training set. The correlation between each pair of variables was estimated by analyzing Pearson correlation coefficients (PCCs) [51]. If the absolute value of a feature has a PCC > 0.95, it may negatively affect the accuracy of the ML model [51,52,53]. Figure 2 shows the PCCs for each explanatory variable.

Figure 2 shows that there were significant positive correlations between the highest elevation (Max_Elev) and the lowest elevation (Min_Elev), as well as between the minimum slope angle (Min_Slop_Deg) and the relative relief (Relief). Based on the findings, the lowest elevation (Min_Elev) and relative relief (Relief) were removed from the training set after trial and error.

In addition to the explanatory variables mentioned above, the SPT-N value, which is an indicator of ground strength, was downloaded from the Kunijiban National Ground Information Search site [54]. The obtained data is represented to the orange line in Figure 3. Specifically, SPT-N values were added as explanatory variables for every 1 m depth from the ground surface to a depth of 10 m, as shown by the blue arrow line in Figure 3. However, as shown in Figure 3, the depth at which the SPT-N value was recorded was not always an integer value. Therefore, data interpolation was performed according to the input layer. For the SPT-N map with a depth of 10 m obtained by interpolation, the SPT-N values at the nearest borehole point to the target mesh were assigned as N₁–N₁₀. The shortest distance from the center point of MESH5 to the boring point was also added as an explanatory variable (B_Dist). Data were aggregated into MESH5 units.

3.2. Description of Prediction Model

In this study, we constructed a model for predicting the amount of ground subsidence due to liquefaction using extreme gradient boosting (XGBoost) [55] and LightGBM [56] algorithms on existing data. Specifically, we utilized XGBoost version 1.7.3 and LightGBM version 4.1.0, operating on a 64-bit Windows 11 Pro system. Gradient boosting reduces loss at each step based on the gradient vector function [30]. In this process, gradient-boosting decision trees are integrated into XGBoost and LightGBM, functioning to create an ensemble method that combines multiple weak learners to create a strong model [57].

XGBoost is an ensemble-learning algorithm based on decision trees and is known for its high accuracy. It employs a level-wise approach to tree growth, simultaneously expanding all tree levels. This method ensures balanced growth of trees at each step, leading to more evenly developed structures. XGBoost offers numerous features, including handling of missing values and regularization, to prevent overfitting. It supports distributed and parallel computing, making it suitable for large datasets, although the computation time tends to increase proportionally with the data size.

LightGBM, developed by Microsoft, is an efficient gradient-boosting technique. It adopts a leaf-wise growth approach, prioritizing branches from the leaves that contribute the most to losses. This method can create imbalanced trees, but often leads to faster and more accurate model development owing to greater reductions in loss during splits. LightGBM is specialized for the efficient processing of large datasets and enhances computational efficiency using techniques such as gradient-based one-side sampling (GOSS) and exclusive feature bundling (EFB).

3.3. Adjustment of Hyperparameters for Machine-Learning Models

In machine-learning models, tuning the hyperparameters is crucial for optimizing the model performance [55,58]. Hyperparameters define the structure and learning process of the model, directly affecting its learning and generalization abilities [59] and if they are not set correctly, the model’s performance can be significantly compromised. For example, excessively deep trees or overly high learning rates can lead to overfitting, whereas setting parameters that are too low may result in underfitting.

In this study, we used the hyperparameter optimization tool, Optuna [60], to optimize the hyperparameters. Optuna is based on Bayesian optimization [61,62,63] and learns from past trials to suggest better parameter combinations for subsequent trials. This approach enables the identification of optimal parameters more rapidly than methods such as random or grid searches. Moreover, Optuna is highly flexible and compatible with diverse machine-learning algorithms and frameworks. Thus, Optuna is vital for complex models or large datasets, and can improve model performance while accelerating and streamlining the optimization process. Finding the optimal search range for a hyperparameter requires manual tuning. Since the optimal hyperparameters vary depending on the dataset used, the search range was manually adjusted. The hyperparameters of the model were tuned with 5-fold cross-validation to prevent overfitting. Table 1 lists the results of parameter tuning for the machine-learning model using XGBoost conducted with Optuna. Similarly, Table 2 presents the results of parameter tuning for the machine-learning model using LightGBM, which was also performed using Optuna.

3.4. Dataset Splitting

Training, validation, and test sets are required to construct a predictive model. The training set was used as the basic dataset to fit the predictive model, and the magnitude of loss for each iteration (number of weight updates) was evaluated using the validation set. The test set was used to measure the prediction accuracy of the final model. In this study, we split the data in the ratio 70:20:10 for the training, validation, and test sets, respectively, and conducted learning, evaluation and inference accordingly.

3.5. Evaluation Metrics

Evaluation metrics were established to evaluate the performance of the machine-learning models. These metrics are important for understanding the accuracy of model predictions and classifications. They are used to check the prediction accuracy of a model in each iteration and reduce the loss value, thereby minimizing the error between the actual and predicted values. This helps in assessing the effectiveness of the model and contributes to preventing overfitting. The mean absolute error (MAE) [64], a commonly used evaluation metric in regression analysis, was adopted for evaluation. Metrics based on absolute or squared errors are called scale-dependent metrics [65]. MAE is a metric used to measure the average magnitude of the absolute errors between the predicted value and the actual value [66]. The MAE is often called the mean absolute deviation (MAD) [67,68,69].

MAE is calculated as follows [70]:

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - {\hat{y}}_{i}|,

(1)

where

N

is the number of validation sets,

y_{i}

is the actual value of the validation set, and

{\hat{y}}_{i}

is the predicted value of the model. This measurement allowed us to determine how far the model’s predicted values were from the actual values without considering the direction of error.

3.6. Additive Feature Attribution Methods

3.6.1. SHAP Values

Recently, a novel unified approach for interpreting predictions from machine-learning models, SHAP, was introduced, applying explanation models as a method for understanding these predictions [38]. This study employed SHAP values to uniformly quantify feature importance, aligning features more appropriately with human intuition and effectively distinguishing between different model outputs. This approach, based on game theory [39], enabled the visualization of the influence of the features included in each explanatory variable on the prediction results.

In additive feature attribution methods, the explanation model defines a linear function of binary variables as follows [38,71]:

g (z^{'}) {= ϕ}_{0} + \sum_{i = 1}^{M} {ϕ_{i} z}_{i}^{'},

(2)

where

M

is the number of input features; the

z_{i}^{'}

variables typically represent a feature being observed

(z_{i}^{'} = 1)

or unknown

(z_{i}^{'} = 0)

, and the

ϕ_{i}

are the feature attribution values,

ϕ_{i} \in R

, and

z^{'} \in {\{0, 1\}}^{M}

. Equation (2) suggests that the Shapley value has three properties: local accuracy, missingness, and consistency. By contrast, SHAP values are derived by combining these conditional expectations with the classic Shapley values from game theory [71].

The SHAP values used to fairly distribute the values and assess the contribution of each explanatory variable were determined by the following equation [38,71]:

ϕ_{i} = \sum_{S \subseteq N ∖ {i}} \frac{|S|! (M - |S| - 1)!}{M!} [f_{x} (S \cup \{i\}) - f_{x} (S)],

(3)

where

N

is the set of all input features,

S

is any subset of

N

excluding feature

i

,

|S|

is the number of features in subset

S

, and

f_{x} (S)

is the predicted value of the model using the set of features

S

. This formula represents the average additional value provided by a feature.

3.6.2. SHAP Interaction Values

In machine-learning models, new features can be generated by multiplying the input explanatory variables to enhance the predictive accuracy of the target variable. This type of feature, expressed as the product of two input features, is referred to as a pairwise interaction feature. When interpreting a model using Shapley values, impacts can be categorized into main and interaction effects. These represent the contributions of individual features to the model’s predictions and the influence of the interactions between features, respectively. This distinction is valuable for examining the relationships between explanatory variables. Computation of the SHAP interaction values can be achieved using the following [71,72]:

Φ_{i, j} = \sum_{S \subseteq N ∖ {i, j}} \frac{|S|! (M - |S| - 2)!}{2 (M - 1)!} \nabla_{i j} (S),

(4)

where

\begin{matrix} \nabla_{i j} (S) & = f_{x} (S \cup i, j) - f_{x} (S \cup \{i\}) - f_{x} (S \cup \{j\}) + f_{x} (S) \\ = f_{x} (S \cup \{i, j\}) - f_{x} (S \cup \{j\}) - [f_{x} (S \cup \{i\}) + f_{x} (S)] . \end{matrix}

(5)

The SHAP interaction value between feature

i

and feature

j

is split equally between each feature so

Φ_{i, j} = Φ_{j, i}

and the total interaction effect is

Φ_{i, j} + Φ_{j, i}

[71]. The main effects for a prediction can then be defined as the difference between the SHAP and SHAP interaction values for a feature using the following [71,72]:

Φ_{i, i} = ϕ_{i} - \sum_{j \neq i} Φ_{i, j} .

(6)

This approach allows for the differentiation and assessment of the individual impacts of features and the impact of their interactions on a model’s predictions.

4. Results and Discussion

4.1. Model Training Results

The results of the machine-learning models are presented and considered in this section. The learning curve illustrates the evolution of the loss values over the number of iterations (the number of times the weights were updated) and provides an overall graph visualizing the predictive performance of the model. The learning curve allowed us to assess whether the predictive model experienced over- or underfitting. Figure 4 shows the training results of the prediction models used in this study.

The blue line in the figure represents the MAE loss values of the training data during model learning, whereas the orange line indicates the MAE loss values of the validation set. As shown in Figure 4, the MAE loss values decrease as the model training progresses. This indicates that proper feature extraction is performed through weight updates in the gradient-boosting decision tree, and that the machine-learning model used in this study fits well with the training set. Additionally, the “early_stopping” configuration for preventing overfitting effectively concluded the training for both XGBoost and LightGBM models. Comparing the two models, the XGBoost model exhibited a smaller evaluation loss than the LightGBM model. This suggests that the XGBoost model is well-suited to the dataset used in this study.

4.2. Model Performance

To assess the performance of the trained model, we verified the predictive accuracy of the machine-learning model on unknown data. In addition to validating the overall predictive accuracy using a loss function, we assessed the predictive accuracy of the individual data points. The test set was used for validation and the absolute errors between the predicted and true values were calculated. A parietal chart of the absolute errors and cumulative frequency curves is shown in Figure 5. A scatter plot illustrating the relationship between the predicted and true values is shown in Figure 6. Through these processes, we evaluated the generalization performance of each model. As shown in Figure 5, the XGBoost and LightGBM models exhibit different prediction accuracies. Examining the cumulative frequency for the XGBoost model, 50, 70, and 80% of the test set had absolute errors of ±0.14, ±0.41, and ±0.71 cm, respectively. By contrast, for the LightGBM model, the cumulative frequency indicated absolute errors of ±0.26, ±0.67, and ±1.07 cm for 50, 70, and 80% of the test set, respectively. In this case, the XGBoost model has better accuracy.

The scatter plot in Figure 6 shows the relationship between the true and predicted values. The XGBoost model appears to fit well with the y_true = y_pred line. To quantify how well the values predicted from the regression model aligned with the true values, the coefficient of determination

R^{2}

was defined as below:

R^{2} = 1 - \frac{\sum_{i = 1}^{N} (y_{i} - {\hat{y}}_{i})}{\sum_{i = 1}^{N} (y_{i} - \bar{y})},

(7)

where

N

is the number of test-data points,

y_{i}

is the actual value of the test data,

{\hat{y}}_{i}

is the predicted value of the model for the test data, and

\bar{y}

is the mean of the actual value. The

R^{2}

score ranged from 0 to 1, with values closer to 1 indicating a better fit of the regression model to the actual data. The results show an

R^{2}

of 0.97 for the XGBoost model and 0.94 for the LightGBM model. This suggests that the former model provides a better explanation for the amount of ground subsidence. The study uses ground subsidence data based on numerical analysis, which minimizes the effect of outliers. Additionally, missing values were removed by data preprocessing, and an ample amount of data was used. Therefore, this forecasting model has high reliability because it satisfies the data quality and quantity.

4.3. Global Explanations of Prediction Model Based on SHAP

4.3.1. Feature Importance

Feature importance is a measure of the extent to which the features of each explanatory variable affect its predicted value. This analysis clarified the process and basis of the prediction results, allowing for an easy technical explanation. The evaluation of explanatory variables for the entire instance is distinguished as a global explanation.

In this study, we evaluated the impacts of explanatory variables on liquefaction ground subsidence. In the data analysis, stratified sampling was performed for each geomorphological classification. This is because data imbalances can occur during random sampling. Geomorphological classifications were added to the explanatory variables through dummy-variable transformation using one-hot encoding. An overrepresentation of the same geomorphological classification from random sampling cab biases those results. Stratified sampling mitigates categorical data biases and facilitates precise feature importance analysis by ensuring proportional representation across classes. For stratified sampling, 30 data points for each relevant geomorphological classification were randomly extracted from the training, evaluation, and test data. As 20 points were extracted for each category, applying this to the 25 categories resulted in a sample size of 500.

The average absolute SHAP value is an indicator of the feature importance obtained by taking the absolute SHAP value for each instance and averaging it over the number of samples. Notably, the average absolute SHAP value does not directly indicate an increase or decrease in the relationship with the predicted value but visualizes the absolute amount of information contained in the explanatory variables. For example, ICE (individual conditional expectation) and PD (partial dependence) allow us to visualize whether the predictions change by varying the values of the features. However, the complexity of the data makes it difficult to interpret and explain the model, and the impact on the predicted value cannot be visualized, so different evaluation merits are used depending on what is being evaluated. Figure 7, Figure 8 and Figure 9 show the SHAP and average absolute SHAP values for each instance of the 44 explanatory variables used in this study. They also show the GIS data, borehole data, and explanatory variables related to geomorphological classification, respectively, all in descending order. In addition, the distribution of the features and the SHAP values for each explanatory variable in each instance are shown. Although the average absolute SHAP value evaluates the impact of the explanatory variables as a single value of feature importance, it does not provide a view of the liquefaction ground subsidence trend for each instance. Therefore, by using summary plots to represent visually each instance’s features and SHAP values on a plane, the liquefaction ground subsidence factors were interpreted.

Figure 7 shows the contribution of each explanatory variable to the predicted value in the GIS data. The variable with the most significant impact on the amount of ground subsidence due to liquefaction was the JMA seismic intensity at the engineering bedrock (Ia_min_dI). Liquefaction phenomena depend on the intensity of seismic motion and occur because of shear stress within the ground. Examining the factor plot for the JMA seismic intensity on the engineering bedrock (Ia_min_dI), it can be seen that as the explanatory variable for the base seismic intensity increased, the impact on the predicted value (SHAP value) also increased, indicating a positive correlation. Geotechnically, when shear stress acts strongly within the ground, the voids in the soil change spatially, causing a rapid increase in pore-water pressure. This theory is derived from Terzaghi’s principle of effective stress and is considered geotechnically sound. Consequently, the contact force between the soil particles decreases, causing the ground to liquefy. Therefore, the high feature importance of the JMA seismic intensity at the engineering bedrock (Ia_min_dI) can be a major cause of liquefaction, owing to the rapid increase in pore-water pressure caused by the seismic motion introduced into the ground.

For AVS30, the amplification of the ground is considered to affect the amount of ground subsidence owing to liquefaction. AVS30 represents the average S-wave velocity from the surface to 30 m underground, a parameter dependent on ground density and elastic properties. Generally, firmer ground transmits waves faster, whereas looser ground transmits them more slowly; therefore, areas with looser ground are more prone to noticeable subsidence. It is commonly believed that liquefaction is more likely to occur on weak ground, as indicated by AVS30. The factor plot shows that as the AVS30 value increases, the impact on the predicted value (SHAP value) decreases. Thus, there is a negative correlation between AVS30 and the predicted value, with a lower AVS30 indicating a higher likelihood of weaker ground, and therefore, a greater impact on the amount of ground subsidence due to liquefaction.

Regarding topographic data, the highest elevation (Max_Elev), steepest slope angle (Max_Slop_Deg), and minimum slope angle (Min_Slop_Deg) are included. The highest elevation indicates the highest point within a mesh; however, most liquefaction phenomena occur in flat or low-lying areas. This is believed to be influenced by the groundwater level, which is an important factor in the occurrence of liquefaction. Elevation exhibits an inverse relationship with depth below the surface relative to sea level; greater elevation corresponds to reduced ground saturation due to increased depth. Thus, it seems likely that the higher the maximum elevation, the lower the probability of liquefaction. The summary plot shows no clear correlation, but a trend is observed in which the higher the elevation, the smaller the impact on the predicted value.

The distance to the coast (Coast_Dist), river (River_Dist), and HAND, were considered. These explanatory variables represent the linear distance from the mesh centroid to the nearest coastline and river, and the relative height from the nearest river, respectively, adding information related to the degree of ground saturation. The impact of each variable on the predicted value was lower than that of the seismic motion and topographic data, confirming a minor impact on the predicted value. The summary plot shows little correlation between the explanatory variables and the predicted value, but it indicates that the farther the distance from the centroid to the river, the more significant the ground subsidence due to liquefaction. This result suggests that the degree of ground saturation does not depend on distance, implying that other factors are at play. Coast and river distances represent the distances from water bodies with high similarity as explanatory variables. Generally, meshes closer to the coast have higher ground saturation because of this explanatory variable. However, meshes close to the coast but far from a river may be incorrectly determined by the learning model as having significant ground subsidence due to liquefaction, despite the strong influence of coastal distance, because they are far from a river. This is because both explanatory variables have similar features related to ground saturation, making it unclear which is the primary factor, leading to the perception that liquefaction is more likely to occur farther from a river. In other words, we believe that the model learns terrain features rather than extracting the dominant factors of liquefaction. However, as shown in Figure 5 and Figure 6, the predictive accuracy of the model was high, suggesting that overall feature extraction for liquefaction was being performed successfully. When a model is trained to align with data trends, it can produce good accuracy; however, if the model does not correctly grasp the overall trend of the data, this is referred to as overfitting. Therefore, when conducting factor analysis using machine-learning models, it is necessary to consider the impact of overfitting.

Figure 8 shows the contribution of each explanatory variable to the predicted value of the borehole data. Borehole data served as indicators for determining the compactness and strength of the ground, represented by the SPT-N value. This indicator was obtained through standard penetration tests and was crucial for quantitatively determining the amount of ground subsidence caused by liquefaction. Figure 8 shows the feature importance of the SPT-N value at different depths and the summary plot, with SPT-N values greater than 5 m from the surface having a strong impact on the predicted value, and higher SPT-N values (firmer ground) tending to have a smaller contribution (SHAP value). The pore-water pressure generated deep in the ground dissipates towards the surface, but the supply of pore-water pressure to the ground near the surface can cause liquefaction owing to infiltration failure. Therefore, when liquefied ground is near the surface, the pore-water pressure is rapidly dissipated, resulting in only minor ground subsidence; however, more significant ground subsidence occurs as the drainage distance increases. Thus, it appears that the SPT-N value in deeper ground has a relatively significant impact on the predicted value of ground subsidence owing to liquefaction, generally providing a better explanation for the degree of liquefaction. B_Dist was added as a variable indicating that the distance from the borehole, and feature extraction related to the reliability of the ground information is thought to have been performed. Borehole data used in this study were obtained from the nearest boreholes. However, if the corresponding mesh was far from the borehole point, data lost credibility, and different SPT-N values were obtained from the actual point instead. Because ground information is important for predicting ground subsidence, the feature was considered highly important.

Finally, Figure 9 shows the contribution of each explanatory variable to the predicted value in the geomorphological classification. Categorical variables were converted into quantitative variables through one-hot encoding, with the corresponding category being true and all other categories being false. As confirmed by the feature importance, geomorphological classification, reflecting a high tendency for liquefaction occurrence, occupied the top ranks, suggesting a high impact on the predicted value. Conversely, a low impact of factors on the predicted value suggests that these factors are not related to ground subsidence, with other features having a significant influence. Therefore, the geomorphological classifications with low feature importance in Figure 9 are likely to be geologically stable and less prone to liquefaction. However, the summary plot shows instances with high SHAP values in geomorphological classifications, such as volcanoes and volcanic foot-slopes. In terms of geomorphological classification, liquefaction is less likely to occur in volcanic areas; however, extensive liquefaction is possible when the soil in these areas contains a liquefaction layer or meets the conditions for liquefaction. Therefore, when quantitatively calculating the amount of ground subsidence owing to liquefaction, it is difficult to make a unique assessment based on the geomorphological classification because of varying conditions such as soil, topography, and saturation. However, as indicated by Kuwabara and Matsuoka [28], this is useful for predicting the probability of widespread liquefaction.

4.3.2. Effect of Soft Ground on the Amount of Subsidence

Ground softness amplifies shear stress. In particular, areas with shallow groundwater levels and slow shear wave velocities tended to amplify the strength of the seismic motion. In such situations, interactive effects may occur when multiple factors are combined. Interaction refers to a synergistic effect that appears as a combination of multiple factors. This occurs when there is a causal relationship between multiple characteristics and one characteristic may change because of other explanatory variables. Visualization of interactions using SHAP can illustrate nonlinear relationships and dependencies among these features. Figure 10 shows the dependence plot of the JMA seismic intensity at the engineering bedrock (Ia_min_dI) for AVS30 and HAND. Here, we focus on AVS30 as a variable related to the degree of ground amplification and HAND as a variable related to groundwater-level height, and consider the causal relationship with base seismic intensity.

The vertical axis represents the SHAP values of the JMA seismic intensity at the engineering bedrock (Ia_min_dI), which contributes to ground subsidence. The horizontal axis shows the values of the JMA seismic intensity at the engineering bedrock (Ia_min_dI). As shown in Figure 10, even with identical base seismic intensity values, the SHAP values of the base seismic intensity vary depending on the AVS30 and HAND values. Considering a base seismic intensity of 5.5 as a reference, the SHAP value when AVS30 exceeded 300 m/s was approximately 0.02 m. In contrast, when AVS30 was below 150 m/s, the SHAP value was approximately 0.04 m. This indicates that a lower AVS30 value results in a greater impact of the base seismic intensity on ground subsidence. Similarly, for HAND, the SHAP value is approximately 0.02 m when HAND exceeds 7.5 m, but it increases to about 0.03 m when HAND is 7.5 m or less. This suggests that a lower relative height from the water body (or a shallower groundwater level) increases the impact of the base seismic intensity. Therefore, the varying contributions of the base seismic intensity influenced by the AVS30 and HAND values suggest a causal relationship between these features.

4.4. Local Explanations of Prediction Model Based on SHAP

The constructed prediction model was applied to a specific area, and the prediction results were visualized. Whereas global explanations focus on the contribution to the target variable, this approach enables explanations of targeted phenomena from a micro perspective. Combining this with domain knowledge allows for a logical interpretation.

The model was applied to Urayasu City in Chiba Prefecture, which suffered extensive liquefaction damage in the 2011 Great East Japan Earthquake. The selected validation mesh for the area near the Urayasu City Hall was MESH5, with a MESH_CODE of 5339378124. The location of the target regional mesh is shown in Figure 11.

The attribute data of the target mesh were used as the input explanatory variables. For the JMA seismic intensity at the engineering bedrock, I = 5.5 was used, corresponding to a JMA seismic intensity class 6 lower [46]. Figure 12 shows the results of the local explanation using the SHAP method.

Local explanations can elucidate why liquefaction occurred in a particular area on a factor-by-factor basis. In this case, the JMA seismic intensity at the engineering bedrock (Ia_min_dI), AVS30, and B_Dist contributed to ground subsidence of 0.040, 0.009, and 0.008 m, respectively. Urayasu City in the Chiba Prefecture consists of reclaimed land and is a coastal area, therefore, its distance to the coast is short. For the same reason, it has sandy soil, and so the ground bearing capacity is low, and the SPT-N value is correspondingly low. Whereas previous evaluations of liquefaction were conducted through numerical simulations and ground surveys, the machine-learning model constructed in this study allows for the simple calculation of how much liquefied ground subsidence takes place in any regional mesh. Similarly, by changing the input seismic motion, liquefaction evaluation assuming a major earthquake is possible, and the factors contributing to ground subsidence can be interpreted from the explanatory variables used.

5. Conclusions

In this study, a factor analysis of ground subsidence was conducted using seismic motion, GIS, and ground data. Machine-learning models were constructed using two ensemble learning algorithms, XGBoost and LightGBM, and feature extraction was performed. The XGBoost model demonstrated higher predictive accuracy than the LightGBM model, with absolute errors of ±0.14, ±0.41, and ±0.71 cm at cumulative frequencies of 50, 70, and 80% on the test set, respectively. The coefficient of determination

R^{2}

for the XGBoost model was 0.97, and there was a good fit between the predicted and actual values, proving its effectiveness in predicting the amount of ground subsidence due to liquefaction. Subsequently, using the XGBoost model, the contribution of each explanatory variable to the prediction results was determined using SHAP impact analysis.

In terms of global explanations, the JMA seismic intensity at the engineering bedrock and AVS30 were the key variables affecting the amount of liquefied ground subsidence, suggesting a significant impact on liquefaction phenomena. In addition, borehole data represented by the SPT-N value, which indicates the hardness of the ground, were confirmed to contribute significantly to ground subsidence. The limitations of GIS data and geomorphological classification in representing ground information suggest that including borehole information in the machine-learning model could improve prediction accuracy.

In terms of local explanations, it was able to provide an engineering explanation for the cause of ground subsidence in any given instance. In this study, factor analysis was conducted for Urayasu City, Chiba Prefecture, and it was confirmed that the contribution to the amount of ground subsidence could be evaluated using the SHAP method. Consequently, this suggests that predictions with flexibility and interpretability are possible.

Author Contributions

Conceptualization, K.K.; Formal analysis, K.K.; Funding acquisition, Y.M.; Project administration, W.L. and Y.M.; Writing—original draft, K.K.; Writing—review and editing, W.L. and Y.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by JSPS KAKENHI Grant Number 23H01652 and 22K18303.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Seed, R.B.; Cetin, K.O.; Moss, R.E.S.; Kammerer, A.M.; Wu, J.; Pestana, J.M.; Reimer, M.F. Recent Advances in Soil Liquefaction Engineering and Seismic Site Response Evaluation. In Proceedings of the 4th International Conference on Recent Advances in Geotechnical Earthquake Engineering and Soil Dynamics, San Diego, CA, USA, 31 March 2001. [Google Scholar]
Seed, H.B.; Idriss, I.M. Simplified Procedure for Evaluating Soil Liquefaction Potential. J. Soil Mech. Found. Div. 1971, 97, 1249–1273. [Google Scholar] [CrossRef]
Wakamatsu, K.; Yoshida, N.; Kiku, H. Liquefaction during the 2004 Niigata-Ken Chuetsu Earthquake—General Aspect and Geotechnical and Geomorphic Conditions. Doboku Gakkai Ronbunshuu C 2006, 62, 263–276. [Google Scholar] [CrossRef]
Wakamatsu, K.; Senna, S. Liquefaction and Their Site Conditions in Kanto Region during the 2011 Off the Pacific Coast of Tohoku Earthquake. J. JAEE 2015, 15, 2_25–2_44. [Google Scholar] [CrossRef]
Wakamatsu, K.; Senna, S.; Ozawa, K. Liquefaction and its Characteristics during the 2016 Kumamoto Earthquake. J. J. JAEE 2017, 17, 4_81–4_100. [Google Scholar] [CrossRef]
Watabe, Y.; Nishimura, S. Ground Movements and Damage in Satozuka District, Sapporo Due to 2018 Hokkaido Eastern Iburi Earthquake. Soils Found. 2020, 60, 1331–1356. [Google Scholar] [CrossRef]
Kazama, M. Overview of the Damages of the 2011 Off the Pacific Coast of Tohoku Earthquake and Its Geotechnical Problems. Jpn. Geotech. J. 2012, 7, 1–11. [Google Scholar] [CrossRef]
Yasuda, S.; Harada, K.; Ishikawa, K.; Kanemaru, Y. Characteristics of Liquefaction in Tokyo Bay Area by the 2011 Great East Japan Earthquake. Soils Found. 2012, 52, 793–810. [Google Scholar] [CrossRef]
Yamaguchi, A.; Mori, T.; Kazama, M.; Yoshida, N. Liquefaction in Tohoku District during the 2011 off the Pacific Coast of Tohoku Earthquake. Soils Found. 2012, 52, 811–829. [Google Scholar] [CrossRef]
Abe, H.; Higashi, M.; Higuchi, S.; Inada, A.; Ito, A.; Iwamoto, H.; Kamikaseda, S.; Kawasaki, K.; Kusunoki, K.; Sato, S.; et al. Ground Failures on Reclaimed Land during the 2011 Tohoku Earthquake: A Case Study in Urayasu City, Japan. Quat. Int. 2016, 397, 555–562. [Google Scholar] [CrossRef]
Yasuda, S.; Ishikawa, K. Several Features of Liquefaction-Induced Damage to Houses and Buried Lifelines During the 2011 Great East Japan. In Proceedings of the International Symposium on Engineering Lessons Learned from the 2011 Great East Japan Earthquake, Tokyo, Japan, 1–4 March 2012; pp. 825–836. [Google Scholar]
Towhata, I.; Kiku, H.; Taguchi, Y. Technical and Societal Problems to Be Solved in Geotechnical Issues. In Proceedings of the International Symposium on Engineering Lessons Learned from the 2011 Great East Japan Earthquake, Tokyo, Japan, 1–4 March 2012; pp. 837–848. [Google Scholar]
Cubrinovski, M. Liquefaction-Induced Damage in the 2010–2011 Christchurch (New Zealand) Earthquakes. In Proceedings of the 7th International Conference on Case Histories in Geotechnical Engineering, Chicago, IL, USA, 4 May 2013. [Google Scholar]
Cubrinovski, M.; Henderson, D.; Bradley, B. Liquefaction Impacts in Residential Areas in the 2010–2011 Christchurch Earthquakes. In Proceedings of the International Symposium on Engineering Lessons Learned from the 2011 Great East Japan Earthquake, Tokyo, Japan, 1–4 March 2012. [Google Scholar]
Von Terzaghi, K. Die Berechnung der Durchassigkeitsziffer des Tones Aus Dem Verlauf der Hydrodynamischen Spannungs. Erscheinungen. Sitzungsber. Akad. Wiss. Math. Naturwiss. Kl. Abt. 2A 1923, 132, 105–124. [Google Scholar]
Von Terzaghi, K. The Shearing Resistance of Saturated Soils and the Angle between the Planes of Shear. First Int. Conf. Soil Mech. 1936, 1, 54–59. [Google Scholar]
Gabel, L.L.S.; O’Brien, F.E.; Allan, J.C.; Bauer, J.M. Tsunami Evacuation Analysis of Communities Surrounding the Coos Bay Estuary: Building Community Resilience on the Oregon Coast (Technical Report O-19-07); Oregon Department of Geology and Mineral Industries: Portland, OR, USA, 2019; p. 65.
Guidelines for Considering Liquefaction Countermeasures for Evacuation Routes in Kochi Prefecture. Available online: https://www.pref.kochi.lg.jp/soshiki/010201/files/2021041300261/tebiki.pdf (accessed on 16 December 2023).
Waga Machi Hazard Map. Available online: https://disaportal.gsi.go.jp/hazardmapportal/hazardmap/index.html (accessed on 18 December 2023).
Boulanger, R.W.; Idriss, I.M.; Boulanger, R.W. CPT and SPT Based Liquefaction Triggering Procedures; Report No. UCD/CGM-14/01; Center for Geotechnical Modeling, Department of Civil and Environmental Engineering, College of Engineering, University of California: Davis, CA, USA, 2014. [Google Scholar]
Matsuoka, M.; Wakamatsu, K.; Hashimoto, M. Liquefaction Potential Estimation Based on the 7.5-Arc-Second Japan Engineering Geomorphologic Classification Map. J. Jpn. Assoc. Earthq. Eng. 2011, 11, 2_20–2_39. [Google Scholar] [CrossRef]
Zhu, J.; Baise, L.G.; Thompson, E.M. An Updated Geospatial Liquefaction Model for Global Application. Bull. Seismol. Soc. Am. 2017, 107, 1365–1385. [Google Scholar] [CrossRef]
Demir, S. Numerical Investigation of the Effects of Ground Motion Characteristics on the Seismic Behavior of Liquefiable Soil. Period. Polytech. Civ. Eng. 2023, 67, 24–35. [Google Scholar] [CrossRef]
Ghani, S.; Sapkota, S.C.; Singh, R.K.; Bardhan, A.; Asteris, P.G. Modelling and Validation of Liquefaction Potential Index of Fine-Grained Soils Using Ensemble Learning Paradigms. Soil Dyn. Earthq. Eng. 2024, 177, 108399. [Google Scholar] [CrossRef]
Ozsagir, M.; Erden, C.; Bol, E.; Sert, S.; Özocak, A. Machine Learning Approaches for Prediction of Fine-Grained Soils Liquefaction. Comput. Geotech. 2022, 152, 105014. [Google Scholar] [CrossRef]
Zhang, H.; Wang, L.; Shi, W. Seismic Control of Adaptive Variable Stiffness Intelligent Structures Using Fuzzy Control Strategy Combined with LSTM. J. Build. Eng. 2023, 78, 107549. [Google Scholar] [CrossRef]
Galupino, J.; Dungca, J. Estimating Liquefaction Susceptibility Using Machine Learning Algorithms with a Case of Metro Manila, Philippines. Appl. Sci. 2023, 13, 6549. [Google Scholar] [CrossRef]
Kuwabara, K.; Matsuoka, M. Estimation of Liquefaction Susceptibility in Japan Using Machine Learning Approach. J. Jpn. Assoc. Earthq. Eng. 2021, 21, 2_70–2_89. [Google Scholar] [CrossRef]
Jas, K.; Dodagoudar, G.R. Explainable Machine Learning Model for Liquefaction Potential Assessment of Soils Using XGBoost-SHAP. Soil Dyn. Earthq. Eng. 2023, 165, 107662. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Carnevale, L.; Floramo, G.; Di Fabrizio, D.; Arena, S.; Montalto, A.S.; Impellizzeri, P.; Romeo, C.; Villari, M. Towards a Precision Medicine Solution for Optimal Pediatric Laparoscopy: An Exploratory Data Analysis for Features Selections. Biomed. Signal Process. Control. 2024, 88, 105321. [Google Scholar] [CrossRef]
Sharma, A.; Sharma, M.K.; Dwivedi, R.K. Exploratory Data Analysis and Deception Detection in News Articles on Social Media Using Machine Learning Classifiers. Ain Shams Eng. J. 2023, 14, 102166. [Google Scholar] [CrossRef]
Gupta, U.; Sharma, R. Analysis of Criminal Spatial Events in India Using Exploratory Data Analysis and Regression. Comput. Electr. Eng. 2023, 109, 108761. [Google Scholar] [CrossRef]
Indrakumari, R.; Poongodi, T.; Jena, S.R. Heart Disease Prediction Using Exploratory Data Analysis. Procedia Comput. Sci. 2020, 173, 130–139. [Google Scholar] [CrossRef]
Chakri, P.; Pratap, S.; Lakshay; Gouda, S.K. An Exploratory Data Analysis Approach for Analyzing Financial Accounting Data Using Machine Learning. Decis. Anal. J. 2023, 7, 100212. [Google Scholar] [CrossRef]
Doran, D.; Schulz, S.; Besold, T.R. What Does Explainable AI Really Mean? A New Conceptualization of Perspectives. arXiv 2017, arXiv:1710.00794. [Google Scholar] [CrossRef]
Futia, G.; Vetrò, A. On the Integration of Knowledge Graphs into Deep Learning Models for a More Comprehensible AI—Three Challenges for Future Research. Information 2020, 11, 122. [Google Scholar] [CrossRef]
Lundberg, S.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar] [CrossRef]
Štrumbelj, E.; Kononenko, I. Explaining Prediction Models and Individual Predictions with Feature Contributions. Knowl. Inf. Syst. 2014, 41, 647–665. [Google Scholar] [CrossRef]
Pathy, A.; Meher, S.; P, B. Predicting Algal Biochar Yield Using EXtreme Gradient Boosting (XGB) Algorithm of Machine Learning Methods. Algal Res. 2020, 50, 102006. [Google Scholar] [CrossRef]
Prasertpong, P.; Onsree, T.; Khuenkaeo, N.; Tippayawong, N.; Lauterbach, J. Exposing and Understanding Synergistic Effects in Co-Pyrolysis of Biomass and Plastic Waste via Machine Learning. Bioresour. Technol. 2023, 369, 128419. [Google Scholar] [CrossRef]
Woldesellasse, H.; Tesfamariam, S. Prediction of Lateral Spreading Displacement Using Conditional Generative Adversarial Network (CGAN). Soil Dyn. Earthq. Eng. 2022, 156, 107214. [Google Scholar] [CrossRef]
Mangalathu, S.; Hwang, S.-H.; Jeon, J.-S. Failure Mode and Effects Analysis of RC Members Based on Machine-Learning-Based SHapley Additive ExPlanations (SHAP) Approach. Eng. Struct. 2020, 219, 110927. [Google Scholar] [CrossRef]
Somala, S.N.; Chanda, S.; Karthikeyan, K.; Mangalathu, S. Explainable Machine Learning on New Zealand Strong Motion for PGV and PGA. Structures 2021, 34, 4977–4985. [Google Scholar] [CrossRef]
Nankai Trough Megathrust Earthquake Model Study Group. Available online: https://www.bousai.go.jp/jishin/nankai/model/ (accessed on 18 December 2023).
Tables Explaining the JMA Seismic Intensity Scale. Available online: https://www.data.jma.go.jp/eqev/data/kyoshin/kaisetsu/calc_sindo.html (accessed on 20 December 2023).
Digital National land Information Download Sites. Available online: http://nlftp.mlit.go.jp/ksj/ (accessed on 18 December 2023).
Wakamatsu, K.; Matsuoka, M. Nationwide 7.5-Arc-Second Japan Engineering Geomorphological Classification Map and Vs30 Zoning. J. Disaster Res. 2013, 8, 904–911. [Google Scholar] [CrossRef]
Rennó, C.D.; Nobre, A.D.; Cuartas, L.A.; Soares, J.V.; Hodnett, M.G.; Tomasella, J.; Waterloo, M.J. HAND, a New Terrain Descriptor Using SRTM-DEM: Mapping Terra-Firme Rainforest Environments in Amazonia. Remote Sens. Environ. 2008, 112, 3469–3481. [Google Scholar] [CrossRef]
Yamazaki, D.; Togashi, S.; Takeshima, A.; Sayama, T. High-Resolution Flow Direction Map of Japan. J. JSCE 2020, 8, 234–240. [Google Scholar] [CrossRef] [PubMed]
Lee, M.-H. Predicting and Analyzing the Fill Factor of Non-Fullerene Organic Solar Cells Based on Material Properties and Interpretable Machine-Learning Strategies. Sol. Energy 2024, 267, 112191. [Google Scholar] [CrossRef]
Zhu, X.-W.; Xin, Y.-J.; Ge, H.-L. Recursive random forests enable better predictive performance and model interpretation than variable selection using LASSO. J. Chem. Inf. Model. 2015, 55, 736–746. [Google Scholar] [CrossRef]
Jiang, X.; Wang, Y.; Jia, B.; Qu, X.; Qin, M. Prediction of Oxygen Evolution Activity of NiCoFe-oxide Catalysts via Machine Learning. ACS Omega 2022, 7, 14160–14164. [Google Scholar] [CrossRef] [PubMed]
National Ground Information Search Sites. Available online: http://www.kunijiban.pwri.go.jp (accessed on 18 December 2023).
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2016, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. LightGBM: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
Tarwidi, D.; Pudjaprasetya, S.R.; Adytia, D.; Apri, M. Optimized XGBoost-Based Machine Learning Method for Predicting Wave Run-up on a Sloping Beach. MethodsX 2023, 10, 102119. [Google Scholar] [CrossRef] [PubMed]
Probst, P.; Bischl, B.; Boulesteix, A.-L. Tunability: Importance of Hyperparameters of Machine Learning Algorithms. arXiv 2018, arXiv:1802.09596. [Google Scholar] [CrossRef]
Zhang, W.; Wu, C.; Zhong, H.; Li, Y.; Wang, L. Prediction of Undrained Shear Strength Using Extreme Gradient Boosting and Random Forest Based on Bayesian Optimization. Geosci. Front. 2021, 12, 469–477. [Google Scholar] [CrossRef]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. arXiv 2019, arXiv:1907.10902. [Google Scholar] [CrossRef]
Bischl, B.; Richter, J.; Bossek, J.; Horn, D.; Thomas, J.; Lang, M. mlrMBO: Modular framework for model-based optimization of expensive black-box functions. arXiv 2017, arXiv:1703.03373. [Google Scholar] [CrossRef]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian Optimization of Machine Learning Algorithms. Adv. Neural Inf. Process. Syst. 2012, 4, 2951–2959. [Google Scholar]
Hutter, F.; Hoos, H.H.; Leyton-Brown, K. Sequential Model-Based Optimization for General Algorithm Configuration. In Proceedings of the Learning and Intelligent Optimization: 5th International Conference, LION 5, Rome, Italy, 17–21 January 2011; Springer: Berlin/Heidelberg, Germany, 2011; Volume 6683, pp. 507–523. [Google Scholar]
Nguyen, T.; Tran, N.; Nguyen, B.M.; Nguyen, G. A Resource Usage Prediction System Using Functional-Link and Genetic Algorithm Neural Network for Multivariate Cloud Metrics. In Proceedings of the 2018 IEEE 11th Conference on Service-Oriented Computing and Applications (SOCA), Paris, France, 20–22 November 2018; pp. 49–56. [Google Scholar]
Hyndman, R. Another Look at Forecast Accuracy Metrics for Intermittent Demand. Foresight Int. J. Appl. Forecast. 2006, 4, 43–46. [Google Scholar]
Jierula, A.; Wang, S.; OH, T.M.; Wang, P. Study on accuracy metrics for evaluating the predictions of damage locations in deep piles using Artificial Neural Networks with Acoustic Emission data. Appl. Sci. 2021, 11, 2314. [Google Scholar] [CrossRef]
Flores, B.E. Pragmatic View of Accuracy Measurement in Forecasting. Omega 1986, 14, 93–98. [Google Scholar] [CrossRef]
Sanders, N.R. Measuring Forecast Accuracy: Some Practical Suggestions. Prod. Inventory Manag. J. 1997, 38, 43–46. [Google Scholar]
Rakicevic, Z.; Vujosevic, M. Focus Forecasting in Supply Chain: The Case Study of Fast Moving Consumer Goods Company in Serbia. Serbian J. Manag. 2015, 10, 3–17. [Google Scholar] [CrossRef]
Hodson, T.O. Root-mean-square error (RMSE) or Mean Absolute Error (MAE): When to Use Them or Not. Geosci. Model. Dev. 2022, 15, 5481–5487. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.G.; Lee, S.-I. Consistent Individualized Feature Attribution for Tree Ensembles. arXiv 2018, arXiv:1802.03888. [Google Scholar] [CrossRef]
Fujimoto, K.; Kojadinovic, I.; Marichal, J.-L. Axiomatic characterization of probabilistic and cardinal–probabilistic interaction indices. Games Econ. Behav. 2006, 55, 72–99. [Google Scholar] [CrossRef]

Figure 1. Maps of calculated liquefaction ground subsidence for different scenarios related to the Nankai Trough Megathrust Earthquake: (a) Fundamental case; (b) hypocenter located near the land; (c) hypocenter located in the east; (d) hypocenter located in the west; (e) hypocenter assumed empirically.

Figure 2. The heatmap showing the Pearson correlation coefficients (PCCs) of each explanatory variable in the GIS data correlation.

Figure 3. Schematic diagram for extracting SPT-N values from borehole data.

Figure 4. Learning curve of the prediction models using machine learning: (a) XGBoost model; (b) LightGBM model.

Figure 5. The parietal chart of absolute errors and cumulative frequency curves: (a) XGBoost model; (b) LightGBM model.

Figure 6. The scatter plot depicting the relationship between predicted values and true values. Note that 20,000 samples were randomly extracted from the test data: (a) XGBoost model; (b) LightGBM model.

Figure 7. The average absolute SHAP values for the eight explanatory variables in GIS data and the SHAP values for each instance: (a) Feature importance plot; (b) summary plot.

Figure 8. The average absolute SHAP values for the 11 explanatory variables in borehole data and the SHAP values for each instance: (a) Feature importance plot; (b) summary plot.

Figure 9. The average absolute SHAP values for the 25 explanatory variables in geomorphologic classification and the SHAP values for each instance: (a) Feature importance plot; (b) summary plot.

Figure 10. SHAP dependence plot of the JMA seismic intensity at the engineering bedrock (Ia_min_dI): (a) Impact of causality for AVS30; (b) impact of causality for HAND.

Figure 11. The location of the target regional mesh for local explanations based on SHAP.

Figure 12. Local description results for XGBoost model with MESH_CODE: 5339378124.

Table 1. Hyperparameter tuning results using Optuna for prediction model using XGBoost.

Parameter	Range	Value
colsample_bytree	[0.8, 1.0]	8.41 × 10⁻¹
gamma	[0, 1]	3.61 × 10⁻⁵
learning_rate	[0.01, 0.3]	9.16 × 10⁻²
max_depth	[2, 128]	102
min_child_weight	[0.1, 10]	3.17
reg_alpha	[10⁻⁸, 10]	4.38 × 10⁻⁵
reg_lambda	[10⁻⁸, 10]	8.60 × 10⁻⁷
subsample	[0.8, 1.0]	9.82 × 10⁻¹

Table 2. Hyperparameter tuning results using Optuna for prediction model using LightGBM.

Parameter	Range	Value
bagging_fraction	[0.8, 1.0]	8.75 × 10⁻¹
feature_fraction	[0.8, 1.0]	9.78 × 10⁻¹
lambda_l1	[10⁻⁸, 10]	4.67 × 10⁻⁸
lambda_l2	[10⁻⁸, 10]	6.69
learning_rate	[0.01, 0.3]	2.74 × 10⁻¹
max_depth	[2, 128]	28
min_child_weight	[0.1, 10]	5.98
num_leaves	[2, 1024]	961

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Karimai, K.; Liu, W.; Maruyama, Y. Prediction and Factor Analysis of Liquefaction Ground Subsidence Based on Machine-Learning Techniques. Appl. Sci. 2024, 14, 2713. https://doi.org/10.3390/app14072713

AMA Style

Karimai K, Liu W, Maruyama Y. Prediction and Factor Analysis of Liquefaction Ground Subsidence Based on Machine-Learning Techniques. Applied Sciences. 2024; 14(7):2713. https://doi.org/10.3390/app14072713

Chicago/Turabian Style

Karimai, Kazuki, Wen Liu, and Yoshihisa Maruyama. 2024. "Prediction and Factor Analysis of Liquefaction Ground Subsidence Based on Machine-Learning Techniques" Applied Sciences 14, no. 7: 2713. https://doi.org/10.3390/app14072713

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction and Factor Analysis of Liquefaction Ground Subsidence Based on Machine-Learning Techniques

Abstract

1. Introduction

2. Machine-Learning Algorithm and Factor Analysis

2.1. Overview of Gradient-Boosting Decision Tree

2.2. Factor Analysis of Exploratory Data Analysis

2.2.1. Overview of Exploratory Data Analysis

2.2.2. Description of SHapley Additive exPlanations

2.2.3. Components of EDA

3. Development of Prediction Model of Ground Subsidence

3.1. Description of Database

3.1.1. Data about Ground Subsidence Because of Liquefaction

3.1.2. Explanatory Valuables

3.2. Description of Prediction Model

3.3. Adjustment of Hyperparameters for Machine-Learning Models

3.4. Dataset Splitting

3.5. Evaluation Metrics

3.6. Additive Feature Attribution Methods

3.6.1. SHAP Values

3.6.2. SHAP Interaction Values

4. Results and Discussion

4.1. Model Training Results

4.2. Model Performance

4.3. Global Explanations of Prediction Model Based on SHAP

4.3.1. Feature Importance

4.3.2. Effect of Soft Ground on the Amount of Subsidence

4.4. Local Explanations of Prediction Model Based on SHAP

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI