1. Introduction
Concrete structures are subjected to repeated loading (
N) from many sources, such as dead load and live loads in buildings, traffic loads in civil structures, or environmental loads, such as temperature and humidity changes. It is commonly known that concrete strength under repeated loading will be lower than that under static loading [
1,
2]. Concrete structures subjected to many repeated loadings will experience an increase in deflections, crack widths, and eventually lead to the reduction in durability and fatigue failure [
3].
A classic fatigue equation for plain concrete is typically represented by an
S-N diagram, where the stress level
(S) is defined as the percentage of the static strength, with respect to the logarithm of
N. Most previous research results on fatigue have been analyzed with a simple linear equation. However, it is well-known that the single
S-N curve (known as a Wohler curve) is inappropriate to describe fatigue behavior [
1], as it is affected by other factors.
In addition to
S, concrete fatigue is affected by various factors, such as concrete compressive strength, concrete mix proportions, and loading parameters [
1,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33]. As stated in [
1], unlike insensitivity to the details of mix design and concrete compressive strength, concrete is highly sensitive to fatigue loading parameters, such as maximum stress level (
Smax), maximum stress to minimum stress level (
R), frequency (
f), and fatigue loading history [
1].
Moreover, high strength concrete yields a different fatigue pattern, while various mix designs proportioned by different water-binder ratios, including the use of fibers, also produce different fatigue patterns [
1]. Recently, incorporating supplementary cementitious materials (SCMs), such as slag, fly ash, metakaolin, and silica fume, in the concrete mix is widely regarded as the most economical means of improving durability and reducing CO
2 emission issues [
34,
35]. Thus, in the near future, it will be essential to understand the fatigue behavior of the waste materials, as well as the SCMs. However, the fatigue behavior of innovative concrete materials combined with the above-mentioned mixture constituents is difficult to estimate. In addition, the concrete structures will be exposed to diverse fatigue loading parameters, such as different stress levels and frequencies, as mentioned before. Therefore, the traditional statistical treatment of accurately predicting concrete fatigue behavior has reached its limit, due to its inability to consider the complicated combined effects of those influential parameters.
To overcome this limitation inherent in the traditional regression-based statistics methods, a machine learning (ML) method has been introduced to solve complex concrete material properties in terms of durability, as well as mechanical strengths [
4,
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58,
59,
60,
61,
62]. In recent years, the ML method has become more widely used for structural and material design in civil engineering. Various ML methods have been frequently used since 2020 for predicting basic mechanical strength properties [
36,
37,
38,
39,
40,
41,
42,
43,
44,
45,
46,
47,
48,
49,
50,
51,
52,
53,
54,
55,
56,
57,
58,
59,
60,
61] and mixture optimizations [
36,
39,
62]. The main concrete property that has been predicted using the ML method is the compressive strength of various concretes, such as normal concrete [
36,
37,
38,
39,
40,
41], high performance concrete (HPC) [
38,
42,
43,
44], concrete with industrial wastes, including supplementary cementitious materials (SCMs) [
45,
46,
47,
48,
49,
50,
51,
52,
53,
54], recycled aggregate (RA) concrete [
52,
55,
56,
57], geopolymer concrete [
58,
59], and concrete with fibers [
53]. In addition, the split tensile strength [
45,
57,
60] and modulus of elasticity [
61] of concrete were predicted by using ML techniques.
Among the ML methods, artificial neural network (ANN) models are widely used [
36,
37,
38,
42,
43,
45,
46,
47,
48,
49,
50,
52,
54,
55,
58,
59,
60,
62]. In addition to ANN, the prediction of mechanical strength properties and mix proportions of concrete using other regression models has recently gained popularity, including the use of support vector regression [
39,
47,
52], decision tree [
40,
57,
61,
62], random forest [
36,
44,
47,
56], AdaBoost [
40,
41,
52,
57,
59,
61], gradient boost [
40,
53], and ensemble algorithm [
51,
61].
In 2019, an ANN-based concrete fatigue strength model was proposed by Abambres and Lantsoght [
63]. They used 203 data points gathered from the literature. Predicted values analyzed from the ANN model were compared to the existing code expressions. Their ANN model includes the compressive strength of concrete, maximum stress level, and minimum stress level. In 2021, a strength degradation model of concrete under fatigue loading was proposed by Zhang et al. [
4] using several ML algorithms, such as the random forest, support vector machine, and artificial neural network models. About 1000 experimental data were collected from various independent experiments [
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33]. Seven independent variables were chosen in their study, including the compressive strength of concrete, sustained strength of concrete, height to width ratio and shape of the test specimens, maximum stress level, minimum to maximum stress ratio, and loading frequency. The analysis results revealed that the random forest model produces the highest value of the correlation coefficient at 0.85.
Due to the nature of the fatigue strength test, outliers can remarkably occur in this test compared to other material strength properties tests. In statistics, an outlier is a data point that differs significantly from other observations [
64,
65]. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter is sometimes excluded from the data set. There are various methods of outlier detection, such as Grubbs’s test [
64], Chauvenet’s criterion [
66], Peirce’s criterion [
67], Dixon’s Q-test [
68], the generalized extreme studentized deviation test [
69], Thompson and Tau test [
70], and the IQR-test [
71,
72].
In this study, 1300 samples of experimental data [
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33] of concrete fatigue tests originally carried out by Zhang et al. [
4] were treated using 4 kinds of machine learning models (artificial neural network, random forest, and the gradient boosting and AdaBoost method). Unlike previous studies, this research adopts six independent values, excluding only the sustained strength of the concrete variable used from the work of Zhang et al. [
4]. For our approach, three data files were generated to compare the actual number of fatigue repetition values (log
N) against the predicted values (log
N). The first data file uses the entire original dataset, which was treated by Zhang et al. [
4]. However, unlike Zhang et al. [
4], our research adds the second data file with the grouping data and the third data file that excludes outliers. In this work, Chauvenet’s criterion, Pierce’s criterion, the Thompson–Tau criterion and the IQR method were adopted to remove outliers. Finally, a permutation feature importance (PFI) analysis was carried out to determine which input variables are the most critical or minor in the fatigue life model. Our novel approach allows better fatigue life prediction than Zhang et al. [
4]’s approach.
2. Input and Output DATA (Independent and Dependent Variables)
Six basic input features (variables) that influence the fatigue life span of plain concrete under a uniaxial fatigue test in compression were chosen, as shown in
Table 1. One output variable is the logarithm value of the maximum number of cycles at failure, representing the fatigue life of the test. The number of the first group of the key input variables, which are related to the material and dimensional properties of the test specimens, included the compressive strength of concrete (
f′c), height to width ratio (
h/w), and shape of the test specimens. The other three variables that reflect the loading conditions of the fatigue test specimens include the maximum stress level (
Smax), minimum stress to maximum stress ratio (
R), and loading frequency (
f).
This study covers low-strength hydraulic concrete (10~30 MPa), ordinary concrete (30~60 MPa), and high-strength concrete (60~120 MPa). The h/w of the test specimens ranged from 1.0 to 3.0, and the specimen’s shape includes the cube, prism, and cylinder. The loading conditions were also greatly diverse, with the Smax ranging from 0.457 to 0.95, the R covering 0 to about 0.67, and a loading frequency ranging from 0.0625 to 150 Hz. The dataset used in this study is summarized below.
f′c: the compressive strength of concrete by MPa;
h/w: height to width ratio of the tested specimens;
Shape: shape of the test specimens;
Smax: maximum stress level;
R: minimum stress to maximum stress ratio;
f (Hz): loading frequency by Hz;
LogN: logarithm number of cycles to failure of the specimen.
3. DATA Preparation for the Developed Model
Three data files were generated and used to develop the final ML model. Each data file is described below.
ORIGINAL DATA. These are data used in Zhang’s paper, directly collected by the authors from papers [
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33]. The full-data spreadsheet is available in the
Supplementary Materials. These are used as the reference data for this study. A total of 1298 data were collected, and statistical features such as the mean, median, dispersion, minimum, and maximum values of independent and dependent variables are summarized in
Table 1. The ORIGINAL DATA were grouped by the same input variable value.
DATA Excluding OUTLIERS. If there are outliers in the group, these are the data created after removing them. These data are used as a basis for determining the average value after removing the outliers. A total of 1252 data were generated. Statistical features such as the mean, median, dispersion, minimum and maximum values of independent and dependent variables are summarized in
Table 2.
AVERAGE DATA Excluding OUTLIERS. These are the data created by averaging the grouped data after excluding outliers from among the grouped data. In this process, the total number of data was reduced to 310. Statistical features such as the mean, median, dispersion, minimum and maximum values of independent and dependent variables are summarized in
Table 3.
Table 1,
Table 2 and
Table 3 illustrate the statistical analysis of variables, showing the numerous mathematical descriptions of the input and output values for each data set.
Table 4,
Table 5 and
Table 6 describe the data process in which parts of the data from reference [
5] are used to illustrate the process more clearly as an example.
Table 4 represents a part of the grouped data in which data sets with the same input variable value, but with different output variable values, are grouped together.
Table 4 consists of two groups. Group 1 is a data set with an
f′c value of 56 MPa,
h/w value of 1, shape value of 1,
Smax value of 0.85, R value of 0.3, and f value of 4 Hz, but with different output values
N. Group 2 is a data set with an
f′c value of 56 MPa,
h/w value of 1, shape value of 1,
Smax value of 0.85, R value of 0.3, and
f value of 1 Hz, but with different output values
N.
To designate whether there are outlier data in each group, four commonly used outlier detection methods [
70,
71] were performed. If three or more of them were designated as an outlier, they were excluded from the data. The four methodologies are as follows:
- 1.
Outlier detection method using Chauvenet’s criterion;
- 2.
Outlier detection method using Peirce’s criterion;
- 3.
Outlier detection method using Thompson–Tau criterion;
- 4.
Outlier detection method using IQR (inter quartile range) criterion.
We performed all four of these methodologies on each group of data to determine which values were detected as outliers. All four of these outlier detection methodologies detected the N value of 22,570 (see
Table 4) as an outlier for the Group 1 data. On the other hand, for the data in Group 2, the N value of 1571 (see
Table 4) was detected as an outlier only by the Thompson–Tau methodology, but was not detected as an outlier by the other three methodologies.
Table 5 represents the grouped data in which the data set with an N value of 22,570 is removed from Group 1. Even after removing outliers, the values of different output variables are recorded as experimental values in the same input variable values. With these data, it is difficult to make an accurate prediction model as long as the current input variables are maintained. One must suppose that the user predicts a function of y = sin(x). If several different values of the y experimental value for the sin(x) value are matched when x = 30, it will be difficult to create an ML model that predicts the sin(x) function. Therefore, in the case of grouped data having the same input variable value and different output variable values to eliminate this situation, the average value of all other output variable values is obtained. One average value is used as the output value for the same specific input variable value. This should provide more reasonable data for creating predictive ML models.
Table 6 represents the average grouped data in
Table 5.
Figure 1 depicts the relative frequency distributions of the six input variables and one output variable. The shape variable is not only a numerical variable, but also a categorical variable. In the model, shape = 1 is represented as a cube, shape = 2 as a prism, and shape = 3 as a cylinder. Since the numbers are meaningful in determining the category, (d) in
Figure 1 can be changed to (e), which is more suitable for normal distribution. The
f variable appears to be unsuitable for normal distribution, since some high-frequency values of 10 Hz exist in the data. The
f variable appears to be unsuitable for normal distribution, since some high-frequency value of 10Hz exist in the data. If these high-frequency data are removed, the rest of the data are much more suitable for normal distribution in
Figure 1i.
The relationships between various independent variables and log
N are plotted in
Figure 2. Although not strong, one linear relationship is identified in
Figure 2a (log
N vs.
Smax). All other plots show non-linear behavior.
The most commonly used methods in correlation analysis are the Pearson correlation analysis and Spearman correlation analysis. Pearson correlation evaluates the linear relationship and direction between two variables using the values of the variables. Spearman correlation evaluates a monotonic relationship between two variables. In a monotonic relationship, the two variables tend to change together, but do not necessarily change at a constant rate. The Spearman correlation coefficient is based on ranked values for each variable, not on raw data.
Table 7 summarizes the Pearson correlation coefficient and Spearman correlation coefficient of the data used for our ML model. According to the Pearson correlation coefficient,
Smax and log
N have a negative solid linear relationship, while f has a positive and, R has a negative moderate linear relationship with log
N.
f′c, shape, and
h/w have a non-significant linear relationship with log
N. According to the Spearman correlation coefficient,
Smax has a negative and f has a positive significant;
f′c has a negative moderate; R, shape, and
h/w have a negligible monotonic relationship with log
N.
Therefore, a complex relationship rather than linear mapping is critical for capturing variation and interaction. This is why it is necessary to create predictive systems using ML methods.
5. Model Development
The models for fatigue prediction were developed using Orange software, which is a popular open-source machine learning technology platform for statistical computing and data mining [
78,
79]. All data analysis in this research was carried out using Orange software (version 3.32.0, developed at Bioinformatics Laboratory, Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia, together with open source community.), which provides the most prevalent supervised ML algorithms. These algorithms were used to develop our novel ML model. Information regarding the input parameters and implementation of each machine learning algorithm are summarized in documentation and can be found at (
https://orangedatamining.com/widget-catalog/, accessed on 4 April 2022). Orange provides a platform for developing the predictive modeling with big data. The schematic model developed using Orange Software is presented in
Figure 7, and the specific parameters of each proposed model are shown in
Figure 8,
Figure 9,
Figure 10 and
Figure 11. Unfortunately, the Orange 3 software used for this study does not have an optimizer function that automatically finds the hyper-parameters of the model. Thus, starting with the default parameters provided by Orange 3, the authors manually adjusted the parameters to generate feasible output for each ML model.
In order to develop an ANN network model, the user has to set several important parameters, which are as follows. The number of hidden layers is set to two, and there are seven and eight neurons in each hidden layer, as shown in
Figure 8. The rectified linear unit function is selected as the activation function for the hidden layer. As a solver for weight optimization, a stochastic gradient-based optimizer called Adam is used. As a regularization parameter, commonly called alpha, 0.0004 is used. Replicable training is allowed.
In order to develop a random forest model, the user has to set several important parameters, which are as follows. As shown in
Figure 9, 50 decision trees are included in the forest. Four attributes will be arbitrarily drawn for consideration at each node. Replicable training was permitted, while balance class distribution was not. The limit depth of individual trees has not been determined. One must select five subsets as the smallest subset that can be split.
In order to develop a gradient boosting model, the user has to set several important parameters, which are as follows. As shown in
Figure 10, 150 gradient boosted trees are specified. A larger number usually results in better performance. The boosting rate is set to 0.2. Replicable training is allowed. The maximum depth of the individual tree is set to 4. One must select three subsets as the smallest subset that can be split. The fraction of training instances is set to 1. One must specify the percentage of the training instances for fitting the individual tree.
In order to develop an AdaBoost model, the user has to set several important parameters, which are as follows. The number of estimators is set to 50, as shown in
Figure 11. The learning rate is set to 1. It determines to what extent the newly acquired information will override the old information. The number of 1 means that the agent considers only the most recent information. The number of 3 is set as a fixed seed to enable reproduction of the results. We decided to use SAMME as the classification algorithm, which updates the base estimator’s weights with classification results. Among the regression loss function options, the linear option is selected.