1. Introduction
Scientists use various models when studying different environmental phenomena. Mathematical models provide an opportunity to determine equations and dependencies to correlate the parameters of miscellaneous objects and processes. Mathematical models are built for various reasons, including the achievement of the best understanding of the objects under study, the possibility of mathematical analysis, and the possibility of conducting experimentation with the model in case it is difficult to repeat the experiment with the objects under study [
1].
The process of mathematical model building contains several steps:
- (1)
Experimental study and the measuring of the parameters of real-world systems and phenomena;
- (2)
Collecting initial data for the model;
- (3)
Mathematical formulations and fitting one or more models;
- (4)
The statistical simulation of the model to validate it [
2].
There are general rules for building mathematical models. These rules assume the following: (1) collecting background information for the phenomenon under study, (2) using simple models at the first stage, (3) determining all parameters and the quantities and correlations between them based on data analysis, (4) complicating the model based on the nature of the phenomenon under study, (5) estimating the efficiency of the model, and (6) others [
3]. The efficiency analysis involves choosing the optimal mathematical model for the problem considered.
There are various efficiency measures for mathematical models. Generally, researchers use the following parameters:
- (1)
Accuracy—for the coincidence analysis of the output of a mathematical model with observed data;
- (2)
Reliability—for the analysis of the precision of a mathematical model;
- (3)
Transparency—for the analysis of choices and assumptions of the output expectations [
4,
5].
To analyze mathematical models, researchers can use additional criteria, such as model simplicity, calculation time, costs, depth level, and others.
The main parameters for the efficiency level of mathematical models in terms of accuracy analysis are standard deviation [
6,
7], the sum of absolute deviations between the model output and the observed data [
8], a weighted sum of squared deviations [
9,
10], and the maximal deviation [
11]. The criterion for these parameters is the minimum value of the estimated parameter [
12,
13].
This article contains seven sections. The first section discusses the background information for the problems of mathematical model building. The second section presents a literature review regarding the topic of research and presents the statement of the problem. The third section deals with the description of mathematical tools for segmented regression building while using ordinary least squares. The fourth section proposes the step-by-step procedure for accuracy increment during segmented regression usage. The fifth section concentrates on the analysis of the proposed method based on statistical simulations. The sixth section discusses the implementation of the proposed method in real data examples, and the seventh section presents the conclusions.
2. Literature Review and Statement of the Problem
Mathematical model building aims at decreasing the uncertainty level for the objects being studied [
14,
15]. The analysis of the level, location, and nature of uncertainty helps to obtain more reliable information and adequate knowledge [
16,
17].
To build mathematical models, researchers use methods from different sciences, such as mathematical analysis, probability theory, data science, regression analysis, mathematical statistics, recognition theory, applied geometry, and others [
18].
This article concentrates on the techniques of regression analysis for mathematical model building, so corresponding methods are considered in detail. Regression analysis is used to determine the relationship between two or more variables [
19] and is widely used to fit mathematical models to statistical data [
20].
Regression analysis is frequently used in various applications due to its approximate ease of calculation, high accuracy, and good predictive properties, depending on the approximating function type usage. Regression analysis is applied to different fields in different capacities, for example, in:
- (1)
Medicine: to detect Parkinson’s disease based on the analysis of finger-tapping data [
21], to forecast the uptake of oxygen based on genes evaluation and to predict data on patient admission [
22], and others;
- (2)
Econometrics: to predict the audit opinion using six financial indicators [
23], to determine the dependence of economic growth on the level of environmental pollution [
24], to describe the trends of economical parameters in correlation with various factors [
25,
26], and others;
- (3)
Transport systems: to determine the optimal periodicity of the implementation of operation processes [
27,
28], and to analyze possible routes and traffic intensity [
29,
30,
31];
- (4)
Aviation: to identify flight conditions and situations based on diagnostic parameter monitoring [
32,
33], and to predict the human state and decision making depending on various environmental factors [
34,
35];
- (5)
Radar systems: to estimate the efficiency of signal detection [
36], to determine the dependence of weather parameters on radar-received signals [
37,
38,
39], and others;
- (6)
Navigation systems: to build a mathematical model for the optimal selection of the navigation equipment [
40,
41,
42], to establish the correlation between navigation equipment failures [
43], to approximate operational data trends for the prediction of possible aviation events [
44], and others;
- (7)
Cybersecurity: to evaluate the efficiency of information web-resources functioning [
45], to synthesize data-processing algorithms while detecting cyberattacks [
46,
47,
48], to ensure high-level security against cyberattacks [
49,
50], and others;
- (8)
Engineering and control: to describe nonlinear dynamic object behavior [
51,
52], to build the mathematical model for statistical parameters while designing control systems [
53], to make decisions based on statistical information processing [
54,
55], and others;
- (9)
Equipment maintenance: to build the mathematical model for diagnostic variable trends [
56], and to determine the uncertainty level while conducting condition monitoring and maintenance preference analysis [
57,
58];
- (10)
Reliability analysis: to describe the behavior of reliability parameters [
59,
60], to simulate statistically nonstationary random processes of failures occurrence [
61,
62], to describe the processes of technical condition deterioration in the trend of failure rate [
63,
64], and others.
Regression analysis usually starts with research on the possibility of using a linear regression model. In the case of an unsatisfactory level of accuracy, more complicated models are used [
65]. These models are nonlinear regression models [
66]. Nonlinear regression models suggest parabolic, hyperbolic, exponential, segmented, and other approximating functions [
65,
67]. Because of the complicated calculations required when using a nonlinear regression model, various software can be utilized [
68].
There are various methods for increasing the accuracy and predictive properties of mathematical models. One approach is to use segmented regression [
69,
70]. In this case, it is necessary to determine the coordinates of the breakpoint between adjacent segments. This problem can be solved using various algorithms [
69,
70,
71,
72,
73,
74,
75]. These algorithms use the maximum likelihood estimator [
69,
70], Bayesian changepoint models [
71,
72], inverted F test [
73], random search method, the method of cumulative sums [
74,
75], and others. A comparative analysis showed some flaws in the algorithms for determining breakpoint coordinates. These flaws are related to a need for prior limitations, as well as the effectiveness of the obtained estimate in terms of robustness and bias. Additionally, the discussed algorithms do not give the possibility to obtain a single mathematical formula for breakpoint coordinates and require the usage of the iterative numerical method described in [
76].
The considered literature review motivates authors to synthesize a new approach for calculating the optimal coordinates of breakpoints while using segmented regression and analyzing time series with nonstationary behavior. The building of a mathematical model based on segmented regression usage is of considerable importance because:
Using segmented regression gives the possibility to obtain a model with greater accuracy.
Segmented regression more correctly describes the geometrical structure of time series.
The obtained segmented models have effective predictive properties.
The research gap in the field of mathematical model building is associated with the absence of a step-by-step procedure for determining the optimal segmented regression model in case of multiple breakpoints in a dataset structure. At the same time, to solve such problems, the method of simple enumeration of the possible options is often used. However, such an approach does not provide mathematical formulations and requires a long computing time.
Therefore, the goal of this article is: (1) to describe the technique of segmented regression building and (2) to obtain mathematical equations for a step-by-step procedure of accuracy increment based on optimal breakpoints abscissas calculations.
Let us state the research problem mathematically. Let us present the statistical dataset in two arrays
and
, each with sample size
.
is the dependent or response variable, while
is the independent or predictor variable. The relationship between the variables is determined by the function set
, where
describes the quantity of the model being fitted to the dataset and
is a vector of
parameters for the
-th regression model. In this case, the regression model is determined by the equation [
65]
where
is an error, which can be described by a normal probability density function. Such an assumption allows the use of ordinary least squares (OLS). For example, in the case of linear regression,
, where
and
are coefficients to be estimated.
This paper focuses on increasing the accuracy of mathematical models based on segmented regression usage. In this case, the function set
depends on abscissas
of the breakpoints, where
is the quantity of breakpoints. The accuracy of the model using OLS is usually estimated by the standard deviation σ between the model output and the observed data. The standard deviation depends on the values of abscissas
of the breakpoint. Thus, this paper aims to solve the minimization problem that can be formulated as follows:
3. Segmented Regression Models
This section presents the basic mathematical equations for different segmented regression models. Authors mostly employ piecewise linear, linear-quadratic, and quadratic models.
This regression type is a sequential connection of
straight-line segments without discontinuities. The mathematical model of SLR is given as
where
is the Heaviside function. This function helps to obtain the single mathematical equation for the segmented model.
An example of a mathematical model of three-segmented linear regression has the form
This model has two breakpoints,
and
, and it requires the computation of four unknown coefficients:
,
,
, and
. These coefficients are estimated based on the OLS. The computation result can be presented in the form of matrix equations
where
corresponds to all
greater than
.
- 2.
Segmented quadratic regression (SQR)
This regression type is a sequential connection of
quadratic parabola segments without discontinuities. The mathematical model of SQR is given as
An example of a mathematical model of two-segmented quadratic regression has the form
This model has one breakpoint,
, and it requires the computation of four unknown coefficients:
,
,
, and
. These coefficients are estimated based on the OLS. The computation result can be presented in the form of matrix equations
- 3.
Segmented linear-quadratic regression (SLQR)
This regression type is a sequential connection of
straight lines and quadratic parabola segments without discontinuities. The mathematical model of SLQR is given as
where
is an indicator function. If the segment is a straight line,
. If the segment is a quadratic parabola,
.
An example of a mathematical model of two-segmented linear-quadratic regression has the form
This model has one breakpoint,
, and it requires the computation of three unknown coefficients:
,
, and
. The feature of this model is the equality of adjacent coefficients for the transition between the quadratic parabola segment and the straight-line segment. Thus,
. The coefficients are estimated based on the OLS. The computation result can be presented in the form of matrix equations
4. Step-by-Step Procedure for Accuracy Increment during Segmented Regression Usage
The method of accuracy increment during segmented regression usage is associated with the estimation of breakpoint abscissas. The breakpoint is the point of connection between two neighboring segments.
The step-by-step procedure contains the following operations:
Choosing of the regression model and the quantity of segments. At this stage, the researcher analyzes the geometrical structure of the observed data presented graphically in the form of the dependence of on . After that, based on their experience, the researcher must choose one of the models SLR, SQR, and SLQR. To substantiate the decision on segmented regression usage, the researcher can test the initial data for nonlinearity. The geometrical structure of the observed data also gives the ability to choose the quantity of the breakpoints
Determining the possible range of values of the breakpoint abscissas. At this stage, the researcher subjectively chooses the discrete range for all breakpoints. The minimal quantity of discrete values should be greater than five. The result of this step is a two-dimensional array with size , where is the number of discrete values in the range of breakpoint abscissas.
Building a regression model. At this stage, based on the matrix equations presented in the previous section, the researcher calculates the unknown coefficients for the chosen regression model and all possible values in the array .
Calculating the standard deviations. In the case of OLS usage, the accuracy of the model is determined by the standard deviation between the model output and the observed data, which can be presented as follows:
where
is the degree of freedom for the chosen regression model.
At this stage, it is necessary to determine the discrete multidimensional dependence for all possible values in the array .
Note that in the case of an alternative regression method (for example, least absolute deviations regression), similar calculations for corresponding accuracy measures should be completed.
- 5.
Approximating the standard deviation dependence on the breakpoint abscissas by multidimensional paraboloid using OLS. The dimension of the paraboloid corresponds to the quantity of breakpoints. It is possible to use one of two types of paraboloid:
- (a)
- (a)
Simplified:
where
,
, and
are approximation coefficients. The simplified paraboloid (5) can be used in case of assumptions about
for the general paraboloid (4).
The coefficients of Equations (4) and (5) are estimated based on OLS. Such a calculation is possible, because all of the values of the possible breakpoints in the two-dimensional array with size are known, and function values correspond to the standard deviations obtained at the previous step.
Consider the case of a simplified paraboloid. According to OLS, it is necessary to solve the system of equations
Let us simplify the first equation in the system. After derivative calculation, it can be presented as follows:
or
Making simplifications in the left side of equation, we can get
Taking into account that
the first equation can be presented as follows:
Similar simplifications can be made for other equations in the system. Therefore, the computation result for paraboloid (5) can be presented in the form of matrix equations
- 6.
Calculating the coordinates of paraboloid optimum. To obtain the minimum standard deviation, it is necessary to determine the coordinates of the minimum multidimensional paraboloid. To do this, the partial derivatives are calculated and equated to zero [
77]:
This system for general paraboloid (4) can be presented in the form of
q linear equations system. For paraboloid (5), the solution of the system is given as
- 7.
Calculating the coefficients of the model for the optimal case. The coefficients of SLR, SQR, or SLQR are computed for the optimal location of the breakpoints using OLS. The final model can be used for the explanation and prediction of the response variable.
Consider the simple example for proposed method. Let us use the dataset with a small sample size presented in [
6]. These data describe the relationship between production lot size
x and the average production cost per unit
y (in dollars) and are given in
Table 1.
Consider this step-by-step procedure.
To describe the presented data, the SLR model with breakpoint is chosen.
The possible breakpoint abscissas values are . Therefore, in this case is a two-dimensional array with size .
There are five alternative SLR models for all possible values in the array
xbr:
The standard deviations for the obtained SLR models are .
Because of one breakpoint, this multidimensional paraboloid converts into simple parabola. The result of the calculation is
- 6.
The optimal value of the breakpoint abscissa is
- 7.
The optimal SLR model is calculated for the obtained breakpoint abscissa. The final equation is
The standard deviation for the optimal SLR model is 0.313. The result of the model building using the SLR model is shown in
Figure 1.
5. Analysis of Proposed Method Based on Statistical Simulation
The analysis of the proposed method is performed using statistical simulation and real data examples. This section presents the statistical simulation results. During the simulation, a dataset with two breakpoints is generated using build-in software operators. The dataset is an additive mixture of deterministic components and random noise.
Assume that the deterministic component corresponds to an SLR model
The random noise is distributed according to the Gaussian probability density function.
The initial data for the simulation are as follows:
- (1)
Sample size ;
- (2)
Sampling time (for discrete representation of the deterministic component);
- (3)
Predetermined parameters of the SLR model:
,
,
,
,
, and
(such parameters correspond, for example, to the real process of deterioration occurrence when monitoring the values of voltage for the supply of electronic devices [
63]);
- (4)
Predetermined parameters of Gaussian noise: the expected value is equal to zero and the standard deviation equal to 20 (additionally, it is assumed that the noise values are independent random variables for any sampling time moment);
- (5)
The quantity of simulations reiteration .
Consider the calculation procedure of the proposed method for one of the generated datasets.
Table 2 shows one of the generated datasets.
Figure 2 presents three realizations of the generated datasets, and each realization is marked by circle, triangle, or diamond (the circles correspond to the data in
Table 2).
To describe the obtained dataset, we choose the SLR model with three segments with
breakpoints. To simplify the calculations, we choose the quantity of discrete values within the range of possible breakpoints to be
. According to the geometrical structure of the observed dataset (
Figure 2), the ranges for two breakpoints are as follows:
The next step is to evaluate the unknown coefficients , , , and for all possible values of the first and second breakpoints using OLS. As a result, 25 alternative SLR models are obtained.
After that, the standard deviations between the model output and the observed data for these SLR models are determined.
Table 3 shows the computation results.
Even visual analysis of the data on the standard deviation (
Table 3) indicates that the minimal standard deviation is located approximately near
and
. To estimate the exact values of breakpoint abscissas, paraboloids (4) and (5) are built using OLS.
After the calculations, the following mathematical equations were obtained:
Figure 3 and
Figure 4 show the visual presentation of paraboloids (4) and (5) for this numerical example, respectively.
To determine the optimum coordinates for three-dimensional general paraboloid (4), it is necessary to solve the following system of two linear equations:
In this case, the calculation gives the following solution:
The general paraboloid (4) has a minimum standard deviation at the coordinates
The simplified paraboloid (5) has a minimum standard deviation at the coordinates
The results of the calculation for paraboloids (4) and (5) almost coincide. The relative error for the first and second breakpoint abscissa is equal to 5.558% and 0.5014%, respectively.
After the calculation of the model’s coefficients for the optimal case, the optimal SLR models for paraboloids (4) and (5) are obtained:
The obtained SLR models give almost the same standard deviations equal to 18.429 and 18.424, respectively.
Figure 5 shows the generated dataset and final optimal SLR models. Visual analysis shows the coincidence of both SLR models.
We consider the general simulation results for all iterations. Repeating the simulation provides an opportunity to perform a complete statistical analysis of the breakpoint estimation during mathematical model building. An analysis was performed by plotting histograms and evaluating the numerical characteristics of the random variables.
Figure 5 shows the histograms for the estimate of two breakpoint abscissas and the usage of different optimization options (general and simplified paraboloids). The parameter λ in
Figure 6 is the quantity of breakpoint abscissa estimates, which are located in the corresponding grouping interval of the histogram.
Table 4 shows the numerical characteristics of the breakpoint abscissas estimates (mathematical expectation, standard deviation, range of change, and skewness).
To describe the obtained estimates of breakpoint abscissa completely, it is necessary to fit the histogram by theoretical probability density function. Approximate assumptions can be made based on the graphical view of the histograms in
Figure 6. The shape of the histogram can correspond to the Gaussian probability density function. Such an assumption can be proven using the chi-squared test with high confidence probability.
The breakpoint estimation bias has preferable values when the general paraboloid method is used. However, the benefit is negligible and averages 0.337% compared with the simplified paraboloid method. The highest percentage of estimate bias (in relative values) is 3.012%. In the case of a long-term breakpoint, the simplified paraboloid method has, on average, a narrower range of change of breakpoint estimates.
Let us analyze the proposed method in comparison with the method of simple enumeration. To obtain the approximately 3% of breakpoint abscissas estimate bias, the method of simple enumeration requires at least 33 possible values for each breakpoint. Therefore, it is necessary to repeat computations for at least 1089 iterations in the case of two breakpoints. At the same time, the proposed method requires 25 iterations and additional calculations of the paraboloid optimum. Therefore, the proposed method reduces the computing time by at least 30 times compared to the method of simple enumeration.
A comparison of the simulation results for a range of initial data provides the ability to conclude approximately the same accuracy characteristics for SLR models based on general and simplified paraboloid usage. Therefore, in practical cases, the adoption of the simplified paraboloid method usage is more advantageous when creating a segmented regression model because of the reduction in computations and calculation time.
6. Real Data Example
Consider the example of real data on the number of earthquakes with a magnitude of 7 or higher by year, according to the United States Geological Survey [
78].
Table 5 presents the corresponding data from 1922 to 2021.
Table 5 contains data observed from 1922 to 2021, where
is the number of observations,
is the year, and
is the quantity of earthquakes.
Figure 7 shows the graphical view of the dataset.
To simplify the presentation and calculations, the first year of observation (1922) is assigned a zero point at the abscissa axis in the next computations. Thus, to return to the original data, it is necessary to add 1922 for the shifted abscissa axis.
According to the visual analysis of the dataset, let us assume that there are five breakpoints in this realization. The following are the ranges for these breakpoints:
With such a range of variables, 3125 different SLR model options are available. The standard deviation is calculated for each case. As a result, a six-dimensional array
is generated. To approximate the obtained data, OLS is used on a six-dimensional optimization paraboloid. For simplicity, we used a simplified paraboloid as follows:
This simplified paraboloid has optimum standard deviation at the coordinates
After the calculation of the model’s coefficients for the optimal case of breakpoint locations using OLS, the final SLR model is obtained:
The standard deviation for the obtained SLR model is equal to 3.799.
Figure 8 shows the observed dataset and the final optimal SLR model.
The method of simple enumeration for a given dataset gives approximately the same result as that shown in
Figure 8. However, this method increases the computing time approximately twice. Polynomial regression using a seventh-order polynomial is characterized by a faster computation time; however, it gives unacceptable predictive properties.
The results of the mathematical model building can be used for solving prediction problems. Consider this problem for the observed dataset based on generally known results that have been extensively described in the literature (for example, [
76,
77]) and innovative methods that may be used in accordance with the properties of segmented regression models.
To predict the future trend, let us determine the range of the SLR model change. For this purpose, we used a straight line and OLS to approximate the upper and lower ordinates of the breakpoints. The lower line contains the zero point, and the second and fourth breakpoints. The upper line contains the first, third, and fifth breakpoints. The numerical values of the calculated equations are
The last segment of the SLR model is continued to the intersection point with the lower straight line.
Figure 9 shows the visual representation of the trend prediction.
This method of prediction and the obtained SLR model allow us to anticipate that, through 2042, the average annual number of earthquakes with a magnitude of 7 or higher would decrease.
In general, the proposed method can be applied to different datasets and, in the case of using multidimensional optimization, to determine breakpoints.
7. Conclusions
This article presents a method of accuracy increment when segmented regression is used. The main problem for segmented regression model building is the estimation of the coordinates of the breakpoint between adjacent segments. To solve this problem, two types of multidimensional optimization paraboloids are used. The paraboloids contain information on standard deviations between the model output and the observed data for different sets of possible values of breakpoint abscissas. The minimum standard deviation of each paraboloid coincided with the optimal position of the breakpoints.
A step-by-step procedure for the proposed method was described by examples based on statistical simulation and real data observation.
Generally, the use of SLR, SQR, and SLQR models provides a mathematical model with high accuracy, more accurately describes the geometrical structure of the analyzed dataset, and has good predictive properties.
The results of this research can be used during mathematical model building for statistical data obtained in various branches of human activity.