Next Article in Journal
Application of Computational Model Based Probabilistic Neural Network for Surface Water Quality Prediction
Next Article in Special Issue
Machine Learning Feedback Control Approach Based on Symbolic Regression for Robotic Systems
Previous Article in Journal
Functional Limit Theorem for the Sums of PSI-Processes with Random Intensities
Previous Article in Special Issue
Complex Color Space Segmentation to Classify Objects in Urban Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Relaxation Subgradient Algorithms with Machine Learning Procedures †

1
Department of Applied Mathematics, Kemerovo State University, Krasnaya Street 6, Kemerovo 650043, Russia
2
Institute of Informatics and Telecommunications, Reshetnev Siberian State University of Science and Technology, Prosp. Krasnoyarskiy Rabochiy 31, Krasnoyarsk 660031, Russia
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in Mathematical Optimization Theory and Operations Research MOTOR 2021 Conference, Irkutsk, Russia, 5–10 July 2021; pp. 477–492.
Mathematics 2022, 10(21), 3959; https://doi.org/10.3390/math10213959
Submission received: 27 September 2022 / Revised: 19 October 2022 / Accepted: 19 October 2022 / Published: 25 October 2022
(This article belongs to the Special Issue Applied and Computational Mathematics for Digital Environments)

Abstract

:
In the modern digital economy, optimal decision support systems, as well as machine learning systems, are becoming an integral part of production processes. Artificial neural network training as well as other engineering problems generate such problems of high dimension that are difficult to solve with traditional gradient or conjugate gradient methods. Relaxation subgradient minimization methods (RSMMs) construct a descent direction that forms an obtuse angle with all subgradients of the current minimum neighborhood, which reduces to the problem of solving systems of inequalities. Having formalized the model and taking into account the specific features of subgradient sets, we reduced the problem of solving a system of inequalities to an approximation problem and obtained an efficient rapidly converging iterative learning algorithm for finding the direction of descent, conceptually similar to the iterative least squares method. The new algorithm is theoretically substantiated, and an estimate of its convergence rate is obtained depending on the parameters of the subgradient set. On this basis, we have developed and substantiated a new RSMM, which has the properties of the conjugate gradient method on quadratic functions. We have developed a practically realizable version of the minimization algorithm that uses a rough one-dimensional search. A computational experiment on complex functions in a space of high dimension confirms the effectiveness of the proposed algorithm. In the problems of training neural network models, where it is required to remove insignificant variables or neurons using methods such as the Tibshirani LASSO, our new algorithm outperforms known methods.

1. Introduction

In this study, which is an extension of previous work [1], a problem of minimizing a convex, not necessarily differentiable function f ( x ) , x R n (where R n is a finite-dimensional Euclidean space) is discovered. Such a problem is quite common in the field of machine learning (ML), where optimization methods, in particular, gradient descent, are widely used to minimize the loss function during training stage. In the era of the digital economy, such functions arise in many engineering applications. First of all, training and regularizing the artificial neural networks of a simple structure (e.g., radial or sigmoidal) may lead to the application of a loss function in a high-dimensional space, which are often non-smooth. When working with more complex networks, such functions can be non-convex.
While a number of efficient machine learning tools exist to learn smooth functions with high accuracy from a finite data sample, the accuracy of these approaches becomes less satisfactory for nonsmooth objective functions [2]. In a machine learning context, it is quite common to have an objective function with a penalty term that is non-smooth such as the Lasso [3] or Elastic Net [4] linear regression models. Common loss functions, such as the hinge loss [5] for binary classification, or more advanced loss functions, such as the one arising in classification with a reject option, are also nonsmooth [6], as well as some widely used activation functions (ReLU) [7] in the field of deep learning. Modern convolutional networks, incorporating rectifiers and max-pooling, are neither smooth nor convex. However, the absence of differentiability creates serious theoretical difficulties on different levels (optimality conditions, definition and computation of search directions, etc.) [8].
Modern literature offers two main approaches to building nonsmooth optimization methods. The first is based on creating smooth approximations for nonsmooth functions [9,10,11,12,13]. On this basis, various methods intended for solving convex optimization problems, problems of composite and stochastic composite optimization [9,10,11,14] were theoretically substantiated.
The second approach is based on subgradient methods that have their origins in the works of N. Z. Shor [15] and B.T. Polyak [16], the results of which can be found in [17]. Initially, relaxation subgradient minimization methods (RSMMs) were considered in [18,19,20]. They were later developed into a number of effective approaches such as the subgradient method with space dilation in the subgradient direction [21,22] that involve relaxation by distance to the extremum [17,23,24]. The idea of space dilation is to change the metric of the space at each iteration with a linear transformation and to use the direction opposite to that of the subgradient in the space with the transformed metric.
Embedding the ideas of machine learning theory [25] into such optimization methods made it possible to identify the principles of organizing RSMM with space dilation [26,27,28,29]. The problem of finding the descent direction in the RSMM can be reduced to the problem of solving a system of inequalities on subgradient sets, mathematically formulated as a problem of minimizing a quality functional. This means that a new learning algorithm is embedded into the basis of some new RSMM algorithm. Thus, the convergence rate of the minimization method is determined by the properties of the learning algorithm.
Rigorous investigation of the approximation capabilities of various neural networks has received much research interest [30,31,32,33,34,35,36] and is widely applied to problems of system identification, signal processing, control, pattern recognition and many others [37]. Due to universal approximation theorem [30], a feedforward network with a single hidden layer and a sigmoidal activation function can arbitrarily well approximate any continuous function on a compact set [38]. The studies of learning theory on unbounded sets can be found in [39,40,41].
The stability of the neural network solutions can be improved by introducing a regularizing term in the minimized functional [42], which stabilizes the solution using some auxiliary non-negative function carrying information on the solution obtained earlier (a priori information). The most common form of a priori information is the assumption of the function smoothness in the sense that the same input signal corresponds to the same output. Commonly used regularization types include:
1. Quadratic Tikhonov regularization (or ridge regression, R 2 ). In the case of approximation by a linear model, the Tikhonov regularizer [42] is used:
R 2 ( U ) = i = 1 k u i 2 ,
where parameters of the linear part of the model are included, and k is the number of vector U components. The regularizer R 2 is mainly used to suppress large components of the vector U to prevent overfitting the model.
2. Modular linear regularization (Tibshirani Lasso, R 1 ). The regularizer proposed in [3] is mainly used to suppress large and small components of the vector U:
R 1 ( U ) = i = 1 k | u i | .
3. Non-smooth regularization ( R γ ) [3,43]:
R γ ( U ) = i = 1 k ( | u i | + ε ) γ , ε = 10 6 , γ = 0.7 .
The use of R γ led to the suppression of small (“weak”) components of the vector U. This property of R γ enables us to reduce to zero weak components that are not essential for the description of data.
The aim of our work is to outline an approach to accelerating the convergence of learning algorithms in RSMM with space dilation and to give an example of the implementation of such an algorithm, confirming the effectiveness of theoretical constructions.
In the RSMM, successive approximations [18,19,20,26,28,44] are:
x k + 1 = x k γ k s k + 1 , γ k = a r g min γ f ( x k γ s k + 1 ) ,
where k is the iteration number, γ k is the stepsize, x 0 is a given starting point, and the descent direction s k + 1 is a solution of a system of inequalities on s R n  [20]:
( s , g ) > 0 , g G .
Hereinafter, ( s , g ) is a dot product of vectors, and G is a set of subgradients calculated on the descent trajectory of the algorithm at a point x k .
Denote as ( s , g ) a set of solutions to inequality (5), as f ( x ) is a subgradient set at point x. If the function is convex and G = ε f ( x k ) is an ε -subgradient set at point x k , and s k + 1 is an arbitrary solution of system (5), then the function will be reduced by at least ε after iteration (4) [20].
Since there is no explicit specification of ε -subgradient sets, the subgradients g k f ( x k ) are used as elements of the set G, calculated on the descent trajectory of the minimization algorithm. These vectors must satisfy the condition:
( s k , g k ) 0 .
Inequality (6) means that for the vectors used, condition (5) is not satisfied. The choice of learning vectors is made according to this principle in the perceptron method [25,45], for instance.
A sequence of vectors g k f ( x k ) , k = 0 , 1 , is not predetermined, but determined during minimization (4) with a built-in method for finding the vector s k + 1 at each iteration of minimization by a ML algorithm.
Let vector s k + 1 be a solution of the system of inequalities (5) for the subgradient set of some neighborhood of the current minimum x k . Then, as a result of iteration (4), we go beyond this neighborhood with a simultaneous function decrease, since vector s k + 1 forms an acute angle with each of the subgradients of the set.
In this work, we present a formalized model of subgradient sets, which enables us, taking into account their specificity, to formulate stronger learning relations and quality functionals, which leads to acceleration in the convergence rate of learning algorithms designed to form the direction of descent in the RSMM.
As a result of theoretical analysis, an effective learning algorithm has been developed. For the proposed ML algorithm, the convergence in a finite number of iterations is proved when solving problem (5) on separable sets. Based on the learning algorithm, we proposed a method for minimizing nonsmooth functions. Its convergence on convex functions was substantiated. It is shown that on quadratic functions, the minimization method generates a sequence of minimum approximations identical to the sequence of the conjugate gradient method. We also proposed a minimization algorithm with a specific one-dimensional search method. A computational experiment confirmed its effectiveness for problems of neural network approximations, where the technology of uninformative model component suppression with regularizers similar to the Tibshirani LASSO was used [46]. The result of our work is an optimization algorithm applicable for solving neuron network regularization and other machine learning problems which, in turn, contains an embedded machine learning algorithm for finding the most promising descent direction. In the problems used for comparison, the objective function forms an elongated multidimensional ravine. As with other subgradient algorithms, our new algorithm proved to be efficient for such problems. Moreover, the new algorithm outperforms known relaxation subgradient and quasi-Newtonian algorithms.
The rest of this article is organized as follows. In Section 2, we consider a problem of acceleration of the learning algorithms convergence using relaxation subgradient methods. In Section 3, we make assumptions regarding the parameters of subgradient sets that affect the convergence rate of the proposed learning algorithm. In Section 4, we formulate and justify a machine learning algorithm for finding the descent direction in the subgadient method. In Section 5, we give a description of the minimization method. In Section 6, we establish the identity of the sequences generated by the conjugate gradient method and the relaxation subgradient method with space dilation on the minimization of quadratic functions. In Section 7 and Section 8, we consider a one-dimensional minimization algorithm and its implementation, respectively. In Section 9 and Section 10, we present experimental results for the considered algorithms. In Section 9, we show a computational experiment on complex large-sized function minimization. In Section 10, we consider experiments with training neural network models, where it is required to remove insignificant neurons. Section 11 contains a summary of the work.

2. Acceleration of the Learning Algorithm’s Convergence in Relaxation Subgradient Methods

To use more efficient learning algorithms, a relation stronger than (5) can be written for the descent direction. We make an additional assumption about the properties of the set G.
Assumption 1. 
Let a convex set G R n belong to a hyperplane; its minimal length vector η is also the minimal length vector of this hyperplane. Then, a solution of the system ( s , g ) = 1 g G is also a solution of (5) [26]. It can be found as a solution to a system of equations using a sequence of vectors from G [26]:
( s , g i ) = q i = 1 , g i G , i = 0 , 1 , k .
It is easy to see that the solution to system (7) in s is the vector s = η / | | η | | 2 . Figure 1 shows the subgradient set in the form of a segment lying on a straight line. Equalities (7) can be solved by a least squares method. For example, using the quadratic quality functional
Q k ( s ) = 1 2 ( 1 ( s , g k ) ) 2 ,
it is possible to implement a gradient machine learning algorithm,
s k + 1 = s k β k Q k ( s k ) ,
where β k is a gradient method step. Hence, when choosing β k = 1 / ( g k , g k ) , we obtain the well-known Kaczmarz algorithm [47],
s k + 1 = s k + 1 ( s k , g k ) ( g k , g k ) g k .
The found direction s k satisfies the learning equality ( s k + 1 , g k ) = 1 . To ensure the possibility of decreasing the function as a result of iteration (4), the new descent direction in the minimization method must be consistent with the current subgradient, i.e., satisfy inequality ( s k + 1 , g k ) > 0 . Process (8) corresponds to this condition.
To propose a faster algorithm, consider the interpretation of process (8). Let Assumption 1 be fulfilled. The step of process (8) is equivalent to the step of one-dimensional minimization of the function
E ( s ) = ( s s , s s ) 2
from point s k in the direction g k . Let the current approximation s k be obtained using a vector g k 1 and satisfy the condition ( s k , g k 1 ) = 1 . Figure 2 shows the projections of the current approximation s k and the required vector s on the plane of vectors g k 1 , g k . Straight lines W 1 and Z 1 are hyperplane projections for vectors s, given by the equalities ( s , g k 1 ) = 1 and ( s , g k ) = 1 . Vector s k + 1 1 is a projection of the approximation obtained from the iteration (8).
If ( g k , g k 1 ) 0 , then the angle between subgradients is obtuse, and the angle φ is acute (Figure 2). In this case, it is possible to completely extinguish the projection of the residual between s k and s in the plane of vectors g k 1 , g k , passing from point A to point C along vector A C , perpendicular to the vector g k 1 , i.e., along vector p k :
p k = g k g k 1 ( g k , g k 1 ) ( g k 1 , g k 1 ) .
In this case, the iteration has the form
s k + 1 = s k + p k 1 ( s k , g k ) ( g k , p k ) .
In Figure 2, this vector is denoted as s k + 1 2 . The vector s k + 1 , obtained by formula (10), satisfies the equalities ( s k + 1 , g k 1 ) = 1 , ( s k + 1 , g k ) = 1 and coincides with the projection s of the optimum of the function E ( s ) . At small angles ϕ between the straight lines W 1 and Z 1 , the acceleration of convergence for process (10) becomes essential. In this work, process (10) will be used to accelerate the convergence in the metric of the iterative least squares method.
Using learning information (7), one of the possible solutions to the system of inequalities (5) can be found in the form s k + 1 = a r g min s F k ( s ) , where
F k ( s ) = i = 0 k w i Q i ( s ) + 1 2 i = 1 n s i 2 , Q i ( s ) = 1 2 ( q i ( s , g i ) ) 2 .
Such a solution can be obtained by the iterative least squares method (ILS). With weight factors w i = 1 , after the arrival of new data q k , g k , the transition from the previously found solution s k to a new solution s k + 1 in ILS is made as follows:
s k + 1 = s k + H k g k ( q k ( s k , g k ) ) 1 + ( H k g k , g k ) , s 0 = 0 ,
H k + 1 = H k H k g k g k T H k T 1 + ( g k , H k g k ) , H 0 = I .
Note that, in contrast to the usual ILS, there is a regularizing component i = 1 n s i 2 / 2 in F k ( s ) , which allows us to use transformations (11) and (12) from the initial iteration, setting s 0 = 0 and H 0 = I .
In [26], based on ILS (11) and (12), an iterative process is proposed for solving the system of inequalities (5) using learning information (7):
s k + 1 = s k + H k g k [ 1 ( s k , g k ) ] ( g k , H k g k ) , s 0 = 0 ,
H k + 1 = H k ( 1 1 α k 2 ) H k g k g k T H k T ( g k , H k g k ) , H 0 = I .
Here, α k > 1 is a space dilation parameter.
Consider the rationale for the method of obtaining formulas (13) and (14). Using processes (11) and (12) for scaled data, we obtain
g k ^ = g k [ q ( g k , H k g k ) ] 0.5 , q k ^ = q k [ q ( g k , H k g k ) ] 0.5 ,
where scaling factor q > 0 . The latter is equivalent to introducing the weight factors w k = 1 / [ q ( g k , H k g k ) ] in F ( s ) . Then, after returning to the original data g k , y k , we obtain the expressions:
s k + 1 = s k + H k g k [ q k ( s k , g k ) ] ( 1 + q ) ( g k , H k g k ) , s 0 = 0 ,
H k + 1 = H k H k g k g k T H k T ( 1 + q ) ( g k , H k g k ) , H 0 = I .
The transformation of matrices (16) is practically equivalent to (14) after the appropriate choice of the parameter q. For transformation (15), the condition ( s k + 1 , g k ) > 0 providing the condition for the possibility of decreasing the function in the course of iteration (4) along the direction s k + 1 may not be satisfied. Therefore, the transformation (13) is used with q k = 1 , which ensures equality ( s k + 1 , g k ) = 1 > 0 . Transformation (13) can be interpreted as the Kaczmarz algorithm in the corresponding metric. As a result, we obtain processes (13) and (14). Methods [18,19,20] of the class under consideration possess the properties of the conjugate gradient method. The noted properties expand the area of effective application of such methods.
The higher the convergence rate of processes (13) and (14), the greater the value of the permissible value α k in (14) [26], which depends on the set G’s characteristics. In algorithms (13) and (14), we distinguish 2 stages: the correction stage (13). reducing the residual between the optimal solution s and the current approximation s k , and the space dilation stage (14), resulting in the increase in the residual in the transformed space without exceeding its initial value, which limits the magnitude of the space dilation parameter. To create more efficient algorithms for solving systems of inequalities, we have to choose the direction of correction in such a way that the reduction of the residual is higher than that of process (13). The direction of space dilation should be chosen so that it becomes possible to increase the space dilation parameter value due to this choice.
This paper presents one of the special cases of the implementation of the correction stage and space dilation stage. It was proposed to use linear combinations of vectors g k 1 , g k in transformation (13) instead of a vector g k when it is appropriate:
p k = g k g k 1 ( g k , H k g k 1 ) ( g k 1 , H k g k 1 ) ,
s k + 1 = s k + H k p k 1 ( s k , g k ) ( g k , H k p k ) .
Transformations (17) and (18) are similar to the previously discussed transformations (9) and (10) carried out in the transformed space.
In the matrix transformation, we use equation (14) instead of vector g k . We also use vector y k = g k g k 1 such that
H k + 1 = H k ( 1 1 α k 2 ) H k y k y k T H k T ( y k , H k y k ) .
As shown below, the discrepancy between the optimal solution s and the current approximation s k + 1 along the vector y k is small, which makes it possible to use large parameters of space dilation α k in (19). Iterations (18) and (19) are conducted under the condition:
( g k , H k g k 1 ) 0 .
In the next section, assumptions will be made regarding the parameters of subgradient sets that affect the convergence rate of the proposed learning algorithm and determine the permissible parameters of space dilation. This allows us to formulate and justify a machine learning algorithm for finding the descent direction in the subgradient method. Note that the described parameterization of the sets does not impose any restrictions on the subgradient sets but is used only for the purpose of developing constraints on the learning algorithm parameters.

3. Formalization of the Separable Sets Model

In this section, we will use the following notation (as earlier, vector η is the shortest vector from G):
(1)
ρ = | | η | | is the length of the minimal vector of the set;
(2)
R = max g G | | g | | is the length of the maximal vector of the set;
(3)
μ = η / ρ is the normalized vector η ;
(4)
s = μ / ρ is a vector associated with the sought solution of systems (5) and (7) when analyzing the ML algorithm;
(5)
R s = max g G ( μ , g ) is an upper-bound value of the set G in the direction μ ;
(6)
M = R s / ρ is the ratio of the upper and lower bounds of the set along μ , r = ρ / R s = M 1 ;
(7)
V = ρ / R is the ratio of the minimal and maximal vectors of the set.
For some set Q, we will use the noted characteristics indicating the set as an argument; for example, η ( Q ) , r ( Q ) .
We assume that set G satisfies the assumption:
Assumption 2 
([1]). Set G is convex, closed, limited ( R < ) and satisfies the separability condition, i.e., ρ > 0 .
Vector s is a solution to the system of inequalities (5), and ρ and R s describe the thickness of the set G in the direction μ:
ρ ( μ , g ) R s , g G .
Due to (21) and the form of s :
1 ( s , g ) R s / ρ = M , g G .
R s , according to its definition, satisfies the constraints:
ρ R s | | μ | | max g G | | g | | R .
Figure 3 shows a separable set and its characteristics. The R s characteristic determines the thickness of the set G and significantly affects the convergence rate of learning algorithms with space dilation. When the thickness of the set is equal to zero, R s = ρ and a flat set takes place (Figure 1). Set G and its characteristics with boundaries (22) are shown in Figure 3.
For example, consider the function
f ( x ) = i = 1 n | x i | a i , a i 0 , i = 1 , , n , x 0 = ( b 1 , , b m , 0 , , 0 ) .
The point x 0 is located at the bottom of a multidimensional ravine. Let us study its subgradient set at point x 0 . As earlier, g ( x 0 ) is subgradient of a function. Components of subgradient vectors at non-zero values x i are as follows: g i ( x ) = s i g n ( x i ) a i . For zero x i , the components of the subgradient vectors belong to the set g i ( x ) [ a i , a i ] , where the zero component g i ( x ) = 0 exists. Hence, it follows that the subgradient of the minimum length of function (23) at point x 0 has the form
η = g m i n 0 = ( s i g n ( b 1 ) a 1 , , s i g n ( b m ) a m , 0 , , 0 ) .
Maximum length subgradients are specified by a set of subgradients
g m a x 0 G = { ( s i g n ( b 1 ) a 1 , , s i g n ( b m ) a m , ± a m + 1 , , ± a n ) } .
It is easy to verify that the projections of arbitrary subgradients at the point x 0 onto vector g m i n 0 are the same. Therefore, the thickness of the subgradient set is zero. In a sufficiently small neighborhood of the point x 0 , the union of the subgradient sets of function (23) coincides with the subgradient set at the point x 0 . Consequently, the descent direction in the form of a solution to the system of inequalities (7) enables us to go beyond this neighborhood.
Figure 4 shows a representation of the subgradient set of a two-dimensional piecewise linear function
f ( x ) = | x 1 | + | x 2 | a , a > 1 , x 0 = ( b 1 , 0 ) .
In quadrants, the function has the form
L 1 ( x ) = x 1 + x 2 a , L 2 ( x ) = x 1 x 2 a , L 3 ( x ) = x 1 x 2 a , L 4 ( x ) = x 1 + x 2 a .
Its subgradients at point x 0 are given as follows:
g 1 ( x 0 ) = L 3 ( x 0 ) = ( 1 , a ) T , g 2 ( x 0 ) = L 4 ( x 0 ) = ( 1 , a ) T .
Minimum length vector η = ( 1 , 0 ) T .
For large values of the parameter a in (24), the complexity of solving the minimization problem increases significantly.

4. Machine Learning Algorithm with Space Dilation for Solving Systems of Inequalities on Separable Sets

In this section, we briefly introduce an algorithm for solving systems of inequalities from [1] and theoretically justify iterations (18) and (19). Specific operators will be used for transformations (13), (14) and (17)–(19). Denote by S ( s , g , p , H ) transformation (18)’s operator, where the correspondence is used for the eponymous components. Then, for example, Formula (13) can be represented as s k + 1 = S ( s k , g k , g k , H k ) . Similarly, for (14) and (19), we introduce the operator H ( H , α , g ) . Formula (19) can be represented as H k + 1 = H ( H k , α k , y k ) .
For a chain of approximations s k , we form the residual vector Δ k = s s k . Until vector s k is not a solution to (5), for vectors g k selected at step 2 of Algorithm 1, from (6) and (22), the following inequality holds:
( Δ k , g k ) = ( s s k , g k ) = ( s , g k ) ( s k , g k ) 1 ( s k , g k ) 1 .
The transformation equation for matrix A k is as follows [1]:
A k + 1 = A k + ( α 2 1 ) g k g k T ( g k , H k g k ) ,
A k + 1 = A k + ( α k 2 1 ) y k y k T ( y k , H k y k ) ,
where A k = H k 1 . For vectors s k and g k of Algorithm 1:
1 ( Δ k , g k ) 2 = ( Δ k , A k 1 / 2 H k 1 / 2 g k ) 2 ( Δ k , A k Δ k ) ( g k , H k g k ) ,
where A 1 / 2 A 1 / 2 = A , and A > 0 is a symmetric, strictly positive, definite matrix.
Algorithm 1 Method for solving systems of inequalities.
1:
Set k = 0 , s 0 = 0 , g 1 = 0 , H 0 = I . Set α > 1 as the limit for choosing the admissible value of the parameter α k for transformations (13) and (14)
2:
Find g k G , satisfying the condition (6) ( s k , g k ) 0
3:
If such a vector does not exist, then
  • solution s k S ( G ) is found; stop the algorithm.
 end if
4:
If k = 0 or condition (20) ( g k , H k g k 1 ) 0 is not satisfied, then
  • go to step 7
 end if
5:
Compute vector p k = g k g k 1 ( g k , H k g k 1 ) / ( g k 1 , H k g k 1 ) and perform transformation (18) s k + 1 = S ( s k , g k , p k , H k ) . Compute the limit of the admissible values of the space dilation parameter α y k for the combination of transformations (18) and (19)
6:
If α y k 2 α 2 , then
  • set α k satisfying the inequalities α 2 α k 2 α y k 2 and perform transformation (19)
  • H k + 1 = H ( H k , α k , y k ) ,
  •  else
    • compute the limit of the admissible values of the space dilation parameter α g k for the combination of transformations (18), (14); set α k satisfying the inequalities α 2 α k 2 α g k 2 and perform transformation (14) H k + 1 = H ( H k , α k , g k ) . Go to step 8
  •  end if
7:
Set α k 2 = α 2 and perform transformations (13), (14) s k + 1 = S ( s k , g k , g k , H k ) , H k + 1 = H ( H k , α k , g k )
8:
Increase k by one and go to step 2
Inequality (28) is essential in justifying the convergence rate of the methods we study for solving systems of inequalities (5). The main idea of the algorithm formation is the point that the values of ( Δ k , A k Δ k ) do not increase when the values of ( g k , H k g k ) decrease with a geometric progression speed. In such a case, after a finite number of iterations, the right side of (28) becomes less than one. The resulting contradiction means that problem (5) is solved, and there is no more possibility of finding a vector g k satisfying condition (6).
For the decreasing rate of the sequence { τ k } , τ k = min 0 j k 1 [ ( g j , H j g j ) / ( g j , g j ) ] , the following theorem is known [26].
Theorem 1. 
Let a sequence { H k } be a transformation (14) result with H 0 = I , α k = α > 1 and arbitrary g k R n , g k 0 , k = 0 , 1 , 2 , . Then
τ k k ( α 2 1 ) / n ( α 2 k / n 1 ) , k 1 .
This theorem does not impose restrictions on the choice of vectors g k . Therefore, regardless of which of equations (14) or (19) is used to transform the matrices, the result (29) is valid for a sequence of vectors composed of g k or y k , depending on which one of them is used to transform the matrices. Let us show that for Algorithm 1 with fixed values of parameter α k , estimates similar to (29) are valid, and we obtain expressions for the admissible parameters α k in (14), (19), at which the values ( Δ k , A k Δ k ) do not increase.
In order to obtain the visually analyzed operation of the algorithm, similarly to the analysis of iterations of the process (9), (10) carried out on the basis of Figure 2, we pass to the coordinate system s ^ = A k 1 / 2 s . In this new coordinate system, corresponding vectors and matrices of iterations (13), (14) and (18), (19) are transformed as follows [1]:
s ^ = A k 1 / 2 s , g ^ = H k 1 / 2 g , A k ^ = H k 1 / 2 A k H k 1 / 2 = I , H k ^ = A k 1 / 2 H k A k 1 / 2 = I .
s ^ k + 1 = s ^ k + g ^ k [ 1 ( s ^ k , g ^ k ) ] ( g ^ k , g ^ k ) ,
H ^ k + 1 = I ( 1 1 α k 2 ) g ^ k g ^ k T ( g ^ k , g ^ k ) ,
A ^ k + 1 = I + ( α k 2 1 ) g ^ k g ^ k T ( g ^ k , g ^ k ) .
For expressions (17), (18) and (27):
s ^ k + 1 = s ^ k + p ^ k [ 1 ( s ^ k , g ^ k ) ] ( g ^ k , p ^ k ) ,
H ^ k + 1 = I ( 1 1 α k 2 ) y ^ k y ^ k T ( y ^ k , y ^ k ) ,
y ^ k = g ^ k g ^ k 1 , p ^ k = g ^ k g ^ k 1 ( g ^ k , g ^ k 1 ) ( g ^ k 1 , g ^ k 1 ) ,
A ^ k + 1 = I + ( α k 2 1 ) y ^ k y ^ k T ( y ^ k , y ^ k ) .
Inequality (22) for new variables is:
1 ( s ^ , g ^ ) R s / ρ = M , g ^ G ^ .
In Figure 5, characteristics of set G ^ in the plane Z formed by the vectors g ˜ k , g ˜ k 1 are shown. Straight lines W 1 , W M are projections of hyperplanes, i.e., corresponding inequality (37) boundaries for possible positions of the vector s ^ projections defined by the normal g ˜ k 1 . Straight lines Z 1 , Z M are boundaries of inequality (37) for vector s ^ possible projection positions defined by the normal g ˜ k .
Let ψ be the angle between vectors g ˜ k and g ˜ k 1 . Since in Figure 5, this angle is obtuse, then condition (20) holds:
( g k , H k g k 1 ) = ( g ˜ k , g ˜ k 1 ) 0 .
Consequently, angle φ in Figure 5 is acute. Due to the fact that vectors g ˜ k , g ˜ k 1 are normals for the straight lines W 1 and Z 1 , we obtain the relations [1]:
sin 2 φ = sin 2 ( π ψ ) = sin 2 ψ = 1 cos 2 ψ , cos 2 φ = cos 2 ψ .
cos 2 φ = cos 2 ψ = ( g ˜ k , g ˜ k 1 ) 2 ( g ˜ k , g ˜ k ) ( g ˜ k 1 , g ˜ k 1 ) = ( g k , H k g k 1 ) 2 ( g k , H k g k ) ( g k 1 , H k g k 1 ) .
The following lemmas [1] allow us to estimate the admissible values of space dilation parameters.
Lemma 1. 
Let the values a, b, c, β satisfy the constraints a a m 0 , b > 0 , c > 0 and 0 β 1 ; then:
min α , β ( a + β b ) 2 β 2 b 2 β 2 c 2 = a m 2 + 2 a m b c 2 = ( a m + b ) 2 b 2 c 2 .
The proofs of Lemmas 1–6, as well as the proofs of Theorems 2–5, can be found in [1].
Lemma 2. 
Let vectors p 1 , p 2 and g be linked by equalities ( p 1 , g ) = a , ( p 2 , g ) = b . Let the difference of vectors p 2 p 1 be collinear to the vector p, and let ξ be an angle between vectors p and g; then:
p 1 p 2 2 = ( a b ) 2 ( g , p ) 2 p 2 = ( a b ) 2 g 2 cos 2 ξ .
Lemma 3. 
As a result of transformation (13) at step 7 of Algorithm 1, the following equality holds:
( s k + 1 , g k ) = ( s ^ k + 1 , g ^ k ) = 1 ,
and as a result of (18) at step 5, (42) will hold, and the equality is as follows:
( s k + 1 , g k 1 ) = ( s ^ k + 1 , g ^ k 1 ) = 1 .
Lemma 4. 
Let set G satisfy Assumption 2. Then, the limit α of the admissible parameter value α k α in Algorithm 1 providing inequality ( Δ k + 1 , A k + 1 Δ k + 1 ) ( Δ k , A k Δ k ) in the case of transformations (13) and (14) is
α 2 = M 2 ( M 1 ) 2 = 1 ( 1 r ) 2 .
Lemma 5. 
Let set G satisfy Assumption 2. Then, the limit α g k of the admissible parameter value α k at step 5 of Algorithm 1, providing inequality ( Δ k + 1 , A k + 1 Δ k + 1 ) ( Δ k , A k Δ k ) in the case of transformations (18), (14) is given by the equation:
α g k 2 = 1 + M 2 ( M 1 ) 2 ( M 1 ) 2 sin 2 φ = 1 + 2 M 1 ( M 1 ) 2 sin 2 φ ,
where
sin 2 φ = 1 ( g k , H k g k 1 ) 2 ( g k , H k g k ) ( g k 1 , H k g k 1 ) .
Lemma 6. 
Let set G satisfy Assumption 2. Then, the limit α y k for the admissible value of parameter α k at step 5 of Algorithm 1 providing inequality ( Δ k + 1 , A k + 1 Δ k + 1 ) ( Δ k , A k Δ k ) in the case of transformations (18) and (19) is given as
α y k 2 = min { α E k 2 , α J k 2 } ,
where
α E k 2 = 1 + ( 2 M 1 ) ( y k , H k y k ) ( M 1 ) 2 ( g k , H k g k ) sin 2 φ ,
α J k 2 = 1 + ( y k , H k y k ) ( M 1 ) 2 ( g k , H k g k ) sin 2 φ 1 + 2 ( M 1 ) ( g k , H k g k ) 1 / 2 cos φ ( g k 1 , H k g k 1 ) 1 / 2 ,
The value cos 2 φ is defined in (39), and sin 2 φ = 1 cos 2 φ .
In matrix transformations (14) and (19) of Algorithm 1, vectors g k and y k are used, which does not allow for directly using estimate (29) of Theorem 1 in the case when expression τ k involves some vector y m , m < k . In the next theorem, an estimate similar to (29) is obtained directly for subgradients g k generated by Algorithm 1.
Theorem 2. 
Let set G satisfy Assumption 1 and let the sequence { π k = min 0 j k 1 ( g j , H j g j ) = ( g J k , H J k g J k ) } be calculated based on the characteristics of Algorithm 1 for fixed values of the space dilation parameters α k 2 = α 2 specified at steps 5 and 6, where parameter α is specified according to (44). Then:
π k = ( g m , H m g m ) 4 R 2 k ( α 2 1 ) n [ α 2 k / n 1 ] , k 1 ,
where m = a r g min 0 j k 1 ( g j , H j g j ) .
Theorem 3. 
Let set G satisfy Assumption 1 and let the sequence { ( Δ k , A k Δ k ) } be calculated based on the characteristics of Algorithm 1. Let dilation parameter α satisfy constraint (44) and let the admissible value α y k 2 be given by (47). Then:
( Δ k + 1 , A k + 1 Δ k + 1 ) ( Δ k , A k Δ k ) ( Δ 0 , Δ 0 ) = ρ 2 , k = 0 , 1 , 2
For fixed values of the space dilation parameter with respect to the convergence of Algorithm 1, the following theorem holds.
Theorem 4. 
Let set G satisfy Assumption 2. Let the values of the space dilation parameters in Algorithm 1 specified at steps 5 and 6 be fixed as α k 2 = α 2 , and let parameter α be given according to constraint (44). Then, the solution to system (5) will be found by Algorithm 1 in a finite number of iterations, which does not exceed K 0 , the minimum integer number k satisfying the inequality
4 k R 2 ( α 2 1 ) n ρ 2 [ α 2 k / n 1 ] = 4 k ( α 2 1 ) n V 2 [ α 2 k / n 1 ] < 1 .
Herewith, until a solution s k S ( G ) is found, and the following inequalities hold:
( g k , H k g k ) ρ 2 ,
( g k , H k g k ) ( g k , g k ) ρ 2 R 2 = V 2 .
According to (21), parameters ρ and R s characterize the deviation of the component vectors g G along the vector μ . If ρ = R s , there is a set G in plane with normal μ . Such a structure of the set G allows one to specify large values of the parameter α (44) in Algorithm 1 and, according to (52), to obtain a solution in a small number of iterations.
In the minimization algorithm (4) on the descent trajectory, due to the exact one-dimensional search, it is always possible to choose a subgradient from the subgradient set satisfying condition (6), including at the minimum point. Therefore, we will impose constraints on the subgradient sets of functions to be minimized, similar to those for the set G. Due to the biases in the minimization algorithm, we need to define the constraints, taking into account the union of subgradient sets in the neighborhood of the current minimum point x k , and use these characteristics based on Theorem 4 results to develop a stopping criterion for the algorithm for solving systems of inequalities.

5. Minimization Algorithm

Since in the subgradient set at point x k + 1 an exact one-dimensional search is performed, there is always a subgradient satisfying condition (6): ( s k + 1 , g k + 1 ) 0 . For example, for smooth functions, the equality ( s k + 1 , g k + 1 ) = 0 holds. Therefore, vector g k + 1 can be used in Algorithm 1 to find a new descent vector approximation. In the built-in algorithm for solving systems of inequalities, the dilation parameter is chosen to solve the system of inequalities for the union of subgradient sets in some neighborhood of current approximation x k . This allows the minimization algorithm to leave the neighborhood after a finite number of iterations.
Due to possible significant biases during the operation of the minimization algorithm, the shell of the subgradient set involved in the learning may contain a zero vector. To avoid situations when there is no solution similar to (5) for the subgradient set from the operational area of the algorithm, we introduce an update to the algorithm of solving systems of inequalities. To track the updates, we used a stopping criterion, formulated based on Theorem 4’s results.
To accurately determine the parameters of the algorithm involved in the calculation of the dilation parameters α y k and α g k , we define their calculation in the form of operators. Denote by A L g 2 ( M , H , g 1 , g ) the operator of calculation α g k 2 according to (45) and (46) in Lemma 5, which is α g k 2 = A L y g 2 ( M , H k , g k 1 , g k ) . For α y k 2 ’s calculation according to expressions (47)–(49) in Lemma 6, we introduce operator A L y g 2 ( M , H , g 1 , g ) , where parameters H , g 1 , g correspond to set H k , g k 1 , g k .
A description of the minimization method is given in Algorithm 2.
Algorithm 2 R A ( α ) .
1:
Set x 0 R n , w 0 = x 0 , k = q = l = 0 , s 0 = 0 , H 0 = I . Set σ > 0 , parameters M > 0 , r = 1 / M and the limit α for the dilation parameter according to equality (44). Compute g 0 f ( x 0 ) .
2:
If g k = 0 then
  • stop the algorithm
 end if
3:
If ( g k , H k g k ) / ( g k , g k ) < σ then
  • update q = q + 1 , w q = x k , l = 0 , H k = I , s k = 0
 end if
4:
If l = 0 or ( g k , H k g k 1 ) > 0  then
  • go to step 7
 end if
5:
Compute vector p k = g k g k 1 ( g k , H k g k 1 ) / ( g k 1 , H k g k 1 ) and perform transformation (18) s k + 1 = S ( s k , g k , p k , H k ) . Compute the limit of the admissible value of the dilation parameter α y k 2 = A L y 2 ( M , H k , g k 1 , g k ) for the combination of transformations (18) and (19)
6:
If α y k 2 α 2 then
  • set α k satisfying the inequalities α 2 α k 2 α y k 2 and perform transformation (19)
  • H k + 1 = H ( H k , α k , y k )
  •  else
    • compute the limit of the admissible value of the dilation parameter
    • α g k 2 = A L g 2 ( M , H k , g k 1 , g k ) for the combination of transformations (18) and (14), set
    • α k satisfying the inequalities α 2 α k 2 α g k 2 and perform transformation (14)
    • H k + 1 = H ( H k , α k , g k ) . Go to step 8
  •  end if
7:
Set α k 2 = α 2 and perform transformations (13), (14) s k + 1 = S ( s k , g k , g k , H k ) , H k + 1 = H ( H k , α k , g k )
8:
Find a new approximation of the minimum point x k + 1 = x k γ k s k + 1 , γ k = a r g min γ f ( x k γ s k + 1 )
9:
Compute subgradient g k + 1 f ( x k + 1 ) , based on the condition ( g k + 1 , s k + 1 ) 0 .
10:
Increase k and l by one and go to step 2
At step 9, due to the condition of the exact one-dimensional descent at step 8, the sought subgradient always exists. This follows from the condition for the extremum of the one-dimensional function. For the sequence of approximations of the algorithm, due to the exact one-dimensional descent at step 8, the following lemma holds [20].
Lemma 7. 
Let function f ( x ) be strictly convex on R n , let set D ( x 0 ) be limited, and let the sequence { x k } k = 0 be such that f ( x k + 1 ) = min γ [ 0 , 1 ] f ( x k + γ ( x k + 1 x k ) ) . Then, lim k x k + 1 x k = 0 .
Denote D ( z ) = { x R n f ( x ) f ( z ) } . Let x be a minimum point of function and let x be limit points of the sequence { w q } generated by Algorithm 2 ( R A ( α ) ). The existence of limit points of a sequence { w q } when the set D ( x 0 ) is bounded follows from w q D ( x 0 ) . Concerning the convergence of the algorithm, we formulate the following theorem [1].
Theorem 5. 
Let function f ( x ) be strictly convex on R n ; let set D ( x 0 ) be bounded, and for x x ,
r ( f ( x ) ) r 0 > 0 ,
V ( f ( x ) ) V 0 > 0 ,
where parameters M, r and α of Algorithm 2 are given according to the equalities
M = 4 3 r 0 , r = 3 4 r 0 , α = 1 1 3 r 0 / 4 ,
and parameters α k , set at steps 5 and 7, are fixed as α k = α . In this case, if σ = ( 3 V 0 / 4 ) 2 , then any limit point of the sequence { w q } generated by Algorithm 2 ( R A ( α ) ) is a minimum point on R n .

6. Relationship between the Relaxation Subgradient Method and the Conjugate Gradient Method

Let us establish the identity of the sequences generated by the conjugate gradient method and the relaxation subgradient method with space dilation on the minimization of quadratic functions:
f ( x ) = 1 2 ( x , A x ) + ( b , x ) + c , A R n × n , A > 0 , b R n .
Suppose that the number of iterations without updating the relaxation subgradient method and the conjugate gradient method is the same and equal to m n . Denote by f ( x ) the function gradient at point x. Denote by g k = f ( x k ) the gradients at the points of the current minimum approximation x k and denote by x 0 R n the initial point. Iterations of the subgradient method for the purposes of its comparison with the conjugate gradient method are briefly represented in Algorithm 3 ( S S G ).
As we will see below, metric transformations for establishing the identity of the approximation sequences of the subgradient method and the method of conjugate gradients on quadratic functions are not essential. Therefore, in this scheme, there are no details of the choice of space dilation transformations and their parameters.
Algorithm 3 S S G
1:
Set initial point x 0 R n , s 0 = 0 , H 0 = I .
2:
For k = 0 , 1 , , m < n do
  • 2.1 Compute s k + 1 as:
    s k + 1 = s k + H k p k 1 ( s k , g k ) ( g k , H k p k ) ,
      where
    p k = g k , if ( k 1 a n d ( g k 1 , H k g k 1 ) > 0 ) or k = 0 , g k g k 1 ( g k , H k g k 1 ) ( g k 1 , H k g k 1 ) , otherwise .
  • 2.2 Perform metric transformation according to one of the formulas (14) or (19) according to Algorithm 2 ( R A ( α ) ) :
    H k + 1 = H k ( 1 1 α k 2 ) H k z k z k T H k T ( z k , H k z k ) ,
    where z k = g k in the case of transformation (14) and z k = g k g k 1 in the case of transformation (19)
  • 2.3 Execute the descent step: x k + 1 = x k γ k s k + 1 , γ k = a r g min γ f ( x k γ s k + 1 )
endfor
Iterations of the conjugate gradient method are carried out according to the Algorithm 4 scheme (see, for instance, [17]):
Algorithm 4 S G
1:
Set initial point x ¯ 0 R n
2:
For k = 0 , 1 , , m < n do
  • 2.1 Compute x ¯ k + 1 = x ¯ k γ k ¯ s k + 1 , γ k ¯ = a r g min γ ¯ f ( x k ¯ γ ¯ s ¯ k + 1 )
    • where
      s ¯ k = g ¯ k + β k s ¯ k 1 , g ¯ k = f ( x ¯ k ) ,
      β k = 0 , if k = 0 , ( g k , g k ) ( g k 1 , g k 1 ) , otherwise .
endfor
The following theorem gives conditions for the equivalence of methods for minimizing quadratic functions:
Theorem 6. 
Let function f ( x ) be quadratic, and for the S S G process matrix H 0 = I , α k 1 . Then, if the initial points for the S S G and S G processes coincide, x 0 = x ¯ 0 , and for any m > k 0 , until a solution is found, the following sequences coincide:
x k = x ¯ k
s k = s ¯ k ( g k , g k ) ,
i.e., both processes generate coincident successive approximations.
Proof. 
We carry out the proof by induction. With k = 0 , Equalities (62) and (63) hold, since, according to (59), p k = g k , taking into account s 0 = 0 , we obtain:
s 1 = s 0 + I g 0 1 ( s 0 , g k ) ( g 0 , I g 0 ) = g 0 ( g 0 , g 0 ) .
According to (61) with β 0 = 0 , we have s ¯ 0 = g ¯ 0 = g 0 .
Assume that equalities (62) and (63) are satisfied for k = 0 , 1 , , l , where l > 0 . Let us show that they are satisfied for k = l + 1 .
Since x l = x ¯ l , vectors s l and s ¯ l are collinear, then by virtue of the condition of exact one-dimensional descent x ¯ l + 1 = x l + 1 . For the conjugate gradient method, the gradient vectors calculated during the operation of the algorithm are mutually orthogonal [17]. Therefore, by virtue of (62), all of the vectors g 0 , g 1 , , g l will be mutually orthogonal. Using the recursive transformation of inverse matrices for (60),
A k + 1 = A k + ( α k 2 1 ) z k z k T ( z k , H k z k ) ,
we obtain an expression for the matrix A l ,
A l = I + k = 0 l 1 ( α k 2 1 ) z k z k T ( z k , H k z k ) .
Since in (60), vectors z k = g k or z k = g k g k 1 , then, in this expression, all vectors z k , k = 0 , 1 , , l 1 participating in the formation of matrix A l are orthogonal to vector g l . Therefore, A l g l = g l . This implies the equality H l g l = g l . Due to the orthogonality of the vectors g l , g l 1 , according to (59), equality p l = g l holds. By virtue of the condition of the exact one-dimensional descent, ( s l , g l ) = 0 . In view of the above, from (58), we obtain:
s l + 1 = s l + H l g l ( g l , H l g l ) = s ¯ l ( g l 1 , g l 1 ) + g l ( g l , g l ) .
Multiplying the last equality by ( g l , g l ) , we obtain a proof of equality (63) for k = l + 1 :
s l + 1 ( g l , g l ) = s ¯ l ( g l , g l ) ( g l 1 , g l 1 ) + g l = s ¯ l + 1 .
 □
Note. For an arbitrary initial matrix H 0 > 0 , one should pass to the coordinate system, where the initial matrix H ˜ 0 = I , and in the new coordinate system, one should use the results of Theorem 6.
The presented RSMM with space dilation in an exact one-dimensional search has the properties of the conjugate gradient method. On a quadratic function, it is equivalent to the conjugate gradient method.

7. One-Dimensional Search Algorithm

Consider a one-dimensional minimization algorithm for process (4). Computational experience shows that the one-dimensional search algorithm in relaxation subgradient methods should have the following properties:
  • An overly accurate one-dimensional search in subgradient methods leads to poor convergence. The search should be adequately rough.
  • At the iteration of the method, the search step should be large enough to leave a sufficiently large neighborhood of the current minimum.
  • To ensure the position of the previous paragraph, it should be possible to increase the search step faster than the possibilities of decreasing it.
  • In PCM, a one-dimensional search should provide over-relaxation, that is, overshoot the point of a one-dimensional minimum along the direction of descent when implementing iteration (4). This provides condition (6), which is necessary for the learning process.
  • When implementing one-dimensional descent along the direction at iteration (4), as a new approximation of the minimum, one can take the points with the smallest value of the function, and for training, one can take the points that ensure condition (6).
We use the implementation of the one-dimensional minimization procedure proposed in [26]. The set of input parameters is { x , s , g x , f x , h 0 } , where x is the current minimum approximation point, s is the descent direction, h 0 is the initial search step, and f x = f ( x ) , g x f ( x ) ; moreover, the necessary condition for the possibility of decreasing the function along the direction ( g x , s ) > 0 must be satisfied. Its output parameters are { γ m , f m , g m , γ 1 , g 1 , h 1 } . Here, γ m is a step to the new minimum approximation point
x + = x γ m s , f m = f ( x + ) , g m f ( x + ) ,
where γ 1 is a step along s, such that at the point z + = x γ 1 s for subgradient g 0 f ( z + ) , inequality ( g 0 , s ) 0 holds. This subgradient is used in the learning algorithm. The output parameter h 1 is the initial descent step for the next iteration. Step h 1 is adjusted in order to reduce the number of calls to the procedure for calculating the function and the subgradient.
In the minimization algorithm, vector g 0 f ( z + ) is used to solve the system of inequalities, and point x + = x γ m s is a new minimum approximation point.
Denote the call to the procedure of one-dimensional minimization as O M ( { x , s , g x , f x , h 0 } ; { γ m , f m , g m , γ 0 , g 0 , h 1 } ) . Here is a brief description of it. We introduce the one-dimensional function φ ( β ) = f ( x β s ) . To localize its minimum, we take an increasing sequence β 0 = 0 and β i h 0 q M i 1 with i 1 . Here, q M > 1 is step increase parameter. In most cases, q M = 3 is set. Denote z i = x β i s , r i f ( z i ) , i = 0 , 1 , 2 , ; l is the number i at which the relation ( r i , s ) = 0 holds. Let us determine the parameters of the localization segment [ γ 0 , γ 1 ] of the one-dimensional minimum γ 0 = β l 1 , f 0 = f ( z l 1 ) , g 0 = r i 1 , γ 1 = β l , f 1 = f ( z l ) , g 0 = r l and find the minimum point γ using a cubic approximation of function [48] on the localization segment using the values of the one-dimensional function and its derivative. Compute
γ m = 0.1 γ 1 , if l = 1 and γ 0.1 γ 1 , γ 1 , if γ 1 γ 0.2 ( γ 1 γ 0 ) , γ 0 , if l > 1 and γ γ 0 0.2 ( γ 1 γ 0 ) , γ , otherwise .
The initial descent step for the next iteration is defined by the rule h 1 = q m h 0 ( γ 1 / h 0 ) 1 / 2 . Here, q m < 1 is the parameter of the descent step decrease, which, in most cases, is given as q m = 0.8 . In the overwhelming majority, when solving applied problems, the set of parameters { q M = 3 , q m = 0.8 } is satisfactory. When solving complex problems with a high degree of level surface elongation, the parameter q m 1 should be increased.

8. Implementation of the Minimization Algorithm

Algorithm 2 ( R A ( α ) ), as a result of updates at step 3, loses information about the space metric. In the proposed algorithm, the matrix update is replaced by the correction of the diagonal elements, and exact one-dimensional descent by approximate. Denote by S p ( H ) the matrix H’s trace and denote by ε H the limit of admissible decrease in the matrix H’s trace. The algorithm sets the initial metric matrix H 0 = I . Since, as a result of transformations, the elements of the matrix decrease, then when the trace of the matrix decreases, S p ( H ) ε H , it is corrected using the transformation H + = n H / S p ( H ) , where ε H is a lower bound for trace reduction, and n is the space dimension. As an indicator of matrix degeneracy, we use the cosine of the angle between vectors g and H g . When it decreases to a certain value ε λ , which can be done by checking ( g , H g ) ε λ | | g | | | | H g | | , transformation H + = H + 10 ε λ I is performed. Here, I is the identity matrix and ε λ is the cosine angle limit. To describe the algorithm, we will use the previously introduced operators.
Let us explain the actions of the algorithm. Since s 0 = 0 , then at k = 0 , condition (6) is satisfied, and ( s k , g k ) 0 and g 0 O = g 0 . Therefore, at step 8, learning iterations (13) and (14) will be implemented. According to the algorithm of the O M procedure, the subgradient g k + 1 O obtained at step 10 of the algorithm satisfies the condition (6): ( g k + 1 O , s k + 1 ) 0 . Therefore, at the next iteration, it is used in learning in steps 6–8.
At step 9, an additional correction of the descent direction is made in order to provide the necessary condition ( g k , s k + 1 ) > 0 for the possibility of descent in the direction opposite to s k + 1 . From the point of view of solving the system of inequalities, this correction also improves the descent vector, which can be shown using Figure 5. Here, as under the conditions of Lemma 4, the movement is made in the direction A B , not from point A, but from some point of the segment A B , where ( s k + 1 1 / 2 , g k ) < 1 . Since the projection of the optimal solution is in the area between the straight lines Z 1 , Z M , the shift to point B, where ( s k + 1 , g k ) = 1 , reduces the distance to the optimal vector.
An effective set of parameters for the O M in the minimization algorithm is { q m = 0.8 , q M = 3 } . The next section presents the results of numerical studies of the presented Algorithm 5.
Algorithm 5 R A O M ( α )
1:
Set x 0 R n , w 0 = x 0 , k = 0 , s 0 = 0 , H 0 = I . Set ε H > 0 , ε λ > 0 , M > 0 , r = 1 / M and the limit α for the dilation parameter according to equality (44) α 2 = M 2 / ( M 1 ) 2 . Compute f k = f ( x 0 ) . Set g 0 O = g 0 , f k m i n = f 0 , x k m i n = x 0 .
2:
If g k = 0 then
  • stop the algorithm
3:
If ( g k , H k g k ) ε λ | | g k | | | | H k g k | | then
  • H k = n H k / S p ( H k )
4:
If S p ( H k ) ε H then
  • H k = H k + 10 ε λ I
5:
If k = 0 or ( g k , H k g k 1 ) > 0  then
  • go to step 8
6:
Perform transformations (17) and (18) for the training system of subgradients: p k = g k O g k 1 ( g k O , H k g k 1 ) / ( g k 1 , H k g k 1 ) , s k + 1 1 / 2 = S ( s k , g k O , p k , H k ) . Compute the limit of the dilation parameter α y k 2 = A L y 2 ( M , H k , g k 1 , g k O ) for the combination of transformations (18) and (19)
7:
If α y k 2 α 2 then
  • set α k satisfying the inequalities α 2 α k 2 α y k 2 and perform transformation (19)
  • H k + 1 = H ( H k , α k , y k ) with y k = g k O g k 1
  •  else
    • compute the limit of the dilation parameter α g k 2 = A L g 2 ( M , H k , g k 1 , g k O )
    • for the combination of transformations (18), (14), set α k satisfying the inequalities
    • α 2 α k 2 α g k 2 and perform transformation (14) H k + 1 = H ( H k , α k , g k O ) .
    • Go to step 9
8:
Set α k 2 = α 2 and perform transformations (13), (14) s k + 1 1 / 2 = S ( s k , g k O , g k O , H k ) , H k + 1 = H ( H k , α k , g k O )
9:
If ( s k + 1 1 / 2 , g k ) < 1 then
  • perform transformation s k + 1 = S ( s k + 1 1 / 2 , g k , g k , H k + 1 )
  •  else
    • s k + 1 = s k + 1 1 / 2
10:
Perform one-dimensional minimization O M ( { x k , s k , g k + 1 , f k , h k } ; { γ k + 1 , f k + 1 , g k + 1 , g k + 1 O , h k + 1 } ) and compute a new approximation of the minimum point x k + 1 = x k γ k + 1 s k + 1
11:
If f k m i n > f k + 1 then
  • set f k + 1 m i n = f k + 1 , x k + 1 m i n = x k + 1
  •  else
    • set f k + 1 m i n = f k m i n , x k + 1 m i n = x k m i n
  • Here, the subgradient g k + 1 f ( x k + 1 ) is obtained in the O M procedure and is used as the current approximation of the minimum. Subgradient g k + 1 O is also obtained in the O M procedure. It satisfies condition (6) ( g k + 1 O , s k + 1 ) 0 and is further used in training
12:
If | | x k + 1 x k | | ε x then
  • stop the algorithm
  •  else
    • increase k by one and go to step 2

9. Computational Experiment Results

In this section, we conduct a computational experiment on minimizing test functions using the following methods: (1) the relaxation method with space dilation in the direction of the subgradient ( R S D ) [26]; (2) the r-algorithm ( r O M ( α ) ) [22,26]; (3) the quasi-Newtonian method ( Q N ) implemented with the matrix transformation formula B F G S ; (4) algorithm R A ( α ) with the fixed parameter ( R A ( α = c o n s t ) ), where α 2 = 6 ; (5) an algorithm with a dynamic way to select the space dilation parameter ( R A ( α k ) ), where α 2 = 6 .
As test functions, we took functions with a high degree of level surface elongation, which increases with the dimension:
(1)
f 1 ( x ) = i = 1 n x i 2 i 6 , x 0 = ( 10 / 1 , 10 / 2 , , 10 / n ) , ε = 10 10 ;
(2)
f 2 ( x ) = i = 1 n x i 2 ( n / i ) 6 , x 0 = ( 10 / 1 , 10 / 2 , , 10 / n ) , ε = 10 10 ;
(3)
f 3 ( x ) = ( i = 1 n x i 2 i ) r , x 0 = ( 1 , 1 , , 1 ) , r = 2 , ε = 10 10 ;
(4)
f 4 ( x ) = i = 1 n | x i | i 3 , x 0 = ( 10 / 1 , 10 / 2 , , 10 / n ) , ε = 10 4 ;
(5)
f 5 ( x ) = max 1 i n ( | x i | i 3 ) , x 0 = ( 10 / 1 , 10 / 2 , , 10 / n ) , ε = 10 4 .
When testing the methods, the values of the function and the subgradient were computed simultaneously. Parameter ε for quadratic functions was chosen as a sufficiently small value ( 10 10 ); for non-smooth functions, it was chosen so that the accuracies in terms of variables are approximately the same for different types of functions. Table 1, Table 2, Table 3, Table 4, Table 5 and Table 6 show the number of calculations of the function values and the subgradient values necessary for achieving the required accuracy for the function f ( x k ) f ε . The initial point of minimization x 0 and the value ε are given in the description of the function.
The test case contains quadratic and piecewise linear functions. Due to their simplicity and unambiguity, an analysis of the level surface elongation can be carried out easily. This choice of functions is due to the fact that, during minimization, the local representation in the current minimization area, as a rule, has either a quadratic or piecewise linear representation.
Functions 1 and 2 are quadratic, where the ratio of the minimum to maximum eigenvalue is 1 / n 6 . The ratio of the level surface range along the coordinate axes of the minimum to the maximum is equal to 1 / n 3 . Function 2, in comparison with function 1, has a higher density of eigenvalues in the region of small values. Function 3 is smooth with a low degree of variation of the level surface elongation. Its complexity is due to the degree above quadratic. Functions 4 and 5 are piecewise. For these functions, the ratio of the level surface range along the coordinate axes of the minimum to the maximum is equal to 1 / n 3 . the same as for quadratic functions 1 and 2. It is of interest to compare the complexity of minimizing smooth and nonsmooth functions by nonsmooth optimization methods provided that their ratios of the surface range are identical.
None of problems 1, 2, 4 and 5 can be solved by the multistep minimization method [24] for n 100 , which emphasizes the relevance of methods with a change in the space metric, in particular, space dilation minimization algorithms capable of solving nonsmooth minimization problems with a high degree of level surface elongation.
In order to identify the least costly one-dimensional search in the quasi-Newtonian method, it was implemented in various ways when specifying the initial unit step. Due to the high degree of condition number for functions 1 and 2, for the best of them, the costs of localizing a one-dimensional minimum when minimizing function 1 include about 2–4 steps. This, together with the final iteration of the approximation and finding the minimum on the localized segment, adds up to 3–5 calculations of the function and the gradient. For function 2, the total iteration costs are 5–10 calculations of the function and the gradient. Table 1, Table 2 and Table 3 for the Q N method show only the number of iterations required to solve the problem.
According to the results of Table 1 and Table 2, the R A ( α = c o n s t ) algorithm outperforms the R S D and r O M ( α ) methods on smooth functions. Therefore, the changes in the directions of correction and space dilation have a positive effect on the convergence rate of the new algorithm. In the R A ( α k ) algorithm, compared to R A ( α = c o n s t ) , an additional factor of convergence acceleration is involved due to an increase in the space dilation parameter, which, according to the results of Table 1 and Table 2, led to an increase in the R A ( α = c o n s t ) algorithm’s convergence rate.
For function 2, the eigenvalues of the Hessian are shifted to the small values area, which has a positive effect on the convergence rate of subgradient methods. Here, the quasi-Newtonian method Q N , taking into account the costs of localizing the minimum in a one-dimensional search, required a larger number of calculations of the function and gradient.
For function 3, the number of iterations of the quasi-Newtonian method turned out to be higher than the number of calculations of the function and the gradient of subgradient methods. Based on the results of minimizing functions 1–3, we can conclude that subgradient methods with space dilation can also be useful in minimizing smooth functions. New methods R A ( α k ) and R A ( α = c o n s t ) show better results here than other algorithms with space dilation.
According to the ratio of the level surface range along the coordinate axes, functions 1, 2, and 4 are similar. Function 4 is difficult to minimize by subgradient methods. Comparing the results of Table 1 and Table 4, we can note insignificant differences in the convergence rate of subgradient methods on these functions, which is additional evidence of the method’s effectiveness in solving nonsmooth optimization problems.
To simulate the presence of the thickness of the subgradient set when minimizing function 4, the subgradients g ( x ) f ( x ) in the process of minimization were generated with interference according to the g ( x ) ( 1 + ξ ) f ( x ) , where ξ [ 0 , 1 ] is a uniformly distributed random number. The interference negatively affects both the quality of the one-dimensional search and the quality of the descent direction. The results are shown in Table 5. Here, the maximum possible value of the characteristic of a subgradient set M = R s / ρ = 2 . Due to the random nature of the quantities ξ [ 0 , 1 ] , the value of M for a set of subgradients on a certain time interval of minimization may have smaller values. According to the results of Lemma 4, the admissible value is α 2 = M 2 / ( M 1 ) 2 = 4 . The calculations were carried out at large values of the space dilation parameter α 2 = 6 . The proposed methods also show significantly better results here.
The ratios of the level surface range along the coordinate axes for functions 5 and 1 are similar. The results for function 5 are shown in Table 6. Here, the RSD method has an advantage due to the fact that the function is separable and all of its subgradients calculated in the minimization procedure are directed along the coordinate axes. Space dilations occur along the coordinate axes, which does not change the eigenvectors of the metric matrix directed along the coordinate axes.
To simulate the presence of the thickness of the subgradient set when minimizing function 5, the subgradients g ( x ) f ( x ) in the process of minimization were generated with interference according to the g ( x ) ( 1 + ξ ) f ( x ) , where ξ [ 0 , 1 ] is uniformly distributed random number. The results for function 5 with subgraduient distortion are shown in Table 7. Here, the maximum possible value of the characteristic of a subgradient set M = R s / ρ = 2 . The proposed methods show better results here than the R S D and r O M ( α ) methods.
A number of conclusions can be drawn regarding the convergence rate of the presented methods:
  • Functions 1, 2, 4, 5 have a significant degree of level surface elongation. The problems of minimizing these functions could not be solved by the multistep minimization methods investigated in [24], which emphasize the relevance of developing methods with a change in the space metric, in particular, space dilation minimization algorithms capable of solving nonsmooth minimization problems with a high degree of level surface elongation.
  • Based on the results of minimizing smooth functions 1–3, we can conclude that subgradient methods with space dilation can also be useful in minimizing smooth functions. At the same time, the new algorithms R A ( α k ) and R A ( α = c o n s t ) also show significantly better results on smooth functions than other subgradient R S D and r O M ( α ) methods.
  • The new methods R A ( α k ) and R A ( α = c o n s t ) significantly outperform the R S D and r O M ( α ) methods when minimizing nonsmooth functions. In the R A ( α k ) algorithm, in comparison with the R A ( α = c o n s t ) algorithm, an additional factor of convergence acceleration is involved due to an increase in the space dilation parameter, which also leads to a significant increase in the convergence rate.

10. Computational Experiment Results in Approximation by Neural Networks

The purpose of this section is to demonstrate the usefulness of applying the methods of nonsmooth regularization (for example, the “Tibshirani lasso” [3]) to the problems of the elimination of uninformative variables when constructing mathematical models, where a necessary element of the technology is rapidly converging nonsmooth optimization methods applicable to minimize nonsmooth nonconvex functions. In this section, we will give several examples of approximation by artificial neural networks (ANN) using nonsmooth regularization to remove uninformative neurons. To assess the quality of this approximation technology using nonsmooth regularization, the obtained approximation results are compared with the previously known results. In each of the examples, a study of the effectiveness of the presented nonsmooth optimization methods will be carried out.
Consider the approximation problem
w = a r g min w E ( α , w , D ) ,
E ( α , w , D ) = x , y D ( y f ( x , w ) ) 2 + α R i ( w ) ,
where D = { ( x i , y i ) | x i R p , y i R 1 } , i = 1 , , N are observational data, R i ( w ) are different kinds of regularizers, α are regularization parameters, f ( x , w ) is an approximating function, x R p is a data vector, w R n is a vector of the tunable parameters, and p and n are their dimensions. Formulas (1)–(3) can be used as regularizers.
Suppose that in the problem of approximation by a feedforward network, it is required to train a two-layer sigmoidal neural network of the following form using data D (i.e., evaluate its unknown parameters w)
f ( x , w ) = w 0 ( 2 ) + i = 1 m w i ( 2 ) φ ( s i ) , φ ( s ) = 1 / ( 1 + e s ) .
For the sigmoidal network
s i = w i 0 ( 1 ) + j = 1 p x j w i j ( 1 ) , i = 1 , 2 , , m ,
where x j are components of vector x R p , w = ( ( w i ( 2 ) , i = 0 , , m ) , ( w i j ( 1 ) , j = 0 , , p , i = 1 , , n ) ) is a set of parameters, the total number of which is denoted by n, φ ( s ) is a neuron activation function, and m is the number of neurons. The unknown parameters w must be estimated by the least squares method (64) using one of the regularizers R i ( w ) . To solve problem (64), we use subgradient methods.
In a radial basis function (RBF) network, we will use the following representation of a neuron
s i = j = 1 p ( w i j ( 1 ) ( x j c i j ) ) 2 , i = 1 , 2 , , m ,
where x j are components of vector x R p , and the network parameters will be as follows:
w = ( ( w i ( 2 ) , i = 0 , , m ) , ( w i j ( 1 ) , j = 0 , , p , i = 1 , , m ) , ( w i j ( 0 ) = c i j , j = 0 , , p , i = 1 , , m ) ) .
One of the goals of our study is to compare the effectiveness of subgradient methods in solving the problem of approximating a two-layer sigmoidal ANN under conditions of reducing the number of excess neurons using various regularization functionals. To assess the quality of the solution, we will use the value of the root-mean-square error:
S ( D , f ) = x , y D ( y f ( x , w ) ) 2 / N
on a test sample of data D = D T 10.000 uniformly distributed in Ω .
In the algorithm we use (Algorithm 6), at the initial stage, an approximation of the ANN is found with a fixed position of the neurons’ working areas using the specified centers c i R p , i = 1 , 2 , , m in the approximation area defined by the data. By neuron working area, we mean the area of significant changes in the neuron activation function. The need for fixation arises due to the possible displacement of the working areas of neurons outside the data area. As a result, the neuron in the data area turns into a constant. For the RBF networks (65) and (67), this is easy to do, since the parameters of the centers are present in expression (67). For RBF networks (65) and (66), instead of (66), the following expression will be used:
s i = j = 1 p ( x j c i j ) w i j ( 1 ) , i = 1 , 2 , , m ,
where vector w components do not contain free members. In this case, some center c i is located on the central hyperplane of the working band of a sigmoidal neuron. Centers c i inside the data area can be found by some data clustering algorithm x i R p , i = 1 , , N , which will ensure that neurons are located in areas with high data density. We use the maximin algorithm [45] in which two data points that are maximally distant from each other are selected as the first two centers. Each new center is obtained by choosing data point x i , the distance from which to the nearest known center is at its maximum. The resulting centers are mainly located on the edges of the data area. Computational experience shows that the use of the k-means method turns out to be ineffective, or effective with a small number of iterations.
Algorithm 6 Training Algorithm
1:
On the data D, using the maximin algorithm, form the centers c i R p , i = 1 , 2 , , m , where m is the initial number of neurons. Set the regularization parameter α and the type of regularizer R i ( w ) . Using a generator of uniformly distributed random numbers for each neuron, determine the initial parameters of the ANN.
2:
Solve the problem of estimating the parameters W of the neural network (64) for an ANN at fixed centers c i R p , i = 1 , 2 , , m with a regularizer R i ( w ) . Create an initial set of parameters for solving the problem of estimating network parameters (64) without fixing the centers of neurons. The resulting set of parameters is denoted by W 0 .
3:
For k = 0 , 1 , do
  • 3.1 Set S 0 = S ( D , f k ) , where f k is a neural network obtained as a result of solving problem (64) at the current iteration. Perform sequential removal of all neurons for
  • which, after removal, inequality S ( D , f k ˜ ) ( 1 + ε p s ) S 0 is satisfied, where ε p s = 0.1 ,
  • f k ˜ is a neural network with removed neurons. If none of the neurons could be removed,
  • then the neuron is removed, leading to the smallest increase in the value S ( D , f k ˜ ) .
  • 3.2 If the number of neurons is less than three, then
    •   stop the algorithm
  •   endif
  • 3.3 Using the neural network parameters for the remaining neurons as initial values, obtain
  • a new approximation W k + 1 , solving problem (64) for the ANN with regularizer R i ( w )
  endfor
Initially, problem (64) is solved with an excess number of neurons at fixed centers c i R p , i = 1 , , m for the RBF network in the forms (65) and (67) or for a sigmoidal network in the forms (65) and (68). Regularization even with an excessive number of parameters in comparison with the amount of data allows, at this stage, to obtain an acceptable solution.
After solving problem (4) with fixed centers, it is necessary to return to the original description for the sigmoidal network in the forms (65) and (66). This can be done through the formation of a free member of the neuron
w i 0 ( 1 ) = j = 1 p c i j w i j ( 1 ) , i = 1 , 2 , , m ,
while leaving the other parameters unchanged. Such an algorithm for finding the initial approximation of the sigmoidal ANN guarantees that the data area will be covered by the working areas of neurons.
Here is a brief description of the algorithm for constructing an ANN. The algorithm first finds the initial approximation for fixed working areas of neurons and then iterates the removal of insignificant neurons, which is followed by training the trimmed network.
With a limited number of data, ANN f ( x , W k ) with a number of parameters n not exceeding N and the smallest value of S k = S ( D , f k ) is selected as the final approximation model.
Consider examples of solving approximation problems. Table 8, Table 9 and Table 10 show the value of S ( D T 10.000 , f ) calculated during the operation of the network learning algorithm with the sequential removal of insignificant neurons after network training at step 3.3. The first row of each table contains the function f i ( x ) to be approximated, the initial number of neurons m 0 , number of training data N 0 , the type of regularizer, the regularization parameter α , and the index deduced by rows. The first two columns indicate the number of neurons and the number of ANN parameters. The remaining columns show the values of the index for the tested methods. The values of the index with the limiting order of accuracy are highlighted in bold. This allows one to see a segment of maintaining a high quality of the network with a decrease in the number of neurons for each of the presented minimization algorithms. For some of the tables, the last row contains the minimum value of the maximum network deviation for the constructed models on the test data Δ m i n . The dimensions of problems where the number of model variables exceeds the number of data are underlined. Standard least squares practice recommends having more data than the parameters to be estimated. Good values of the index for this case emphasize the role of regularization in approximation by neural networks.
In [49], in the domain Ω = [ 1 , 1 ] × [ 1 , 1 ] the function
f 6 ( x ) = sin ( π x 1 2 ) sin ( 2 π x 2 ) / 2
was approximated by the cascade correlation method using data uniformly distributed at N = 500. The maximum deviation of the ANN obtained in [49] with the number of neurons m = 41 on the test sample of 1000 data was Δ 0.15 . Such a result is practically very difficult to obtain using the standard ANN learning apparatus. The authors in [49] did not succeed in obtaining such a result without first fixing the position of the working areas of neurons at the initial stage of the approximation algorithm. In our work, we obtained a series of ANN models with a smaller number of network training data N = 150 and with an assessment of the results on a test sample D T 10.000 consisting of 10.000 data with good approximation quality ( Δ < 0.09 for selected index values). For example, for some of the constructed models, Δ 0.02 , which is almost an order of magnitude less than the result from [49].
Table 8 shows the results of the approximation of the function f 6 ( x ) using a smooth regularizer R 2 ( w ) . The quasi-Newtonian method ( Q N ) was also used here. Using the R A ( α k ) , R A ( α = c o n s t ) and R S D methods, it is possible to obtain a better quality approximation with a smaller number of neurons. The Q N method is inferior in approximation quality to subgradient methods. Note that in some cases, the number of network parameters exceeds the number of data. At the same time, the network quality index is not worse than in the area with m < 38 . For the methods r O M ( α ) and Q N , the best indexes are in the m > 37 area.
Table 9 shows the results of approximating function f 6 ( x ) by the sigmoidal ANN using a nonsmooth regularizer R 1 ( w ) (“Tibshirani lasso” technique [3]). Here, the trend in the relative efficiency of the methods continues. Using the R A ( α k ) , R A ( α = c o n s t ) and R S D methods, it is possible to obtain a better-quality approximation with a smaller number of neurons.
Table 10 shows the results of approximating function f 6 ( x ) by the sigmoidal ANN using a nonsmooth regularizer R γ ( w ) . Using the R A ( α k ) , R A ( α = c o n s t ) and R S D methods, it is possible to obtain a better quality approximation with a smaller number of neurons. When using a nonsmooth regularizer to approximate a function, it is possible to obtain an ANN with good approximation quality characteristics with a smaller number of neurons.
Based on the results of Table 8, Table 9 and Table 10, it can be concluded that the use of regularizers makes it possible to obtain a qualitative approximation in the case when the number of parameters of the neural network function exceeds the number of data.
In [49], on the data at N = 625 , formed in the domain Ω = [ 3 , 3 ] × [ 3 , 3 ] , the generator of uniform random numbers approximated the function:
f 7 ( x 1 , x 2 ) = 3 ( 1 x 1 ) 2 e x 1 2 ( x 2 + 1 ) 10 ( x 1 5 x 1 3 x 2 5 ) e x 1 2 x 2 2 e ( x 1 + 1 ) x 2 2 3 .
The maximum deviation of the ANN constructed in [49], based on RBF, on a test sample of 1000 data was Δ 1 000 = 0.06 . Function f 3 is a typical example of a convenient radial basis function for approximating a network. In this work, we obtained a series of ANN models based on RBF with a smaller number of network training data N = 150 and with an assessment of the results on the test sample D T 10.000 consisting of 10.000 data, with good quality of approximation. For example, several of the constructed models give a value that is an order of magnitude smaller: Δ 10.000 = 0.0024 (see Table 11).
Table 11 shows the value of the index S ( D T 10.000 , f ) calculated during the operation of the network learning algorithm with the sequential removal of insignificant neurons after training the network at step 3.3. The initial number of neurons is 36. The first two columns indicate the number of neurons and the number of ANN parameters. The last row of the tables shows the maximum deviation of the network for the constructed models on the test data Δ m i n . On this function, the methods R A ( α k ) , R A ( α = c o n s t ) and R S D turned out to be equivalent in quality of approximation.
In [50], for testing purposes, an RBF ANN was built on data uniformly distributed in the domain Ω = [ 3 , 3 ] × [ 3 , 3 ] for function
f 8 ( x 1 , x 2 ) = x 1 2 + x 2 2
with the number of data N = 100 . In this case, the achieved value of the root-mean-square error on the training sample is S ( D 100 , f ) = 10 6 [50]. We have built a number of sigmoidal ANNs with several orders of magnitude lower value of the quality index on a test sample. The values of the index S ( D 10.000 , f ) = 10 6 on the test sample, depending on the number of neurons in the ANN, are given in Table 12. Here, as earlier, algorithms R A ( α k ) and R A ( α = c o n s t ) manage to obtain a longer series of ANN models with good quality of approximation.
In this section, the ANN training technology was presented, where nonsmooth optimization methods are its integral component. A specific feature of the ANN approximation problems is the absence of convexity of the minimized functions. The fastest methods R A ( α k ) and R A ( α = c o n s t ) turn out to be more effective in solving problems of ANN approximation and make it possible to obtain models with a smaller number of neurons. Nevertheless, the experience of solving similar problems of approximation suggests that when solving an applied problem, it is better to have several alternatives for choosing a minimization method.

11. Conclusions

The statement of the problem consisted in the construction of a rapidly converging algorithm for finding the descent direction in the minimization method, which forms an obtuse angle with all subgradients of some neighborhood of the current minimum, forming a separable set. Minimization along such a direction allows the algorithm to go beyond this neighborhood. This is the problem of constructing a separating hyperplane between the origin and a separable set, the normal of which is the desired direction. As a result, we have a problem of solving a system of inequalities, for the solution of which learning algorithms can be applied, for example, the perceptron learning algorithm [45].
Formalization of the subgradient sets model made it possible to reduce the problem of solving a system of inequalities to an approximation problem, for the solution of which an algorithm with space dilation was proposed, which is ideologically close to the iterative least squares method. Taking into account the peculiarities of subgradient sets makes it possible to improve the standard ILS scheme and obtain an effective rapidly converging iterative method for finding the descent direction in the minimization method based on subgradients obtained in the process of the one-dimensional search.
The new algorithm for solving inequalities was theoretically substantiated, and an estimate of its convergence rate was obtained depending on the parameters of the subgradient set. On this basis, a new subgradient minimization method was developed and justified. On quadratic functions, the proposed method has the properties of the conjugate gradient method. The outlined approach to creating learning algorithms can be used to develop new learning algorithms with space dilation for relaxation subgradient minimization.
A practically implementable version of the minimization algorithm has been developed, which uses a rough one-dimensional search. The performed computational experiment on complex large-sized functions confirms the effectiveness of the proposed relaxation subgradient minimization method.
The possibility of using the relaxation subgradient minimization method in solving nonsmooth non-convex optimization problems makes it possible to use it in problems of neural network training, where it is required to remove insignificant variables or neurons by methods similar to the Tibshirani lasso. Algorithms of this type are of great practical importance due to their high convergence rate and the possibility of using them to minimize non-convex functions, for example, when estimating the parameters of mathematical models under conditions of nonsmooth regularization, used for the purpose of model feature reduction [3,29,46]. The effectiveness of using the proposed relaxation subgradient minimization method in one of these technologies has been demonstrated in the present work.

Author Contributions

Conceptualization, V.K. and L.K.; methodology, V.K., S.G. and L.K.; software, S.G.; validation, E.T. and E.S.; formal analysis, E.S.; investigation, L.K. and E.S.; resources, L.K. and E.S.; data curation, V.K.; writing—original draft preparation, V.K., S.G., E.T. and L.K; writing—review and editing, E.T. and L.K.; visualization, V.K. and E.T.; supervision, E.S.; project administration, L.K.; funding acquisition, L.K. and E.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Ministry of Science and Higher Education of the Russian Federation (Project FEFE-2020-0013).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Krutikov, V.; Meshechkin, V.; Kagan, E.; Kazakovtsev, L. Machine Learning Algorithms of Relaxation Subgradient Method with Space Extension. In Mathematical Optimization Theory and Operations Research: MOTOR 2021; Lecture Notes in Computer Science; Pardalos, P., Khachay, M., Kazakov, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2021; Volume 12755, pp. 477–492. [Google Scholar]
  2. Lauer, F.; Le, V.; Bloch, G. Learning smooth models of nonsmooth functions via convex optimization. In Proceedings of the 2012 IEEE International Workshop on Machine Learning for Signal Processing, Santander, Spain, 23–26 September 2012; Volume 1, pp. 1–6. [Google Scholar]
  3. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  4. Friedman, J.; Hastie, T.; Tibshirani, R.J. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Chang, K.; Hsieh, C.; Lin, C. Coordinate descent method for largescale l2-loss linear support vector machines. J. Mach. Learn. Res. 2008, 9, 1369–1398. [Google Scholar]
  6. Pierucci, F. Nonsmooth Optimization for Statistical Learning with Structured Matrix Regularization. Ph.D Thesis, Université Grenoble Alpes, Grenoble, France, 2017. [Google Scholar]
  7. Hahnloser, R.; Sarpeshkar, R.; Mahowald, M.; Douglas, R.; Seung, H. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 2000, 405, 947–951. [Google Scholar] [CrossRef]
  8. Nesterov, Y. Subgradient Optimization; John Wiley and Sons, Inc.: Hoboken, NJ, USA, 2009. [Google Scholar]
  9. Golshtein, E.; Nemirovsky, A.; Nesterov, Y. Level method, its generalizations and applications. Econ. Math. Methods 1995, 31, 164–180. [Google Scholar]
  10. Nesterov, Y. Universal gradient methods for convex optimization problems. Math. Program. Ser. A 2015, 152, 381–404. [Google Scholar] [CrossRef] [Green Version]
  11. Gasnikov, A.; Nesterov, Y. Universal method for stochastic composite optimization problems. Comput. Math. Math. Phys. 2018, 58, 48–64. [Google Scholar] [CrossRef]
  12. Nesterov, Y. Smooth minimization of nonsmooth functions. Math. Program. 2005, 103, 127–152. [Google Scholar] [CrossRef]
  13. Ouyang, H.; Gray, A. Stochastic smoothing for nonsmooth minimizations: Accelerating SGD by exploiting structure. In Proceedings of the 29th International Conference on Machine Learning (ICML), Edinburgh, UK, 26 June–1 July 2012; Volume 1, pp. 33–40. [Google Scholar]
  14. Gasnikov, A.; Lagunovskaya, A.; Usmanova, I.; Fedorenko, F. Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex. Autom. Remote Control 2016, 77, 2018–2034. [Google Scholar] [CrossRef]
  15. Shor, N.Z. Applying the gradient descent method to solve transportation network problem. In Issues in Cybernetics and Operational Research; Scientific Council on Cybernetics AS UkrSSR: Kyiv, Ukraine, 1962; pp. 9–17. [Google Scholar]
  16. Polyak, B. A general method for solving extremum problems. Sov. Math. Dokl. 1967, 8, 593–597. [Google Scholar]
  17. Polyak, B. Introduction to Optimization; Optimization Software: New York, NY, USA, 1987. [Google Scholar]
  18. Wolfe, P. Note on a method of conjugate subgradients for minimizing nondifferentiable functions. Math. Program. 1974, 7, 380–383. [Google Scholar] [CrossRef]
  19. Lemarechal, C. An extension of Davidon methods to non-differentiable problems. Math. Program. Study 1975, 3, 95–109. [Google Scholar]
  20. Demyanov, V. Nonsmooth Optimization. In Nonlinear Optimization; Lecture Notes in Mathematics; Di Pillo, G., Schoen, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2010; Volume 1989, pp. 55–163. [Google Scholar]
  21. Nemirovsky, A.; Yudin, D. Problem Complexity and Method Efficiency in Optimization; Wiley: Chichester, UK, 1983. [Google Scholar]
  22. Shor, N. Minimization Methods for Nondifferentiable Functions; Springer: Berlin/Heidelberg, Germany, 1985. [Google Scholar]
  23. Polyak, B. Optimization of non-smooth composed functions. USSR Comput. Math. Math. Phys. 1969, 9, 507–521. [Google Scholar]
  24. Krutikov, V.; Samoilenko, N.; Meshechkin, V. On the properties of the method of minimization for convex functions with relaxation on the distance to extremum. Autom. Remote Control 2019, 80, 102–111. [Google Scholar] [CrossRef]
  25. Tsypkin, Y.Z. Foundations of the Theory of Learning Systems; Academic Press: New York, NY, USA, 1973. [Google Scholar]
  26. Krutikov, V.N.; Petrova, T. Relaxation method of minimization with space extension in the subgradient direction. Ekon. Mat. Met. 2003, 39, 106–119. [Google Scholar]
  27. Cao, H.; Song, Y.; Khan, K. Convergence of Subtangent-Based Relaxations of Nonlinear Programs. Processes 2019, 7, 221. [Google Scholar] [CrossRef]
  28. Krutikov, V.N.; Gorskaya, T. A family of subgradient relaxation methods with rank 2 correction of metric matrices. Ekon. Mat. Met. 2009, 45, 37–80. [Google Scholar]
  29. Krutikov, V.; Meshechkin, V.; Kagan, E.; Kazakovtsev, L. Approximation Capability to Compact Sets of Functions and Operators by Feedforward Neural Networks. In Mathematical Optimization Theory and Operations Research; Lecture Notes in Computer Science; Pardalos, P., Khachay, M., Kazakov, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2021; Volume 12755, pp. 477–493. [Google Scholar]
  30. Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
  31. Funahashi, K.I. On the approximate realization of continuous mappings by neural networks. Neural Netw. 1989, 2, 183–192. [Google Scholar] [CrossRef]
  32. Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991, 4, 251–257. [Google Scholar] [CrossRef]
  33. Guliyev, N.J.; Ismailov, V.E. Approximation capability of two hidden layer feedforward neural networks with fixed weights. Neurocomputing 2018, 316, 262–269. [Google Scholar] [CrossRef] [Green Version]
  34. Hanin, B.; Sellke, M. Approximating continuous functions by ReLU nets of minimal width. arXiv 2017, arXiv:1710.11278. [Google Scholar]
  35. Petersen, P.; Voigtlaender, F. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Netw. 2018, 108, 296–330. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  36. Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef] [Green Version]
  37. Tsypkin, Y.Z.; Gupta, M.; Jin, L.; Homma, N. Static and Dynamic Neural Networks: From Fundamentals to Advanced Theory; John Wiley and Sons: Hoboken, NJ, USA, 2003. [Google Scholar]
  38. Wei, W.; Nan, D.; Li, Z.; Long, J.; Wang, J. Approximation Capability to Compact Sets of Functions and Operators by Feedforward Neural Networks. In Proceedings of the 2007 Second International Conference on Bio-Inspired Computing: Theories and Applications, Zhengzhou, China, 14–17 September 2007; pp. 82–86. [Google Scholar]
  39. Gribonval, R.; Kutyniok, G.; Nielsen, M.; Voigtlaender, F. Approximation spaces of deep neural networks. arXiv 2020, arXiv:1905.01208. [Google Scholar] [CrossRef]
  40. Liu, Z.; Tilman, H.; Masahito, U. Neural networks fail to learn periodic functions and how to fix it. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Vancouver, BC, Canada, 6–12 December 2020; pp. 1583–1594. [Google Scholar]
  41. Wang, M.X.; Qu, W. Approximation capabilities of neural networks on unbounded domains. Neural Netw. 2022, 145, 56–67. [Google Scholar] [CrossRef]
  42. Tikhonov, A.; Arsenin, V. Solutions of Ill-Posed Problems; John Wiley and Sons: New York, NY, USA, 1977. [Google Scholar]
  43. Krutikov, V.; Samoilenko, N.; Nasyrov, I.; Kazakovtsev, L. On the applicability of non-smooth regularization in construction of radial artificial neural networks. Control Syst. Inf. Technol. 2018, 2, 70–75. [Google Scholar]
  44. Nurminskii, E.; Thien, D. Method of conjugate subgradients with constrained memory. Autom. Remote Control 2014, 75, 646–656. [Google Scholar] [CrossRef]
  45. Neimark, J. Perceptron and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  46. Krutikov, V.; Kazakovtsev, L.; Shkaberina, G.; Kazakovtsev, V. New method of training two-layer sigmoid neural networks using regularization. IOP Conf. Ser. Mater. Sci. Eng. 2019, 537, 042055. [Google Scholar] [CrossRef]
  47. Kaczmarz, S. Approximate solution of systems of linear equations. Int. J. Control 1993, 57, 1269–1271. [Google Scholar] [CrossRef]
  48. Lorentz, G. Approximation of Functions; American Mathematical Society: Providence, RI, USA, 2005. [Google Scholar]
  49. Osovski, S. Neural Networks for Information Processing; Hot Line-Telecom: Moscow, Russia, 2016. [Google Scholar]
  50. Filippov, V.; Elisov, L.; Gorbachenko, V. Radial basis function networks learning to solve approximation problems. Int. J. Civ. Eng. Technol. 2019, 10, 872–881. [Google Scholar]
Figure 1. Set G belongs to a hyperplane.
Figure 1. Set G belongs to a hyperplane.
Mathematics 10 03959 g001
Figure 2. Projections of approximations s k + 1 in the plane of vectors g k , g k 1 .
Figure 2. Projections of approximations s k + 1 in the plane of vectors g k , g k 1 .
Mathematics 10 03959 g002
Figure 3. The set G and its characteristics.
Figure 3. The set G and its characteristics.
Mathematics 10 03959 g003
Figure 4. Level lines, subgradient set and minimum length vector.
Figure 4. Level lines, subgradient set and minimum length vector.
Mathematics 10 03959 g004
Figure 5. Characteristics of the set G in the plane of vectors g ˜ k , g ˜ k 1 .
Figure 5. Characteristics of the set G in the plane of vectors g ˜ k , g ˜ k 1 .
Mathematics 10 03959 g005
Table 1. Function f 1 ( x ) minimization results.
Table 1. Function f 1 ( x ) minimization results.
n RA ( α k ) RA ( α = const ) RSD r OM ( α ) QN
1001494183421272333107
2003474389645855244216
3005507631771178480324
40076908548979111,773432
500976011,51012,36615,281542
60012,13313,88915,53719,073650
70013,93316,39418,45022,500757
80016,49218,72121,38726,096867
90017,77421,60624,67130,233975
100020,32424,20627,44734,7021084
Table 2. Function f 2 ( x ) minimization results.
Table 2. Function f 2 ( x ) minimization results.
n RA ( α k ) RA ( α = const ) RSD r OM ( α ) QN
100304381480482124
200525621788852239
30071584210631223336
400869101513071587423
5001065124114971900504
6001217136817422188582
7001366152718982512658
8001465172120952829733
9001602188522933101855
100017912019255533001022
Table 3. Function f 3 ( x ) minimization results.
Table 3. Function f 3 ( x ) minimization results.
n RA ( α k ) RA ( α = const ) RSD r OM ( α ) QN
200159148365295440
400221200395505638
600258248409702833
8002952804219001030
100033631743310941205
Table 4. Function f 4 ( x ) minimization results.
Table 4. Function f 4 ( x ) minimization results.
n RA ( α k ) RA ( α = const ) RSD r OM ( α )
1002248271442143505
2004988601090878826
3007680930111,14414,018
40010,62512,80823,68719,549
50013,49016,65628,03724,865
60016,46620,20739,70331,502
70020,12222,85044,57338,796
80023,01627,65352,38044,200
90025,91331,98261,63143,502
100028,96235,79272,17549,050
Table 5. Function f 4 ( x ) minimization results with subgradient distortion.
Table 5. Function f 4 ( x ) minimization results with subgradient distortion.
n RA ( α k ) RA ( α = const ) RSD r OM ( α )
1002505313557394777
2005538739313,36410,665
300903311,64620,58916,889
40012,88618,70530,13223,397
50018,49022,05935,54430.015
60020,74229,97647,66436,749
70025,52439,25854,76843,589
80030,46243,96168,94450,737
90033,57044,08978,69757,817
100039,76449,77282,49064,777
Table 6. Function f 5 ( x ) minimization results.
Table 6. Function f 5 ( x ) minimization results.
n RA ( α k ) RA ( α = const ) RSD r OM ( α )
2003401355131516906
40074837707643114,596
60011,67811,85110,28022,853
80015,86816,08814,02031,259
100019,89320,86717,70739,275
Table 7. Function f 5 ( x ) minimization results with subgradient distortion.
Table 7. Function f 5 ( x ) minimization results with subgradient distortion.
n RA ( α k ) RA ( α = const ) RSD r OM ( α )
2003773381668057463
4007976837016,18815,246
60012,65413,54824,87324,542
80017,28618,41135,25133,600
100022,34423,55946,87341,773
Table 8. Results of function f 6 ( x ) approximation by a sigmoidal ANN with a regularizer R 2 ( w ) , m 0 = 50 , N 0 = 150 , α = 10 7 , S ( D T 10.000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
Table 8. Results of function f 6 ( x ) approximation by a sigmoidal ANN with a regularizer R 2 ( w ) , m 0 = 50 , N 0 = 150 , α = 10 7 , S ( D T 10.000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
mn RA ( α k ) RA ( α = const ) r OM ( α ) RSD QN
501510.0002710.0008070.0004460.0003890.00126
491970.00002280.00006990.00001530.00006010.000259
481930.00001710.0000570.00001060.00002420.00014
471890.000020.00001880.00001480.00002740.000129
461850.00001900.00002680.00001730.00001940.000693
451810.00001660.00002530.00004770.00001580.000774
441770.00001420.00002010.00007960.00002220.00442
431730.00001580.00002130.00005560.00002750.0038
421690.00001580.00002600.00009660.00002600.064
411650.0000168E0.00001830.000220.00003840.0906
401610.00002890.00006730.0001070.001270.039
391570.000060.00005070.0003670.00002490.0597
381530.0000360.00006560.001140.00004610.0544
371490.00003050.00002980.0002610.00002980.437
361450.0000370.00007940.002030.00003230,16
351410.00004370.00002940.001150.00003660.256
341370.00001710.00003150.0009650.000121.82
331330.00006550.00002390.0005470.00008860.459
321290.0004680.0000140.003150.0008822.91
311250.0002150.0001530.01150.000512.22
301210.0001750.00006240.001390.00314.7
Δ m i n 0.02340.02360.02630.02870.9
Table 9. Results of function f 6 ( x ) approximation by a sigmoidal ANN with a regularizer R 1 ( w ) , m 0 = 50 , N 0 = 150 , α = 10 7 , S ( D T 10000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
Table 9. Results of function f 6 ( x ) approximation by a sigmoidal ANN with a regularizer R 1 ( w ) , m 0 = 50 , N 0 = 150 , α = 10 7 , S ( D T 10000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
mn RA ( α k ) RA ( α = const ) r OM ( α ) RSD
501510.0002710.0008070.0004460.000389
491970.00002140.00001580.00001820.0000315
481930.00003390.00000570.00003070.0000211
471890.00002870.000008420.00005470.0000368
461850.00001390.00001060.00003410.0000443
451810.00001610.00003720.0001030.0000371
441770.00001420.0000080.0002630.0000226
431730.00002320.00001390.0004550.0000389
421690.00003170.00003480.001270.0000154
411650.00002590.00006570.001270.000183
401610.00001790.00005930.003570.0000189
391570.00002090.0007840.001630.00148
381530.00002390.0009770.0001390.00537
371490.00002450.0004780.0004040.000232
361450.00001570.0006050.000420.00189
351410.00001480.0002720.004150.000476
341370.00001680.0003530.003620.000549
331330.00001620.001850.002460.00324
321290.000240.0002630.00190.00498
311250.00008910.0008440.0009250.0134
Δ m i n 0.020.0250.03730.03
Table 10. Results of function f 6 ( x ) approximation by a sigmoidal ANN with a regularizer R γ ( w ) , m 0 = 50 , N 0 = 150 , α = 10 7 , S ( D T 10.000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
Table 10. Results of function f 6 ( x ) approximation by a sigmoidal ANN with a regularizer R γ ( w ) , m 0 = 50 , N 0 = 150 , α = 10 7 , S ( D T 10.000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
mn RA ( α k ) RA ( α = const ) r OM ( α ) RSD
501510.0002710.0008070.0004460.0000389
491970.00001950.00002630.00003270.0000108
481930.00002920.00002720.00002070.000023
471890.00002570.00001340.00009810.0000355
461850.00002940.00001460.0009060.0000502
451810.00002640.00002270.000310.0000761
441770.0000320.00003310.00005780.0000773
431730.00003320.00003350.0009860.000079
421690.00003070.0000440.0008870.000079
411650.00003070.0000440.001740.000079
401610.0000420.0000440.001740.000079
391570.00004060.0000440.001260.000079
381530.00002410.00004660.0002420.0000778
371490.00004770.00004070.0001690.0000596
361450.00002670.00005710.001150.0000455
351410.00002970.00002860.0003350.0000588
341370.00001850.00001920.0006890.000057
331330.00001770.0000180.0005980.0000593
321290.00001420.00001290.0007950.0000464
311250.00001710.00003190.0004020.0000608
301210.00001530.0001680.0006690.0000681
291170.00001380.0003120.003010.0000502
281130.00003840.0001060.001170.0000405
271090.00003460.0010.002550.0000519
261050.00002880.0004870.003740.0000674
Δ m i n 0.020.0260.0360.028
Table 11. Results of function f 7 ( x ) approximation by RBF ANN with a regularizer R γ ( w ) , m 0 = 36 , N 0 = 150 , α = 10 7 , S ( D T 10.000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
Table 11. Results of function f 7 ( x ) approximation by RBF ANN with a regularizer R γ ( w ) , m 0 = 36 , N 0 = 150 , α = 10 7 , S ( D T 10.000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
mn RA ( α k ) RA ( α = const ) r OM ( α ) RSD
361090.07060.0370.2440.0596
351760.0001960.002140.0003040.000165
341710.000008520.0000006160.00002870.00000127
331660.0000004450.0000001280.000007360.0000000735
321610.0000006010.0000001690.000009290.000000192
311560.0000002290.000000170.000004910.000000193
301510.0000003230.0000001680.000002080.000000191
291460.0000003040.0000001670.000002190.00000019
281360.0000002730.0000001680.000001020.000000192
271360.0000002730.0000001680.0000005230.000000193
261310.0000005580.0000001680.0000005180.000000291
251260.0000001870.0000001660.000000460.000000291
241210.0000005230.0000003240.0000005020.000000293
231160.0000003970.0000003230.000000520.000000288
221160.0000003970.0000003190.000000520.000000214
211160.0000003970.0000003250.000000350.000000205
201160.0000003970.0000001920.0000004960.000000203
19960.0000005210.0000001950.0000005290.000000161
18910.0000006130.00000007550.0000005960.000000148
17860.0000007470.0000001030.0000003920.000000138
16810.0000001520.0000001630.0001810.00000014
15760.0000001240.000000170.00001670.000000144
Δ m i n 0.00240.00240.0050.0025
Table 12. Results of function f 8 ( x ) approximation by a sigmoidal ANN with a regularizer R γ ( w ) , m 0 = 30 , N 0 = 100 , α = 10 7 , S ( D T 10.000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
Table 12. Results of function f 8 ( x ) approximation by a sigmoidal ANN with a regularizer R γ ( w ) , m 0 = 30 , N 0 = 100 , α = 10 7 , S ( D T 10.000 , f ) . The values with the limiting order of accuracy are given in bold. The dimensions of problems where the number of variables exceeds the number of data are given in underline.
mn RA ( α k ) RA ( α = const ) r OM ( α ) RSD
30910.0003780.001350.0001790.000464
291170.0000000410.0000004370.000001210.0000112
281130.0000000410.000000007940.000000007070.000000866
271090.0000000008620.000000003430.000000003720.000000089
261050.0000000007110.000000001520.000000001050.00000018
251010.0000000006850.000000001150.000000003410.00000000607
24970.000000005670.0000000007980.000000001540.00000000091
23930.000000001560.0000000007170.000000003150.000000000333
22890.000000003780.0000000005720.000000004350.00000000128
21850.000000002720.0000000002910.000000008470.00000000127
20810.00000000260.0000000002460.0000001080.00000000162
19770.0000000008880.0000000006120.0000003960.00000000775
18730.00000003450.00000003260.00000001620.0000000136
17690.000000001140.000000002960.0000001050.0000000181
16650.000000004130.000000007470.0000001130.000000168
15610.00000003950.00000009820.00000004950.000000054
14570.000000002450.00000002210.0000003280.0000000475
Δ m i n 0.0002440.0002930.0003310.000301
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Krutikov, V.; Gutova, S.; Tovbis, E.; Kazakovtsev, L.; Semenkin, E. Relaxation Subgradient Algorithms with Machine Learning Procedures. Mathematics 2022, 10, 3959. https://doi.org/10.3390/math10213959

AMA Style

Krutikov V, Gutova S, Tovbis E, Kazakovtsev L, Semenkin E. Relaxation Subgradient Algorithms with Machine Learning Procedures. Mathematics. 2022; 10(21):3959. https://doi.org/10.3390/math10213959

Chicago/Turabian Style

Krutikov, Vladimir, Svetlana Gutova, Elena Tovbis, Lev Kazakovtsev, and Eugene Semenkin. 2022. "Relaxation Subgradient Algorithms with Machine Learning Procedures" Mathematics 10, no. 21: 3959. https://doi.org/10.3390/math10213959

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop