Next Article in Journal
Machine Learning Methods for Multiscale Physics and Urban Engineering Problems
Next Article in Special Issue
Utility–Privacy Trade-Off in Distributed Machine Learning Systems
Previous Article in Journal
Fault Diagnosis of Power Transformer Based on Time-Shift Multiscale Bubble Entropy and Stochastic Configuration Network
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks

by
Aleksandr Beknazaryan
Institute of Environmental and Agricultural Biology (X-BIO), University of Tyumen, Volodarskogo 6, 625003 Tyumen, Russia
Entropy 2022, 24(8), 1136; https://doi.org/10.3390/e24081136
Submission received: 6 July 2022 / Revised: 11 August 2022 / Accepted: 15 August 2022 / Published: 16 August 2022
(This article belongs to the Special Issue Entropy in Soft Computing and Machine Learning Algorithms II)

Abstract

:
We show that neural networks with an absolute value activation function and with network path norm, network sizes and network weights having logarithmic dependence on 1 / ε can ε -approximate functions that are analytic on certain regions of C d .

1. Introduction

Deep neural networks have found broad applications in many areas and disciplines, such as computer vision, speech and audio recognition and natural language processing. Two of the main characteristics of a given class of neural networks are its complexity and approximating capability. Once the activation function is selected, a class of networks is determined by the specification of the network architecture (namely, its depth and width) and the choice of network weights. Hence, the estimation of the complexity of a given class is carried out by regularizing (one of) those parameters, and the approximation properties of obtained regularized classes of networks are then investigated.
The capability of shallow networks of depth 1 to approximate continuous functions is shown in the universal approximation theorem ([1]), and approximations of integrable functions by networks with fixed width are presented in [2]. Network-architecture-constrained approximations of analytic functions are given in [3], where it is shown that ReLU networks with depth depending logarithmically on 1 / ε and width d + 4 can ε -approximate analytic functions on the closed subcubes of ( 1 , 1 ) d .
The weight regularization of networks is usually carried out by imposing an l p -related constraint on network weights, p 0 . The most popular types of such constraints include the l 0 , l 1 and the path norm regularizations (see, respectively, [4,5,6] and references therein). Approximations of β -smooth functions on [ 0 , 1 ] d by l 0 -regularized sparse ReLU networks are given in [5,7], and exponential rates of approximations of analytic functions by l 0 -regularized networks are derived in [8].
Path-norm-regularized classes of deep ReLU networks are considered in [4], where, together with other characteristics, the Rademacher complexities of those classes are estimated. The network size independence of those estimates makes the path norm regularization particularly remarkable. As the estimation only uses the Lipschitz continuity (with Lipschitz constant 1), the idempotency and the non-negative homogeneity of the ReLU function, it can be extended to networks with the absolute value activation function. Network characteristics similar to the path norm are also considered in the works [9,10], where they are called, respectively, a variation and a basis-path norm, and statistical features of classes of networks are described in terms of those characteristics.
The objective of the present paper is the construction of path-norm-regularized networks that exponentially fast approximate analytic functions. Our goal is to achieve such convergence rates with activations that are idempotent, non-negative homogeneous and Lipschitz continuous with Lipschitz constant 1 so that the constructed path-norm-regularized networks fall within the scope of network classes studied in [4]. It turns out that networks with an absolute value activation function may suit this goal better than the networks with an ReLU activation function. More precisely, we show that analytic functions can be ε -approximated by networks with an absolute value activation function a ( x ) and with the path norm, the depth, the width and the weights all depending logarithmically on 1 / ε . Such an approximation holds (i) on any subset ( 0 , 1 δ ] d ( 0 , 1 ) d for analytic functions on ( 0 , 1 ) d with absolutely convergent power series; (ii) on the whole hypercube [ 0 , 1 ] d for functions that can be analytically continued to certain subsets of C d . Note that, as the network weights, as well as the total number of weights, depend logarithmically on 1 / ε , then the l 1 weight norms of the constructed approximating deep networks are also of logarithmic dependence on 1 / ε .
Note that the absolute value activation function considered in this paper is among the common built-in activation functions of the software-based neural network evolving method NEAT-Python ([11]). Training algorithms for networks with an absolute value activation function are developed in the works [12,13]. In addition, the VC-dimensions and the structures of the loss surfaces of neural networks with piecewise linear activation functions, including the absolute value function, are described in the works [14,15].
Notation: For a matrix W R d 1 × d 2 , we denote by | W | R d 1 × d 2 the matrix obtained by taking the absolute values of the entries of W: | W | i j = | W i j | . For brevity of presentation, we will say that the matrix | W | is the absolute value of the matrix W (note that, in the literature, there are also other definitions of the notion of an absolute value of a matrix). The path norm of a neural network f is denoted by f × . For x = ( x 1 , , x d ) R d and k = ( k 1 , , k d ) N 0 d , the degree of the monomial x k = x 1 k 1 · · x d k d is defined to be k 1 = i = 1 d k i . To assure that the matrix–vector multiplications are able to be accomplished, the vectors from R d , according to the context, may be treated as matrices either from R d × 1 or from R 1 × d .

2. The Class of Approximant Networks

Neural networks are constituted of weight matrices, biases and nonlinear activation functions acting neuron-wise in the hidden layers. The biases, also called shift vectors, can be omitted by adding a fixed coordinate 1 to the input vector and correspondingly modifying the weight matrices. As the definition of the path norm of networks does not assume the presence of shift vectors, we will add a coordinate 1 to the input vector x and will consider classes of neural networks of the form
F α ( L , p ) = { f : [ 0 , 1 ] p R p L + 1 | f ( x ) = W L α W L 1 α α W 0 ( 1 , x ) } ,
where W i R p i + 1 × p i are the weight matrices, i = 0 , , L , and p = ( p 0 , p 1 , , p L + 1 ) is the width vector, with p 0 = p + 1 . The number of hidden layers L determines the depth of networks from F α ( L , p ) and, in each layer, the activation function α : R R acts element-wise on the input vector. For f F α ( L , p ) given by
f ( x ) = W L α W L 1 α α W 0 ( 1 , x ) ,
let
f × : = i = 0 L | W i | 1
be the path norm of f, where · 1 denotes the l 1 norm of the p 0 ( = p + 1 ) dimensional vector i = 0 L | W i | obtained as a product of absolute values of the weight matrices of f. For B > 0 , let
F α ( L , p , B ) = { f F α ( L , p ) , f × B }
be a path-norm-regularized subclass of F α ( L , p ) . As the results obtained in [4] indicate, the path norm regularizations are particularly well-suited for networks whose activation function α is
  • Lipschitz continuous with Lipschitz constant 1;
  • Idempotent, that is, α ( α ( x ) ) = α ( x ) , x R ;
  • Non-negative homogeneous, that is, α ( c x ) = c α ( x ) , for c 0 , x R .
We therefore aim to choose an activation α possessing those properties such that analytic functions can be approximated by networks from F α ( L , p , B ) with a small path norm constraint B. The most popular activation functions satisfying the above conditions are the ReLU function σ ( x ) = max { 0 , x } and the absolute value function a ( x ) = | x | . Below, we show that, with the absolute value activation function, the path norms of approximant networks may be significantly smaller than the path norms of the ReLU networks.
The standard technique of neural network function approximation relies on approximating the product function ( x , y ) x y , which then allows us to approximate monomials and polynomials of any desired degree. In [7], the approximation of the product x y = ( ( x + y ) 2 x 2 y 2 ) / 2 is achieved by approximating the function x x 2 . The latter is based on the observation that, for the triangle wave
g s ( x ) = g g g s times 1 ( x ) ,
where g : [ 0 , 1 ] [ 0 , 1 ] is defined by
g ( x ) = 2 x , 0 x < 1 / 2 , 2 ( 1 x ) , 1 / 2 x 1 ,
and for any positive integer m,
| x 2 f m ( x ) | 2 2 m 2 ,
where
f m ( x ) : = x s = 1 m g s ( x ) 2 2 s .
The approximation of x 2 by networks with the ReLU activation function σ ( x ) then follows from the representation
g ( x ) = 2 σ ( x ) 4 σ ( x 1 / 2 ) .
Thus, in this case, we will obtain matrices containing weights 2 and 4, which will make the path norm of approximant networks big. Note that the same approach is also used in [3] for constructing ReLU network approximations of analytic functions. In [5], the approximation of the product
x y = h x y + 1 2 h x + y 2 + x + y 2 1 4
is achieved by approximating the function h ( x ) : = x ( 1 x ) , which, in turn, is based on the observation that, for the triangle wave
R k = T k T k 1 T 1 ,
where T k : [ 0 , 2 2 2 k ] [ 0 , 2 2 k ] is defined by
T k ( x ) : = σ ( x / 2 ) σ ( x 2 1 2 k ) ,
and for any positive integer m,
| h ( x ) k = 1 m R k ( x ) | 2 m , x [ 0 , 1 ] .
Although in the representation (6), the coefficients (weights) are all in [ 1 , 1 ] , the approximant k = 1 m R k ( x ) in this case does not have the factors 2 2 s presented in the approximant f m ( x ) in (4), which, again, will result in big values of path norms. Therefore, in order to take advantage of the presence of those reducing weights, we would like to represent the function g ( x ) in (5) by a linear combination of activation functions with smaller coefficients. This is possible if, instead of σ ( x ) , we deploy the absolute value activation function a ( x ) . Indeed, in this case, we have that g ( x ) can be represented on [ 0 , 1 ] as
g ( x ) = 1 2 a ( x 1 / 2 ) .
In the next section, we use the above representation (7) to show that analytic functions can be ε -approximated by networks from F a ( L , p , B ) with each of L , p and B, as well as the network weights having logarithmic dependence on 1 / ε . As all networks will have the same activation function a ( x ) = | x | , in the following, the subscript a will be omitted.

3. Results

We first construct a neural network with activation function a ( x ) , that, for the given γ , m N , simultaneously approximates all d-dimensional monomials of a degree less than γ up to an error of γ 2 4 m . The depth of this network has order m log 2 γ and its width is of order m γ d + 1 . Moreover, the entries of the product of the absolute values of matrices of the network have an order of at most γ 5 (note the independence of m).
For γ > 0 , let C d , γ denote the number of d-dimensional monomials x k with degree k 1 < γ . Then, C d , γ < ( γ + 1 ) d and the following holds:
Lemma 1.
There is a neural network Mon m , γ d F ( L , p ) with L log 2 γ ( 2 m + 5 ) + 2 , p 0 = d + 1 , p L + 1 = C d , γ and p 6 γ ( m + 2 ) C d , γ such that
Mon m , γ d ( x ) ( x k ) k 1 < γ γ 2 4 m , x [ 0 , 1 ] d .
Moreover, the entries of the C d , γ × ( d + 1 ) -dimensional matrix obtained by multiplying the absolute values of matrices presented in Mon m , γ d are all bounded by 144 ( γ + 1 ) 5 .
Taking in the above lemma γ , m = log 2 1 ε , we obtain a neural network from F ( L , p ) , with L and p having logarithmic dependence on 1 / ε , which simultaneously approximates the monomials of a degree at most of γ with error ε (up to a logarithmic factor). Moreover, the entries of the product of absolute values of matrices of this network will also have logarithmic dependence on 1 / ε . Below, we use this property to construct a neural network approximation of analytic and analytically continuable functions with an approximation error ε and with network parameters having logarithmic order.
Theorem 1.
Let f ( x ) = k N 0 d a k x k be an analytic function on ( 0 , 1 ) d with k N 0 d | a k | F . Then, for any ε , δ ( 0 , 1 ) , there is a constant C = C ( d , F ) and a neural network F ε F ( L , p , B ) with L C ( log 2 1 δ ) ( log 2 2 1 ε ) , p C δ d + 1 ( log 2 1 ε ) d + 2 and
B 10 4 d F log 2 ( ( 2 F + 16 ) / ε ) δ 5 ,
such that
| F ε ( x ) f ( x ) | ε δ 2 , forall x ( 0 , 1 δ ] d .
Note that an exponential convergence rate of deep ReLU network approximants on subintervals ( 0 , 1 δ ] d is also given in [3]. In our case, however, not only the depth and the width but also the path norm F ε × of the constructed network F ε have logarithmic dependence on 1 / ε . Note that, in the above theorem, as δ approaches to 0, both p and B, as well as the approximation error, grow polynomially on 1 / δ . In the next theorem, we use the properties of Chebyshev series to derive an exponential convergence rate on the whole hypercube [ 0 , 1 ] d . Recall that the Chebyshev polynomials are defined as T 0 ( x ) = 1 , T 1 ( x ) = x and
T n + 1 ( x ) = 2 x T n ( x ) T n 1 ( x ) .
Chebyshev polynomials play an important role in the approximation theory ([16]), and, in particular, it is known ([17], Theorem 3.1) that if f is Lipschitz continuous on [ 1 , 1 ] , then it has a unique representation as an absolutely and uniformly convergent Chebyshev series
f ( x ) = k = 0 a k T k ( x ) .
Moreover, in case f can be analytically continued to an ellipse E ρ C with foci 1 and 1 and with the sum of semimajor and semiminor axes equal to ρ > 1 , then the partial sums of the above Chebyshev series converge to f with a geometric rate and the coefficients a k also decay with a geometric rate. This result was first derived by Bernstein in [18] and its extension to the multivariate case was given in [19]. Note that the condition z E ρ implies that z 2 N 1 , h 2 , where h = ( ρ ρ 1 ) / 2 and, for d , a > 0 , N d , a C denotes an open ellipse with foci 0 and d and the leftmost point a . For F > 0 , ρ > 1 and h = ( ρ ρ 1 ) / 2 , let A d ( ρ , F ) be the space of functions f : [ 0 , 1 ] d R that can be analytically continued to the region { z C d : z 1 2 + + z d 2 N d , h 2 } and are bounded there by F. Using the extension of Bernstein’s theorem to the multivariate case, we obtain
Lemma 2.
Let ρ 2 d . For f A d ( ρ , F ) , there is a constant C = C ( d , ρ , F ) and a polynomial
p ( x ) = k 1 γ b k x k , x [ 0 , 1 ] d ,
with
| b k | C ( γ + 1 ) d
and
| f ( x ) p ( x ) | C ρ γ / d , forall x [ 0 , 1 ] d .
Combining Lemma 1 and Lemma 2, we obtain the following.
Theorem 2.
Let ε ( 0 , 1 ) and let ρ 2 d . For f A d ( ρ , F ) , there is a constant C = C ( d , ρ , F ) and a neural network F ε F ( L , p , B ) with L C log 2 2 1 ε , p C ( log 2 1 ε ) d + 2 and B C ( log 2 1 ε ) 2 d + 5 such that
| F ε ( x ) f ( x ) | ε , forall x [ 0 , 1 ] d .
We conclude this part by estimating the l 1 weight regularization of networks constructed in Theorem 2. First, the total number of weights in those networks is bounded by ( L + 1 ) p 2 = O ( log 2 1 ε ) 2 d + 6 . From (7), it follows that all of the weights of network Mon m , γ d from Lemma 1 are in [ 2 , 2 ] . In Theorem 2, the network F ε is obtained by adding to a network Mon m , γ d , with γ = m = O ( log 2 1 ε ) , a layer with coefficients of partial sums of power series of an approximated function. Thus, using (8), we obtain that the l 1 weight norm of the network F ε constructed in Theorem 2 has order O ( log 2 1 ε ) 4 d + 6 .

4. Proofs

In the following proofs, I k denotes an identity matrix of size k × k and all of the networks have activation a ( x ) = | x | . The proof of Lemma 1 is based on the following two lemmas.
Lemma 3.
For any positive integer m, there exists a neural network Mult m F ( 2 m + 3 , p ) , with p 0 = 3 , p L + 1 = 1 and p = 3 m + 2 , such that
| Mult m ( x , y ) x y | 3 · 2 2 m 3 , for all x , y [ 0 , 1 ] ,
and the product of absolute values of the matrices presented in Mult m is equal to
3 k = 1 m 2 k 1 2 2 k , 2 2 m , 2 2 m .
Proof. 
For k 2 , let R k denote a row of length k with a first entry equal to 1 / 2 , last entry equal to 1 and all other entries equal to 0. Let A k be a matrix of size ( k + 1 ) × k obtained by adding the ( k + 1 ) -th row R k to the identity matrix I k . That is,
Entropy 24 01136 i001
In addition, let B k denote a matrix of size k × k given by
Entropy 24 01136 i002
It then follows from (7) that
B m + 2 a A m + 1 B 3 a A 2 1 x = 1 x g 1 ( x ) g 2 ( x ) · · · g m ( x ) ,
where g s ( x ) is the function defined in (3), s = 1 , , m . Thus, if S m + 2 is a row of length m + 2 defined as
S m + 2 = 0 , 1 , 1 2 2 · 1 , 1 2 2 · 2 , , 1 2 2 · m ,
then
S m + 2 a B m + 2 a A m + 1 a B 3 a A 2 1 x = f m ( x ) ,
where f m is defined by (4). We have that
| S m + 2 | · | B m + 2 | · | A m + 1 | · · | B 3 | · | A 2 | = k = 1 m 2 k + 1 2 2 2 k , 2 2 m .
As x y = 1 2 ( x + y ) 2 x 2 y 2 , then, in the first layer of Mult m , we will obtain a vector
1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 1 x y : = C 1 x y = 1 x 1 y 1 x + y
and will then apply the network in a parallel manner from the first part of the proof to each of the pairs (1, x),(1, y) and (1, x + y). More precisely, for a given matrix M of size p × q, let M ˜ be a matrix of size 3p × 3q defined as
M ˜ = M 0 0 0 M 0 0 0 M .
Then, for the network
Mult m ( x , y ) = 1 2 , 1 2 , 1 2 a S ˜ m + 2 a B ˜ m + 2 a A ˜ m + 1 B ˜ 3 a A ˜ 2 a C 1 x y
we have that
Mult m ( x , y ) = 1 2 ( f m ( x + y ) f m ( x ) f m ( y ) ) ,
which, together with | f m ( x ) x 2 | < 2 2 m 2 and the triangle inequality, implies (9). It remains to be noted that the product of absolute values of the matrices presented in Mult m is equal to
1 2 , 1 2 , 1 2 · | S ˜ m + 2 | · | B ˜ m + 2 | · | A ˜ m + 1 | · · | B ˜ 3 | · | A ˜ 2 | · | C | = 3 k = 1 m 2 k 1 2 2 k , 2 2 m , 2 2 m ,
which completes the proof of the lemma. □
Lemma 4.
For any positive integer m, there exists a neural network Mult m r F ( L , p ) , with L = ( 2 m + 5 ) log 2 r + 1 , p 0 = r + 1 , p L + 1 = 1 and p 6 r ( m + 2 ) + 1 , such that
| Mult m r ( x ) i = 1 r x i | r 2 4 m f o r   a l l x = ( x 1 , , x r ) [ 0 , 1 ] r ,
and, for the ( r + 1 ) -dimensional vector J m r obtained by multiplication of absolute values of matrices presented in Mult m r , we have that J m r 144 r 4 .
Proof. 
First, for a given k N , we construct a network N m k F ( L , p ) with L = 2 m + 4 , p 0 = 2 k + 1 and p L + 1 = k + 1 , such that
N m k ( x 1 , x 2 , , x 2 k 1 , x 2 k ) = ( 1 , Mult m ( x 1 , x 2 ) , , Mult m ( x 2 k 1 , x 2 k ) ) .
In the first layer, we obtain a vector for which the first coordinate is 1 followed by triples ( 1 , x 2 l 1 , x 2 l ) l = 1 , , k , that is, the vector ( 1 , 1 , x 1 , x 2 , 1 , x 3 , x 4 , , 1 , x 2 k 1 , x 2 k ) . N m k is then obtained by applying in parallel the network Mult m to each triple ( 1 , x 2 l 1 , x 2 l ) while keeping the first coordinate equal to 1. The product of absolute values of the matrices presented in this construction is a matrix of size ( k + 1 ) × ( 2 k + 1 ) having a form
1 0 0 0 0 0 0 0 0 a m b m b m 0 0 0 0 0 0 a m 0 0 b m b m 0 0 0 0 · · · · · · · · · · a m 0 0 0 0 0 0 b m b m ,
where a m = 3 k = 1 m 2 k 1 2 2 k and b m = 2 2 m are the coordinates obtained in the previous lemma. Let us now construct the network Mult m r . The first hidden layer of Mult m r computes
( 1 , x 1 , , x r ) ( 1 , x 1 , , x r , 1 , 1 , , 1 2 q r 1 ) ,
where q = log 2 r . We then subsequently apply the networks N m 2 q , N m 2 q 1 , , N m 2 and, in the last layer, we multiply the outcome by ( 0 , 1 ) . From Lemma 3 and triangle inequality, we have that | Mult m ( x , y ) t z | 3 · 2 2 m 3 + | x t | + | y z | , for x , y , t , z [ 0 , 1 ] . Hence, by induction on q, we obtain that | Mult m r ( x ) i = 1 r x i | 3 q 2 2 m 3 3 r 2 2 2 m 3 r 2 4 m .
Note that the product of absolute values of matrices in each network N m k has the above form, that is, in each row, it has at most three nonzero values, each of which is less than 2. As the matrices given in the first and the last layer of Mult m r also satisfy this property, then each entry of the product of absolute values of all matrices of Mult m r will not exceed 12 q + 2 144 r 4 . □
Proof of Lemma 1.
We have that, if k 1 = 0 , then x k = 1 , and if k 1 = 1 , then k has only one non-zero coordinate, say, k j , which is equal to 1 and x k = x j . Denote N = C d , γ d 1 and let k 1 , , k N be the multi-indices satisfying 1 < k i 1 < γ , i = 1 , , N . For k = ( k 1 , , k d ) with k 1 > 1 , denote by x k the ( k 1 + 1 ) -dimensional vector of the form
x k = ( 1 , x 1 , , x 1 k 1 1 , , x d , , x d k d 1 ) .
The first layer of Mon m , γ d computes the d + 1 + i = 1 N ( k i 1 + 1 ) -dimensional vector
( 1 , x , x k 1 , , x k N )
by multiplying the input vector by matrix Γ of size d + 1 + i = 1 N ( k i 1 + 1 ) × ( r + 1 ) . In the following layers, we do not change the first d + 1 coordinates (by multiplying them by I d + 1 ), and, to each x k i , we apply in parallel the network Mult m k i 1 . Recall that, in Lemma 4, J m r denotes the ( r + 1 ) -dimensional vector obtained from the product of absolute values of the matrices of Mult m r . We then have that the product of the absolute values of the matrices of Mon m , γ d has the form
Entropy 24 01136 i003
As the matrix Γ only contains entries 0 and 1, then, applying Lemma 4, we obtain that the entries of M are bounded by
max 1 i N | | J m k i 1 | | 1 144 ( γ + 1 ) 5 .
Proof of Theorem 1.
Let γ = log 2 ( ( 2 F + 16 ) / ε ) log 2 ( 1 δ ) 1 . Then, for x ( 0 , 1 δ ] d , we have that
| f ( x ) k 1 < γ a k x k | = | k 1 γ a k x k | ( 1 δ ) γ F ε F 2 F + 16 ε 2 ε 2 δ 2 .
Applying Lemma 1 with m = log 2 4 F + 16 ε , we obtain that, for all x [ 0 , 1 ] d
Mon m , γ d ( x ) ( x k ) k 1 < γ γ 2 4 m 4 log 2 2 ( 1 δ ) 1 log 2 2 2 F + 16 ε ε 4 F + 16 2 4 ( 2 F + 16 ) ε 2 δ 2 ε ( 4 F + 16 ) 2 ε 2 F δ 2 ,
where we used the inequalities log 2 ( 1 δ ) 1 δ , δ ( 0 , 1 ) , and log 2 2 r r for r 16 . In order to approximate the partial sum k 1 γ a k x k , we add one last layer with the coefficients of that partial sum to the network Mon m , γ + 1 d . As the sum of absolute values of those coefficients is bounded by F, then, combining (10) and (11), for the obtained network F ε we obtain
| F ε ( x ) f ( x ) | ε δ 2 , f o r   a l l x ( 0 , 1 δ ] d .
From Lemma 1 it follows that
F ε × 144 ( d + 1 ) F ( γ + 1 ) 5 10 4 d F log 2 ( ( 2 F + 16 ) / ε ) δ 5 .
Let us now present the result from [19] that will be used to derive Lemma 2. First, if f A d ( ρ , F ) , then ([20], Theorem 4.1) f has a unique representation as an absolutely and uniformly convergent multivariate Chebyshev series
f ( x ) = k 1 = 0 k d = 0 a k 1 , , k d T k 1 ( x 1 ) T k d ( x d ) , x [ 0 , 1 ] d .
Note that, for k : = ( k 1 , , k d ) , the degree of a d-dimensional polynomial T k 1 ( x 1 ) T k d ( x d ) is k 1 = k 1 + + k d . Then, for any non-negative integers n 1 , , n d , the partial sum
p ( x ) = k 1 = 0 n 1 k d = 0 n d a k T k 1 ( x 1 ) T k d ( x d )
is a polynomial truncation of the multivariate Chebyshev series of f of degree d ( p ) = n 1 + + n d . It is shown in [19] that
Theorem 3.
For f A d ( ρ , F ) , there is a constant C = C ( d , ρ , F ) such that the multivariate Chebyshev coefficients of f satisfy
| a k | C ρ k 2
and, for the polynomial truncations p of the multivariate Chebyshev series of f, we have that
inf d ( p ) γ f ( x ) p ( x ) [ 0 , 1 ] d C ρ γ / d .
Proof of Lemma 2.
Note that, from the recursive definition of the Chebyshev polynomials, it follows that, for any k 0 , the coefficients of the Chebyshev polynomial T k ( x ) are all bounded by 2 k . Let p now be a polynomial given by (12) with degree d ( p ) γ . As the number of summands in the right-hand side of (12) is bounded by ( γ + 1 ) d , then, using (13), we obtain that p can be rewritten as
p ( x ) = k 1 γ b k x k ,
with
| b k | C ( γ + 1 ) d 2 k 1 ρ k 2 C ( γ + 1 ) d 2 d k 2 ρ k 2 C ( γ + 1 ) d ,
where the last inequality follows from the condition ρ 2 d . □
Proof of Theorem 2.
The proof follows from Lemmas 1 and 2 by taking γ = m = log 2 1 ε and adding, to the network Mon m , γ + 1 d , the last layer with the coefficients of the polynomial p ( x ) from Lemma 2. For the obtained network F ε we have that
F ε × 144 C ( d + 1 ) C d , γ + 1 ( γ + 2 ) d ( γ + 2 ) 5 144 C ( d + 1 ) ( γ + 2 ) 2 d + 5 ,
where C is the constant from Lemma 2. □

5. Discussion

Although various activation functions, including the ReLU, sigmoid and the Gaussian function, have already been used in the literature for neural network approximations of smooth and analytic functions (see [3,8,21]), approximating properties of neural networks with an absolute value activation function, which is a built-in activation function of software-based neural network evolving methods (such as NEAT-Python, [11]), has been barely covered previously. Whereas the algorithms developed in the works [12,13] allow us to train neural networks with an absolute value activation function, in the present paper, we study the capabilities of those networks to approximate analytic functions. While popular types of constraints imposed on approximating neural networks are either controlling the l p norms of network weights or adjusting their architectures, in the present work, we study approximating properties of neural networks with regularized path norms and show that networks with an absolute value activation function and with network path norms having logarithmic dependence on 1 / ε can ε -approximate functions that are analytic on certain regions of C d . The sizes and the weights of constructed networks also have logarithmic dependence on 1 / ε .

Funding

This research was funded by NWO Vidi grant: “Statistical foundation for multilayer neural networks”: VI.Vidi.192.021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author would like to thank Johannes Schmidt-Hieber for support and valuable suggestions. The author is also grateful to the referees for the evaluation of the paper and for constructive comments.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Scarselli, F.; Tsoi, A.C. Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Netw. 1998, 11, 15–37. [Google Scholar] [CrossRef]
  2. Lu, Z.; Pu, H.; Wang, F.; Hu, Z.; Wang, L. The expressive power of neural networks: A view from the width. Adv. Neural Inf. Process. Syst. 2017, 30, 6231–6239. [Google Scholar]
  3. E, W.; Wang, Q. Exponential convergence of the deep neural network approximation for analytic functions. Sci. China Math. 2018, 61, 1733–1740. [Google Scholar] [CrossRef]
  4. Neyshabur, B.; Tomioka, R.; Srebro, N. Norm-based capacity control in neural networks. In Proceedings of the 28th Conference on Learning Theory (COLT), Paris, France, 3–6 July 2015; pp. 1376–1401. [Google Scholar]
  5. Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 2020, 48, 1875–1897. [Google Scholar]
  6. Taheri, M.; Xie, F.; Lederer, J. Statistical Guarantees for Regularized Neural Networks. Neural Netw. 2021, 142, 148–161. [Google Scholar] [CrossRef] [PubMed]
  7. Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef] [PubMed]
  8. Opschoor, J.A.A.; Schwab, C.; Zech, J. Exponential ReLU DNN Expression of Holomorphic Maps in High Dimension. Constr. Approx. 2021, 55, 537–582. [Google Scholar] [CrossRef]
  9. Barron, A.; Klusowski, J. Approximation and estimation for high-dimensional deep learning networks. arXiv 2018, arXiv:1809.03090. [Google Scholar]
  10. Zheng, S.; Meng, Q.; Zhang, H.; Chen, W.; Yu, N.; Liu, T. Capacity control of ReLU neural networks by basis-path norm. arXiv 2019, arXiv:1809.07122. [Google Scholar] [CrossRef]
  11. Overview of Builtin Activation Functions. Available online: https://neat-python.readthedocs.io/en/latest/activation.html (accessed on 5 July 2022).
  12. Batruni, R. A multilayer neural network with piecewise-linear structure and backpropagation learning. IEEE Trans. Neural Netw. 1991, 2, 395–403. [Google Scholar] [CrossRef]
  13. Lin, J.-N.; Unbehauen, R. Canonical piecewise-linear neural networks. IEEE Trans. Neural Netw. 1995, 6, 43–50. [Google Scholar] [PubMed]
  14. Bartlett, P.L.; Harvey, N.; Liaw, C.; Mehrabian, A. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 2019, 20, 2285–2301. [Google Scholar]
  15. He, F.; Wang, B.; Tao, D. Piecewise linear activations substantially shape the loss surfaces of neural networks. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  16. Mason, J.C.; Handscomb, D.C. Chebyshev Polynomials; Chapman and Hall/CRC: New York, NY, USA, 2002. [Google Scholar]
  17. Trefethen, L.N. Approximation Theory and Approximation Practice; SIAM: Philadelphia, PA, USA, 2013. [Google Scholar]
  18. Bernstein, S. Sur la meilleure approximation de |x| par des polynomes de degrés donnés. Acta Math. 1914, 37, 1–57. [Google Scholar] [CrossRef]
  19. Trefethen, L.N. Multivariate polynomial approximation in the hypercube. Proc. Am. Math. Soc. 2017, 145, 4837–4844. [Google Scholar] [CrossRef]
  20. Mason, J.C. Near-best multivariate approximation by Fourier series, Chebyshev series and Chebyshev interpolation. J. Approx. Theory 1980, 28, 349–358. [Google Scholar] [CrossRef]
  21. Mhaskar, H.N. Neural networks for optimal approximation of smooth and analytic functions. Neural Comput. 1996, 8, 164–177. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Beknazaryan, A. Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks. Entropy 2022, 24, 1136. https://doi.org/10.3390/e24081136

AMA Style

Beknazaryan A. Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks. Entropy. 2022; 24(8):1136. https://doi.org/10.3390/e24081136

Chicago/Turabian Style

Beknazaryan, Aleksandr. 2022. "Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks" Entropy 24, no. 8: 1136. https://doi.org/10.3390/e24081136

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop