Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks

Beknazaryan, Aleksandr

doi:10.3390/e24081136

Open AccessArticle

Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks

by

Aleksandr Beknazaryan

Institute of Environmental and Agricultural Biology (X-BIO), University of Tyumen, Volodarskogo 6, 625003 Tyumen, Russia

Entropy 2022, 24(8), 1136; https://doi.org/10.3390/e24081136

Submission received: 6 July 2022 / Revised: 11 August 2022 / Accepted: 15 August 2022 / Published: 16 August 2022

(This article belongs to the Special Issue Entropy in Soft Computing and Machine Learning Algorithms II)

Download Versions Notes

Abstract

:

We show that neural networks with an absolute value activation function and with network path norm, network sizes and network weights having logarithmic dependence on

1 / ε

can

ε

-approximate functions that are analytic on certain regions of

C^{d}

.

Keywords:

deep neural networks; analytic functions; path norm regularization; exponential convergence

1. Introduction

Deep neural networks have found broad applications in many areas and disciplines, such as computer vision, speech and audio recognition and natural language processing. Two of the main characteristics of a given class of neural networks are its complexity and approximating capability. Once the activation function is selected, a class of networks is determined by the specification of the network architecture (namely, its depth and width) and the choice of network weights. Hence, the estimation of the complexity of a given class is carried out by regularizing (one of) those parameters, and the approximation properties of obtained regularized classes of networks are then investigated.

The capability of shallow networks of depth 1 to approximate continuous functions is shown in the universal approximation theorem ([1]), and approximations of integrable functions by networks with fixed width are presented in [2]. Network-architecture-constrained approximations of analytic functions are given in [3], where it is shown that ReLU networks with depth depending logarithmically on

1 / ε

and width

d + 4

can

ε

-approximate analytic functions on the closed subcubes of

{(- 1, 1)}^{d}

.

The weight regularization of networks is usually carried out by imposing an

l_{p}

-related constraint on network weights,

p \geq 0

. The most popular types of such constraints include the

l_{0}

,

l_{1}

and the path norm regularizations (see, respectively, [4,5,6] and references therein). Approximations of

β

-smooth functions on

{[0, 1]}^{d}

by

l_{0}

-regularized sparse ReLU networks are given in [5,7], and exponential rates of approximations of analytic functions by

l_{0}

-regularized networks are derived in [8].

Path-norm-regularized classes of deep ReLU networks are considered in [4], where, together with other characteristics, the Rademacher complexities of those classes are estimated. The network size independence of those estimates makes the path norm regularization particularly remarkable. As the estimation only uses the Lipschitz continuity (with Lipschitz constant 1), the idempotency and the non-negative homogeneity of the ReLU function, it can be extended to networks with the absolute value activation function. Network characteristics similar to the path norm are also considered in the works [9,10], where they are called, respectively, a variation and a basis-path norm, and statistical features of classes of networks are described in terms of those characteristics.

The objective of the present paper is the construction of path-norm-regularized networks that exponentially fast approximate analytic functions. Our goal is to achieve such convergence rates with activations that are idempotent, non-negative homogeneous and Lipschitz continuous with Lipschitz constant 1 so that the constructed path-norm-regularized networks fall within the scope of network classes studied in [4]. It turns out that networks with an absolute value activation function may suit this goal better than the networks with an ReLU activation function. More precisely, we show that analytic functions can be

ε

-approximated by networks with an absolute value activation function

a (x)

and with the path norm, the depth, the width and the weights all depending logarithmically on

1 / ε

. Such an approximation holds (i) on any subset

{(0, 1 - δ]}^{d} \subset {(0, 1)}^{d}

for analytic functions on

{(0, 1)}^{d}

with absolutely convergent power series; (ii) on the whole hypercube

{[0, 1]}^{d}

for functions that can be analytically continued to certain subsets of

C^{d}

. Note that, as the network weights, as well as the total number of weights, depend logarithmically on

1 / ε

, then the

l_{1}

weight norms of the constructed approximating deep networks are also of logarithmic dependence on

1 / ε

.

Note that the absolute value activation function considered in this paper is among the common built-in activation functions of the software-based neural network evolving method NEAT-Python ([11]). Training algorithms for networks with an absolute value activation function are developed in the works [12,13]. In addition, the VC-dimensions and the structures of the loss surfaces of neural networks with piecewise linear activation functions, including the absolute value function, are described in the works [14,15].

Notation: For a matrix

W \in R^{d_{1} \times d_{2}}

, we denote by

| W | \in R^{d_{1} \times d_{2}}

the matrix obtained by taking the absolute values of the entries of W:

{| W |}_{i j} = | W_{i j} |

. For brevity of presentation, we will say that the matrix

| W |

is the absolute value of the matrix W (note that, in the literature, there are also other definitions of the notion of an absolute value of a matrix). The path norm of a neural network f is denoted by

{∥ f ∥}_{\times}

. For

x = (x_{1}, \dots, x_{d}) \in R^{d}

and

k = (k_{1}, \dots, k_{d}) \in N_{0}^{d}

, the degree of the monomial

x^{k} = x_{1}^{k_{1}} \cdot \dots \cdot x_{d}^{k_{d}}

is defined to be

{∥ k ∥}_{1} = \sum_{i = 1}^{d} k_{i}

. To assure that the matrix–vector multiplications are able to be accomplished, the vectors from

R^{d}

, according to the context, may be treated as matrices either from

R^{d \times 1}

or from

R^{1 \times d}

.

2. The Class of Approximant Networks

Neural networks are constituted of weight matrices, biases and nonlinear activation functions acting neuron-wise in the hidden layers. The biases, also called shift vectors, can be omitted by adding a fixed coordinate 1 to the input vector and correspondingly modifying the weight matrices. As the definition of the path norm of networks does not assume the presence of shift vectors, we will add a coordinate 1 to the input vector

x

and will consider classes of neural networks of the form

F_{α} (L, p) = {f : {[0, 1]}^{p} \to R^{p_{L + 1}} | f (x) = W_{L} \circ α \circ W_{L - 1} \circ α \circ \dots \circ α \circ W_{0} (1, x)},

where

W_{i} \in R^{p_{i + 1} \times p_{i}}

are the weight matrices,

i = 0, \dots, L,

and

p = (p_{0}, p_{1}, \dots, p_{L + 1})

is the width vector, with

p_{0} = p + 1

. The number of hidden layers L determines the depth of networks from

F_{α} (L, p)

and, in each layer, the activation function

α : R \to R

acts element-wise on the input vector. For

f \in F_{α} (L, p)

given by

f (x) = W_{L} \circ α \circ W_{L - 1} \circ α \circ \dots \circ α \circ W_{0} (1, x),

(1)

let

{∥ f ∥}_{\times} : = ∥ \prod_{i = 0}^{L} | W_{i} | ∥_{1}

(2)

be the path norm of f, where

{∥ \cdot ∥}_{1}

denotes the

l_{1}

norm of the

p_{0} (= p + 1)

dimensional vector

\prod_{i = 0}^{L} | W_{i} |

obtained as a product of absolute values of the weight matrices of f. For

B > 0

, let

F_{α} (L, p, B) = {f \in F_{α} {(L, p), ∥ f ∥}_{\times} \leq B}

be a path-norm-regularized subclass of

F_{α} (L, p)

. As the results obtained in [4] indicate, the path norm regularizations are particularly well-suited for networks whose activation function

α

is

Lipschitz continuous with Lipschitz constant 1;
Idempotent, that is, $α (α (x)) = α (x)$ , $x \in R$ ;
Non-negative homogeneous, that is, $α (c x) = c α (x),$ for $c \geq 0$ , $x \in R$ .

We therefore aim to choose an activation

α

possessing those properties such that analytic functions can be approximated by networks from

F_{α} (L, p, B)

with a small path norm constraint B. The most popular activation functions satisfying the above conditions are the ReLU function

σ (x) = max {0, x}

and the absolute value function

a (x) = | x |

. Below, we show that, with the absolute value activation function, the path norms of approximant networks may be significantly smaller than the path norms of the ReLU networks.

The standard technique of neural network function approximation relies on approximating the product function

(x, y) \mapsto x y

, which then allows us to approximate monomials and polynomials of any desired degree. In [7], the approximation of the product

x y = ({(x + y)}^{2} - x^{2} - y^{2}) / 2

is achieved by approximating the function

x \mapsto x^{2}

. The latter is based on the observation that, for the triangle wave

g_{s} (x) = \underset{s times}{\underset{︸}{g \circ g \circ \dots \circ g}} (x),

(3)

where

g : [0, 1] \to [0, 1]

is defined by

g (x) = \{\begin{matrix} 2 x, & 0 \leq x < 1 / 2, \\ 2 (1 - x), & 1 / 2 \leq x \leq 1, \end{matrix}

and for any positive integer m,

| x^{2} - f_{m} (x) | \leq 2^{- 2 m - 2},

where

f_{m} (x) : = x - \sum_{s = 1}^{m} \frac{g_{s} (x)}{2^{2 s}} .

(4)

The approximation of

x^{2}

by networks with the ReLU activation function

σ (x)

then follows from the representation

g (x) = 2 σ (x) - 4 σ (x - 1 / 2) .

(5)

Thus, in this case, we will obtain matrices containing weights 2 and 4, which will make the path norm of approximant networks big. Note that the same approach is also used in [3] for constructing ReLU network approximations of analytic functions. In [5], the approximation of the product

\begin{matrix} x y = h (\frac{x - y + 1}{2}) - h (\frac{x + y}{2}) + \frac{x + y}{2} - \frac{1}{4} \end{matrix}

is achieved by approximating the function

h (x) : = x (1 - x)

, which, in turn, is based on the observation that, for the triangle wave

\begin{matrix} R^{k} = T^{k} \circ T^{k - 1} \circ \dots \circ T^{1}, \end{matrix}

where

T^{k} : [0, 2^{2 - 2 k}] \to [0, 2^{- 2 k}]

is defined by

T^{k} (x) : = σ (x / 2) - σ (x - 2^{1 - 2 k}),

(6)

and for any positive integer m,

\begin{matrix} | h (x) - \sum_{k = 1}^{m} R^{k} (x) | \leq 2^{- m}, x \in [0, 1] . \end{matrix}

Although in the representation (6), the coefficients (weights) are all in

[- 1, 1]

, the approximant

\sum_{k = 1}^{m} R^{k} (x)

in this case does not have the factors

2^{- 2 s}

presented in the approximant

f_{m} (x)

in (4), which, again, will result in big values of path norms. Therefore, in order to take advantage of the presence of those reducing weights, we would like to represent the function

g (x)

in (5) by a linear combination of activation functions with smaller coefficients. This is possible if, instead of

σ (x)

, we deploy the absolute value activation function

a (x)

. Indeed, in this case, we have that

g (x)

can be represented on

[0, 1]

as

g (x) = 1 - 2 a (x - 1 / 2) .

(7)

In the next section, we use the above representation (7) to show that analytic functions can be

ε

-approximated by networks from

F_{a} (L, p, B)

with each of

{L, ∥ p ∥}_{\infty}

and B, as well as the network weights having logarithmic dependence on

1 / ε

. As all networks will have the same activation function

a (x) = | x |

, in the following, the subscript a will be omitted.

3. Results

We first construct a neural network with activation function

a (x)

, that, for the given

γ, m \in N

, simultaneously approximates all d-dimensional monomials of a degree less than

γ

up to an error of

γ^{2} 4^{- m}

. The depth of this network has order

m {log}_{2} γ

and its width is of order

m γ^{d + 1}

. Moreover, the entries of the product of the absolute values of matrices of the network have an order of at most

γ^{5}

(note the independence of m).

For

γ > 0

, let

C_{d, γ}

denote the number of d-dimensional monomials

x^{k}

with degree

{∥ k ∥}_{1} < γ

. Then,

C_{d, γ} < {(γ + 1)}^{d}

and the following holds:

Lemma 1.

There is a neural network Mon

_{m, γ}^{d} \in F (L, p)

with

L \leq ⌈ {log}_{2} γ ⌉ (2 m + 5) + 2

,

p_{0} = d + 1

,

p_{L + 1} = C_{d, γ}

and

{∥ p ∥}_{\infty} \leq 6 γ (m + 2) C_{d, γ}

such that

∥ {Mon}_{m, γ}^{d} (x) - {(x^{k})}_{{∥ k ∥}_{1} < γ} ∥_{\infty} \leq γ^{2} 4^{- m}, x \in {[0, 1]}^{d} .

Moreover, the entries of the

C_{d, γ} \times (d + 1)

-dimensional matrix obtained by multiplying the absolute values of matrices presented in

{Mon}_{m, γ}^{d}

are all bounded by

144 {(γ + 1)}^{5}

.

Taking in the above lemma

γ, m = ⌈ {log}_{2} \frac{1}{ε} ⌉

, we obtain a neural network from

F (L, p)

, with L and

{∥ p ∥}_{\infty}

having logarithmic dependence on

1 / ε

, which simultaneously approximates the monomials of a degree at most of

γ

with error

ε

(up to a logarithmic factor). Moreover, the entries of the product of absolute values of matrices of this network will also have logarithmic dependence on

1 / ε

. Below, we use this property to construct a neural network approximation of analytic and analytically continuable functions with an approximation error

ε

and with network parameters having logarithmic order.

Theorem 1.

Let

f (x) = \sum_{k \in N_{0}^{d}} a_{k} x^{k}

be an analytic function on

{(0, 1)}^{d}

with

\sum_{k \in N_{0}^{d}} | a_{k} | \leq F

. Then, for any

ε, δ \in (0, 1)

, there is a constant

C = C (d, F)

and a neural network

F_{ε} \in F (L, p, B)

with

L \leq C ({log}_{2} \frac{1}{δ}) ({log}_{2}^{2} \frac{1}{ε}),

{∥ p ∥}_{\infty} \leq \frac{C}{δ^{d + 1}} {({log}_{2} \frac{1}{ε})}^{d + 2}

and

B \leq 10^{4} d F {(\frac{{log}_{2} ((2 F + 16) / ε)}{δ})}^{5},

such that

| F_{ε} (x) - f (x) | \leq \frac{ε}{δ^{2}}, forall x \in {(0, 1 - δ]}^{d} .

Note that an exponential convergence rate of deep ReLU network approximants on subintervals

{(0, 1 - δ]}^{d}

is also given in [3]. In our case, however, not only the depth and the width but also the path norm

∥ F_{ε} ∥_{\times}

of the constructed network

F_{ε}

have logarithmic dependence on

1 / ε

. Note that, in the above theorem, as

δ

approaches to 0, both

{∥ p ∥}_{\infty}

and B, as well as the approximation error, grow polynomially on

1 / δ

. In the next theorem, we use the properties of Chebyshev series to derive an exponential convergence rate on the whole hypercube

{[0, 1]}^{d}

. Recall that the Chebyshev polynomials are defined as

T_{0} (x) = 1

,

T_{1} (x) = x

and

T_{n + 1} (x) = 2 x T_{n} (x) - T_{n - 1} (x) .

Chebyshev polynomials play an important role in the approximation theory ([16]), and, in particular, it is known ([17], Theorem 3.1) that if f is Lipschitz continuous on

[- 1, 1]

, then it has a unique representation as an absolutely and uniformly convergent Chebyshev series

f (x) = \sum_{k = 0}^{\infty} a_{k} T_{k} (x) .

Moreover, in case f can be analytically continued to an ellipse

E_{ρ} \subset C

with foci

- 1

and 1 and with the sum of semimajor and semiminor axes equal to

ρ > 1

, then the partial sums of the above Chebyshev series converge to f with a geometric rate and the coefficients

a_{k}

also decay with a geometric rate. This result was first derived by Bernstein in [18] and its extension to the multivariate case was given in [19]. Note that the condition

z \in E_{ρ}

implies that

z^{2} \in N_{1, h^{2}}

, where

h = (ρ - ρ^{- 1}) / 2

and, for

d, a > 0

,

N_{d, a} \subset C

denotes an open ellipse with foci 0 and d and the leftmost point

- a

. For

F > 0

,

ρ > 1

and

h = (ρ - ρ^{- 1}) / 2

, let

A^{d} (ρ, F)

be the space of functions

f : {[0, 1]}^{d} \to R

that can be analytically continued to the region

{z \in C^{d} : z_{1}^{2} + \dots + z_{d}^{2} \in N_{d, h^{2}}}

and are bounded there by F. Using the extension of Bernstein’s theorem to the multivariate case, we obtain

Lemma 2.

Let

ρ \geq 2^{\sqrt{d}}

. For

f \in A^{d} (ρ, F)

, there is a constant

C = C (d, ρ, F)

and a polynomial

p (x) = \sum_{{∥ k ∥}_{1} \leq γ} b_{k} x^{k}, x \in {[0, 1]}^{d},

with

| b_{k} {| \leq C (γ + 1)}^{d}

(8)

and

| f (x) - p (x) | \leq C ρ^{- γ / \sqrt{d}}, forall x \in {[0, 1]}^{d} .

Combining Lemma 1 and Lemma 2, we obtain the following.

Theorem 2.

Let

ε \in (0, 1)

and let

ρ \geq 2^{\sqrt{d}}

. For

f \in A^{d} (ρ, F)

, there is a constant

C = C (d, ρ, F)

and a neural network

F_{ε} \in F (L, p, B)

with

L \leq C {log}_{2}^{2} \frac{1}{ε},

{∥ p ∥}_{\infty} \leq C {({log}_{2} \frac{1}{ε})}^{d + 2}

and

B \leq C {({log}_{2} \frac{1}{ε})}^{2 d + 5}

such that

| F_{ε} {(x) - f (x) | \leq ε, forall x \in [0, 1]}^{d} .

We conclude this part by estimating the

l_{1}

weight regularization of networks constructed in Theorem 2. First, the total number of weights in those networks is bounded by

{(L + 1) ∥ p ∥}_{\infty}^{2} = O {({log}_{2} \frac{1}{ε})}^{2 d + 6} .

From (7), it follows that all of the weights of network

{Mon}_{m, γ}^{d}

from Lemma 1 are in

[- 2, 2]

. In Theorem 2, the network

F_{ε}

is obtained by adding to a network

{Mon}_{m, γ}^{d},

with

γ = m = O ({log}_{2} \frac{1}{ε})

, a layer with coefficients of partial sums of power series of an approximated function. Thus, using (8), we obtain that the

l_{1}

weight norm of the network

F_{ε}

constructed in Theorem 2 has order

O {({log}_{2} \frac{1}{ε})}^{4 d + 6}

.

4. Proofs

In the following proofs,

I_{k}

denotes an identity matrix of size

k \times k

and all of the networks have activation

a (x) = | x |

. The proof of Lemma 1 is based on the following two lemmas.

Lemma 3.

For any positive integer m, there exists a neural network Mult

_{m} \in F (2 m + 3, p)

, with

p_{0} = 3

,

p_{L + 1} = 1

and

{∥ p ∥}_{\infty} = 3 m + 2,

such that

| {Mult}_{m} (x, y) - x y | \leq 3 \cdot 2^{- 2 m - 3}, for all x, y \in [0, 1],

(9)

and the product of absolute values of the matrices presented in Mult

_{m}

is equal to

(3 \sum_{k = 1}^{m} \frac{2^{k} - 1}{2^{2 k}}, 2 - 2^{- m}, 2 - 2^{- m}) .

Proof.

For

k \geq 2

, let

R_{k}

denote a row of length k with a first entry equal to

- 1 / 2

, last entry equal to 1 and all other entries equal to 0. Let

A_{k}

be a matrix of size

(k + 1) \times k

obtained by adding the

(k + 1)

-th row

R_{k}

to the identity matrix

I_{k}

. That is,

In addition, let

B_{k}

denote a matrix of size

k \times k

given by

It then follows from (7) that

B_{m + 2} \circ a \circ A_{m + 1} \circ \dots \circ B_{3} \circ a \circ A_{2} (\binom{1}{x}) = (\begin{matrix} 1 \\ x \\ g_{1} (x) \\ g_{2} (x) \\ \cdot \\ \cdot \\ \cdot \\ g_{m} (x) \end{matrix}),

where

g_{s} (x)

is the function defined in (3),

s = 1, \dots, m

. Thus, if

S_{m + 2}

is a row of length

m + 2

defined as

S_{m + 2} = (0, 1, - \frac{1}{2^{2 \cdot 1}}, - \frac{1}{2^{2 \cdot 2}}, \dots, - \frac{1}{2^{2 \cdot m}}),

then

S_{m + 2} \circ a \circ B_{m + 2} \circ a \circ A_{m + 1} \circ \dots \circ a \circ B_{3} \circ a \circ A_{2} (\binom{1}{x}) = f_{m} (x),

where

f_{m}

is defined by (4). We have that

| S_{m + 2} | \cdot | B_{m + 2} | \cdot | A_{m + 1} | \cdot \dots \cdot | B_{3} | \cdot | A_{2} | = (\sum_{k = 1}^{m} \frac{2^{k + 1} - 2}{2^{2 k}}, 2 - 2^{- m}) .

As

x y = \frac{1}{2} ({(x + y)}^{2} - x^{2} - y^{2}),

then, in the first layer of

{Mult}_{m}

, we will obtain a vector

(\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \\ 1 & 0 & 0 \\ 0 & 1 & 1 \end{matrix}) (\begin{matrix} 1 \\ x \\ y \end{matrix}) : = C (\begin{matrix} 1 \\ x \\ y \end{matrix}) = (\begin{matrix} 1 \\ x \\ 1 \\ y \\ 1 \\ x + y \end{matrix})

and will then apply the network in a parallel manner from the first part of the proof to each of the pairs (1, x),(1, y) and (1, x + y). More precisely, for a given matrix M of size p × q, let

\tilde{M}

be a matrix of size 3p × 3q defined as

\tilde{M} = (\begin{matrix} M & 0 & 0 \\ 0 & M & 0 \\ 0 & 0 & M \end{matrix}) .

Then, for the network

{Mult}_{m} (x, y) = (- \frac{1}{2}, - \frac{1}{2}, \frac{1}{2}) \circ a \circ {\tilde{S}}_{m + 2} \circ a \circ {\tilde{B}}_{m + 2} \circ a \circ {\tilde{A}}_{m + 1} \circ \dots \circ {\tilde{B}}_{3} \circ a \circ {\tilde{A}}_{2} \circ a \circ C (\begin{matrix} 1 \\ x \\ y \end{matrix})

we have that

{Mult}_{m} (x, y) = \frac{1}{2} (f_{m} (x + y) - f_{m} (x) - f_{m} (y)),

which, together with

| f_{m} (x) - x^{2} | < 2^{- 2 m - 2}

and the triangle inequality, implies (9). It remains to be noted that the product of absolute values of the matrices presented in

{Mult}_{m}

is equal to

(\frac{1}{2}, \frac{1}{2}, \frac{1}{2}) \cdot | {\tilde{S}}_{m + 2} | \cdot | {\tilde{B}}_{m + 2} | \cdot | {\tilde{A}}_{m + 1} | \cdot \dots \cdot | {\tilde{B}}_{3} | \cdot | {\tilde{A}}_{2} | \cdot | C | = (3 \sum_{k = 1}^{m} \frac{2^{k} - 1}{2^{2 k}}, 2 - 2^{- m}, 2 - 2^{- m}),

which completes the proof of the lemma. □

Lemma 4.

For any positive integer m, there exists a neural network Mult

_{m}^{r} \in F (L, p)

, with

L = (2 m + 5) ⌈ {log}_{2} r ⌉ + 1

,

p_{0} = r + 1

,

p_{L + 1} = 1

and

{∥ p ∥}_{\infty} \leq 6 r (m + 2) + 1,

such that

| {Mult}_{m}^{r} (x) - \prod_{i = 1}^{r} x_{i} | \leq r^{2} 4^{- m} f o r a l l x = (x_{1}, \dots, x_{r}) \in {[0, 1]}^{r},

and, for the

(r + 1)

-dimensional vector

J_{m}^{r}

obtained by multiplication of absolute values of matrices presented in

{Mult}_{m}^{r}

, we have that

∥ J_{m}^{r} ∥_{\infty} \leq 144 r^{4}

.

Proof.

First, for a given

k \in N

, we construct a network

N_{m}^{k} \in F (L, p)

with

L = 2 m + 4,

p_{0} = 2 k + 1

and

p_{L + 1} = k + 1,

such that

N_{m}^{k} (x_{1}, x_{2}, \dots, x_{2 k - 1}, x_{2 k}) = (1, {Mult}_{m} (x_{1}, x_{2}), \dots, {Mult}_{m} (x_{2 k - 1}, x_{2 k})) .

In the first layer, we obtain a vector for which the first coordinate is 1 followed by triples

(1, x_{2 l - 1}, x_{2 l})

l = 1, \dots, k,

that is, the vector

(1, 1, x_{1}, x_{2}, 1, x_{3}, x_{4}, \dots, 1, x_{2 k - 1}, x_{2 k})

.

N_{m}^{k}

is then obtained by applying in parallel the network

{Mult}_{m}

to each triple

(1, x_{2 l - 1}, x_{2 l})

while keeping the first coordinate equal to 1. The product of absolute values of the matrices presented in this construction is a matrix of size

(k + 1) \times (2 k + 1)

having a form

(\begin{matrix} 1 & 0 & 0 & 0 & 0 & 0 & \dots & 0 & 0 & 0 \\ a_{m} & b_{m} & b_{m} & 0 & 0 & 0 & \dots & 0 & 0 & 0 \\ a_{m} & 0 & 0 & b_{m} & b_{m} & 0 & \dots & 0 & 0 & 0 \\ \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot & \cdot \\ a_{m} & 0 & 0 & 0 & 0 & 0 & \dots & 0 & b_{m} & b_{m} \end{matrix}),

where

a_{m} = 3 \sum_{k = 1}^{m} \frac{2^{k} - 1}{2^{2 k}}

and

b_{m} = 2 - 2^{- m}

are the coordinates obtained in the previous lemma. Let us now construct the network

{Mult}_{m}^{r}

. The first hidden layer of

{Mult}_{m}^{r}

computes

(1, x_{1}, \dots, x_{r}) \mapsto (1, x_{1}, \dots, x_{r}, \underset{2^{q} - r}{\underset{︸}{1, 1, \dots, 1}}),

where

q = ⌈ {log}_{2} r ⌉

. We then subsequently apply the networks

N_{m}^{2^{q}}, N_{m}^{2^{q - 1}}, \dots, N_{m}^{2}

and, in the last layer, we multiply the outcome by

(0, 1)

. From Lemma 3 and triangle inequality, we have that

| {Mult}_{m} (x, y) - t z | \leq 3 \cdot 2^{- 2 m - 3} + | x - t | + | y - z |,

for

x, y, t, z \in [0, 1]

. Hence, by induction on q, we obtain that

| {Mult}_{m}^{r} (x) - \prod_{i = 1}^{r} x_{i} | \leq 3^{q} 2^{- 2 m - 3} \leq 3 r^{2} 2^{- 2 m - 3} \leq r^{2} 4^{- m}

.

Note that the product of absolute values of matrices in each network

N_{m}^{k}

has the above form, that is, in each row, it has at most three nonzero values, each of which is less than 2. As the matrices given in the first and the last layer of

{Mult}_{m}^{r}

also satisfy this property, then each entry of the product of absolute values of all matrices of

{Mult}_{m}^{r}

will not exceed

12^{q + 2} \leq 144 r^{4}

. □

Proof of Lemma 1.

We have that, if

{∥ k ∥}_{1} = 0

, then

x^{k} = 1

, and if

{∥ k ∥}_{1} = 1

, then

k

has only one non-zero coordinate, say,

k_{j}

, which is equal to 1 and

x^{k} = x_{j}

. Denote

N = C_{d, γ} - d - 1

and let

k^{1}, \dots, k^{N}

be the multi-indices satisfying

1 < ∥ k^{i} ∥_{1} < γ,

i = 1, \dots, N

. For

k = (k_{1}, \dots, k_{d})

with

{∥ k ∥}_{1} > 1

, denote by

x_{k}

the

(∥ k ∥_{1} + 1)

-dimensional vector of the form

x_{k} = (1, \underset{k_{1}}{\underset{︸}{x_{1}, \dots, x_{1}}}, \dots, \underset{k_{d}}{\underset{︸}{x_{d}, \dots, x_{d}}}) .

The first layer of

{Mon}_{m, γ}^{d}

computes the

(d + 1 + \sum_{i = 1}^{N} (∥ k^{i} ∥_{1} + 1))

-dimensional vector

{(1, x, x_{k^{1}}, \dots, x_{k^{N}})}^{⊺}

by multiplying the input vector by matrix

Γ

of size

(d + 1 + \sum_{i = 1}^{N} (∥ k^{i} ∥_{1} + 1)) \times (r + 1)

. In the following layers, we do not change the first

d + 1

coordinates (by multiplying them by

I_{d + 1}

), and, to each

x_{k^{i}}

, we apply in parallel the network

{Mult}_{m}^{∥ k^{i} ∥_{1}}

. Recall that, in Lemma 4,

J_{m}^{r}

denotes the

(r + 1)

-dimensional vector obtained from the product of absolute values of the matrices of

{Mult}_{m}^{r}

. We then have that the product of the absolute values of the matrices of

{Mon}_{m, γ}^{d}

has the form

As the matrix

Γ

only contains entries 0 and 1, then, applying Lemma 4, we obtain that the entries of M are bounded by

max_{1 \leq i \leq N} | | J_{m}^{∥ k^{i} ∥_{1}} | |_{1} \leq 144 {(γ + 1)}^{5} .

□

Proof of Theorem 1.

Let

γ = ⌈\frac{{log}_{2} ((2 F + 16) / ε)}{{log}_{2} {(1 - δ)}^{- 1}}⌉ .

Then, for

x \in {(0, 1 - δ]}^{d}

, we have that

| f (x) - \sum_{{∥ k ∥}_{1} < γ} a_{k} x^{k} | = | \sum_{{∥ k ∥}_{1} \geq γ} a_{k} x^{k} | \leq {(1 - δ)}^{γ} F \leq \frac{ε F}{2 F + 16} \leq \frac{ε}{2} \leq \frac{ε}{2 δ^{2}} .

(10)

Applying Lemma 1 with

m = ⌈ {log}_{2} \frac{4 F + 16}{ε} ⌉

, we obtain that, for all

x \in {[0, 1]}^{d}

\begin{matrix} ∥ {Mon}_{m, γ}^{d} (x) - {(x^{k})}_{{∥ k ∥}_{1} < γ} ∥_{\infty} \leq γ^{2} 4^{- m} \leq (\frac{4}{{log}_{2}^{2} {(1 - δ)}^{- 1}}) ({log}_{2}^{2} \frac{2 F + 16}{ε}) {(\frac{ε}{4 F + 16})}^{2} \\ \leq \frac{4 (2 F + 16) ε^{2}}{δ^{2} ε {(4 F + 16)}^{2}} \leq \frac{ε}{2 F δ^{2}}, \end{matrix}

(11)

where we used the inequalities

{log}_{2} {(1 - δ)}^{- 1} \geq δ, δ \in (0, 1),

and

{log}_{2}^{2} r \leq r

for

r \geq 16

. In order to approximate the partial sum

\sum_{{∥ k ∥}_{1} \leq γ} a_{k} x^{k},

we add one last layer with the coefficients of that partial sum to the network

{Mon}_{m, γ + 1}^{d}

. As the sum of absolute values of those coefficients is bounded by F, then, combining (10) and (11), for the obtained network

F_{ε}

we obtain

| F_{ε} (x) - f (x) | \leq \frac{ε}{δ^{2}}, f o r a l l x \in {(0, 1 - δ]}^{d} .

From Lemma 1 it follows that

∥ F_{ε} ∥_{\times} \leq 144 (d + 1) F {(γ + 1)}^{5} \leq 10^{4} d F {(\frac{{log}_{2} ((2 F + 16) / ε)}{δ})}^{5} .

□

Let us now present the result from [19] that will be used to derive Lemma 2. First, if

f \in A^{d} (ρ, F)

, then ([20], Theorem 4.1) f has a unique representation as an absolutely and uniformly convergent multivariate Chebyshev series

f (x) = \sum_{k_{1} = 0}^{\infty} \dots \sum_{k_{d} = 0}^{\infty} a_{k_{1}, \dots, k_{d}} T_{k_{1}} (x_{1}) \dots T_{k_{d}} (x_{d}), x \in {[0, 1]}^{d} .

Note that, for

k : = (k_{1}, \dots, k_{d})

, the degree of a d-dimensional polynomial

T_{k_{1}} (x_{1}) \dots T_{k_{d}} (x_{d})

is

{∥ k ∥}_{1} = k_{1} + \dots + k_{d}

. Then, for any non-negative integers

n_{1}, \dots, n_{d},

the partial sum

p (x) = \sum_{k_{1} = 0}^{n_{1}} \dots \sum_{k_{d} = 0}^{n_{d}} a_{k} T_{k_{1}} (x_{1}) \dots T_{k_{d}} (x_{d})

(12)

is a polynomial truncation of the multivariate Chebyshev series of f of degree

d (p) = n_{1} + \dots + n_{d}

. It is shown in [19] that

Theorem 3.

For

f \in A^{d} (ρ, F)

, there is a constant

C = C (d, ρ, F)

such that the multivariate Chebyshev coefficients of f satisfy

| a_{k} | \leq C ρ^{- {∥ k ∥}_{2}}

(13)

and, for the polynomial truncations p of the multivariate Chebyshev series of f, we have that

inf_{d (p) \leq γ} {∥ f (x) - p (x) ∥}_{{[0, 1]}^{d}} \leq C ρ^{- γ / \sqrt{d}} .

Proof of Lemma 2.

Note that, from the recursive definition of the Chebyshev polynomials, it follows that, for any

k \geq 0

, the coefficients of the Chebyshev polynomial

T_{k} (x)

are all bounded by

2^{k}

. Let p now be a polynomial given by (12) with degree

d (p) \leq γ

. As the number of summands in the right-hand side of (12) is bounded by

{(γ + 1)}^{d}

, then, using (13), we obtain that p can be rewritten as

p (x) = \sum_{{∥ k ∥}_{1} \leq γ} b_{k} x^{k},

with

| b_{k} {| \leq C (γ + 1)}^{d} 2^{{∥ k ∥}_{1}} ρ^{- {∥ k ∥}_{2}} \leq C {(γ + 1)}^{d} 2^{\sqrt{d} {∥ k ∥}_{2}} ρ^{- {∥ k ∥}_{2}} \leq C {(γ + 1)}^{d},

where the last inequality follows from the condition

ρ \geq 2^{\sqrt{d}}

. □

Proof of Theorem 2.

The proof follows from Lemmas 1 and 2 by taking

γ = m = ⌈ {log}_{2} \frac{1}{ε} ⌉

and adding, to the network

{Mon}_{m, γ + 1}^{d}

, the last layer with the coefficients of the polynomial

p (x)

from Lemma 2. For the obtained network

F_{ε}

we have that

∥ F_{ε} ∥_{\times} \leq 144 C (d + 1) C_{d, γ + 1} {(γ + 2)}^{d} {(γ + 2)}^{5} \leq 144 C (d + 1) {(γ + 2)}^{2 d + 5},

where C is the constant from Lemma 2. □

5. Discussion

Although various activation functions, including the ReLU, sigmoid and the Gaussian function, have already been used in the literature for neural network approximations of smooth and analytic functions (see [3,8,21]), approximating properties of neural networks with an absolute value activation function, which is a built-in activation function of software-based neural network evolving methods (such as NEAT-Python, [11]), has been barely covered previously. Whereas the algorithms developed in the works [12,13] allow us to train neural networks with an absolute value activation function, in the present paper, we study the capabilities of those networks to approximate analytic functions. While popular types of constraints imposed on approximating neural networks are either controlling the

l_{p}

norms of network weights or adjusting their architectures, in the present work, we study approximating properties of neural networks with regularized path norms and show that networks with an absolute value activation function and with network path norms having logarithmic dependence on

1 / ε

can

ε

-approximate functions that are analytic on certain regions of

C^{d}

. The sizes and the weights of constructed networks also have logarithmic dependence on

1 / ε

.

Funding

This research was funded by NWO Vidi grant: “Statistical foundation for multilayer neural networks”: VI.Vidi.192.021.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author would like to thank Johannes Schmidt-Hieber for support and valuable suggestions. The author is also grateful to the referees for the evaluation of the paper and for constructive comments.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Scarselli, F.; Tsoi, A.C. Universal approximation using feedforward neural networks: A survey of some existing methods, and some new results. Neural Netw. 1998, 11, 15–37. [Google Scholar] [CrossRef]
Lu, Z.; Pu, H.; Wang, F.; Hu, Z.; Wang, L. The expressive power of neural networks: A view from the width. Adv. Neural Inf. Process. Syst. 2017, 30, 6231–6239. [Google Scholar]
E, W.; Wang, Q. Exponential convergence of the deep neural network approximation for analytic functions. Sci. China Math. 2018, 61, 1733–1740. [Google Scholar] [CrossRef]
Neyshabur, B.; Tomioka, R.; Srebro, N. Norm-based capacity control in neural networks. In Proceedings of the 28th Conference on Learning Theory (COLT), Paris, France, 3–6 July 2015; pp. 1376–1401. [Google Scholar]
Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 2020, 48, 1875–1897. [Google Scholar]
Taheri, M.; Xie, F.; Lederer, J. Statistical Guarantees for Regularized Neural Networks. Neural Netw. 2021, 142, 148–161. [Google Scholar] [CrossRef] [PubMed]
Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef] [PubMed]
Opschoor, J.A.A.; Schwab, C.; Zech, J. Exponential ReLU DNN Expression of Holomorphic Maps in High Dimension. Constr. Approx. 2021, 55, 537–582. [Google Scholar] [CrossRef]
Barron, A.; Klusowski, J. Approximation and estimation for high-dimensional deep learning networks. arXiv 2018, arXiv:1809.03090. [Google Scholar]
Zheng, S.; Meng, Q.; Zhang, H.; Chen, W.; Yu, N.; Liu, T. Capacity control of ReLU neural networks by basis-path norm. arXiv 2019, arXiv:1809.07122. [Google Scholar] [CrossRef]
Overview of Builtin Activation Functions. Available online: https://neat-python.readthedocs.io/en/latest/activation.html (accessed on 5 July 2022).
Batruni, R. A multilayer neural network with piecewise-linear structure and backpropagation learning. IEEE Trans. Neural Netw. 1991, 2, 395–403. [Google Scholar] [CrossRef]
Lin, J.-N.; Unbehauen, R. Canonical piecewise-linear neural networks. IEEE Trans. Neural Netw. 1995, 6, 43–50. [Google Scholar] [PubMed]
Bartlett, P.L.; Harvey, N.; Liaw, C.; Mehrabian, A. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. J. Mach. Learn. Res. 2019, 20, 2285–2301. [Google Scholar]
He, F.; Wang, B.; Tao, D. Piecewise linear activations substantially shape the loss surfaces of neural networks. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Mason, J.C.; Handscomb, D.C. Chebyshev Polynomials; Chapman and Hall/CRC: New York, NY, USA, 2002. [Google Scholar]
Trefethen, L.N. Approximation Theory and Approximation Practice; SIAM: Philadelphia, PA, USA, 2013. [Google Scholar]
Bernstein, S. Sur la meilleure approximation de |x| par des polynomes de degrés donnés. Acta Math. 1914, 37, 1–57. [Google Scholar] [CrossRef]
Trefethen, L.N. Multivariate polynomial approximation in the hypercube. Proc. Am. Math. Soc. 2017, 145, 4837–4844. [Google Scholar] [CrossRef]
Mason, J.C. Near-best multivariate approximation by Fourier series, Chebyshev series and Chebyshev interpolation. J. Approx. Theory 1980, 28, 349–358. [Google Scholar] [CrossRef]
Mhaskar, H.N. Neural networks for optimal approximation of smooth and analytic functions. Neural Comput. 1996, 8, 164–177. [Google Scholar] [CrossRef]

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Beknazaryan, A. Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks. Entropy 2022, 24, 1136. https://doi.org/10.3390/e24081136

AMA Style

Beknazaryan A. Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks. Entropy. 2022; 24(8):1136. https://doi.org/10.3390/e24081136

Chicago/Turabian Style

Beknazaryan, Aleksandr. 2022. "Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks" Entropy 24, no. 8: 1136. https://doi.org/10.3390/e24081136

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analytic Function Approximation by Path-Norm-Regularized Deep Neural Networks

Abstract

1. Introduction

2. The Class of Approximant Networks

3. Results

4. Proofs

5. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI