Next Article in Journal
Approximate Nonlocal Symmetries for a Perturbed Schrödinger Equation with a Weak Infinite Power-Law Memory
Previous Article in Journal
Modeling a Spheroidal Particle Ensemble and Inversion by Generalized Runge–Kutta Regularizers from Limited Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Universally Optimal Activation Function for a Class of Residual Neural Networks

1
Department of Electronics Engineering, Tsinghua University, Beijing 100089, China
2
Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen 518000, China
*
Author to whom correspondence should be addressed.
AppliedMath 2022, 2(4), 574-584; https://doi.org/10.3390/appliedmath2040033
Submission received: 19 September 2022 / Revised: 8 October 2022 / Accepted: 12 October 2022 / Published: 16 October 2022

Abstract

:
While non-linear activation functions play vital roles in artificial neural networks, it is generally unclear how the non-linearity can improve the quality of function approximations. In this paper, we present a theoretical framework to rigorously analyze the performance gain of using non-linear activation functions for a class of residual neural networks (ResNets). In particular, we show that when the input features for the ResNet are uniformly chosen and orthogonal to each other, using non-linear activation functions to generate the ResNet output averagely outperforms using linear activation functions, and the performance gain can be explicitly computed. Moreover, we show that when the activation functions are chosen as polynomials with the degree much less than the dimension of the input features, the optimal activation functions can be precisely expressed in the form of Hermite polynomials. This demonstrates the role of Hermite polynomials in function approximations of ResNets.

1. Introduction

Consider a function approximation problem in which we use f 1 , , f k to approximate a target function g. Assume g , f i are bijective functions with the range consisting of n ( > k ) elements from R , and to obtain different functions, we sample elements of ranges of g , f i from the Gaussian distribution. When we only use linear combinations i = 1 k w i f i to approximate g, the expectation of minimal residual sum of squares (RSS) is achieved at ( 1 k n ) . However, if we introduce a non-linear function σ and use σ ( i = 1 k w i f i ) to approximate g, we can achieve an averaged RSS lower than ( 1 k n ) . This paper quantitatively investigates the RSS that the non-linear approximator achieves, which is lower than its linear counterpart.
The function approximation problem revealed above can also be formulated in the terminology of neural networks. It is well-known that with only one hidden layer and choosing sigmoid as the activation function, we can use neural networks to approximate any continuous function [1]. However, this universal approximation theorem requires that the non-linear activation function satisfies certain regularity conditions and does not tell us which kind of activation function performs best in a certain problem. On the contrary, this paper shows that the Hermite polynomial is the optimal candidate to achieve the minimal RSS in the problem mentioned above.
Moreover, note that most of the previous works along this direction focused on providing upper and lower bounds [2] of the function approximation performance without closed-form characterizations. On the other hand, our formulation not only illustrates the role non-linearity plays in activation function, but more importantly, we provide the closed-form solutions for the non-linear activation function to achieve the minimal RSS. To the best of our knowledge, this is the first work that demonstrates how profitable that non-linear functions can be in a tight formulation.
In particular, our work is not restricted to any specific family of activation functions but only requires the activation function to be a perturbation of linear functions. This assumption allows us to study the non-linear gain when a non-linear term is added to the network. Such a gain can be expressed quantitatively up to the second-order approximation of ( 1 k n ) . To simplify the technical analysis, we use the one-node neural network which outputs one-dimensional feature [3]. The network we consider can be regarded as a special ResNet model, which is widely used as a building block in deep neural network architectures. We use RSS to measure the network loss, from which we can obtain the universally optimal activation function for the given network, and it is found that the averaged loss is minimized when the activation function is the Hermite polynomial, which validates the previous empirical results [4].
The rest of this paper is organized as follows. In Section 3, we formulate the universally optimal activation function problem mathematically. Under specific assumptions, we derive the optimal solution in the form of the Hermite polynomial and establish the error rate in Section 4. Furthermore, in Section 5, numerical experiments are conducted to verify our theoretical results. Section 6 concludes the paper. The detail of proofs of our results are provided in Appendix A.
Throughout this paper, we use X , X ̲ and X to represent random variables, random vectors and random matrices. In addition, we use x , x ̲ , x (or M ) to represent a numerical value, numerical vector and numerical matrix. In addition, we use I n to denote the n-dimensional identity matrix, and [ · ] T is the transpose operation on matrices. Moreover, we use 1 m = n to denote the indicator function and γ ( i , j ) i + j + 1 ( mod 2 ) which equals 1 if i + j is even and 0 otherwise. Finally, n ! ! is the double factorial which equals n × ( n 2 ) × 1 if n is odd and n × ( n 2 ) × 2 for even n. ( 1 ) ! ! = 1 and ( 3 ) ! ! = 1 are defined specifically.

2. Related Works

To explore the influence of network structures, most existing theoretical works focus on the factor of layers and hidden units. For example, Kuri-Morales determines the minimal number of hidden units for a multilayer perceptron [5]. ReLU is used in [6], which focuses on studying the number of layers and hidden units. We notice that these works are restricted to a specific activation function and seldom explore the best performance that the non-linear activation function can achieve for a certain network architecture.
To compensate this shortcoming, our study on activation function adopts the approach of statistical learning theory [7], which regards the learning problem from the view of probability. Our theoretical result can explain previous empirical studies such as:
  • Fixing the network structure, optimizing over parameterized activation function can improve the overall performance [8].
  • Using Hermite orthogonal polynomial as an activation function works better than sigmoid under certain conditions [4].

3. Models and Methods

We consider one-node neural network, whose structure is shown in Figure 1. Its input is a k-dimensional random vector X ̲ = ( X 1 , , X k ) and it outputs a random variable Y. The one-node neural network is the output layer in a complex neural network for the regression task, where X ̲ is the extracted feature vector by previous layers. Although our study focuses on this specific network structure, we enlarge the choice of activation term σ to find insights for the best choice of activation functions. Our goal is to predict Y using σ ( i = 1 k w i X i ) . For a given sampling result, we have n data pairs ( x ̲ 1 , y 1 ) , , ( x ̲ n , y n ) , and the 2 loss function is given by 2 ( x , y ̲ ) = y ̲ σ ( i = 1 k w i x ̲ i ) 2 where x ̲ i = ( x 1 i , , x n i ) 2 is the i-th column of the feature vector x and y ̲ = ( y 1 , , y n ) is the label vector. σ is applied to a vector in an elementwise way. The optimal weight for a given activation function σ is the minimizer of 2 ( x , y ̲ ) .
This type of network arises from function approximation problems, where a target function g, defined in a finite set, is estimated by σ ( i = 1 k w i f i ) . Suppose the ranges of f i , g are with cardinality n; then, the functions themselves can be completely determined by n-dimensional vectors. Furthermore, we assume each element from the ranges of f i , g is sampled from a distribution, since we are considering the average performance over different f i , g pairs. Then, we can establish a correspondence between f i and x ̲ i , and between g and y ̲ .
In our model, we suppose x , y ̲ are drawn from a joint distribution G. That is, x , y ̲ are samples of random matrix X and random vector Y ̲ . We are interested to find a function σ which minimizes the expectation E [ min w ̲ 2 ( X , Y ̲ ) ] . Our requirement of the activation function σ is of the special form σ ( z ) = z + ϵ ξ ( z ) where ϵ is a small constant. Such a form of σ can be regarded as a special kind of ResNet [9]. To draw a fair comparison between different non-linear terms ξ for given ϵ , we apply the normalization constraint E [ ξ ( X w ̲ ) ] = 1 to ξ , where X w ̲ = i = 1 k w i X ̲ i is the matrix-vector product with w ̲ = ( w 1 , , w k ) .
We formulate the universally optimal activation function σ as follows:
Definition 1.
Assume ( X , Y ̲ ) follow distribution G, and F is a function space for σ ( z ) . Then, we define the averaged residual error E [ σ ] as
E ( σ ) E [ min w ̲ Y ̲ σ ( X w ̲ ) 2 ]
The function σ * which minimizes E ( σ ) is called the universally optimal activation function for the one-node neural network, i.e., σ * = arg min σ F E ( σ ) .
Definition 1 is a general formulation for any distribution space G. To obtain analytical insights, we should choose some specific distribution. Therefore, in the following analysis, we assume:
(1)
Y ̲ follows Gaussian distribution N ( 0 , 1 n I n ) .
(2)
X is a n × k uniformly distributed random orthogonal matrix.
(3)
Y ̲ and X are independent.
For assumption (2), the definition of uniformly distributed random orthogonal matrix is X X ( X T X ) 1 / 2 where elements of X are i.i.d. N ( 0 , 1 ) random variables ([10] Proposition 7.1). Since X T X = I k , X is indeed an orthogonal matrix. The definition of X can be regarded as the post-processing result of PCA (Principal Component Analysis) on network input X and weight w ̲ , because we make the transformation X = X ( X T X ) 1 / 2 , w ̲ = ( X T X ) 1 / 2 w ̲ . We see that X w ̲ = X w ̲ . For assumption (3), notice that when we project a random vector Y ̲ into a fixed linear subspace, Y ̲ is independent from the fixed space. When we are evaluating the representability of non-linear activation, we also choose a scenerario in which the input and output are independent to demonstrate that the non-linearity, instead of the correlation, helps to decrease the residual error. Based on the distribution space G by (1) –(3), for linear function space, we have the following conclusion:
Proposition 1.
We have E ( σ 0 ) = 1 k n where σ 0 ( z ) = z is the identity function.
Proposition 1 says that the averaged residue error for linear function is equal to ( 1 k n ) , which is consistent with our intuition, since we use k degrees of freedom to estimate an arbitrary vector in n-dimensional space.
We have analyzed the averaged error for linear function. To extend our result to non-linear functions, we need to choose a proper function space F . In this paper, we consider a local region which contains a non-linear function perturbed from the linear space. The distance from non-linear function to linear space is quantified by a positive value ϵ . Thus, we consider F σ 0 ( ϵ ) = { σ | σ σ 0 ϵ } . Then, all smooth functions can be treated as ϵ > 0 F ( ϵ ) . The norm of function is chosen as the expectation with respect to the distribution space G. That is, σ σ 0 2 E [ ( σ σ 0 ) ( A Y ̲ ) 2 ] where A = X X T .
We are particularly interested in how the averaged error changes when σ is contracted to σ 0 along a certain direction in F ( σ ) . That is, given a function ξ , we can construct σ = σ 0 + ϵ ξ where ξ F 0 = { ξ | ξ 1 } . Then, σ σ 0 is equivalent to ϵ 0 .
To measure the change rate of E [ σ ] when σ σ 0 , we introduce the concept of the asymptotic error rate as follows:
Definition 2.
Let ξ F 0 ; then, the asymptotic error rate for ξ is
C [ ξ ] lim ϵ 0 σ = σ 0 + ϵ ξ E [ σ ] E [ σ 0 ] ϵ 2
C [ ξ ] can represent the error change rate of perturbation from a linear function. If C [ ξ ] is negative for a given ξ , E [ σ ] decreases in the rate of | C [ ξ ] | along the perturbation direction ξ , which can be seen more clearly if we rewrite (2) as E [ σ ] = E [ σ 0 ] + C [ ξ ] ϵ 2 + o ( ϵ 2 ) . In this form, we can also see that C [ ξ ] is the coefficient of the second-order term of E [ σ ] .
To justify the definition of C [ ξ ] , we need to show the limit in Equation (2) exists, which is guaranteed by the following proposition:
Proposition 2.
Let ξ ( z ̲ ) = diag [ ξ ( z ̲ 1 ) , , ξ ( z ̲ n ) ] . Then, we have
C [ ξ ] = E [ ξ ( A Y ̲ ) 2 X T ξ ( A Y ̲ ) 2 X T ξ ( A Y ̲ ) ( Y A Y ̲ ) 2 ]
Our goal is to obtain the fastest decreasing path ξ from E [ σ 0 ] to E [ σ ] ; this is equivalent to solving an optimization problem min C [ ξ ] constraint by ξ F 0 . It is hard to optimize ξ over F 0 directly, and we consider F 0 , m = F 0 P m instead. P m consists of polynomials with the degree no greater than m, and we have F 0 = lim m F 0 , m .
For ξ F 0 , m , we have the following result:
Proposition 3.
If ξ ( z ) = i = 0 m q ̲ i z i , and m k , then we have
C [ ξ ] = ( 1 k n ) p ̲ T M p ̲ w h e r e   p ̲ i = k i / 2 n 1 / 2 + i q ̲ i M i j = γ ( i , j ) ( i 1 ) ( j 1 ) ( i + j 3 ) ! !
The constraint ξ F 0 , m is equivalent with
p ̲ T N p ̲ = 1   w i t h N i j = γ ( i , j ) ( i + j 1 ) ! !
Proposition 3 gives a feasible approach to choose the optimal ξ by solving a quadratic optimization problem. We make the assumption m k to write M i j in a concise form. This assumption requires that we should choose low degree polynomials as an activation function, and the number of nodes k should be relatively large to represent features. Since the computational cost is proportional to the degree of polynomials and feature dimensions are usually high, our assumption does not lose practicality.

4. Hermite Polynomials

One of our major results is that we find Proposition 3 leads naturally to the Hermite polynomials solutions. (Probabilists’) Hermite polynomials, denoted as H m ( x ) , m = 1 , 2 , , are defined as a series of polynomials satisfying the orthogonal property E [ H m ( X ) H n ( X ) ] = n ! 1 m = n where X is the standard normal random variable. The degree of H m ( x ) is m. Using Hermite polynomials, the minimizer of C [ ξ ] is expressed in the following way:
Proposition 4.
Let
ξ m ( z ) 1 m ! n H m n k 1 / 2 z
Then C [ ξ m ] = ( 1 k n ) ( m 1 ) C [ ξ ] for all ξ F 0 , m .
Proposition 4 also gives the expression of the minimal asymptotic error rate, which has a linear relation with the degree of Hermite polynomials. This result is obtained under the assumption m k . In a general case, when m k does not hold, by numerical simulation, we find that ( 1 k n ) ( m 1 ) < C [ ξ ] , while C [ ξ ] decreases in a linear way.

5. Experimental Validation and Discussions

To verify that the change rate of C [ ξ ] is linear as we increase the maximal degree of polynomials m, we conduct a simple experiment with n = 180 , k = 120 . We use ξ m ( z ) from Proposition 4 and normalized polynomials with only the highest degree ξ ˜ m ( z ) = ( n z / k ) m ( 2 m 1 ) ! ! n as an activation function. Using the Monte Carlo method to compute C [ ξ ] from Proposition 2, our experimental result is shown in Figure 2. As can be seen, the Hermite polynomial results coincide with the theoretical lower bound in Proposition 4. For ξ ˜ m , it cannot achieve the minimum value, but we can still observe the nearly linear relationship between C [ ξ ˜ m ] and degree m. From Equation (4), we also notice that for ξ ˜ m , the highest degree term contributes ( 1 k n ) ( m 1 ) 2 2 m 1 to C [ ξ ] , which is approximately 1 2 C [ ξ m ] . This phenomenon suggests that the contribution of each polynomial degree to C [ ξ ] is not evenly distributed, and higher degree terms contribute more.

6. Conclusions

In this paper, we have investigated the mechanism of non-linearity in a one-node neural network. If we add polynomial perturbation to linear activation function, the network loss is linearly decreased as the degree of polynomial is increased. Currently, this work only investigates a one-node neural network; how non-linearity works in neural networks with more neurons will be considered in the future.

Author Contributions

Methodology, validation, formal analysis, writing—original draft preparation: F.Z.; Conceptualization, writing—review and editing, supervision, project administration, funding acquisition: S.-L.H. These authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the Shenzhen Science and Technology Program under Grant KQTD20170810150821146, National Key R&D Program of China under Grant 2021YFA0715202 and High-end Foreign Expert Talent Introduction Plan under Grant G2021032013L.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RSSResidual Sum of Squares
ResNetsResidual Neural Networks
PCAPrincipal Component Analysis

Appendix A. Proofs

Lemma A1.
Suppose X is a uniformly distributed random orthogonal matrix and A = X X T , then E [ A ] = k n I n .
Proof of Lemma A1.
By the symmetric property, we only need to show E [ A 11 ] = k n and E [ A 12 ] = 0 . For the first equation, we have n E [ A 11 ] = i = 1 n E [ A i i ] = i = 1 n E [ j = 1 k X i j 2 ] = j = 1 k E [ i = 1 n X i j 2 ] . Since the norm of X j is 1, the whole summation evaluates to k. Therefore, E [ A 11 ] = k n .
On the other hand, X can be treated as an n × k sub-block of an n × n random orthogonal matrix X ¯ with the property that X i j = X ¯ i j for j k . For X ¯ , the first and second row are orthogonal: E [ j = 1 n X ¯ 1 j X ¯ 2 j ] = 0 which leads to E [ X 1 j X 2 j ] = 0 for j k . Hence, E [ A 12 ] = E [ j = 1 k X 1 j X 2 j ] = 0 . □
Proof of Proposition 1.
When ϵ = 0 , the optimal w ̲ 0 = X T Y ̲ . Therefore, E [ Y ̲ σ ( X w ̲ 0 ) 2 ] = E [ Y ̲ X X T Y ̲ 2 ] . The expectation can be taken by Y ̲ first and then by X . Let Z ̲ = ( I A ) Y ̲ , and Z be a Gaussian random vector for given X . Therefore, E Y ̲ [ Z ̲ 2 ] = tr [ ( I A ) T cov [ Y ] ( I A ) ] = 1 n tr [ I A ] . Using Lemma A1, E ( σ ) = 1 k n follows. □
Proof of Proposition 2.
For the problem min w ̲ Y ̲ σ ( X w ̲ ) with σ = σ 0 + ϵ ξ , we can write w ̲ = w ̲ 0 + ϵ w ^ ̲ + o ( ϵ ) where w ̲ 0 = X T Y ̲ is the minimizer for ϵ = 0 . That is, we assume w ̲ can be expanded near w ̲ 0 using ϵ terms. We can expand Y ̲ σ ( X w ̲ ) up to o ( ϵ 2 ) as
Y ̲ X w ̲ 0 ϵ X w ^ ̲ ϵ 2 X w ˜ ̲ ϵ ξ ( X w ̲ 0 + ϵ X w ^ ̲ ) 2 = Y ̲ X w ̲ 0 ϵ ( X w ^ ̲ + ξ ( X w ̲ 0 ) ) ϵ 2 ( X w ˜ ̲ + ξ ( X w ̲ 0 ) X w ^ ̲ ) 2 = Y ̲ X w ̲ 0 2 2 ϵ ( X w ^ ̲ + ξ ( X w ̲ 0 ) ) T ( Y ̲ X w ̲ 0 ) + ϵ 2 ( X w ^ ̲ + ξ ( X w ̲ 0 ) 2 2 ( X w ˜ ̲ + ξ ( X w ̲ 0 ) X w ^ ̲ ) T ( Y ̲ X w ̲ 0 ) )
X w ̲ 0 = X X T Y ̲ is the projection of Y ̲ onto a linear subspace spanned by columns of X .
We first show that in the expansion of E [ σ ] , the coefficient of ϵ term is zero. We define Y ˜ ̲ = Y ̲ + 2 X X T Y ̲ , which is the mirror of Y ̲ about a linear subspace spanned by X X T . Then, we have X X T Y ̲ = X X T Y ̲ ˜ and ( Y ̲ X X T Y ̲ ) = ( Y ̲ ˜ X X T Y ̲ ˜ ) . The density function p Y ̲ satisfies p Y ̲ ( Y ̲ ) = p Y ̲ ( Y ̲ ˜ ) . These symmetric properties lead to E Y ̲ [ ξ ( X w ̲ 0 ) T ( Y ̲ X w ̲ 0 ) ] = 0 . On the other hand, since X T X = I k , we have ( X w ^ ̲ ) T ( Y ̲ X w ̲ 0 ) = w ^ ̲ T ( X T Y ̲ w ̲ 0 ) = 0 .
Next, we minimize the coefficient of ϵ 2 by w ^ ̲ , which is simplified to
X w ^ ̲ + ξ ( X w ̲ 0 ) 2 2 ( ξ ( X w ̲ 0 ) X w ^ ̲ ) T ( Y ̲ X w ̲ 0 )
due to ( X w ˜ ̲ ) T ( Y ̲ X w ̲ 0 ) = 0 . The expression in (A1) is quadratic about w ^ ̲ . The minimum value is achieved at
w ^ ̲ = X T ( ξ ( X X T Y ̲ ) ( Y ̲ X X T Y ̲ ) ξ ( X X T Y ̲ ) )
Substituting w ^ ̲ in expression in (A1) with the above equation, we can obtain the minimum value for the coefficient of ϵ 2 , which is exactly C [ ξ ] . □
Lemma A2.
Suppose ( X , Y ) follows two-dimensional Gaussian distribution N ( 0 , Σ ) ; then, for the case i + j is even, we have
E [ X i Y j ] = k = 0 min { i , j } γ ( k , i ) i ! j ! k ! ( i k ) ! ! ( j k ) ! ! Σ 12 k Σ 11 ( i k ) / 2 Σ 22 ( j k ) / 2
Proof of Lemma A2.
Using Isserlis’ theorem [11], we obtain E [ X i Y j ] = k = 0 min { i , j } C k Σ 12 k Σ 11 ( i k ) / 2 Σ 22 ( j k ) / 2 . The power of Σ 11 should be an integer. Therefore, if γ ( k , i ) = 0 , C k = 0 . For γ ( k , i ) = 1 , we choose k items of X and Y to form Σ 12 = E [ X Y ] , which has i k ( j k ) choices. For the remaining ( i k ) numbers of X, we should pair them. There are 1 ( i k 2 ) ! i = 1 ( i k ) / 2 2 i 2 = ( i k ) ! ( i k ) ! ! possible combinations; for ( j k ) numbers of Y, it is ( j k ) ! ( j k ) ! ! . Finally, for k pairs of X Y , there are k ! possible combinations. The product i k ( j k ) ( i k ) ! ( i k ) ! ! ( j k ) ! ( j k ) ! ! k ! is the coefficient C k . □
Lemma A3.
γ ( i , j ) ( i + j 1 ) ! ! = k = 0 min { i , j } γ ( k , i ) γ ( k , j ) i ! j ! ( i k ) ! ! ( j k ) ! ! k ! γ ( i , j ) ( i 1 ) ( j 1 ) ( i + j 3 ) ! !
= k = 0 min { i , j } γ ( k , i ) γ ( k , j ) ( k 1 ) i ! j ! ( i k ) ! ! ( j k ) ! ! k !
Proof of Lemma A3.
If i + j is odd, γ ( i , j ) = 0 and γ ( k , i ) γ ( k , j ) = 0 . Then, Equation (A2) holds. Therefore, we only need to consider the case if i + j is even. First, we define A ( i , j ) = k = 0 min { i , j } γ ( k , i ) γ ( k , j ) i ! j ! ( i k ) ! ! ( j k ) ! ! k ! . Then, we can show that
A ( 2 i + 1 , 2 j + 1 ) = ( 2 i + 1 ) A ( 2 i , 2 j ) + ( 2 j ) A ( 2 i + 1 , 2 j 1 ) A ( 2 i , 2 j ) = ( 2 j 1 ) A ( 2 i , 2 j 2 ) + ( 2 i ) A ( 2 i 1 , 2 j 1 )
Then, using mathematical induction, we have A ( i , j ) = ( i + j 1 ) ! ! if i + j is even. Based on our discussion on the parity of i + j , in general, we have A ( i , j ) = γ ( i , j ) ( i + j 1 ) ! ! . Equation (A3) follows by applying (A2). □
Proof of Proposition 3.
Let Z ̲ = A Y ̲ , Z ̲ = ( I A ) Y ̲ . Given X , Z ̲ is a Gaussian vector with covariance matrix A cov [ Y ̲ ] A T = A n . Similarly, [ Z ̲ ] = I A n . Using A 2 = A , we have E [ Z ̲ Z ̲ T ] = 0 . That is, Z ̲ and Z ̲ are independent, and we have E [ f ( Z ̲ i ) g ( Z ̲ j ) ] = E [ f ( Z ̲ i ) ] E [ g ( Z ̲ j ) ] for arbitrary function.
Let C 1 = E Y ̲ [ ξ ( A Y ̲ ) 2 ] , C 2 = E Y ̲ [ X T ξ ( A Y ̲ ) 2 ] , C 3 = E Y ̲ [ X T ξ ( A Y ̲ ) ( Y ̲ A Y ̲ ) 2 ] , C [ X , ξ ] : = C 1 C 2 C 3 . Then, C [ ξ ] = E X [ C [ X , ξ ] ] , and we have
C 1 = E [ ξ ( Z ̲ ) 2 ] = i = 1 n E [ ξ 2 ( Z ̲ i ) ] C 2 = i , j = 1 , i j n A i j E [ ξ ( Z ̲ i ) ξ ( Z ̲ j ) ] + i = 1 n A i i E [ ξ 2 ( Z ̲ i ) ] C 3 = i , j = 1 , i j n A i j Σ i j + i = 1 n A i i Σ i i where Σ i i = E [ [ Z ̲ i ] 2 [ ξ ( Z ̲ i ) ] 2 ] = 1 A i i n E [ ξ 2 ( Z ̲ i ) ] Σ i j = E [ Z ̲ i Z ̲ j ξ ( Z ̲ i ) ξ ( Z ̲ j ) ] = A i j n E [ ξ ( Z ̲ i ) ξ ( Z ̲ j ) ]
Combining the above equations, we have
C [ X , ξ ] = i = 1 n ( 1 A i i ) ( E [ ξ 2 ( Z ̲ i ) ] A i i n E [ ξ 2 ( Z ̲ i ) ] ) i , j = 1 , i j n A i j ( E [ ξ ( Z ̲ i ) ξ ( Z ̲ j ) ] A i j n E [ ξ ( Z ̲ i ) ξ ( Z ̲ j ) ] )
The summation term contributes equally to E X [ C [ X , ξ ] ] . Therefore, will a little abuse of notation, we can rewrite C [ X , ξ ] as:
C [ X , ξ ] = n ( 1 A 11 ) ( E [ ξ 2 ( Z ̲ 1 ) ] A 11 n E [ ξ 2 ( Z ̲ 1 ) ] ) n ( n 1 ) A 12 ( E [ ξ ( Z ̲ 1 ) ξ ( Z ̲ 2 ) ] A 12 n E [ ξ ( Z ̲ 1 ) ξ ( Z ̲ 2 ) ] )
Since ξ ( z ) = i = 0 m q ̲ i z i , we can rewrite C [ X , ξ ] as the quadratic form q ̲ T M q ̲ , where M is an ( m + 1 ) × ( m + 1 ) random matrix whose element M i j is given by
M i j = n ( 1 A 11 ) ( E [ Z ̲ 1 i + j ] i j 1 n A 11 E [ Z ̲ 1 i + j 2 ] ) n ( n 1 ) A 12 ( E [ Z ̲ 1 i Z ̲ 2 j ] i j 1 n A 12 E [ Z ̲ 1 i 1 Z ̲ 2 j 1 ] )
Since Z 1 is Gaussian, we have E [ Z 1 2 t ] = 1 n t A 11 t ( 2 t 1 ) ! ! . If γ ( i , j ) = 0 , then M i j = 0 . Otherwise, let 2 t = i + j , from Lemma A2, E [ Z ̲ 1 i Z ̲ 2 j ] can be expanded, and we have:
M i j = n ( 1 A 11 ) A 11 t n t ( i 1 ) ( j 1 ) ( 2 t 3 ) ! ! n ( n 1 ) A 12 n t s = 0 min { i , j } s i ( 1 s ) i ! j ! s ! ( i s ) ! ! ( j s ) ! ! A 12 s A 11 i s 2 A 22 j s 2
C [ ξ ] = E X [ C [ X , ξ ] ] = q ̲ T M q ̲ where M i j = E X [ M i j ] = M 1 + M 2 and
M 1 = E [ A 11 t ] E [ A 11 t + 1 ] n t 1 ( i 1 ) ( j 1 ) ( 2 t 3 ) ! ! M 2 = n 1 n t 1 s = 0 min { i , j } γ ( s , i ) ( s 1 ) i ! j ! s ! ( i s ) ! ! ( j s ) ! ! E [ A 12 s + 1 A 11 i s 2 A 22 j s 2 ]
Below, we consider the case when k , n + while m is finite. Since A = X X T , from Proposition 7.2 of [10], we can obtain A 11 as beta distribution with parameter B ( k 2 , n k 2 ) . From Theorem 1 of [12], E [ A 11 t ] E [ A 11 t + 1 ] = ( 1 r ) s = 0 t 1 2 s + k 2 s + n + 2 ( 1 r ) r t where r k n . Therefore, M 1 ( 1 r ) r t n t 1 ( i 1 ) ( j 1 ) ( 2 t 3 ) ! ! . We consider the case when s + 1 is even, E [ A 12 s + 1 A 11 i s 2 A 22 j s 2 ] E [ A 12 s + 1 A 11 i s ] E [ A 12 s + 1 A 22 j s ] . When k , n are sufficiently large, we can treat A 11 , A 12 as joint Gaussian. From Theorem 2 of [12]: E [ A 12 s + 1 A 11 i s ] s ! ! r i s ( r ( 1 r ) n ) ( s + 1 ) / 2 . Similarly, E [ A 12 s + 1 A 22 j s ] s ! ! r j s ( r ( 1 r ) n ) ( s + 1 ) / 2 . It implies that E [ A 12 s + 1 A 11 i s 2 A 22 j s 2 ] ( s ! ! ) r t s ( r ( 1 r ) n ) ( s + 1 ) / 2 . We can obtain an upper bound of M 2 using Equation (A3): M 2 1 n t 2 ( i 1 ) ( j 1 ) ( 2 t 3 ) ! ! E [ A 12 4 A 11 i 3 2 A 22 j 3 2 ] ( 1 r ) 2 r t 1 n t 3 ( i 1 ) ( j 1 ) ( 2 t 3 ) ! ! = 3 ( 1 r ) k M 1 . When k , we see that M 2 can be ignored and M i j = γ ( i , j ) ( 1 r ) r t n t 1 ( i 1 ) ( j 1 ) ( 2 t 3 ) ! ! . By changing of variables in Equation (4), we can obtain the correspondence M i j = n 1 + 2 t k t M i j = γ ( i , j ) ( 1 r ) ( i 1 ) ( j 1 ) ( 2 t 3 ) ! ! .
To obtain the constraint on q ̲ , we should choose ξ = 1 , which is equivalent to E X [ C 1 ] = 1 . By some computation, we can obtain the expression of N when m k . □
Lemma A4.
Let U be an ( m + 1 ) × ( m + 1 ) upper triangular matrix defined by U i j = γ ( i , j ) j ! ( j i ) ! ! i !   f o r   0 i j m , e ̲ m is an ( m + 1 ) dimensional vector whose elements are zero except that the last element is 1. p ̲ i = γ ( i , m ) ( 1 ) ( m i ) / 2 m ! i ! ( m i ) ! ! . Then, we have U p ̲ = e ̲ m .
Proof of Lemma A4.
For i = m , U m m = m ! , and we can show that U m m p ̲ m = 1 . For 0 i < m , we will show that j = i m U i j p ̲ j = 0 . In specific form as
j = i m γ ( i , j ) j ! ( j i ) ! ! i ! γ ( j , m ) ( 1 ) ( m j ) / 2 m ! j ! ( m j ) ! ! = 0
If γ ( i , m ) = 0 , γ ( i , j ) γ ( j , m ) = 0 and Equation (A4) holds. Otherwise, let 2 m = m i , and Equation (A4) is equivalent to j = i , γ ( i , j ) = 1 m ( 1 ) ( m j ) / 2 ( m j ) ! ! ( j i ) ! ! = 0   j = 0 , j is even 2 m ( 1 ) ( 2 m j ) / 2 ( 2 m j ) ! ! j ! ! = 0   j = 0 m ( 1 ) m j ( 2 m 2 j ) ! ! ( 2 j ) ! ! = 0   j = 0 m ( 1 ) m j ( m j ) ! j ! = 0 j = 0 m ( 1 ) m j ( m j ) = 0   ( 1 + x ) m = 0 ( x = 1 ) . □
Proof of Proposition 4.
Let U be the same matrix defined in Lemma A4. From Lemma A3, we have N i j = k = 0 m U k i U k j . Therefore, N = U T U . Let Λ = diag [ 1 , 0 , 1 , , 1 m ] , from Equation (A3), we have M i j = k = 0 m U k i Λ k k U k j . Therefore, M = U T Λ U . Then, min p ̲ T N p ̲ = 1 p ̲ T M p ̲ = min p ˜ ̲ T p ˜ ̲ = 1 k = 0 m ( 1 k ) p ˜ ̲ k 2 = 1 m . We have made the invertible transformations p ˜ ̲ = U p ̲ . Since the minimum value is achieved at p ˜ ̲ = e ̲ m , from Lemma A4, we have p ̲ i = γ ( i , m ) ( 1 ) ( m i ) / 2 m ! i ! ( m i ) ! ! . Equation (6) follows by transforming p ̲ back to q ̲ using Equation (4) and comparing the expression of q ̲ with the explicit formula of probabilists’ Hermite polynomials (See [13] Equation (18.5.13)). □

References

  1. Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 1989, 2, 303–314. [Google Scholar] [CrossRef]
  2. Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Dey, S.S.; Wang, G.; Xie, Y. An Approximation Algorithm for training One-Node ReLU Neural Network. arXiv 2018, arXiv:1810.03592. [Google Scholar] [CrossRef]
  4. Ma, L.; Khorasani, K. Constructive feedforward neural networks using Hermite polynomial activation functions. IEEE Trans. Neural Netw. 2005, 16, 821–833. [Google Scholar] [CrossRef] [PubMed]
  5. Kuri-Morales, A. Closed determination of the number of neurons in the hidden layer of a multi-layered perceptron network. Soft Comput. 2017, 21, 597–609. [Google Scholar] [CrossRef]
  6. Arora, R.; Basu, A.; Mianjy, P.; Mukherjee, A. Understanding Deep Neural Networks with Rectified Linear Units. In Proceedings of the 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  7. Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: New York, NY, USA, 1999. [Google Scholar]
  8. Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for activation functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
  9. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
  10. Easton, M.L. Chapter 7: Random orthogonal matrices. In Group Invariance in Applications in Statistics; Regional Conference Series in Probability and Statistics; Institute of Mathematical Statistics and American Statistical Association: Haywood, CA, USA; Alexandria, VA, USA, 1989; Volume 1, pp. 100–107. [Google Scholar]
  11. Isserlis, L. On a formula for the product-moment coefficient of any order of a normal frequency distribution in any number of variables. Biometrika 1918, 12, 134–139. [Google Scholar] [CrossRef]
  12. Zhao, F. Moments of the multivariate Beta distribution. arXiv 2020, arXiv:2007.02541. [Google Scholar]
  13. NIST Digital Library of Mathematical Functions. Release 1.0.25 of 2019-12-15. Available online: http://dlmf.nist.gov/ (accessed on 5 September 2021).
Figure 1. The structure of a one-node neural network. The activation function σ takes the form σ ( z ) = z + ϵ ξ ( z ) .
Figure 1. The structure of a one-node neural network. The activation function σ takes the form σ ( z ) = z + ϵ ξ ( z ) .
Appliedmath 02 00033 g001
Figure 2. Illustration of the linear decreasing property of C [ ξ ] . The dotted red line represents the theoretical lower bound ( 1 k m ) ( m 1 ) , while the dotted blue line is the least square fit of the empirical result by polynomials with only the highest degree.
Figure 2. Illustration of the linear decreasing property of C [ ξ ] . The dotted red line represents the theoretical lower bound ( 1 k m ) ( m 1 ) , while the dotted blue line is the least square fit of the empirical result by polynomials with only the highest degree.
Appliedmath 02 00033 g002
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhao, F.; Huang, S.-L. On the Universally Optimal Activation Function for a Class of Residual Neural Networks. AppliedMath 2022, 2, 574-584. https://doi.org/10.3390/appliedmath2040033

AMA Style

Zhao F, Huang S-L. On the Universally Optimal Activation Function for a Class of Residual Neural Networks. AppliedMath. 2022; 2(4):574-584. https://doi.org/10.3390/appliedmath2040033

Chicago/Turabian Style

Zhao, Feng, and Shao-Lun Huang. 2022. "On the Universally Optimal Activation Function for a Class of Residual Neural Networks" AppliedMath 2, no. 4: 574-584. https://doi.org/10.3390/appliedmath2040033

Article Metrics

Back to TopTop