Next Article in Journal
Validation of HiG-Flow Software for Simulating Two-Phase Flows with a 3D Geometric Volume of Fluid Algorithm
Previous Article in Journal
Enhanced Non-Maximum Suppression for the Detection of Steel Surface Defects
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Nonparametric Estimation for High-Dimensional Space Models Based on a Deep Neural Network

School of Statistics and Data Science, Nanjing Audit University, Nanjing 211815, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(18), 3899; https://doi.org/10.3390/math11183899
Submission received: 5 August 2023 / Revised: 10 September 2023 / Accepted: 12 September 2023 / Published: 13 September 2023

Abstract

:
With high dimensionality and dependence in spatial data, traditional parametric methods suffer from the curse of dimensionality problem. The theoretical properties of deep neural network estimation methods for high-dimensional spatial models with dependence and heterogeneity have been investigated only in a few studies. In this paper, we propose a deep neural network with a ReLU activation function to estimate unknown trend components, considering both spatial dependence and heterogeneity. We prove the compatibility of the estimated components under spatial dependence conditions and provide an upper bound for the mean squared error ( M S E ). Simulations and empirical studies demonstrate that the convergence speed of neural network methods is significantly better than that of local linear methods.

1. Introduction

Spatial data arise in many fields, including environmental science, econometrics, epidemiology, image analysis, oceanography, geography, geology, plant ecology, archaeology, agriculture, and psychology. Spatial correlation and spatial heterogeneity are two significant features of spatial data. Various spatial modeling methods have been applied to explore the effect of spatial heterogeneity. Notably, numerous local spatial techniques have been proposed to accommodate spatial heterogeneity. For example, Hallin et al. [1] and Biau and Cadre [2] proposed a local linear method for the modeling of spatial heterogeneity. Bentsen et al. [3] used a graph neural network architecture to extract spatial dependencies with different update functions to learn temporal correlations.
Spatial data often exhibit high dimensionality, a large scale, heterogeneity, and strong complexity. These challenges often make traditional statistical methods ineffective. Statistical machine learning methods can effectively address such challenges. Du et al. [4] pointed out the issues of traditional spatial data being large-scale and generally complex and summarized the effectiveness and application potential of four advanced machine learning methods—support vector machine (SVM)-based kernel learning, semi-supervised and active learning, ensemble learning, and deep learning—in handling complex spatial data. Farrell et al. [5] highlighted the challenges that high-dimensional spatial data, large volumes of data, and multicollinearity among covariates pose to traditional statistical models in variable selection. Three machine learning algorithms—maximum entropy (MaxEnt), random forests (RFs), and support vector machines (SVMs)—were employed to mitigate the issues of multicollinearity in high-dimensional spatial data. Nikparvar et al. [6] pointed out that the properties of spatially explicit data are often ignored or inadequately addressed in machine learning applications within spatial domains. They argued that the future prospects for spatial machine learning are very promising.
Statistical machine learning methods have advanced rapidly, while theoretical ones are not well established. Schmidt-Hieber [7] investigated the following nonparametric model:
Y i = f 0 ( X i ) + ϵ i , i = 1 , , n ,
where the noise variables ϵ i are assumed to be i.i.d., X i [ 0 , 1 ] d , i = 1 , , n , are independently and identically distributed, Y i R , i = 1 , , n and are independently and identically distributed. It was shown that estimators based on sparsely connected deep neural networks with the ReLU activation function and a properly chosen network architecture achieve minimax rates of convergence (up to log n-factors) under a general composition assumption on the regression function.
Considering the dependencies and heterogeneity of spatial models, we study nonparametric high-dimensional spatial models as follows:
Y i = m X i + R i ,
where i Λ n = i 1 , i 2 , i N : i j = 1 , 2 , , n j , j = 1 , 2 , N , Y i R , X i R d , m X i represents the trend function, R i satisfies the α -mixing condition (the definition of the α -mixing condition can be found at the beginning of Section 2.2), n = n 1 + + n N . For example, Y denotes the hourly ozone concentrations; X is a 5 × 1 vector that consists of the following explanatory variables observed at each station: wind speed, air pressure, air temperature, relative humidity, and elevation. The observation locations are recorded in longitude i 1 and latitude i 2 . In this case, d = 5 and N = 2 ; see [8].
Under general assumptions, we prove the consistency of the estimator and provide bounds for the mean squared error M S E . In the simulation aspect, a comparison with the local linear regression method demonstrates that the neural network method converges much faster than does the local linear regression. In the empirical study, considering the air pollution index, air pollutants, and environmental factors, the effectiveness of the neural network is demonstrated through a comparison with the local linear regression method, especially in small sample cases.
Throughout the rest of the paper, bold letters are used to represent vectors; for example, x : = x 1 , , x d . We define | x | : = max i x i , | x | 0 : = i I x i 0 , where I represents the indicator function:
I A ( w ) = 0 if w A , 1 if w A ,
| x | 0 denotes the total number of x i , which is not equal to zero. | x | p : = i = 1 d x i p 1 / p , and we write | f | p : = | f | L p ( D ) as the L p norm on D; D is some domain, and different situations may be different. For two sequences a n n and b n n , we write a n b n if there exists a constant C such that for all n, a n C b n . Moreover, a n b n means a n b n and b n a n . log 2 denotes the logarithm base 2, ln denotes the logarithm base e, x represents the smallest integer x , and x represents the largest integer x .

2. Nonparametric High-Dimensional Space Model Estimation

2.1. Mathematical Modeling of Deep Network Features

Definition 1. 
Fitting a multilayer neural network requires the choice of an activation function σ : R R and the network architecture. Motivated by its importance in deep learning, we study the rectifier linear unit (ReLU) activation function; see [7].
σ ( x ) = max ( x , 0 ) .
For v = v 1 , , v r R r , the displacement activation function σ v : R r R r is defined as follows:
σ v : R r R r ,
σ v y 1 y r = σ y 1 v 1 σ y r v r .
The neural network architecture ( L , p ) consists of a positive integer L known as the number of hidden layers or depth and a width vector p = p 0 , , p L + 1 N L + 2 . A neural network with the network structure ( L , p ) is any function of the following form:
f : R p 0 R p L + 1 , x f ( x ) = W L σ v L W L 1 σ v L 1 W 1 σ v 1 W 0 x ,
where W i is a p i + 1 × p i weight matrix and v i R p i is a displacement vector, where i = 1 , 2 , , L . Therefore, the network function is constructed by alternating matrix vector multiplications and the action of nonlinear activation functions σ v . In Equation (3), the shift vectors can also be omitted by considering the input as ( x , 1 ) and augmenting the weight matrices with an additional row and column. To fit the network to data generated by a d-dimensional nonparametric regression model, it is required to have p 0 = d and p L + 1 = 1 .
Given a network function f ( x ) = W L σ v L W L 1 σ v L 1 W 1 σ v 1 W 0 x , the network parameters are the elements of the matrices W j j = 0 , , L and the vectors v j j = 1 , , L . These parameters need to be estimated/learned from the data. In this context, “estimate” and “learn” can be used interchangeably, as the process of estimating the parameters from data is often referred to as learning in the context of neural networks and machine learning.
The purpose of this paper is to consider a framework that encompasses the fundamental characteristics of modern deep network architectures. In particular, in this paper, we allow for a large depth L and a significant number of potential network parameters without requiring an upper bound on the number of network parameters for the main results. Consequently, this approach deals with high-dimensional settings that have more parameters than training data. Another characteristic of trained networks is that the learned network parameters are typically not very large; see [7]. In practice, the weights of trained networks often do not differ significantly from the initialized weights. As all elements in orthogonal matrices are bounded by 1, the weights of trained networks also do not become excessively large. However, existing theoretical results often demand that the size of the network parameters tends to infinity. To be more consistent with what is observed in practice, all parameters considered in this paper are bounded by 1. By projecting the network parameters at each iteration onto the interval [ 1 , 1 ] , this constraint can be easily incorporated into deep learning algorithms.
Let W j denote the maximum element norm of W j , and let us consider the network function space with a given network structure and network parameters bounded by 1 as follows:
F ( L , p ) : = f in the form of ( 2.1 ) : max j = 0 , , L W j v j 1 ,
where v 0 is a vector with all components being 0.
In this work, we model the network sparsity assuming that there are only a few nonzero/active network parameters. If | | W j | | 0 denotes the number of nonzero entries of W j and | | | f | | | stands for the supnorm of the function x | f ( x ) | , then the s-sparse networks are given by
F ( L , p , s ) : = F ( L , p , s , F ) : = f F ( L , p ) : j = 0 L W j 0 + v j 0 s , | f | F ,
where F is a constant; the upper bound on the uniform norm of the function f is often unnecessary and is thus omitted in the notation. Here, we consider cases where the number of network parameters s is very small compared to the total number of parameters in the network.
For any estimate m ^ n that returns a network in class F ( L , p , s , F ) , the corresponding quantity is defined as follows:
Δ n m ^ n , m : = E m 1 n i Λ n Y i m ^ n X i 2 inf f F ( L , p , s , F ) 1 n i Λ n Y i f X i 2 .
The sequence Δ n m ^ n , m measures the discrepancy between the expected empirical risk of m ^ n and the global minimum of this class over all networks. The subscript m in E m indicates the sample expectation with respect to the nonparametric regression model generated by the regression function m. Notice that Δ n m ^ n , m 0 , and Δ n m ^ n , m = 0 if m ^ n is an empirical risk minimizer.
Therefore, Δ n m ^ n , m is a critical quantity that, together with the minimax estimation rates, determines the convergence rate of m ^ n .
To evaluate the statistical performance of m ^ n under general assumptions, the mean squared error of the estimator is defined as
R m ^ n , m : = E m m ^ n ( X ) m ( X ) 2 .

2.2. Estimation and Theoretical Properties

In order to obtain asymptotic results, we will assume throughout this paper that X i , i Λ n satisfies the following α -mixing condition: there exists a function φ ( t ) as t with φ ( 0 ) = 1 , such that whenever Ξ , ( Ξ ) ˜ Λ n are finite sets, it is the case that
τ ( B ( Ξ ) , B ( Ξ ) ˜ ) = s u p { | P r ( A B ) P r ( B ) | A B ( ( Ξ ) ) , B B ( Ξ ) ˜ } ψ ( c a r d ( Ξ ) , ( Ξ ) ˜ ) φ ( d ( Ξ , Ξ ) ) ,
where B ( Ξ ) denotes the Borel σ -field generated by { X i , i Ξ } , card ( Ξ ) is the cardinality of Ξ , and d( Ξ , ( Ξ ) ˜ ) = m i n { | | i i | | : i Ξ , i ( Ξ ) ˜ } is the distance between Ξ and ( Ξ ) ˜ , where | | i | | = ( i 1 2 + + i N 2 ) 1 2 stands for the Euclidean norm and ψ : N 2 R + is a symmetric positive function that is nondecreasing in each variable; see [8].
The theoretical performance of neural networks depends on the underlying function class, and a classic approach in nonparametric statistics is to assume that the regression function is β -smooth. In this paper, we assume that the regression function m ( X i ) is a composition of multiple functions, i.e.,
m = g q g q 1 g 1 g 0 ,
where g i : a i , b i d i a i + 1 , b i + 1 d i + 1 . We denote the components of g i as g i = g i j j = 1 , , d i + 1 , and we let t i be the maximum variable that each g i depends on. Thus, each g i is a function with t i variables.
If all partial derivatives up to order β of a function exist and are bounded, the β -th order partial derivatives are β - β Hölder, where β represents the largest integer strictly less than β . Then, the ball of β -Hölder functions with radius K is defined as follows:
C r β ( D , K ) = f : D R r R : α : | α | < β α f + α : | α | = β sup x , y D x y α f ( x ) α f ( y ) | x y | β β K ,
where we use multi-index notation, i.e., α = α 1 α r , where α = ( α 1 , , α r ) N r ; see [7].
We assume that each function g i j has Hölder smoothness β i . Since g i j is also a function of t i variables, g i j C t i β i a i , b i t i , K i , the underlying function space is then defined as
G ( q , d , t , β , K ) : = m = g q g 0 : g i = g i j j : a i , b i d i a i + 1 , b i + 1 d i + 1 , g i j C t i β i a i , b i t i , K , a i , b i K ,
where d : = d 0 , , d q + 1 , t : = t 0 , , t q , β : = β 0 , , β q .
Theorem 1. 
We consider the nonparametric regression model with d variables for the composite regression function m = g q g q 1 g 1 g 0 in the class G ( q , d , t , β , K ) , as described in Equation (2). Let m ^ n be an estimator from the function class F L , p i i = 0 , , L + 1 , s , F satisfying the following conditions:
(1) 
F max ( K , 1 ) ,
(2) 
i = 0 q log 2 4 t i 4 β i log 2 n L n ϕ n ,
(3) 
n ϕ n min i = 1 , , L p i ,
(4) 
s n ϕ n ln n ,
where ϕ n is a positive sequence; then, there exist constants C and C depending only on q , d , t , β , F , such that if Δ n m ^ n , m C ϕ n L ln 2 n , then
R m ^ n , m C ϕ n L ln 2 n ,
if Δ n m ^ n , m C ϕ n L ln 2 n , then
1 C Δ n m ^ n , m R m ^ n , m C Δ n m ^ n , m .
To minimize ϕ n L ln 2 n , let Δ n m ^ n , m C ϕ n ln 3 n ; then,
R m ^ n , m C ϕ n ln 3 n .
The convergence rate in Theorem 1 depends on ϕ n and Δ n m ^ n , m . The following reasoning shows that ϕ n serves as a lower bound for the supremum infimum estimation risk over this class. For any empirical risk minimizer, where the definition of the Δ n term becomes zero, the following corollary holds.
Corollary 1. 
Let m ˜ n arg min f F ( L , p , s , F ) i Λ n Y i m X i 2 be an empirical risk minimizer under the same conditions as in Theorem 1. There exists a constant C , depending only on q , d , t , β , F , such that
R m ˜ n , m C ϕ n L ln 2 n .
Condition (1) in Theorem 1 is very mild and states only that the network functions should have at least the same supremum norm as the regression function. From the other assumptions in Theorem 1, it becomes clear that there is a lot of flexibility in selecting a good network architecture as long as the number of active parameters s is taken to be in the right order.
In a fully connected network, the number of network parameters is i = 0 L p i p i + 1 . This implies that Theorem 1 requires a sparse network. More precisely, the network must have at least i = 1 L p i s completely inactive nodes, meaning that all incoming signals are zero. Condition (4) chooses s n ϕ n ln n to balance the mean squared error and variance. From the proof of this theorem (Appendix B), convergence rates for various orders of s can also be derived.
Deep learning excels over other methods only in the large sample regime. This suggests that the method may be adaptable to the underlying structures in the data. This may produce rapid convergence rates, but with larger constants or remainders, which can lead to relatively poor performance in small sample scenarios.
The proof of the risk bounds in Theorem 1 is based on the following oracle-type inequality.
Theorem 2. 
Let us consider the d-dimensional nonparametric regression model given by Equation (2) with an unknown regression function m, where F 1 and m F , let m ^ n be an arbitrary estimator taking values in the class F ( L , p , s , F ) , and let
Δ n m ^ n , m : = E m 1 n i Λ n Y i m ^ n X i 2 inf f F ( L , p , s , F ) 1 n i Λ n Y i f X i 2 ,
and for any ε ( 0 , 1 ] , there exists a constant C ε , depending only on ε, such that
τ ε , n : = C ε F 2 ( s + 1 ) ln n ( s + 1 ) L p 0 p L + 1 n ,
and we have
( 1 ε ) 2 Δ n m ^ n , m τ ε , n R m ^ n , m ( 1 + ε ) 2 inf f F ( L , p , s , F ) f m 2 + Δ n m ^ n , m + τ ε , n .
In the context of oracle-type inequalities, an increase in the number of layers can lead to a deterioration in the upper bound on the risk. In practice, it has also been observed that having too many layers can result in a decline in performance. We refer to Section 4.4 in He et al. [9] and He and Sun [10] for more details.
The proof relies on two specific properties of the ReLU activation function rather than other activation functions. The first property is its projection property, which is expressed as
σ σ = σ ,
where the composite of the ReLU activation function is considered, given that the foundation of approximation theory lies in constructing smaller networks to perform simpler tasks, which may not all require the same network depth. To combine these subnetworks, it is necessary to synchronize the network depth by adding hidden layers that do not alter the output. This can be achieved by selecting weight matrices in the network (assuming an equal width for consecutive layers) and by utilizing the projection property of the ReLU activation function, given by σ σ = σ . This property is beneficial not only theoretically but also in practice, as it greatly aids in passing a result to deeper layers through skip connections.
Next, we prove that ϕ n serves as a lower bound for the supremum infimum estimation risk over class G ( q , d , t , β , K ) with t i min d 0 , , d i 1 . This means that in the composition of functions, no additional dimensions are added at deeper abstract layers. In particular, this approach avoids the case where t i exceeds the input dimension d 0 .
Theorem 3. 
Let us consider the nonparametric regression model (2), where X i is drawn from a distribution with a Lebesgue density on [ 0 , 1 ] d , and the lower and upper bounds of this distribution are positive constants. For any nonnegative integer q, arbitrary dimension vectors d and t , and for all i such that t i min d 0 , , d i 1 , and any smoothness vector β, along with all sufficiently large constants K > 0 , there exists a positive constant c such that
inf m ^ n sup m G ( q , d , t , β , K ) R m ^ n , m c ϕ n ,
where inf is taken over all estimators m ^ n .
By combining the supremum infimum lower bound with the oracle-type inequality, we can easily obtain the following result.
Lemma 1. 
Given β , K > 0 , and d N , there exist constants c 1 , c 2 depending only on β , K , d , such that for ε c 2 , we have
s c 1 ε d / β L ln ( 1 / ε ) ,
and then, for any width vector p , where p 0 = d and p L + 1 = 1 , we know that
sup m C d β [ 0 , 1 ] d , K inf f F ( L , p , s ) f m ε .

2.3. Suboptimality of Wavelet Series Estimation

In this section, we show that wavelet series estimators are unable to take advantage of the underlying composition structure in the regression function and achieve, in some setups, much slower convergence rates. Wavelet estimation is susceptible to the curse of dimensionality, whereas neural networks can achieve faster convergence rates.
We consider a compressed wavelet system ψ λ , λ Λ , restricted to L 2 [ 0 , 1 ] from L 2 ( R ) , as referred to in Cohen et al. [11]. Here, Λ = ( j , k ) : j = 1 , 0 , 1 , ; k I j , and ψ 1 , k : = ϕ ( · k ) denotes the shift-scaled function. For any function f L 2 [ 0 , 1 ] d , we have
f ( x ) = λ 1 , , λ d Λ × × Λ d λ 1 λ d ( f ) r = 1 d ψ λ r x r ,
and the convergence on L 2 [ 0 , 1 ] entails wavelet coefficients.
d λ 1 λ d ( f ) : = f ( x ) r = 1 d ψ λ r x r d x .
To construct a counterexample, it is sufficient to consider the nonparametric regression model Y i = m X i + R i , i Λ n = i 1 , i 2 , , i N : i j = 1 , 2 , , n j , j = 1 , 2 , , N , X i : = U i , 1 , , U i , d U [ 0 , 1 ] d . The empirical wavelet coefficients are obtained; furthermore,
d ^ λ 1 λ d f 0 = 1 n i Λ n Y i r = 1 d ψ λ r U i , r .
Since E d ^ λ 1 λ d f 0 = d λ 1 λ d f 0 , an unbiased estimate for the wavelet coefficients is obtained; furthermore,
Var d ^ λ 1 λ d f 0 = 1 n Var Y 1 r = 1 d ψ λ r U 1 , r 1 n E Var Y 1 r = 1 d ψ λ r U 1 , r U 1 , 1 , , , U 1 , d = 1 n .
We study the estimators of the following form
f ^ n ( x ) = λ 1 , , λ d I d ^ λ 1 λ d f 0 r = 1 d ψ λ r x r ,
and for any subset I Λ × × Λ , we have
R f ^ n , f 0 = λ 1 , , λ d I E d ^ λ 1 λ d f 0 d λ 1 λ d f 0 2 + λ 1 , , λ d I c d λ 1 λ d f 0 2 λ 1 , , λ d Λ × × Λ 1 n d λ 1 λ d f 0 2 .
ψ L 2 ( R ) possesses compact support; thus, without loss of generality, we assume that ψ is zero outside 0 , 2 q for some integer q > 0 .
Lemma 2. 
For any integer q > 0 , ν : = log 2 d + 1 , and any 0 < α 1 and K > 0 , there exists a nonzero constant c ( ψ , d ) , which depends solely on d and the properties of the wavelet function ψ. Thus, for any j, we can find a function f j , α ( x ) = h j , α x 1 + + x d , where h j , α C 1 α ( [ 0 , d ] , K ) , such that for all p 1 , , p d 0 , 1 , , 2 j q ν 1 , we have
d j , 2 q + ν p 1 j , 2 q + ν p d f j , α = c ( ψ , d ) K 2 j 2 ( 2 α + d ) .
Theorem 4. 
If m ^ n represents a wavelet estimator with compact support ψ and an arbitrary index set I,
m ^ n ( x ) = λ 1 , , λ d I d ^ λ 1 λ d f 0 r = 1 d ψ λ r x r ,
therefore, for any 0 < α 1 and any Hölder radius K > 0 , we have
sup m ( x ) = h r = 1 d x r , h C 1 α ( [ 0 , d ] , K ) R m ^ n , m n 2 α 2 α + d .
As a result, the convergence rate of the wavelet series estimation is slower than n 2 α / ( 2 α + d ) . If d is large, this rate becomes significantly slower. Therefore, wavelet estimation is sensitive to the curse of dimensionality, while neural networks can achieve rapid convergence.

3. Simulation Experiments and Case Study

3.1. Simulation Experiments

In this section, we conduct a comparative study through 100 repeated experimental simulations, evaluating the mean squared error of the estimation using both the local linear regression method and deep neural networks with the ReLU activation function. We consider the following models.
Model 1:
Y i = 2 + 2 X i β + e i ,
where X i follows a uniform distribution on [ 1 , 1 ] , β is the coefficient, simulated from a uniform distribution on [ 0 , 1 ] , and e i is the noise variable following a standard normal distribution.
A neural network with a depth of 3 and a width of 32 was created, where the first two layers are fully connected layers, and the output layer uses the ReLU activation function.
We consider four scenarios with the same sample size but different dimensions.
Scenario 1 : dim = 3 , n = [ 200 , 600 , 1000 , 1400 , 1800 , 2200 ] ;
Scenario 2 : dim = 5 , n = [ 200 , 600 , 1000 , 1400 , 1800 , 2200 ] ;
Scenario 3 : dim = 8 , n = [ 200 , 600 , 1000 , 1400 , 1800 , 2200 ] ;
Scenario 4 : dim = 10 , n = [ 200 , 600 , 1000 , 1400 , 1800 , 2200 ] ;
where dim represents the dimension and n is the sample size. For each scenario, the mean squared error ( M S E ) is calculated to compare the performance of the local linear regression method and the deep neural network method. The M S E in this study is defined as follows:
M S E = 1 n i = 1 n ( y ( i ) y ^ ( i ) ) 2 .
Table 1 presents the estimates at different values of n.
In this table, N E represents the M S E of the nonparametric estimation method, and D N N represents the M S E of the deep neural network method. It is evident that as M S E approaches 0, the estimation accuracy increases. From Table 1, we observe that for the same dimension, as the sample size increases, D N N tends towards 0. For the same sample size, D N N is significantly smaller than N E , and with the increase in dimension, the superiority of the neural network method over the local linear regression method becomes more pronounced. Therefore, the neural network method achieves much higher estimation accuracy, especially for large sample sizes and high dimensions.
As shown in Figure 1, with the increase in dimension, the performance of the neural network fitting surpasses the local linear regression method, where the x-axis represents the sample size n, and the y-axis represents the mean squared error ( M S E ). In higher dimensions, the M S E of the neural network method approaches almost zero, which is attributed to the avoidance of the curse of dimensionality by deep neural networks. This demonstrates that the convergence rate of the deep neural network is superior to that of the local linear regression method and approaches the optimal convergence rate.
Next, we consider high-dimensional spatial models with dependency structures to compare the M S E s of the two methods mentioned above.
Model 2:
Y i , j = 2 sin ( 2 π X i , j ) + R i , j ,
where i = 1 , , n 1 , j = 1 , , n 2 , and X i , j follows a zero-mean second-order stationary process. R i , j follows a standard normal distribution. Similarly to the work of Cressie and Wikle [12], high-dimensional spatial processes X i , j are generated using spectral methods.
X i , j = ( 2 / M ) 1 / 2 k = 1 M cos ( w ( 1 , k ) i + w ( 2 , k ) j + r ( k ) ) .
In this case, w ( i , k ) for i = 1 , 2 follows a standard normal distribution and is independent of r ( k ) for k = 1 , , M , where r ( k ) are independently and identically distributed from a uniform distribution on [ π , π ] . As n , X i , j converges to a Gaussian random process. We consider the case with dimension 5 and sample sizes [200, 600, 1000, 1400, 1800, 2200]. The network structure is the same as in Model 1.
As shown in Table 2, it can be observed that the convergence rate of the high-dimensional spatial model with dependence is worse than the convergence rate of the high-dimensional spatial model without dependence. However, in comparison to the local linear regression method, the M S E values of the neural network are much smaller, indicating that the neural network achieves better convergence performance.
As shown in Figure 2, we see that in the case of large sample sizes and high-dimensional spatial models, the neural network achieves superior convergence compared to that of the local linear regression method, where the x-axis represents the sample size n, and the y-axis represents the mean squared error ( M S E ).

3.2. Case Study

To compare the consistency of the local linear regression method and deep neural network for high-dimensional spatial models, we consider the relationship between air pollution and respiratory diseases in the New Territories East of Hong Kong from 1 January 2000 to 15 January 2001, as studied by Wang et al. [13]. There is a dataset consisting of 821 observations, where we mainly consider the air pollution index X 1 and five pollutants, sulfur dioxide ( g / m 3 ) X 2 , inhalable particulate matter ( g / m 3 ) X 3 , nitrogen compounds ( g / m 3 ) X 4 , nitrogen dioxide ( g / m 3 ) X 5 , and ozone ( g / m 3 ) X 6 , as well as two environmental factors: temperature ( ° C) X 7 and relative humidity (%) X 8 . In this section, we examine the relationship between the levels of chemical pollutants in the New Territories East of Hong Kong and the daily hospital admissions for respiratory diseases (Y). The specific parameter settings are the same as those in the numerical simulation part.
After dimensionless and standardization processing, we use 397 data points for the case study. Among these, 80% of the data are used to train the model, and the remaining 20% are used to evaluate the quality of the trained model. The M S E values are shown in Table 3. It can be observed that the M S E 2 values are closer to 0, indicating that in real-world cases, the mean squared error of the deep neural network method is much smaller than that of the local linear regression method. Therefore, the deep neural network shows better convergence performance.
Figure 3 presents a visual representation of the M S E values for both methods, where the x-axis represents the sample size n, and the y-axis represents the mean squared error ( M S E ). From the graph, it is evident that the deep neural network method exhibits a faster convergence rate compared to that of the local linear regression method.

4. Conclusions

In this study, we employ neural networks with ReLU activation functions for nonparametric estimation in high-dimensional spaces. By constructing suitable network architectures, we estimate unknown trend functions and prove the consistency of the estimators while also comparing and analyzing the deep neural network approach with traditional nonparametric methods.
The focus is on high-dimensional space models with unknown error distributions. Considering the spatial dependencies and heterogeneity in the space models, a deep neural network with ReLU activation functions is used to estimate the unknown trend functions. Under general assumptions, the consistency of the estimators is established, and bounds for the mean squared error ( M S E ) are provided. The estimators exhibit a convergence rate that is related to the sample size but independent of the dimensionality d, thereby avoiding the curse of dimensionality. Moreover, the proposed estimators achieve convergence speeds close to optimality.
Considering the spatial dependencies in high-dimensional settings with large sample sizes, the deep neural network method outperforms traditional nonparametric estimation methods.

Author Contributions

Methodology, H.W.; software & writing, X.J.; editing, H.H.; review J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Social Science Fund of China, grant number 22BTJ021.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Embedding Property of Network Function Classes

To approximate functions by using neural networks, we first construct smaller networks to compute simpler objects. Let p = p 0 , , p L + 1 and p = p 0 , , p L + 1 . To merge networks, the following rules are commonly used in this paper.
Enlargement: F ( L , p , s ) F L , q , s , p q and s s .
Composition: Let f F ( L , p ) and g F L , p , where p L + 1 = p 0 . For a vector v R p L + 1 , in the space F L + L + 1 , p 0 , , p L + 1 , p 0 , , p L + 1 , i.e., F L + L + 1 , p , p 1 , , p L + 1 , the composition network g σ v ( f ) is defined.
Additional Layer/Depth Synchronization: To synchronize the number of hidden layers in two networks, an additional layer can be added with a unit weight matrix, such that
F ( L , p , s ) F L + q , ( p 0 , , p 0 q , p ) , s + q p 0 .
Parallelization: Let f and g be two networks with the same number of hidden layers and the same input dimension, i.e., f F ( L , p ) and g F L , p , where p 0 = p 0 . The parallel network ( f , g ) computes both f and g simultaneously in the joint network class F ( L , ( p 0 , p 1 + p 1 , , p L + 1 + p L + 1 ) ) .
Removing Inactive Nodes: We have
F ( L , p , s ) = F L , p 0 , p 1 s , p 2 s , , p L s , p L + 1 , s .
In this context, we have p i s = m i n p i , s , i = 1 , , L . Let f ( x ) = W L σ v L W L 1 σ v 1 W 0 x F ( L , p , s ) . If all entries in the j-th column of W i are zero, we can remove this column along with the j-th row of W i 1 and the j-th element of v i without changing the function. This implies that f F L , p 0 , , p i 1 , p i 1 , p i + 1 , , p L + 1 , s . Since there are s active parameters, for any i = 1 , , L , we need to iterate at least p i s times. This proves that f F L , p 0 , p 1 s , p 2 s , , p L s , p L + 1 , s .
In this paper, we often utilize the following fact. For a fully connected network in F ( L , p ) , there are = 0 L p p + 1 weight matrix parameters and = 1 L p network parameters from bias vectors. Therefore, the total number of parameters is
= 0 L p + 1 p + 1 p L + 1 .

Appendix B. Approximation by Polynomial Neural Networks

We construct a network with all parameters bounded by 1 to approximate the calculation of x y for given inputs x and y. Let T k : 0 , 2 2 2 k 0 , 2 2 k , where k is a positive integer.
T k ( x ) : = ( x / 2 ) 2 1 2 k x / 2 = ( x / 2 ) + x 2 1 2 k + ,
and R k : [ 0 , 1 ] 0 , 2 2 k , where
R k : = T k T k 1 T 1 .
Next, we prove that k = 1 m R k ( x ) converges exponentially to x ( 1 x ) as m increases, especially in L [ 0 , 1 ] : x ( 1 x ) = k = 1 R k ( x ) . This lemma can be seen as a variation of Lemma 2.4 in Telgarsky’s work [14] and Proposition 2 in Yarotsky’s work [15]. Compared to existing results, this result allows us to construct networks with parameters equal to 1 and provides an explicit bound on the approximation error.
Lemma A1. 
For any positive integer m,
x ( 1 x ) k = 1 m R k ( x ) 2 m .
Proof. 
Step 1: We prove by induction that R k ( x ) is a triangular wave. More precisely, R k ( x ) is piecewise linear on the intervals / 2 k , ( + 1 ) / 2 k , where is an integer. If is odd, the endpoints are R k / 2 k = 2 2 k , and if is even, the endpoints are R k / 2 k = 0 .
When k = 1 , the equality R 1 = T 1 holds obviously.
We assume that the statement holds for k, and we let r be divisible by 4, denoted as r mod 4 . Consider x in the interval / 2 k + 1 , ( + 1 ) / 2 k + 1 . When 0 mod 4 , then R k ( x ) = 2 k x / 2 k + 1 . When 2 mod 4 , then R k ( x ) = 2 2 k 2 k x / 2 k + 1 . When 1 mod 4 , where + 1 2 mod 4 , then R k ( x ) = 2 k x / 2 k 2 2 k . When 3 mod 4 , then R k ( x ) = 2 2 k 2 k x / 2 k . Then, for k + 1 , the statement also holds.
R k + 1 ( x ) = T k + 1 R k ( x ) = R k ( x ) 2 I R k ( x ) 2 2 k 1 + 2 2 k 1 R k ( x ) 2 I R k ( x ) > 2 2 k 1 .
The statement holds for k + 1 , and the induction is complete.
Step 2: For convenience, let us denote g ( x ) = x ( 1 x ) . We now prove that for any m 1 and 0 , 1 , , 2 m , the following holds:
g 2 m = k = 1 m R k 2 m .
To prove this, we use mathematical induction on m. For m = 1 , when = 0 , we have g ( 0 ) = 0 and R 1 ( 0 ) = 0 . When = 1 , we have g ( 1 / 2 ) = 1 / 4 and R 1 ( 1 / 2 ) = 1 / 2 . When = 2 , we have g ( 1 ) = 0 and R 1 ( 1 ) = 0 . g 2 1 = R k 2 1 holds. Therefore, for the inductive step, assuming that it holds for m, when m + 1 is considered, if is even, then R m + 1 2 m 1 = 0 , which implies g 2 m 1 = k = 1 m R k 2 m 1 = k = 1 m + 1 R k 2 m 1 . If is odd, then the function x k = 1 m R k ( x ) is linear over the interval ( 1 ) 2 m 1 , ( + 1 ) 2 m 1 . Furthermore, for any t, we have
g ( x ) g ( x + t ) + g ( x t ) 2 = t 2 .
Since x = 2 m 1 and t = 2 m 1 , and considering such that R m + 1 2 m 1 = 2 2 m 2 , we can deduce
g 2 m 1 = 2 2 m 2 + k = 1 m R k 2 m 1 = k = 1 m + 1 R k 2 m 1 ,
and the result also holds for m + 1 , completing the induction.
Thus, the interpolation of k = 1 m R k ( x ) at the point 2 m with function g has been proven, and it is linear over the interval 2 m , ( + 1 ) 2 m . Let k = 1 m R k ( x ) = y ; g is a Lipschitz function with Lipschitz constant 1. Therefore, for any x, there exists an determined by
g ( 2 m ( + 1 ) ) y 2 m ( + 1 ) x = y g ( 2 m ) x 2 m ,
and we have
y = 2 m g ( 2 m ( + 1 ) ) x g ( 2 m ( + 1 ) ) ( + 1 ) g ( 2 m ) + 2 m x g ( 2 m ) ,
which implies
g ( x ) k = 1 m R k ( x ) = g ( x ) 2 m x g ( + 1 ) 2 m + 1 2 m x g 2 m 2 m ,
thus proving the lemma.
Let g ( x ) = x ( 1 x ) . As proven above, to construct a network that takes inputs x and y and approximates the product x y , we use polar-type identities.
g x y + 1 2 g x + y 2 + x + y 2 1 4 = x y .
Lemma A2. 
For any positive integer m, there exists a network M u l t m F ( m + 4 , ( 2 , 6 , 6 , , 6 , 1 ) ) such that Mult m ( x , y ) [ 0 , 1 ] for all x , y [ 0 , 1 ] , and
Mult m ( x , y ) x y 2 m ,
and Mult m ( 0 , y ) = Mult m ( x , 0 ) = 0 .
Proof. 
Let T k ( x ) = ( x / 2 ) + x 2 1 2 k + = T + ( x ) T k ( x ) , where T + ( x ) = ( x / 2 ) + and T k ( x ) = x 2 1 2 k + . Consider a nonnegative function h : [ 0 , 1 ] [ 0 , ) .
Step 1: We prove the existence of a network N m with m hidden layers and width vector ( 3 , 3 , , 3 , 1 ) to compute this function:
T + ( u ) , T 1 ( u ) , h ( u ) k = 1 m + 1 R k ( u ) + h ( u ) ,
for all u [ 0 , 1 ] , as shown in Figure A1; it is worth noting that all parameters in this network are bounded by 1.
Figure A1. Network T + ( u ) , T 1 ( u ) , h ( u ) k = 1 m + 1 R k ( u ) + h ( u ) .
Figure A1. Network T + ( u ) , T 1 ( u ) , h ( u ) k = 1 m + 1 R k ( u ) + h ( u ) .
Mathematics 11 03899 g0a1
Step 2: We prove the existence of a network with m hidden layers to compute the following functions:
( x , y ) k = 1 m + 1 R k x y + 1 2 k = 1 m + 1 R k x + y 2 + x + y 2 1 4 + 1 .
Given the input ( x , y ) , the computation of this network in the first layer is as follows:
T + x y + 1 2 , T 1 x y + 1 2 , x + y 2 + , T + x + y 2 , T 1 x + y 2 , 1 4 .
Applying the network N m on the first three elements and the last three elements, we obtain a network with m + 1 hidden layers and a width vector of ( 2 , 6 , , 6 , 2 ) , and we compute
( x , y ) k = 1 m + 1 R k x y + 1 2 + x + y 2 , k = 1 m + 1 R k x + y 2 + 1 4 ,
applying the two-hidden-layer network ( u , v ) ( 1 ( 1 ( u v ) ) + ) + = ( u v ) + 1 to the output. Therefore, the composite network Mult m ( x , y ) has m + 4 hidden layers and computes
( x , y ) k = 1 m + 1 R k x y + 1 2 k = 1 m + 1 R k x + y 2 + x + y 2 1 4 + 1 ,
and this implies that the output is always in the interval [ 0 , 1 ] . According to Equation (A4) and Lemma A1, we can obtain the following:
Mult m ( x , y ) x y = k = 1 m + 1 R k x y + 1 2 k = 1 m + 1 R k x + y 2 + x + y 2 1 4 g x y + 1 2 + g x + y 2 x + y 2 + 1 4 k = 1 m + 1 R k x y + 1 2 g x y + 1 2 + k = 1 m + 1 R k x + y 2 g x + y 2 2 m 1 + 2 m 1 2 m .
For all 0 u 1 , we have R 1 ( ( 1 u ) / 2 ) = R 1 ( ( 1 + u ) / 2 ) and R 2 ( ( 1 + u ) / 2 ) = R 2 ( u / 2 ) . Therefore, when k is odd, R k ( ( 1 u ) / 2 ) = R k ( ( 1 + u ) / 2 ) ; when k is even, R k ( ( 1 + u ) / 2 ) = R k ( u / 2 ) . For all input pairs ( 0 , y ) and ( 0 , x ) , the output in Equation (A5) becomes zero. □
Lemma A3. 
For any positive integer m, there exists a network
Mult m r F ( m + 5 ) log 2 r , ( r , 6 r , 6 r , , 6 r , 1 ) ,
such that Mult m r [ 0 , 1 ] , for all x = ( x 1 , , x r ) [ 0 , 1 ] r , and we have
Mult m r ( x ) i = 1 r x i r 2 2 m .
Furthermore, if one of the components of x is 0, then Mult m r ( x ) = 0 .
Proof. 
Let q : = log 2 ( r ) . We now construct the network Mult m r and perform calculations in the first hidden layer.
x 1 , , x r ( x 1 , , x r , 1 , , 1 2 q r ) .
We apply the network Mult m from Lemma A2 to each pair x 1 , x 2 , x 3 , x 4 , , ( 1 , 1 ) to compute Mult m x 1 , x 2 , Mult m x 3 , x 4 , , and Mult m ( 1 , 1 ) R 2 q 1 . Now, we pair adjacent terms and apply Mult m again. We continue this process until only one term remains. The resulting network is denoted as Mult m r , which has q ( m + 5 ) hidden layers, and all parameters are bounded by 1.
If a , b , c , d [ 0 , 1 ] , then by Lemma A2 and the triangle inequality, we have
Mult m ( a , b ) c d Mult m ( a , b ) a b | + | a b c d | 2 m + a c | + | b d ;
therefore, by iteration and induction, we obtain
Mult m r ( x ) i = 1 r x i 3 q 1 2 m 3 log 2 ( r ) 1 2 m 3 log 2 ( r ) 2 m 4 log 2 ( r ) 2 m r 2 2 m .
By using Lemma A2 and the construction described above, it is evident that if one of the components of x is 0, then Mult m r ( x ) = 0 .
We construct a sufficiently large network to approximate all monomials x 1 α 1 · · x r α r for nonnegative integers α i up to a certain specified degree. Typically, we use multi-index notation: x α : = x 1 α 1 · · x r α r , where α = α 1 , , α r and | α | : = = 1 r α represents the degree of the monomial.
The number of monomials with degrees satisfying | α | < γ is denoted by C r , γ , and since each α i takes values in { 0 , 1 , , γ } , we have C r , γ ( γ + 1 ) r . □
Lemma A4. 
For γ > 0 and positive integers m, there exists a network
Mon m , γ r F 1 + ( m + 5 ) log 2 ( γ 1 ) , r , 6 γ C r , γ , , 6 γ C r , γ , C r , γ ,
such that Mon m , γ r [ 0 , 1 ] C r , γ , and for all x [ 0 , 1 ] r , we have
Mon m , γ r ( x ) x α | α | < γ γ 2 2 m .
Proof. 
For | α | 1 , the monomials are either linear or constant functions. There exists a shallow network in the class F ( 1 , ( 1 , 1 , 1 ) ) that precisely represents the monomial x α .
Considering the multiplicities in Equation (A6), Lemma A3 can be directly extended to monomials. For | α | 2 , this implies that in the class
F ( m + 5 ) log 2 | α | , ( r , 6 | α | , , 6 | α | , 1 ) ,
there exists a network in the class that takes values within the interval [ 0 , 1 ] and approximates x α to a supnorm error of | α | 2 2 m . By utilizing the parallelization and depth synchronization properties discussed in Appendix B, the proof of Lemma A4 can be established.
Following the classical local Taylor approximation, previously used for network approximation by Yarotsky [15], for a vector a [ 0 , 1 ] r , we define
P a β f ( x ) = 0 | α | < β α f ( a ) ( x a ) α α ! .
According to Taylor’s theorem for multivariable functions, for an appropriate ξ [ 0 , 1 ] , we have
f ( x ) = α : | α | < β 1 α f ( a ) ( x a ) α α ! + β 1 | α | < β α f ( a + ξ ( x a ) ) ( x a ) α α ! .
We have ( x a ) α = i x i a i α i | x a | | α | . Therefore, for f C r β [ 0 , 1 ] r , K , we have
f ( x ) P a β f ( x ) β 1 | α | < β ( x a ) α α ! α f ( a + ξ ( x a ) ) α f ( a ) K | x a | β .
We can express Equation (A7) as a linear combination of monomials.
P a β f ( x ) = 0 | γ | < β x γ c γ ,
for suitable coefficients c γ , and, for convenience, the dependence on a in c γ is omitted here. Since γ P a β f ( x ) x = 0 = γ ! c γ , it follows that
c γ = γ α & | α | < β α f ( a ) ( a ) α γ γ ! ( α γ ) ! ,
and since a [ 0 , 1 ] r and f C r β [ 0 , 1 ] r , K , we have
c γ K / γ ! and γ 0 c γ K j = 1 r γ j 0 1 γ j ! = K e r .
We consider the grid points D ( M ) : = x = j / M j = 1 , , r : = 1 , , r { 0 , 1 , , M } r . The number of elements in this set is ( M + 1 ) r . Let x = x j represent the elements of X . We define
P β f ( x ) : = x D ( M ) P x β f ( x ) j = 1 r 1 M x j x j + .
Lemma A5. 
If f C r β [ 0 , 1 ] r , K , then P β f f L [ 0 , 1 ] r K M β .
Proof. 
For all x = x 1 , , x r [ 0 , 1 ] r , we have
x D ( M ) j = 1 r 1 M x j x j + = j = 1 r = 0 M 1 M x j / M + = 1 ,
and we use mathematical induction, assuming M = 1 . The left-hand side of (A11) is
1 x 1 0 + = 1 x 1 , 1 x 1 1 + = x 1 = 1 ,
and after summing, we obtain
( 1 x 1 ) + x 1 = 1 ,
while the middle-hand side, we have
1 x j 0 + + 1 x j 1 + = ( 1 x 1 ) + x 1 = 1 ,
and therefore the equation holds when M = 1 .
Assuming M = 2 , we have x 1 ( 0 , 1 2 ) , x 2 ( 1 2 , 1 ) , with x 1 = 1 M , 1 = 0 , 1 , 2 ; x 2 = 2 M , 2 = 0 , 1 , 2 . Then, the left-hand side is
1 2 x 1 0 2 + 1 2 x 2 0 2 + = 0 ,
1 2 x 1 0 + 1 2 x 2 1 2 + = ( 1 2 x 1 ) ( 1 2 x 2 + 1 ) = ( 1 2 x 1 ) ( 2 2 x 2 ) ,
1 2 x 1 0 + 1 2 x 2 1 + = ( 1 2 x 1 ) ( 1 2 + 2 x 2 ) = ( 1 2 x 1 ) ( 2 x 2 1 ) ,
1 2 x 1 1 2 + 1 2 x 2 0 + = 0 ,
1 2 x 1 1 2 + 1 2 x 2 1 2 + = 2 x 1 ( 2 2 x 2 ) ,
1 2 x 1 1 2 + 1 2 x 2 1 + = 2 x 1 ( 2 x 2 1 ) ,
1 2 x 1 1 + 1 2 x 2 0 + = 0 ,
1 2 x 1 1 + 1 2 x 2 1 2 + = 0 ,
1 2 x 1 1 + 1 2 x 2 1 + = 0 ,
and after summation, we obtain
( 1 2 x 1 ) ( 2 2 x 2 ) + ( 1 2 x 1 ) ( 2 x 2 1 ) + 2 x 1 ( 2 2 x 2 ) + 2 x 1 ( 2 x 2 1 ) = 1 .
In the middle, we have
j = 1 2 1 2 x j 0 + + 1 2 x j 1 2 + + 1 2 x j 1 + = 1 2 x 1 0 + + 1 2 x 1 1 2 + + 1 2 x 1 1 + · 1 2 x 2 0 + + 1 2 x 2 1 2 + + 1 2 x 2 1 + = 1 2 x 1 + 2 x 1 2 2 x 2 + 2 x 2 1 = 1 ,
and, therefore, when M = 2 , the equation holds true.
Next, we calculate the second equation in (A11); when x j 0 , 1 M , we have
1 M x j 0 M + = 1 M x j ,
1 M x j 1 M + = 1 1 + M x j + = M x j ,
1 M x j 2 M + = 1 2 + M x j + = ( M x j 1 ) + = 0 ,
and when takes values 3 , 4 , , M , the sum above is zero, resulting in 1. By analogy, for x j k M , k + 1 M , where k = 0 , , M 1 , the same holds true. Therefore, we can deduce that j = 1 r = 0 M 1 M x j / M + = 1 .
By using f ( x ) = x D ( M ) : x x 1 / M f ( x ) j = 1 r 1 M x j x j + and Equation (A8), we obtain
P β f ( x ) f ( x ) = x D ( M ) P x β f ( x ) j = 1 r 1 M x j x j + x D ( M ) : x x < 1 / M f ( x ) j = 1 r 1 M x j x j + max x D ( M ) : x 1 / M P x β f ( x ) f ( x ) K x x β K M β .
Then, we describe how to construct a network that approximates P β f . □
Lemma A6. 
For any positive integers M and m, there exists a network
H a t r F 2 + ( m + 5 ) log 2 r , r , 6 r ( M + 1 ) r , , 6 r ( M + 1 ) r , ( M + 1 ) r , s , 1 , where s 49 r 2 ( M + 1 ) r 1 + ( m + 5 ) log 2 r , such that Hat r [ 0 , 1 ] ( M + 1 ) r , and for any x = x 1 , , x r [ 0 , 1 ] r , we have
Hat r ( x ) j = 1 r 1 / M x j x j + x D ( M ) r 2 2 m .
For any x D ( M ) , the support of the function x Hat r ( x ) x is contained within the support of the function x j = 1 r 1 / M x j x j + .
Proof. 
The first hidden layer uses 2 r ( M + 1 ) units and 4 r ( M + 1 ) nonzero parameters to compute the functions x j / M + and / M x j + . The second hidden layer uses r ( M + 1 ) units and 3 r ( M + 1 ) nonzero parameters to compute the function 1 / M x j / M + = ( 1 / M x j / M + / M x j + + . These functions take values in the interval [ 0 , 1 ] , and the result holds when r = 1 .
For r > 1 , we combine the obtained network with the network approximating the product j = 1 r 1 / M x j / M + . According to Lemma A3, there exists a network Mult m r in the following class:
F ( m + 5 ) log 2 r , ( r , 6 r , 6 r , , 6 r , 1 ) .
We compute j = 1 r 1 / M x j x j + with an error bounded by r 2 2 m . From Equation (A3), it follows that a Mult m r network has nonzero parameters
( 36 r 2 + 6 r ) ( 1 + ( ( m + 5 ) log 2 r ) 42 r 2 1 + ( m + 5 ) log 2 r ,
as a bound, and since these networks have ( M + 1 ) r parallel instances, each hidden layer requires 6 ( M + 1 ) r units and 42 r 2 ( M + 1 ) r 1 + ( m + 5 ) log 2 r nonzero parameters for multiplication operations. Adding the 7 ( M + 1 ) r nonzero parameters from the first two layers, the total bound on the number of nonzero parameters is
49 r 2 ( M + 1 ) r 1 + ( m + 5 ) log 2 r .
According to Lemma A3, if one of the components of x is zero, then Mult m r ( x ) = 0 . This implies that for any x D ( M ) , the support of the function x Hat r ( x ) x is contained within the support of the function x j = 1 r 1 / M x j x j + . □
Theorem A1. 
For any function f C r β [ 0 , 1 ] r , K and any integers m 1 and N ( β + 1 ) r ( K + 1 ) e r , there exists a network
f ˜ F ( L , ( r , 6 ( r + β ) N , , 6 ( r + β ) N , 1 ) , s , ) ,
with a depth of
L = 8 + ( m + 5 ) 1 + log 2 ( r β ) ,
and the number of parameters
s 141 ( r + β + 1 ) 3 + r N ( m + 6 ) ,
such that
f ˜ f L [ 0 , 1 ] r ( 2 K + 1 ) 1 + r 2 + β 2 6 r N 2 m + K 3 β N β r .
Proof. 
In this proof, all constructed networks take the form F ( L , p , s ) = F ( L , p , s , ) , where F = . Let M be the largest integer such that ( M + 1 ) r N , and we define L * : = ( m + 5 ) log 2 ( β r ) . With the help of Equations (A9) and (A10), and Lemma A4, we can add a hidden layer to the network Mon m , β r , resulting in a new network
Q 1 F 2 + L * , r , 6 β C r , β , , 6 β C r , β , C r , β , ( M + 1 ) r ,
such that Q 1 ( x ) [ 0 , 1 ] ( M + 1 ) r and for any x [ 0 , 1 ] r , we have
Q 1 ( x ) P x β f ( x ) B + 1 2 x D ( M ) β 2 2 m ,
where B : = 2 K e r and e is the natural logarithm. According to Equation (A3), the number of nonzero parameters in network Q 1 is bounded by 6 r ( β + 1 ) C r , β + 42 ( β + 1 ) 2 C r , β 2 ( L * + 1 ) + C r , β ( M + 1 ) r .
According to Lemma A6, the network Hat r calculates the product j = 1 r 1 / M x j x j + with an error bounded by r 2 2 m . It requires at most 49 r 2 N 1 + L * active parameters. Now, consider the parallel network Q 1 , Hat r . Based on the definition of C r , β and the assumption on N, we observe that C r , β ( β + 1 ) r N . According to Lemma A6, networks Q 1 and Hat r can be embedded into a joint network Q 1 , Hat r with 2 + L * hidden layers. The weight vector r , 6 ( r + β ) N , , 6 ( r + β ) N , 2 ( M + 1 ) r and all parameters are bounded by 1. By using C r , β ( M + 1 ) r N , the bound on the number of nonzero parameters in the combined network Q 1 , Hat r is
6 r ( β + 1 ) C r , β + 42 ( β + 1 ) 2 C r , β 2 L * + 1 + C r , β ( M + 1 ) r + 49 r 2 N 1 + L * 49 ( r + β + 1 ) 2 C r , β N 1 + L * 98 ( r + β + 1 ) 3 + r N ( m + 5 ) ,
where, for the last inequality, we use C r , β ( β + 1 ) r , the definition of L * , and the property that for any x 1 , we have 1 + log 2 ( x ) 2 + log 2 ( x ) 2 ( 1 + ln ( x ) ) 2 x .
Next, we pair the outputs of Q 1 and Hat r corresponding to the x term and apply the Mult m network described in Lemma A2 to each of the ( M + 1 ) r pairs. In the final layer, we sum all the terms together. According to Lemma A2, this requires at most 42 ( m + 5 ) ( M + 1 ) r + ( M + 1 ) r 43 ( m + 5 ) N active parameters for the total ( M + 1 ) r multiplications. By using Lemmas A2 and A6, Equation (A12), and the triangle inequality, we can construct a network
Q 2 F ( 3 + ( m + 5 ) ( 1 + log 2 ( β r ) , ( r , 6 ( r + β ) N , , 6 ( r + β ) N , 1 ) ,
such that, for any x [ 0 , 1 ] r , we have
Q 2 ( x ) x D ( M ) P x β f ( x ) B + 1 2 j = 1 r 1 M x j x j + x D ( M ) : x x 1 / M 1 + r 2 + β 2 2 m 1 + r 2 + β 2 2 r m ,
where the first inequality follows from the fact that the support of Hat r ( x ) x is contained within the support of j = 1 r 1 / M x j x j + , as stated in Lemma A6. Due to Equation (A3), the network Q 2 has at most
98 ( r + β + 1 ) 3 + r N ( m + 5 ) + 43 ( m + 5 ) N 141 ( r + β + 1 ) 3 + r N ( m + 5 ) .
To obtain the network reconstruction of the function f, it is necessary to apply scaling and shifting to the output terms. This is primarily due to the finite parameter weights in the network. We recall that B = 2 K e r . The network x B M r x belongs to the class F 3 , 1 , M r , 1 , 2 K e r , 1 , where the shift vectors v j are all zero and all entries of the weight matrices W j are equal to 1. Since N ( K + 1 ) e r , the number of parameters in this network is bounded by 2 M r + 2 2 K e r 6 N . This implies the existence of a network in the class F 4 , 1 , 2 , 2 M r , 2 , 2 2 K e r , 1 , which computes a B M r ( a c ) , where c : = 1 / 2 M r . This network computes ( a c ) + and ( c a ) + in the first hidden layer, and then applies the network x B M r x to these two units. In the output layer, the first value is subtracted from the second value. This requires at most 6 + 12 N active parameters.
Due to Equations (A11) and (A14), there exists a network Q 3 in the following class
F 8 + ( m + 5 ) 1 + log 2 ( r β ) , ( r , 6 ( r + β ) N , , 6 ( r + β ) N , 1 ) ,
and for all x [ 0 , 1 ] r , we have
Q 3 ( x ) x D ( M ) P x β f ( x ) j = 1 r 1 M x j x j + ( 2 K + 1 ) M r 1 + r 2 + β 2 ( 2 e ) r 2 m .
Under the condition of Equation (A15), the bound for the nonzero parameters of Q 3 is
141 ( r + β + 1 ) 3 + r N ( m + 6 ) .
By constructing ( M + 1 ) r N ( M + 2 ) r ( 3 M ) r , it follows that M β N β / r 3 β . Combined with Lemma A5, we have
f ˜ f L [ 0 , 1 ] r = Q 3 f ( x ) = Q 3 ( x ) x D ( M ) P x β f ( x ) j = 1 r 1 M x j x j + + x D ( M ) P x β f ( x ) j = 1 r 1 M x j x j + f ( x ) Q 3 ( x ) x D ( M ) P x β f ( x ) j = 1 r 1 M x j x j + + x D ( M ) P x β f ( x ) j = 1 r 1 M x j x j + f ( x ) ( 2 K + 1 ) 1 + r 2 + β 2 ( 2 e ) r M r 2 m + K M β ( 2 K + 1 ) 1 + r 2 + β 2 6 r N 2 m + K 3 β N β r .
Thus, the result is proven.
Based on Theorem A1, we can now construct a network that approximates f = g q g 0 . In the first step, we show that f can always be represented as a composition of functions defined on hypercubes [ 0 , 1 ] t i . As in the previous theorem, let g i j C t i β i a i , b i t i , K i , and we assume K i 1 for i = 1 , , q 1 . Define
h 0 : = g 0 2 K 0 + 1 2 , h i : = g i 2 K i 1 · K i 1 2 K i + 1 2 , h q = g q 2 K q 1 · K q 1 ,
where 2 K i 1 x K i 1 means applying the transformation 2 K i 1 x j K i 1 to all j. It is evident that
f = g q g 0 = h q h 0 .
From the definition of the Hölder ball C r β ( D , K ) , we can see that h 0 j takes values in the interval [ 0 , 1 ] .
h 0 j C t 0 β 0 [ 0 , 1 ] t 0 , 1 , h i j C t i β i [ 0 , 1 ] t i , 2 K i 1 β i ,
where, for i = 1 , , q 1 , we have h q j C t q β q [ 0 , 1 ] t q , K q 2 K q 1 β q . Without loss of generality, we can always assume that the radius of the Hölder ball is at least 1, i.e., K i 1 . □
Lemma A7. 
Let h i j be as defined above, with K i 1 . Then, for any function h ˜ i = h ˜ i j j T , where h ˜ i j : [ 0 , 1 ] t i [ 0 , 1 ] , we have
h q h 0 h ˜ q h ˜ 0 L [ 0 , 1 ] d K q = 0 q 1 2 K β + 1 i = 0 q h i h ˜ i L [ 0 , 1 ] d i = i + 1 β 1 .
Proof. 
Let H i = h i h 0 and H ˜ i = h ˜ i h ˜ 0 . If Q i is an upper bound of the Hölder seminorm of h i j for j = 1 , , d i + 1 , then, by the triangle inequality, we have
H i ( x ) H ˜ i ( x ) h i H i 1 ( x ) h i H ˜ i 1 ( x ) + h i H ˜ i 1 ( x ) h ˜ i H ˜ i 1 ( x ) Q i H i 1 ( x ) H ˜ i 1 ( x ) β i 1 + h i h ˜ i L [ 0 , 1 ] d i .
Combining this with the inequality ( y + z ) α y α + z α , which holds for all y , z 0 and all α [ 0 , 1 ] , the lemma is proven. □
Proof of Theorem 1. 
Here, all n are assumed to be sufficiently large. Throughout the entire proof, C is a constant that depends only on the variation of ( q , d , t , β , F ) . Combining Theorem 2 with the bounds on the depth L and network sparsity s assumed, for n 3 , we have
1 4 Δ n m ^ n , m C ϕ n L ln 2 n R m ^ , m 4 inf m * F ( L , p , s , F ) m * m 2 + 4 Δ n m ^ n , m + C ϕ n L ln 2 n ,
where, for the lower bound, we set ε = 1 / 2 , and for the upper bound, we set ε = 1 . We take C = 8 C ; then, when Δ n m ^ n , m C ϕ n L ln 2 n , we have 1 8 Δ n m ^ n , m C ϕ n L ln 2 n . Substituting this into the left-hand side of Equation (A17), we obtain
1 4 Δ n m ^ n , m 1 8 Δ n m ^ n , m R m ^ , m ,
that is,
1 8 Δ n m ^ n , m R m ^ , m .
Thus, the lower bound for Equation (8) is established.
To obtain upper bounds for Equations (7) and (8), it is necessary to constrain the approximation error. For this purpose, the regression function m is rewritten as Equation (A16), i.e., m = h q h 0 , where h i = h i j j T and h i j is defined on [ 0 , 1 ] t i and maps to [ 0 , 1 ] for any i < q .
Here, we apply Theorem A1 to each function h i j separately. Let m = log 2 n and consider the following.
L i : = 8 + log 2 n + 5 1 + log 2 t i β i ;
this means that there exists a network
h ˜ i j F L i , t i , 6 t i + β i N , , 6 t i + β i N , 1 , s i ,
where s i 141 t i + β i + 1 3 + t i N log 2 n + 6 , such that
h ˜ i j h i j L [ 0 , 1 ] t i 2 Q i + 1 1 + t i 2 + β i 2 6 t i N n 1 + Q i 3 β i N β i t i ,
where Q i is the Hölder norm upper bound of h i j . If i < q , two additional layers 1 ( 1 x ) + are applied to the output, requiring four additional parameters. The resulting network is denoted as
h i j * F L i + 2 , t i , 6 t i + β i N , , 6 t i + β i N , 1 , s i + 4 ,
and it is observed that σ h i j * = h ˜ i j ( x ) 0 1 . Since h i j ( x ) [ 0 , 1 ] , we have
σ h i j h i j L [ 0 , 1 ] r h ˜ i j h i j L [ 0 , 1 ] r .
If the network h i j is parallelized, h i = h i j * j = 1 , , d i + 1 belongs to the class
σ h i j * h i j L [ 0 , 1 ] r h ˜ i j h i j L [ 0 , 1 ] r ,
where r i : = max i d i + 1 t i + β i . Finally, constructing the composite network m * = h ˜ q 1 σ h q 1 * σ h 0 * , according to the construction rules in Appendix A, we can realize it in the following class:
F E , d , 6 r i N , , 6 r i N , 1 , i = 0 q d i + 1 s i + 4 ,
where E : = 3 ( q 1 ) + i = 0 q L i . By observation, there exists an A n bounded by n such that
E = A n + log 2 n i = 0 q log 2 t i β i + 1 .
For all sufficiently large n, utilizing the inequality x < x + 1 , we have E i = 0 q ( log 2 4 + log 2 ( t i β i ) ) log 2 n L , according to Equation (A1), and for sufficiently large n, the space defined in Equation (A20) can be embedded into F ( L , p , s ) , where L , p , s satisfies the assumptions of the theorem. We choose N = c max i = 0 , , q n t i 2 β i * + t i with a sufficiently small constant c > 0 , depending only on q , d , t , β . Combining Theorem A1 with Equations (A18) and (A19), we have
inf f * F ( L , p , s ) m * m 2 = inf f * F ( L , p , s ) h ˜ q 1 σ h q 1 * σ h 0 * h q h 0 2 inf f * F ( L , p , s ) h ˜ q 1 h ˜ q 1 h ˜ 0 h q h 0 2 K q = 0 q 1 2 K β + 1 i = 0 q h i h ˜ i 2 2 Q i + 1 1 + t i 2 + β i 2 6 t i N n 1 + Q i 3 β i N β i t i 2 C max i = 0 , , q N 2 β i * t i C max i = 0 , , q c 2 β i * t i n 2 β i * 2 β i * + t i .
For the approximation error in Equation (A17), we need to find a network function bounded by the supnorm of F. According to the previous inequalities, there exists a sequence of functions m ˜ n n such that, for all sufficiently large n, m ˜ n F ( L , p , s ) , and | m ˜ n m | 2 2 C max i = 0 , , q c 2 β i * / t i n 2 β i / 2 β i + t i . Let us define m n * = m ˜ n m / m n 1 ) . Then, m n * m = g q K F , where the last inequality is based on Assumption (1). Additionally, m n * F ( L , p , s , F ) . We can denote m n * m = m n * m ˜ n + m ˜ n m , and we have m n * m m n * m ˜ n + m ˜ n m , which implies m n * m 2 m ˜ n m . This shows that if we take the lower bound on a smaller space F ( L , p , s , F ) , then Equation (A21) also holds. Combining this with the upper bound of Equation (A17), we obtain when Δ n m ^ n , m C ϕ n L ln 2 n , that
R m ^ n , m C ϕ n L ln 2 n ,
and when Δ n m ^ n , m C ϕ n L ln 2 n , that
R m ^ n , m C Δ n m ^ n , m ;
therefore, the upper bounds in Equations (7) and (8) hold for any constant C > 0 . This completes the proof.
We begin by utilizing several oracle inequalities for the least squares estimators, as presented in Gyo et al. [16,17,18,19,20]. However, these inequalities assume bounded response variables, which are violated in the nonparametric regression model with Gaussian measurement noise. Additionally, we provide a lower bound for the risk and offer proof that can be easily generalized to any noise distribution. Let N ( δ , F , | · | ) be the covering number, which represents the minimum number of | · | balls with radius δ needed to cover F (where the center does not necessarily have to be within F ). □
Lemma A8. 
We consider the nonparametric regression model in d-dimensional variables given by Equation (2) with an unknown regression function m. Let m ^ be an arbitrary estimator taking values in F . Let us define
Δ n : = Δ n m ^ n , m , F : = E m 1 n i Λ n Y i m ^ X i 2 inf f F 1 n i Λ n Y i f X i 2 .
For F 1 and assuming m F f : [ 0 , 1 ] d [ F , F ] . If N n : = N δ , F , · 3 , for any δ , ε ( 0 , 1 ] , then
( 1 ε ) 2 Δ n F 2 18 ln N n + 76 n ε 38 δ F R m ^ , m ( 1 + ε ) 2 inf f F E f ( X ) m ( X ) 2 + F 2 18 ln N n + 72 n ε + 32 δ F + Δ n .
Proof. 
Throughout the proof, let E = E m . Define | g | n 2 : = 1 n i = 1 n g X i 2 . For any estimator m ˜ , we introduce the empirical risk R ^ n m ˜ , m : = E m ˜ m n 2 .
Step 1: We show that the upper bound holds under the restriction ln N n n . Since R m ^ , m 4 F 2 , the upper bound naturally holds when ln N n n . In this case, let m ˜ arg min f F i Λ n Y i f X i 2 be a global risk minimizer. We observe that
R ^ n m ^ , m R ^ n m ˜ , m = E m ^ m n 2 E m ˜ m n 2 = E 1 n i Λ n m ^ m 2 E 1 n i Λ n m ˜ m 2 = E 1 n i Λ n m ^ Y i + R i 2 E 1 n i Λ n m ˜ Y i + R i 2 = Δ n + E 2 n i Λ n R i m ^ Y i + E 1 n i Λ n R i 2 E 2 n i Λ n R i m ˜ Y i E 1 n i Λ n R i 2 = Δ n + E 2 n i Λ n R i m ^ m + R i E 2 n i Λ n R i m ˜ m + R i = Δ n + E 2 n i Λ n R i m ^ E 2 n i Λ n R i m E 2 n i Λ n R i m ˜ + E 2 n i Λ n R i m = Δ n E 2 n i Λ n R i m ^ X i E 2 n i Λ n R i m ˜ X i .
From this equation, we see that Δ n 8 F 2 , which implies a lower bound on the logarithm ln N n n in the argument.
Therefore, we assume that ln N n n . The proof is divided into four parts, denoted as (I)–(IV).
(I)
Establishing a connection between risk R m ^ , m = E m ^ ( X ) m ( X ) 2 and its empirical counterpart R ^ n m ^ , m through inequalities
( 1 ε ) R ^ n m ^ , m F 2 n ε 15 ln N n + 75 26 δ F R m ^ , m ( 1 + ε ) R ^ n m ^ , m + ( 1 + ε ) F 2 n ε 12 ln N n + 70 + 26 δ F .
(II)
For any estimate m ˜ taking values in F , we know that
E 2 n i Λ n R i m ˜ X i 2 R ^ n m ˜ , m 3 ln N n + 1 n + 6 δ .
(III)
We have
R ^ n m ^ , m ( 1 + ε ) inf f F E f ( X ) m ( X ) 2 + 6 δ + F 2 6 ln N n + 2 n ε + Δ n .
(IV)
We have
R ^ n m ^ , m ( 1 ε ) Δ n 3 ln N n + 1 n ε 12 δ .
Since F 1 , the lower bound of the lemma can be obtained by combining (I) and (IV), while the upper bound can be obtained from (I) and (III).
(I) Given a minimum δ -covering of F , let f j represent the centers of the balls. According to the construction, there exists a random j * such that m ^ f j * δ . Without loss of generality, we can assume that f j F . The random variables X i , i 1 , 2 , , n have the same distribution as X and are independent of X i , i 1 , 2 , , n . We can use f j , m , δ F
R m ^ , m R ^ n m ^ , m = E 1 n i Λ n m ^ X i m X i 2 1 n i Λ n m ^ X i m X i 2 = E 1 n i Λ n m ^ X i f j * X i + f j * X i m X i 2 1 n i Λ n m ^ X i f j * X i + f j * X i m X i 2 = E 1 n i Λ n 2 m ^ X i f j * X i f j * X i m X i + f j * X i m X i 2 1 n i Λ n 2 m ^ X i f j * X i f j * X i m X i + f j * X i m X i 2 E 1 n i Λ n g j * X i , X i + 9 δ F ,
where g j X i , X i : = f j * X i m X i 2 f j * X i m X i 2 . We replace f j with f j * , and we define g j by using the same method. Similarly, we set r j : = n 1 ln N n E 1 / 2 f j ( X ) m ( X ) 2 , and define r * as r j when j = j * .
r * : = n 1 ln N n E 1 / 2 f j * ( X ) m ( X ) 2 X i , Y i n 1 ln N n + E 1 / 2 m ^ ( X ) m ( X ) 2 X i , Y i + δ .
In the last part, we use the triangle inequality and f j * m ^ δ .
For random variables U and T, the Cauchy–Schwarz inequality states that E [ U T ] E 1 / 2 [ U 2 ] E 1 / 2 [ T 2 ] . Let
U = E 1 / 2 m ^ ( X ) m ( X ) 2 X i , Y i ,
and
T : = max j i Λ n g j X i , X i / r j F .
By using E U 2 = R m ^ , m , we have
R m ^ , m R ^ n m ^ , m E F n T r j + 9 δ F = E F n T U U + r j + 9 δ F = E F n T U + T r j U + 9 δ F F n R m ^ , m 1 / 2 E 1 / 2 T 2 + F n ln N n n + δ E [ T ] + 9 δ F .
Observing that E g j X i , X i = 0 ,
g j X i , X i = f j X i m X i 2 f j X i m X i 2 4 f j X i m X i 4 F 2
and
Var g j X i , X i = 2 Var f j X i m X i 2 2 E f j X i m X i 4 8 F 2 r j 2 .
Bernstein’s inequality states that for independent and centered random variables U i 1 , , U i N , if U i M , then it holds true that [21]
P i Λ n U i t 2 exp t 2 / [ 2 M t / 3 + 2 i Λ n Var U i .
Combining Bernstein’s inequality and the bound argument, we obtain
P ( T t ) 1 2 N n max j exp t 2 8 t / 3 r j + 16 n ,
and since r j n 1 ln N n , for all t 6 n ln N n , we have
P ( T t ) 2 N n exp 6 t n ln N n 48 n ln N n / 3 n 1 ln N n + 16 n 2 N n exp 3 t ln N n / ( 16 n ) .
Thus, for large values of t, the denominator in the exponential is dominated by the first term. We have
E [ T ] = 0 P ( T t ) d t = 0 6 n ln N n P ( T t ) d t + 6 n ln N n P ( T t ) d t 6 n ln N n + 6 n ln N n 2 N n exp 3 t ln N n 16 n d t 6 n ln N n + 32 3 n ln N n .
According to the assumption, N n 3 ; hence, ln N n 1 . By using a similar approach to the upper bound for E [ T ] , we can obtain the quadratic case.
E T 2 = 0 P T 2 u d u = 0 P ( T u ) d u 36 n ln N n + 36 n ln N n 2 N n exp 3 u ln N n 16 n d u 36 n ln N n + 2 8 n .
Step 2: The identity b 2 e u a d u = 2 b s e s a d s = 2 ( b a + 1 ) e b a / a 2 . In this case, we set a = 3 u ln N n 16 n and b = 6 n ln N n to obtain the aforementioned inequality.
R m ^ , m R ^ n m ^ , m F n R m ^ , m 1 / 2 36 n ln N n + 2 8 n 1 / 2 + F 6 n ln N n + 32 3 n ln N n + 9 δ F F n R m ^ , m 1 / 2 36 n ln N n + 2 8 n 1 / 2 + F 6 ln N n + 11 n + 26 δ F .
Let a , b , c , d be positive real numbers such that | a b | 2 a c + d . We have
b 2 a c d a 2 a c + b + d ,
b d 2 ( ε 1 ε a ) ( 1 ε ε ) a b + d + 2 ( ε 1 + ε a ) ( 1 + ε ε ) ,
b d ε 1 ε a + 1 ε ε c 2 a ε 1 + ε a + 1 + 1 + ε ε c 2 + b + d .
Consequently, for any ε ( 0 , 1 ] , we have the following.
( 1 ε ) b d c 2 ε a ( 1 + ε ) ( b + d ) + ( 1 + ε ) 2 ε c 2 .
According to Equation (A24), we take a = R m ^ , m , b = R ^ n m ^ , m , and c = F 9 n ln N n + 64 n 1 / 2 / n , d = F 6 ln N n + 11 / n + 26 δ F . Substituting a , b , c , d into Equation (A25) (denoted as (I)), we complete the proof of (I).
(II) Given an estimation f ˜ that takes values in F , let j be such that | m ˜ f j | δ . Then, E i Λ n R i m ˜ X i f j X i δ E i Λ n R i n δ . Since E R i m X i = c o v ( X i , R i ) 0 , we have
E 2 n i Λ n R i m ˜ X i = E 2 n i Λ n R i m ˜ X i m X i + m X i E 2 n i Λ n R i m ˜ X i m X i + E 2 n i Λ n R i m X i = E 2 n i Λ n R i m ˜ X i f j X i + f j X i m X i + E 2 n i Λ n R i m X i E 2 n i Λ n R i m ˜ X i f j X i + 2 n c o v ( X i , R i ) + E 2 n i Λ n R i f j X i m X i 2 δ + 2 n c o v ( X i , R i ) + E 2 n i Λ n R i f j X i m X i 2 δ + 2 n E m ˜ m n + δ ξ j ,
where
ξ j : = c o v ( X i , R i ) + Λ n R i f j X i m X i n f j m n .
Under the condition X i i , ξ j N ( 0 , 1 ) . According to Lemma A9, we obtain E ξ j 2 E max j ξ j 2 3 ln N n + 1 . By using Cauchy–Schwarz, we have
E m ˜ m n + δ ξ j R ^ n m ˜ , m 1 / 2 + δ 3 ln N n + 1 .
Since ln N n n , we have 2 n 1 / 2 δ 3 ln N n + 1 4 δ . Combining Equations (A26) and (A27), we have
E 2 n i Λ n R i m ˜ X i 2 δ + 2 n E m ˜ m n + δ ξ j 2 δ + 2 n R ^ n m ˜ , m 1 / 2 + δ 3 ln N n + 1 2 R ^ n m ˜ , m 3 ln N n + 1 n + 6 δ .
(II) is proven.
(III) For any fixed f F , we have E 1 n i Λ n Y i m ^ X i 2 E 1 n i Λ n Y i f X i 2 + Δ n . Since X i = D X and f are deterministic, we have
E [ f m n 2 = E f ( X ) m ( X ) 2 . Since E R i m X i = c o v ( X i , R i ) 0 , we have
R ^ n m ^ , m = E m ^ m n 2 = E m ^ Y i + R i n 2 E m ^ Y i n 2 + E 2 n i Λ n R i m ^ Y i E f m n 2 + E 2 n i Λ n R i m ^ m + Δ n E f ( X ) m ( X ) 2 + Δ n + E 2 n i Λ n R i m ^ + 2 n c o v ( X i , R i ) E f ( X ) m ( X ) 2 + 2 n c o v ( X i , R i ) + 2 R ^ n m ˜ , m 3 ln N n + 1 n + 6 δ + Δ n .
By setting a : = R ^ n m ^ , m , b : = 0 , c : = 3 ln N n + 1 / n , d : = E f ( X ) m ( X ) 2 + 6 δ + Δ n + 2 n c o v ( X i , R i ) in Equation (A25), we obtain the result for (III).
(IV) Let m ˜ arg min f F i Λ n Y i f X i 2 be the empirical risk minimizer. By using Equation (A22), (II), and ( 1 ε ) / ε + 1 = 1 / ε , we have
R ^ n m ^ , m R ^ n m ˜ , m = Δ n + E 2 n i Λ n R i m ^ X i E 2 n i Λ n R i m ˜ X i Δ n 2 R ^ n m ^ , m 3 ln N n + 1 n 2 R ^ n m ˜ , m 3 ln N n + 1 n 12 δ Δ n 2 ε 1 ε R ^ n m ^ , m 1 ε ε · 3 ln N n + 1 n R ^ n m ˜ , m 3 ln N n + 1 n 12 δ Δ n ε 1 ε R ^ n m ^ , m R ^ n m ˜ , m 3 ln N n + 1 n ε 12 δ .
After rearranging, we have R ^ n m ^ , m ( 1 ε ) Δ n 3 ln N n + 1 n ε 12 δ , which completes the proof of (IV). □
Lemma A9. 
Let η j N ( 0 , 1 ) , and then E max j = 1 , , M η j 2 3 ln M + 1 .
Proof. 
Let Z = max j = 1 , , M η j 2 . Since Z j η j 2 , and we have E [ Z ] M . For M { 1 , 2 , 3 } , it is evident that M 3 ln ( M ) + 1 holds. Therefore, we consider the case when M 4 . By using Mill’s ratio, we obtain P η 1 t = 2 P η 1 t 2 e t / 2 / ( 2 π t ) . For any T, we have
E [ Z ] = 0 P ( Z t ) d t T + T P ( Z t ) d t T + M T P η 1 2 t d t T + M T 2 2 π t e t / 2 d t T + 2 M 2 π T T e t / 2 d t = T + 4 M 2 π T e T / 2 .
For T = 2 ln M and M 4 , we have
E [ Z ] 2 ln M + 2 / π ln M 3 ln M + 1 .
Since 2 / π ln M ln M + 1 , we can deduce that 2 / π ( ln M ) 2 + ln M . Considering that the function is monotonically increasing with respect to M, it holds for all M 4 . □
Lemma A10. 
If V : = = 0 L + 1 p + 1 , then for any δ > 0 , we have
ln N δ , F ( L , p , s , ) , · ( s + 1 ) ln 2 δ 1 ( L + 1 ) V 2 .
Proof. 
Given a network
f ( x ) = W L σ v L W L 1 σ v L 1 W 1 σ v 1 W 0 x ,
define k { 1 , , L } , A k + f : [ 0 , 1 ] d R p k ,
A k + f ( x ) = σ v k W k 1 σ v k 1 W 1 σ v 1 W 0 x ,
and A k f : R p k 1 R p L + 1 ,
A k f ( y ) = W L σ v L W L 1 σ v L 1 W k σ v k W k 1 y .
Let A 0 + f ( x ) = A L + 2 f ( x ) = x , and we note that for f F ( L , p ) , we have A k + f ( x ) = 0 k 1 p + 1 . For a multivariate function h ( · ) , h ( · ) is said to be a Lipschitz function if, for all x , y in its domain, | h ( x ) h ( y ) | L | x y | , where the smallest L is the Lipschitz constant. The combination of two Lipschitz functions with Lipschitz constants L 1 and L 2 results in a new Lipschitz function with a Lipschitz constant of L 1 L 2 . Therefore, the Lipschitz constant of A k f is bounded by = k 1 L p . Given ε > 0 , let f , f F ( L , p , s ) be two network functions with parameters differing from each other by at most ε . Let f have parameters v k , W k and f * have parameters v k * , W k * . Then, we have
f ( x ) f * ( x ) = W L σ v L W L 1 σ v L 1 W 1 σ v 1 W 0 x W L * σ v L * W L 1 * σ v L 1 * W 1 * σ v 1 * W 0 * x = A k f σ v k W k 1 σ v k 1 A k 1 + f ( x ) A k f * σ v k * W k 1 * σ v k 1 * A k 1 + f * ( x ) k = 1 L + 1 A k + 1 f σ v k W k 1 A k 1 + f * ( x ) A k + 1 f σ v k * W k 1 * A k 1 + f * ( x ) k = 1 L + 1 = k L p σ v k W k 1 A k 1 + f * ( x ) σ v k * W k 1 * A k 1 + f * ( x ) k = 1 L + 1 = k L p W k 1 A k 1 + f * ( x ) v k W k 1 * A k 1 + f * ( x ) v k * k = 1 L + 1 = k L p W k 1 W k 1 * A k 1 + f * ( x ) + v k v k * ε k = 1 L + 1 = k L p p k 1 A k 1 + f * ( x ) + 1 ε k = 1 L + 1 = k L p + 1 ε V ( L + 1 ) .
The final step uses V : = = 0 L + 1 p + 1 . Therefore, according to Equation (A3), the total number of parameters is bounded by T : = = 0 L p + 1 p + 1 ( L + 1 ) 2 L = 0 L + 1 p + 1 V , and there are T s V s combinations to select s nonzero parameters.
Since all parameters are bounded by 1 in absolute value, we can discretize the nonzero parameters by using a grid size of δ / ( 2 ( L + 1 ) V ) , and the covering number
N δ , F ( L , p , s , ) , · s * s 2 δ 1 ( L + 1 ) V 2 s * 2 δ 1 ( L + 1 ) V 2 s + 1 .
Taking the logarithm yields the proof.
Note 1: Similarly, applying Equation (A4) to Lemma A10 gives
ln N δ , F ( L , p , s , ) , · ( s + 1 ) ln 2 2 L + 5 δ 1 ( L + 1 ) p 0 2 p L + 1 2 s 2 L .
Proof of Theorem 2. 
Let δ = 1 / n . The proof follows directly from Lemmas A8 and A10, and Note 1. □
Proof of Theorem 3. 
In this proof, we define · 2 = · L 2 [ 0 , 1 ] d . Let us assume that there exist positive constants γ Γ such that the Lebesgue density of X over [ 0 , 1 ] d is bounded below by γ and above by Γ . For this particular design, we have R m ^ n , m γ m ^ n m 2 2 . Let P f represent the data mechanism in the nonparametric regression model given by Equation (13). For the Kullback–Leibler divergence, we have KL P f , P g = n E f X 1 g X 1 2 Γ n | f g | 2 2 . In Alexandre’s work [22], Theorem 2.7 states that if, for M 1 and κ > 0 , we have f ( 0 ) , , f ( M ) G ( q , d , t , β , K ) , then
(i) f ( j ) f ( k ) 2 κ ϕ n , where 0 j < k M ;
(ii) n j = 1 M f ( j ) f ( 0 ) 2 2 M ln ( M ) / ( 9 Γ ) .
Then, there exists a positive constant c = c ( κ , γ ) such that
inf m ^ n sup m G ( q , d , t , β , K ) R m ^ n , m c ϕ n .
In the next step, we construct functions f ( 0 ) , , f ( M ) G ( q , d , t , β , K ) satisfying (i) and (ii). We define
i * arg min i = 0 , , q β i * / 2 β i * + t i .
The exponent i * determines the rate of estimation, i.e., ϕ n = n 2 β * / 2 β * + t * . For convenience, we denote β * : = β i * , β * * : = β i * * , and t * : = t i * . We note the distinction between β * and β * * . Let K L 2 ( R ) C 1 β * ( R , 1 ) be supported on [ 0 , 1 ] . It is easy to see that such a function K exists. Furthermore, we define m n : = ρ n 1 / 2 β * * + t * and h n : = 1 / m n , where ρ is a constant chosen so that n h n 2 β * + t * 1 / 72 Γ K B 2 2 t * , with B : = = i * + 1 q β 1 . For any u = u 1 , , u t * U n : = u 1 , , u t * : u i 0 , h n , 2 h n , , m n 1 h n , we define
ψ u x 1 , , x t * : = h n β * j = 1 t * K x j u j h n ,
and for any α and | α | < β * , we have α ψ u 1 , by using the fact that K C 1 β * ( R , 1 ) .
For α = α 1 , , α t * with | α | = β * , the triangle inequality and the property K C 1 β * ( R , 1 ) give
h n β * β * j = 1 t * K α j x j u j h n j = 1 t * K α j y j u j h n max i x i y i β * β * t * ;
therefore, ψ u C t * β * [ 0 , 1 ] t * , β * t * + t * . For a vector w = w u u U n 0 , 1 U n , we define
ϕ w = u U n w u ψ u .
By constructing ψ u and ψ u for u , u U n such that u u are mutually disjoint, we ensure that ( ϕ w C t * β * [ 0 , 1 ] t * , β * t * + t * ) .
For i < i * , let g i ( x ) : = x 1 , , x d i . For i = i * , define g i * , w ( x ) = ϕ w x 1 , , x t i * , 0 , , 0 . For i > i , let g i ( x ) : = x 1 β i 1 , 0 , , 0 . Here, B = = i + 1 q β 1 . We frequently use β * * = β * B . Since t i min d 0 , , d i 1 and the mutually disjoint ψ u ensures
f w ( x ) = g q g i * + 1 g i * , w g i * 1 g 0 ( x ) = ϕ w x 1 , , x t i * B = u U n w u ψ u x 1 , , x t i * B ,
by choosing a sufficiently large K, we ensure that f w G ( q , d , t , β , K ) .
For all u , ψ u 2 2 = h n 2 β * * + t * K B 2 2 t * . Let Ham w , w = u U n 1 w u w u denote the Hamming distance; then,
f w f w 2 2 = Ham w , w h n 2 β * * + t * K B 2 2 t * .
According to the Varshamov–Gilbert bound (see [22], Lemma 2.9) and m n t = U n , there exists a subset W { 0 , 1 } m n t * with a cardinality of | W | 2 m n t * / 8 . For all w , w W such that w w , it holds that Ham w , w m n t * / 8 . This implies that for all κ = | K B | 2 t * / 8 ρ β * * , w , w W , w w , we have
f w f w 2 2 = Ham w , w h n 2 β * * + t * K B 2 2 t * 1 8 m n t * h n 2 β * * + t * K B 2 2 t * 1 8 h n 2 β * * K B 2 2 t * κ 2 ϕ n .
According to the definitions of h n and ρ , we have
n f w f w 2 2 = n Ham w , w h n 2 β * * + t * K B 2 2 t * n m n t * h n 2 β * * + t * K B 2 2 t * m n t * 1 72 Γ K B 2 2 t * K B 2 2 t * m n t * 72 Γ ln | W | 9 Γ .
This indicates that the functions f w with w W satisfy (i) and (ii), and, thus, the lemma is proven. □
Proof of Lemma 1. 
Let c 2 1 . Since m K , we need to consider only the lower bound on F ( L , p , s , F ) , where F = K + 1 . Let f ˜ n be the empirical risk minimizer. Recall that Δ n m ˜ n , m = 0 . Due to the minimax lower bound in Theorem 3, there exists a constant c 3 such that for all sufficiently large n, we have c 3 n 2 β / ( 2 β + d ) sup m C 1 β ( [ 0 , 1 ] , K ) R m ˜ n , m . Since p 0 = d and p L + 1 = 1 , by Theorem 2, we can conclude that
c 3 n 2 β / ( 2 β + d ) sup m C d β ( [ 0 , 1 ] , K ) R m ˜ n , m 4 sup m C d β ( [ 0 , 1 ] , K ) inf f F ( L , p , s , K + 1 ) f m 2 + C ( K + 1 ) 2 ( s + 1 ) ln n ( s + 1 ) L d n ,
where C is a constant. Given ε , let n ε : = 8 ε / c 3 ( 2 β + d ) / β . If ε c 3 / 8 , and n ε 1 2 8 ε / c 3 ( 2 β + d ) / β , and 8 ε 2 / c 3 n ε 2 β / ( 2 β + d ) , then, for sufficiently small c 2 > 0 and all ε c 2 , we can insert n ε into the previous inequalities, and we have
8 ε 2 4 sup m C d β ( [ 0 , 1 ] , K ) inf f F ( L , p , s , K + 1 ) f m 2 + C 1 ε 2 β + d β s ln ε 1 s L + C 2 .
The constants C 1 and C 2 depend only on K , β and d. By using the condition s c 1 ε d / β / ( L ln ( 1 / ε ) ) and choosing a sufficiently small c 1 , the proof is completed. □
Proof of Lemma 2. 
Let r be the smallest positive integer such that μ r : = x r ψ ( x ) d x 0 . Such an r exists because the span of x r : r = 0 , 1 , is dense in L 2 [ 0 , A ] , and ψ cannot be a constant function. If h L 2 ( R ) , then, for the wavelet coefficients, we have
h x 1 + + x d = 1 d ψ j , k x d x = 2 j d 2 0 , 2 q d h 2 j = 1 d x + k = 1 d ψ x d x .
For a real number u, let { u } denote the fractional part of u.
We separately consider the cases when μ 0 0 and μ 0 = 0 . If μ 0 0 , we define g ( u ) = r 1 u r I [ 0 , 1 / 2 ] ( u ) + r 1 ( 1 u ) r I ( 1 / 2 , 1 ] ( u ) . We note that g is a Lipschitz function with Lipschitz constant 1. Let h j , α ( u ) = K 2 j α 1 g 2 j q ν u , where q > 0 , ν : = log 2 d + 1 . For a V-periodic function u s ( u ) , α 1   α —Hölder can be expressed as
sup u v , | u v | V | s ( u ) s ( v ) | / | u v | α .
Since g is a 1-Lipschitz function, for any u and v such that | u v | 2 q + ν j , we have
h j , α ( u ) h j , α ( v ) = K 2 j α 1 g 2 j q ν u g 2 j q ν v K 2 j α 1 2 j q ν | u v | K 2 j α 1 2 j q ν 2 j + q + ν K 2 2 α ( j + q + ν q ν ) K 2 2 α ( j + q + ν ) 2 α ( q + ν ) K 2 | u v | α .
Therefore, h j , α K / 2 and h j , α C 1 α ( [ 0 , d ] , K ) . Let f j , α ( x ) = h j , α x 1 + + x d . The support of ψ is contained in 0 , 2 q and 2 ν 2 d . Based on the definition of wavelet coefficients, Equation (A28), the definition of h j , α , and using u r = x r ψ ( x ) d x , for p 1 , , p d 0 , 1 , , 2 j q 2 1 , we have
d j , 2 q + ν p 1 j , 2 q + ν p d f j , α = 2 j d 2 0 , 2 q d h j , α 2 j = 1 d x + 2 q + ν p = 1 d ψ x d x = K 2 j d 2 j α 1 0 , 2 q d g = 1 d x + p 2 q + ν = 1 d ψ x d x = r 1 2 q r ν r 1 K 2 j 2 ( 2 α + d ) 0 , 2 q d x 1 + + x d r + p = 1 d ψ x d x = d r 1 2 q r ν r 1 K μ 0 d 1 μ r + p μ 0 d 2 j 2 ( 2 α + d ) .
In the last equation, according to the definition of r, μ 1 = = μ r 1 = 0 .
In the case of μ 0 = 0 , we take g ( u ) = ( d r ) 1 u d r I [ 0 , 1 / 2 ] ( u ) + ( d r ) 1 ( 1 u ) d r I ( 1 / 2 , 1 ] ( u ) . Following the same reasoning as above and by using the binomial theorem, we obtain
d j , 2 q + ν p 1 j , 2 q + ν p r f j , α = d r r 1 d r 2 d q r d ν r 1 K μ r d 2 j 2 ( 2 α + d ) ;
therefore, the lemma is proven. □
Proof of Theorem 4. 
We define c ( ψ , d ) as in Lemma 2. We choose an integer j * such that
1 n c ( ψ , d ) 2 K 2 2 j * ( 2 α + d ) 2 2 α + d n .
This implies that 2 j * 1 2 c ( ψ , d ) 2 K 2 n 1 / ( 2 α + d ) . According to Lemma 2, there exists a function f j * , α of the form h x 1 + + x d , where h C 1 α ( [ 0 , d ] , K ) , such that
R m ^ n , f j * , α p 1 , , p d 0 , 1 , , 2 j * q ν 1 1 n = 1 n 2 j * d q d ν d 1 n c ( ψ , d ) 2 K 2 n d / ( 2 α + d ) n d 2 α + d 2 d 2 q d d ν c ( ψ , d ) 2 K 2 n d / ( 2 α + d ) n 2 α 2 α + d n 2 α 2 α + d ;
therefore, the lemma is proven. □

References

  1. Hallin, M.; Lu, Z.; Tran, L.T. Local linear spatial regression. Ann. Stat. 2004, 32, 2469–2500. [Google Scholar] [CrossRef]
  2. Biau, G.; Cadre, B. Nonparametric spatial prediction. Stat. Inference Stoch. Process. 2004, 7, 327–349. [Google Scholar] [CrossRef]
  3. Bentsen, L.; Warakagoda, N.D.; Stenbro, R.; Engelstad, P. Spatio-temporal wind speed forecasting using graph networks and novel Transformer architectures. Appl. Energy 2023, 333, 120565. [Google Scholar] [CrossRef]
  4. Du, P.; Bai, X.; Tan, K.; Xue, Z.; Samat, A.; Xia, J.; Li, E.; Su, H.; Liu, W. Advances of four machine learning methods for spatial data handling: A review. J. Geovis. Spat. Anal. 2020, 4, 13. [Google Scholar] [CrossRef]
  5. Farrell, A.; Wang, G.; Rush, S.A.; Martin, J.A.; Belant, J.L.; Butler, A.B.; Godwin, D. Machine learning of large-scale spatial distributions of wild turkeys with high-dimensional environmental data. Ecol. Evol. 2019, 9, 5938–5949. [Google Scholar] [CrossRef] [PubMed]
  6. Nikparvar, B.; Thill, J.C. Machine learning of spatial data. ISPRS Int. J. -Geo-Inf. 2021, 10, 600. [Google Scholar] [CrossRef]
  7. Schmidt-Hieber, J. Nonparametric regression using deep neural networks with ReLU activation function. Ann. Stat. 2020, 48, 1875–1897. [Google Scholar]
  8. Wang, H.; Wu, Y.; Chan, E. Efficient estimation of nonparametric spatial models with general correlation structures. Aust. N. Z. J. Stat. 2017, 59, 215–233. [Google Scholar] [CrossRef]
  9. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  10. He, K.; Sun, J. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5353–5360. [Google Scholar]
  11. Cohen, A.; Daubechies, I.; Vial, P. Wavelets on the interval and fast wavelet transforms. Appl. Comput. Harmon. Anal. 1993, 1, 54–81. [Google Scholar] [CrossRef]
  12. Cressie, N.; Wikle, C.K. Statistics for Spatio-Temporal Data; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
  13. Wang, H.X.; Lin, J.G.; Huang, X.F. Local modal regression for the spatio-temporal model. Sci. Sin. Math. 2021, 51, 615–630. (In Chinese) [Google Scholar]
  14. Telgarsky, M. Benefits of depth in neural networks. In Proceedings of the Conference on Learning Theory, PMLR, Hamilton, New Zealand, 16–18 November 2016; pp. 1517–1539. [Google Scholar]
  15. Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural Netw. 2017, 94, 103–114. [Google Scholar] [CrossRef] [PubMed]
  16. Györfi, L.; Kohler, M.; Krzyzak, A.; Walk, H. A Distribution-Free Theory of Nonparametric Regression; Springer: New York, NY, USA, 2002. [Google Scholar]
  17. Giné, E.; Koltchinskii, V.; Wellner, J.A. Ratio limit theorems for empirical processes. In Stochastic Inequalities and Applications; Birkhäuser: Basel, Switzerland, 2003; pp. 249–278. [Google Scholar]
  18. Hamers, M.; Kohler, M. Nonasymptotic bounds on the L2 error of neural network regression estimates. Ann. Inst. Stat. Math. 2006, 58, 131–151. [Google Scholar] [CrossRef]
  19. Koltchinskii, V. Local Rademacher complexities and oracle inequalities in risk minimization. Ann. Stat. 2006, 34, 2593–2656. [Google Scholar] [CrossRef]
  20. Massart, P. Concentration Inequalities and Model Selection: Ecole d’Eté de Probabilités de Saint-Flour XXXIII-2003; Springer: Berlin/Heidelberg, Germany, 2007. [Google Scholar]
  21. Wellner, J. Weak Convergence and Empirical Processes: With Applications to Statistics; Springer Science and Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  22. Tsybakov, A.B. Introduction to Nonparametric Estimation; Springer Series in Statistics; Springer: New York, NY, USA, 2009. [Google Scholar]
Figure 1. M S E s of local linear regression method (dashed blue line) and the neural network method (solid orange line) of dimension 3, 5, 8 and 10. (a) Three dimensions. (b) Five dimensions. (c) Eight dimensions. (d) Ten dimensions.
Figure 1. M S E s of local linear regression method (dashed blue line) and the neural network method (solid orange line) of dimension 3, 5, 8 and 10. (a) Three dimensions. (b) Five dimensions. (c) Eight dimensions. (d) Ten dimensions.
Mathematics 11 03899 g001
Figure 2. Comparison of M S E s between two methods for Model 2.
Figure 2. Comparison of M S E s between two methods for Model 2.
Mathematics 11 03899 g002
Figure 3. Comparison of M S E s between two methods for this case.
Figure 3. Comparison of M S E s between two methods for this case.
Mathematics 11 03899 g003
Table 1. Various dimensional MSE values of two methods for Model 1.
Table 1. Various dimensional MSE values of two methods for Model 1.
Scenario 1Scenario 2Scenario 3Scenario 4
n NE DNN NE DNN NE DNN NE DNN
2000.14520.03630.91200.06954.44210.11197.54810.1520
6000.04380.01500.24220.02801.37580.04882.76570.0655
10000.02420.01100.13430.02070.83470.03751.76160.0473
14000.01860.00970.09640.01870.61020.03361.24080.0414
18000.01460.00800.07550.01710.47680.02981.05330.0358
22000.01120.00750.06130.01530.41560.02580.91990.0357
Table 2. The M S E values of both methods for Model 2.
Table 2. The M S E values of both methods for Model 2.
n2006001000140018002200
N E 36.860632.401317.699014.057010.77405.1166
D N N 2.03052.00782.00742.00472.00412.0020
Table 3. The M S E values of both methods for this case.
Table 3. The M S E values of both methods for this case.
n50100150200250300
N E 3.19122.26672.01461.84641.79261.7485
D N N 1.79581.50661.39231.33381.29831.2901
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, H.; Jin, X.; Wang, J.; Hao, H. Nonparametric Estimation for High-Dimensional Space Models Based on a Deep Neural Network. Mathematics 2023, 11, 3899. https://doi.org/10.3390/math11183899

AMA Style

Wang H, Jin X, Wang J, Hao H. Nonparametric Estimation for High-Dimensional Space Models Based on a Deep Neural Network. Mathematics. 2023; 11(18):3899. https://doi.org/10.3390/math11183899

Chicago/Turabian Style

Wang, Hongxia, Xiao Jin, Jianian Wang, and Hongxia Hao. 2023. "Nonparametric Estimation for High-Dimensional Space Models Based on a Deep Neural Network" Mathematics 11, no. 18: 3899. https://doi.org/10.3390/math11183899

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop