# Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations

## Abstract

**:**

## 1. Introduction

**Q1.**- Is the estimate in the previous line sharp?
**Q2.**- How efficiently can ReLU nets of a given width $w\ge {w}_{\mathrm{min}}\left(d\right)$ approximate a given continuous function of d variables?

## 2. Statement of Results

**Theorem**

**1.**

**1.****(f is continuous)**- There exists a sequence of feed-forward neural nets ${\mathcal{N}}_{k}$ with ReLU activations, input dimension $d,$ hidden layer width $d+2,$ and output dimension $1,$ such that:$$\underset{k\to \infty}{lim}{\u2225f-{f}_{{\mathcal{N}}_{k}}\u2225}_{{C}^{0}}=0.$$In particular, ${w}_{min}\left(d\right)\le d+2.$ Moreover, write ${\omega}_{f}$ for the modulus of continuity of $f,$ and fix $\epsilon >0.$ There exists a feed-forward neural net ${\mathcal{N}}_{\epsilon}$ with ReLU activations, input dimension $d,$ hidden layer width $d+3,$ output dimension $1,$ and:$$depth\left({\mathcal{N}}_{\epsilon}\right)=\frac{2\xb7d!}{{\omega}_{f}{\left(\epsilon \right)}^{d}}$$$${\u2225f-{f}_{{\mathcal{N}}_{\epsilon}}\u2225}_{{C}^{0}}\le \epsilon .$$
**2.****(f is convex)**- There exists a sequence of feed-forward neural nets ${\mathcal{N}}_{k}$ with ReLU activations, input dimension $d,$ hidden layer width $d+1,$ and output dimension $1,$ such that:$$\underset{k\to \infty}{lim}{\u2225f-{f}_{{\mathcal{N}}_{k}}\u2225}_{{C}^{0}}=0.$$Hence, ${\omega}_{min}^{conv}\left(d\right)\le d+1.$ Further, there exists $C>0$ such that if f is both convex and Lipschitz with Lipschitz constant $L,$ then the nets ${\mathcal{N}}_{k}$ in (8) can be taken to satisfy:$$depth\left({\mathcal{N}}_{k}\right)=k+1,\phantom{\rule{2.em}{0ex}}{\u2225f-{f}_{{\mathcal{N}}_{k}}\u2225}_{{C}^{0}}\le CL{d}^{3/2}{k}^{-2/d}.$$
**3.****(f is smooth)**- There exists a constant K depending only on d and a constant C depending only on the maximum of the first K derivative of f such that for every $k\ge 3$, the width $d+2$ nets ${\mathcal{N}}_{k}$ in (5) can be chosen so that:$$depth\left({\mathcal{N}}_{k}\right)=k,\phantom{\rule{2.em}{0ex}}{\u2225f-{f}_{{\mathcal{N}}_{k}}\u2225}_{{C}^{0}}\le C{\left(k-2\right)}^{-1/d}.$$

**Theorem**

**2.**

## 3. Relation to Previous Work

- Theorems 1 and 2 are “deep and narrow” analogs of the well-known “shallow and wide” universal approximation results (e.g., Cybenko [12] and Hornik-Stinchcombe-White [13]) for feed-forward neural nets. Those articles show that essentially any scalar function $f:{[0,1]}^{d}\to \mathbb{R}$ on the d-dimensional unit cube can be arbitrarily well approximated by a feed-forward neural net with a single hidden layer with arbitrary width. Such results hold for a wide class of nonlinear activations, but are not particularly illuminating from the point of understanding the expressive advantages of depth in neural nets.
- The results in this article complement the work of Liao-Mhaskar-Poggio [3] and Mhaskar-Poggio [5], who considered the advantages of depth for representing certain hierarchical or compositional functions by neural nets with both ReLU and non-ReLU activations. Their results (e.g., Theorem 1 in [3] and Theorem 3.1 in [5]) give bounds on the width for approximation both for shallow and certain deep hierarchical nets.
- Theorems 1 and 2 are also quantitative analogs of Corollary 2.2 and Theorem 2.4 in the work of Arora-Basu-Mianjy-Mukerjee [2]. Their results give bounds on the depth of a ReLU net needed to compute exactly a piecewise linear function of d variables. However, except when $d=1,$ they do not obtain an estimate on the number of neurons in such a network and hence cannot bound the width of the hidden layers.
- Our results are related to Theorems II.1 and II.4 of Rolnick-Tegmark [16], which are themselves extensions of Lin-Rolnick-Tegmark [4]. Their results give lower bounds on the total size (number of neurons) of a neural net (with non-ReLU activations) that approximates sparse multivariable polynomials. Their bounds do not imply a control on the width of such networks that depends only on the number of variables, however.
- This work was inspired in part by questions raised in the work of Telgarsky [8,9,10]. In particular, in Theorems 1.1 and 1.2 of [8], Telgarsky constructed interesting examples of sawtooth functions that can be computed efficiently by deep width 2 ReLU nets that cannot be well approximated by shallower networks with a similar number of parameters.
- Theorems 1 and 2 are quantitative statements about the expressive power of depth without the aid of width. This topic, usually without considering bounds on the width, has been taken up by many authors. We refer the reader to [6,7] for several interesting quantitative measures of the complexity of functions computed by deep neural nets.
- Finally, we refer the reader to the interesting work of Yarofsky [11], which provides bounds on the total number of parameters in a ReLU net needed to approximate a given class of functions (mainly balls in various Sobolev spaces).

## 4. Proof of Theorem 2

**Proof**

**of**

**Theorem**

**2.**

**Lemma**

**1.**

**Proof.**

**Lemma**

**2.**

**Proof.**

## 5. Proof of Theorem 1

**Proof**

**of**

**Theorem**

**1**

**Proposition**

**1.**

**Lemma**

**3.**

**Proof.**

## 6. Conclusions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Bengio, Y.; Hinton, G.; LeCun, Y. Deep learning. Nature
**2015**, 521, 436–444. [Google Scholar] - Arora, R.; Basu, A.; Mianjy, P.; Mukherjee, A. Understanding deep neural networks with Rectified Linear Units. In Proceedings of the International Conference on Representation Learning, Vancouver, BC, Canada, 30 April 30–3 May 2018. [Google Scholar]
- Liao, Q.; Mhaskar, H.; Poggio, T. Learning functions: When is deep better than shallow. arXiv
**2016**, arXiv:1603.00988v4. [Google Scholar] - Lin, H.; Rolnick, D.; Tegmark, M. Why does deep and cheap learning work so well? arXiv
**2016**, arXiv:1608.08225v3. [Google Scholar] [CrossRef] - Mhaskar, H.; Poggio, T. Deep vs. shallow networks: An approximation theory perspective. Anal. Appl.
**2016**, 14, 829–848. [Google Scholar] [CrossRef] - Poole, B.; Lahiri, S.; Raghu, M.; Sohl-Dickstein, J.; Ganguli, S. Exponential expressivity in deep neural networks through transient chaos. Adv. Neural Inf. Process. Syst.
**2016**, 29, 3360–3368. [Google Scholar] - Raghu, M.; Poole, B.; Kleinberg, J.; Ganguli, S.; Dickstein, J. On the expressive power of deep neural nets. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 2847–2854. [Google Scholar]
- Telgrasky, M. Representation benefits of deep feedforward networks. arXiv
**2015**, arXiv:1509.08101. [Google Scholar] - Telgrasky, M. Benefits of depth in neural nets. In Proceedings of the JMLR: Workshop and Conference Proceedings, New York, NY, USA, 19 June 2016; Volume 49, pp. 1–23. [Google Scholar]
- Telgrasky, M. Neural networks and rational functions. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 3387–3393. [Google Scholar]
- Yarotsky, D. Error bounds for approximations with deep ReLU network. Neural Netw.
**2017**, 94, 103–114. [Google Scholar] [CrossRef] [PubMed] - Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control. Signals Syst. (MCSS)
**1989**, 2, 303–314. [Google Scholar] [CrossRef] - Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. J. Neural Netw.
**1989**, 2, 359–366. [Google Scholar] [CrossRef] - Hanin, B.; Sellke, M. Approximating Continuous Functions by ReLU Nets of Minimal Width. arXiv
**2017**, arXiv:1710.11278. [Google Scholar] - Balázs, G.; György, A.; Szepesvári, C. Near-optimal max-affine estimators for convex regression. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 9–12 May 2015; Volume 38, pp. 56–64. [Google Scholar]
- Rolnick, D.; Tegmark, M. The power of deeper networks for expressing natural functions. In Proceedings of the International Conference on Representation Learning, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Shang, Y. A combinatorial necessary and sufficient condition for cluster consensus. Neurocomputing
**2016**, 216, 611–616. [Google Scholar] [CrossRef] - Mossel, E. Mathematical Aspects of Deep Learning. Available online: http://elmos.scripts.mit.edu/mathofdeeplearning/mathematical-aspects-of-deep-learning-intro/ (accessed on 10 September 2019).

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hanin, B.
Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations. *Mathematics* **2019**, *7*, 992.
https://doi.org/10.3390/math7100992

**AMA Style**

Hanin B.
Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations. *Mathematics*. 2019; 7(10):992.
https://doi.org/10.3390/math7100992

**Chicago/Turabian Style**

Hanin, Boris.
2019. "Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations" *Mathematics* 7, no. 10: 992.
https://doi.org/10.3390/math7100992