# A Natural Gradient Algorithm for Stochastic Distribution Systems

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

^{2}, namely,

## 2. Model Description

_{ω}(x). Therefore, the SDCSs can be expressed as:

_{k}∈ ℝ

^{1}is the output (see Figure 2).

_{k}. Thus, according to p

_{ω}(x) and Equation (1), the output PDFs of the system can be expressed by:

^{2}+ ω, then the output PDF can be obtained as:

- (1)
- The inverse function of y = f(u, ω) with respect to ω exists and is denoted by ω = f
^{−1}(y, u), which is at least C^{2}with respect to all variables (y, u). - (2)
- The output PDF p(y; u) is at least C
^{2}with respect to all variables (y, u).

_{*}, so that p(y; u

_{*}) is as close as possible to the target PDF h(y). To formulate it in the frame of information geometry, we first define the relevant statistical manifold.

#### Definition 1

^{1}, . . . , u

^{n})

^{T}∈ ℝ

^{n}plays the role of a coordinate system for S. Thus, S is an n-dimensional manifold.

#### Definition 2 ([5,6,24])

^{3}) are neglected. Thus, the Kullback–Leibler divergence is a distance-like measure of two points on a statistical manifold and has been widely applied, for example, to information theory. Here, (g

_{ij}) is the Fisher metric equipped on manifold S, whose components are expressed as:

_{*}minimizes J(u), namely,

## 3. Natural Gradient Algorithm

^{n}} be a parameter space on which a function L is defined.

#### Lemma 1 ([15])

^{−1}= (g

^{ij}) is the inverse of the Riemannian metric G = (g

_{ij}) and ∇L(ω) is the ordinary gradient:

#### Proposition 1

#### Proof

_{i}(i = 1, 2, . . . , n) are given by:

#### Theorem 1

_{k}= G|

_{u}

_{=}

_{uk}, and ɛ is a sufficiently small positive constant, which determines the step size. Here, we set:

#### Proof

_{k}and P

_{k}

_{+1}be two close points on S corresponding to the functions log p(y; u

_{k}) and log p(y; u

_{k}

_{+1}), whose coordinates are given by ${u}_{k}={({u}_{k}^{1},\dots ,{u}_{k}^{n})}^{T}$ and u

_{k}

_{+1}= u

_{k}+ △u

_{k}, respectively, where ${u}_{k+1}={({u}_{k+1}^{1},\dots ,{u}_{k+1}^{n})}^{T}$, and $\mathrm{\Delta}{u}_{k}={(\mathrm{\Delta}{u}_{k}^{1},\dots ,\mathrm{\Delta}{u}_{k}^{n})}^{T}$. Therefore, our purpose is to formulate an iterative formula with respect to u

_{k}

_{+1}. Assume that the vector $\overrightarrow{{P}_{k}{P}_{k+1}}\in {T}_{{P}_{k}}S$ has a fixed length, namely,

_{Pk}S at P

_{k}. We denote

**a**= (a

^{1}, . . . , a

^{n})

^{T}, and the tangent vector

**v**satisfies:

_{k}means that G|

_{u}

_{=}

_{uk}.

_{k}) and log p(y; u

_{k}

_{+1}) between the sample times k and k + 1, the following equation is performed approximately:

_{k}is known at the sample time k +1. Here,

**a**= (a

^{1}, . . . , a

^{n})

^{T}should be selected, such that the following performance function:

_{k}) at the sample time k, while the third term is a natural quadratic constraint for

**a**= (a

^{1}, . . . , a

^{n})

^{T}. Then, the optimal vector

**a**can be obtained as:

**a**. Now, let us consider the sufficient condition of Equation (15) to minimize the performance function (14).

^{1}, . . . , a

^{n}) with respect to the vector

**a**= (a

^{1}, . . . , a

^{n})

^{T}is given by:

_{k}is positive definite, the Hessian matrix is positive definite. This guarantees that the vector

**a**in the form of Equation (15) minimizes the performance function (14) naturally.

- (1)
- Initialize u
_{0}. - (2)
- At the sample time k − 1, formulate ∇J(u
_{k}_{−1}) and use Equation (1) to give the inverse ${G}_{k-1}^{-1}$ of the Fisher metric G_{k}_{−1}. - (3)
- (4)
- If J(u
_{k}) < δ, where δ is a positive constant, which is determined by the precision needed, escape. Additionally, at the sample time k, the output PDF p(y; u_{k}) is the final one. If not, turn to Step 5. - (5)
- Increase k by one and go back to Step 2.

## 4. Convergence of the Algorithm

#### Lemma 2

^{n}be a continuous mapping on a compact set D of M and the set Ω = {x ∈ D| f(x) = 0} be finite. If the sequence ${\{{x}^{m}\}}_{m=1}^{\infty}\subset D$ satisfies:

_{*}∈ Ω, such that:

#### Proof

_{0}> 0, we have that for arbitrary K > 0, there exists an m > K, such that ${x}^{m}\notin \underset{i=1}{\overset{s}{\cup}}B({a}^{i},{\varepsilon}_{0})$, then, for K = 1, we get an m

_{1}> 1 satisfying ${x}^{{m}_{1}}\notin \underset{i=1}{\overset{s}{\cup}}B({a}^{i},{\varepsilon}_{0})$. Moreover, for K = m

_{1}, we get an m

_{2}> m

_{1}, such that ${x}^{{m}_{2}}\notin \underset{i=1}{\overset{s}{\cup}}B({a}^{i},{\varepsilon}_{0})$. Following this way, we get a subsequence {x

^{mj}} of {x

^{m}}, satisfying ${x}^{{m}_{j}}\notin \underset{i=1}{\overset{s}{\cup}}B({a}^{i},{\varepsilon}_{0})$ for arbitrary j.

^{mj}} must have a convergent subsequence {x

^{mj}

^{i}}, namely,

_{*}∈ Ω according to the process of the conclusion above.

^{i}, ε

_{0})∩B(a

^{j}, ε

_{0}) = ∅︀, when i ≠ j.

^{m}∈ B(x*, ε) for any m ≥ K.

_{1}> 0, such that:

_{1}. Meanwhile, we also have:

^{i}, ε) and β ∈ B(a

^{j}, ε) are arbitrary, when i ≠ j.

_{0}> 0, there exists a K

_{2}> 0, such that:

_{2}.

_{1}, K

_{2}, m

_{L}}; then, we set $K=\underset{k}{\text{min}}\{{m}_{k}\mid {m}_{k}\ge \overline{K}\}$, so that x

^{K}∈ {x

^{mk}}.

^{N}∈ B(x

_{*}, ε), then when m = N + 1, we see that x

^{N+1}should be contained in the union of the s open balls from Equation (16), while from Equations (17) and (19), we also get that x

^{N}and x

^{N+1}must be in the same ball, i.e., x

^{N+1}∈ B(x

_{*}, ε).

#### Lemma 3

^{2}with respect to u. For an initial value u

_{0}∈ ℝ

^{n}, suppose that the level set L = {u ∈ ℝ

^{n}|J(u) ≤ J(u

_{0})} is compact. The sequence {u

_{k}} in Equation (7) has the following property: for a certain k

_{0}, either ${G}_{{k}_{0}}^{-1}\nabla J({u}_{{k}_{0}})=0$ or when k → ∞, ${G}_{k}^{-1}\nabla J({u}_{k})\to 0$, where ${G}_{k}^{-1}$ is the inverse of the Fisher metric G

_{k}.

#### Proof

_{k}≠ 0 for any sample time k. Now, let us give a proof by contradiction. Suppose that when k → ∞, c

_{k}→ 0 does not hold, that is, there exists an ɛ

_{0}> 0, so that the norm of c

_{k}satisfies:

^{−1}(u)∇J(u), and v

_{k}belongs to the continuous space between u

_{k}and u

_{k}− αc

_{k}.

_{k}− αc

_{k}− u

_{k}|| = ||αc

_{k}|| ≤ β,

_{k}) − J(u

_{k}

_{−1}) < 0, i.e., {J(u

_{k})} is monotone decreasing with respect to k.

_{k}

_{→∞}J(u

_{k}) exists, namely,

#### Theorem 2

_{*}∈ Ω, such that:

#### Proof

## 5. Simulations

_{k}∈ [0,+∞) and (μ, σ) is the input vector. Here, the stochastic noise ω

_{k}is a random process whose PDF is written as:

_{0}= (μ

_{0}, σ

_{0})

^{T}= (0.7, 2.5)

^{T}. The weights ɛ and λ are taken as 0.6 and 0.8, respectively. As a result, the response of the output PDFs is shown in Figure 3, in which y denotes the output of the system, p(y; μ, σ) denotes the PDF of the output y and k denotes the sample time.

_{x}

_{∈}

_{A;y}

_{∈}

_{B}d(x, y) may be larger than zero. Actually, in our simulation, the target PDF is in the set of second order polynomials, and the PDF p(y; μ, σ) of the output y is exponential. Therefore, the non-zero steady error still exists all of the time.

## 6. Conclusions

- (1)
- By the statistical characterizations of the stochastic distribution control systems, we formulate the controller design in the frame of information geometry. By virtue of the natural gradient algorithm, a steepest descent algorithm is proposed.
- (2)
- The convergence of the obtained algorithm is proven.
- (3)
- An example is discussed in detail to demonstrate our algorithm.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Rao, C.R. Infromation and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta. Math. Soc
**1945**, 37, 81–91. [Google Scholar] - Efron, B. Defining the curvature of a statistical problem. Ann. Stat
**1975**, 3, 1189–1242. [Google Scholar] - Efron, B. The geometry of exponential families. Ann. Stat
**1978**, 6, 362–376. [Google Scholar] - Chentsov, N.N. Statistical Decision Rules and Optimal Inference; AMS: Providence, RI, USA, 1982. [Google Scholar]
- Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
- Amari, S. Differential Geometrical Methods in Statistics; Springer-Verlag: Berlin/Heidelberg, Germany,, 1990. [Google Scholar]
- Amari, S. Information geometry of the EM and em algorithm for neural networks. Neural Netw
**1995**, 8, 1379–1408. [Google Scholar] - Amari, S.; Kurata, K.; Nagaoka, H. Information geometry of Boltzmann machines. IEEE Trans. Neural Netw
**1992**, 3, 260–271. [Google Scholar] - Amari, S. Differential geometry of a parametric family of invertible linear systems-Riemannian metric, dual affine connections, and divergence. Math. Syst. Theory
**1987**, 20, 53–83. [Google Scholar] - Zhang, Z.; Sun, H.; Zhong, F. Natural gradient-projection algorithm for distribution control. Optim. Control Appl. Methods
**2009**, 30, 495–504. [Google Scholar] - Zhong, F.; Sun, H.; Zhang, Z. An Information geometry algorithm for distribution control. Bull. Braz. Math. Soc
**2008**, 39, 1–10. [Google Scholar] - Zhang, Z.; Sun, H.; Peng, L. Natural gradient algorithm for stochastic distribution systems with output feedback. Differ. Geom. Appl
**2013**, 31, 682–690. [Google Scholar] - Peng, L.; Sun, H.; Sun, D.; Yi, J. The geometric structures and instability of entropic dynamical models. Adv. Math
**2011**, 227, 459–471. [Google Scholar] - Peng, L.; Sun, H.; Xu, G. Information geometric characterization of the complexity of fractional Brownian motions. J. Math. Phys
**2012**, 53, 123305. [Google Scholar] - Amari, S. Natural gradient works efficiently in learning. Neural Comput
**1998**, 10, 251–276. [Google Scholar] - Amari, S. Natural gradient learning for over- and under-complete bases in ICA. Neural Comput
**1999**, 11, 1875–1883. [Google Scholar] - Park, H.; Amari, S.; Fukumizu, K. Adaptive natural gradient learning algorithms for various stochastic model. Neural Netw
**2000**, 13, 755–764. [Google Scholar] - Guo, L.; Wang, H. Stochastic Distribution Control System Design: A Convex Optimization Approach; Springer: London, UK, 2010. [Google Scholar]
- Wang, H. Control of Conditional output probability density functions for general nonlinear and non-Gaussian dynamic stochastic systems. IEE Proc. Control Theory Appl
**2003**, 150, 55–60. [Google Scholar] - Guo, L.; Wang, H. Minimum entropy filtering for multivariate stochastic systems with non-Gaussian noises. IEEE Trans. Autom. Control
**2006**, 51, 695–670. [Google Scholar] - Wang, A.; Afshar, P.; Wang, H. Complex stochastic systems modelling and control via iterative machine learning. Neurocomputing
**2008**, 71, 2685–2692. [Google Scholar] - Dodson, C.T.J.; Wang, H. Iterative approximation of statistical distributions and relation to information geometry. Stat. Inference Stoch. Process
**2001**, 4, 307–318. [Google Scholar] - Wang, A.; Wang, H.; Guo, L. Recent Advances on Stochastic Distribution Control: Probability Density Function Control. Proceedings of the CCDC 2009: Chinese Control and Decision Conference, Guilin, China, 17–19 June 2009. [CrossRef]
- Sun, H.; Peng, L.; Zhang, Z. Information geometry and its applications. Adv. Math. (China)
**2011**, 40, 257–269. (In Chinese) [Google Scholar]

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Zhang, Z.; Sun, H.; Peng, L.; Jiu, L.
A Natural Gradient Algorithm for Stochastic Distribution Systems. *Entropy* **2014**, *16*, 4338-4352.
https://doi.org/10.3390/e16084338

**AMA Style**

Zhang Z, Sun H, Peng L, Jiu L.
A Natural Gradient Algorithm for Stochastic Distribution Systems. *Entropy*. 2014; 16(8):4338-4352.
https://doi.org/10.3390/e16084338

**Chicago/Turabian Style**

Zhang, Zhenning, Huafei Sun, Linyu Peng, and Lin Jiu.
2014. "A Natural Gradient Algorithm for Stochastic Distribution Systems" *Entropy* 16, no. 8: 4338-4352.
https://doi.org/10.3390/e16084338