#
Black-Box Optimization Using Geodesics in Statistical Manifolds^{ †}

^{†}

## Abstract

**:**

## 1. Introduction

_{θ})

_{θ}

_{∈Θ}a family of probability distributions (which will be given a Riemannian manifold structure, following [2]) on X and an initial probability distribution ${P}_{{\theta}^{0}}$. Now, we replace f by F: Θ → ℝ (for example $F(\theta )={E}_{x~{P}_{\theta}}[f(x)]$), and we optimize F by gradient descent, corresponding to the gradient flow:

## 2. Definitions: IGO, GIGO

- The gradient depends on the parametrization of our space of probability distributions (see Section 2.3 for an example).
- The equation is not invariant under monotone transformations of f. For example, the optimization for 10f moves ten times faster than the optimization for f.

#### 2.1. Invariance under Reparametrization of θ: Fisher Metric

**Definition**1. Let P, Q be two probability distributions on X. The Kullback–Leibler divergence of Q from P is defined by:

_{θ}(x) is C

^{2}, then a second-order expansion yields:

_{θ})

_{θ}

_{∈Θ}with a Riemannian manifold structure: a Riemannian manifold M is a differentiable manifold, which can be seen as pieces of ℝ

^{n}glued together, with a metric. The metric at x is a symmetric positive-definite quadratic form on the tangent space of M at x: it indicates how expensive it is to move in a given direction on the manifold. We will think of the updates of the algorithms that we will be studying as paths on M.

_{θ})

_{θ}

_{∈Θ}be a family of normal probability distributions:${P}_{\theta}=N(\mu (\theta ),\mathrm{\Sigma}(\theta )).$ If μ and Σ are C

^{1}, the Fisher metric is given by:

**Notation**1. ${\mathbb{G}}_{d}$ is the manifold of Gaussian distributions in dimension d, equipped with the Fisher metric.${\tilde{\mathbb{G}}}_{d}$ is the manifold of Gaussian distributions in dimension d, with the covariance matrix proportional to identity in the canonical basis of ℝ

^{d}, equipped with the Fisher metric.

#### 2.2. IGO Flow, IGO Algorithm

_{1}, …, x

_{N}) are sampled from the distribution ${P}_{{\theta}^{t}}$, and the integral becomes:

_{i}) = |{j, f(x

_{j}) < f(x

_{i})}|: it can be proven (see [1]) that ${\mathrm{lim}}_{N\to \infty}N{\widehat{w}}_{i}={W}_{f}^{{\theta}^{t}}({x}_{i})$ (here again, we are assuming that there are no ties).

**Definition 1**. The IGO update associated with parametrization θ, sample size N, step size δt and selection scheme w is given by the following update rule:

_{θ}and optimizing over θ using the natural gradient have already been discussed. For example, in the case of a function f defined on {0, 1}

^{n}, IGO with the Bernoulli distributions yields the algorithm, PBIL[9]. Another similar approach (stochastic relaxation) is given in [10]. For a continuous function, as we will see later, the IGO framework recovers several known ranked-based natural gradient algorithms, such as pure rank-μ CMA-ES [11], xNES or SNES (Separable Natural Evolution Strategies) [12]. See [13] or [14] for other, not necessarily gradient-based, optimization algorithms on manifolds.

#### 2.3. Geodesic IGO

^{2}), whereas the different between two vanilla gradient descents with different parametrizations is O(δt).

^{2}). We suppose that the IGO speed for the first algorithm is $(\dot{\mu},\dot{\sigma})$. The corresponding IGO speed in the second parametrization is given by the identity $\dot{c}=2\sigma \dot{\sigma}$. Therefore, the first algorithm gives the standard deviation ${\sigma}_{\mathrm{new},1}={\sigma}_{\mathrm{old}}+\delta t\dot{\sigma}$ and the variance ${c}_{\mathrm{new},1}={({\sigma}_{\mathrm{new},1})}^{2}={c}_{\mathrm{old}}+2\delta t{\sigma}_{\mathrm{old}}\dot{\sigma}+\delta {t}^{2}{\dot{\sigma}}^{2}={c}_{\mathrm{new},2}+\delta {t}^{2}{\dot{\sigma}}^{2}$.

**Definition 2**(GIGO). The geodesic IGO update (GIGO) associated with sample size N, step size δt and selection scheme w is given by the following update rule:

^{t}, with initial speed Y, after a time δt. By definition, this update does not depend on the parametrization θ.

^{t}updated according to the GIGO algorithm is a solution of Equation (9), the equation defining the IGO flow), it not necessarily an IGO update. More precisely, the GIGO update is an IGO update if and only if the geodesics of Θ are straight lines for some parametrization (by Beltrami’s theorem, this is equivalent to Θ having constant curvature).

## 3. Riemannian Geometry, Noether’s Theorem

#### 3.1. Riemannian Geometry

**Definition 3**(Motion in a Lagrangian system). Let M be a differentiable manifold, TM the set of tangent vectors on M (a tangent vector is identified by the point at which it is tangent and a vector in the tangent space) and$\begin{array}{r}\mathcal{L}:TM\to \mathbb{R}\\ (q,v)\mapsto \mathcal{L}(q,v)\end{array}$ differentiable function called the Lagrangian function (in general, it could depend on t). A “motion in the Lagrangian system (M, $\mathcal{L}$) from x to y” is map γ : [t

_{0}, t

_{1}] → M, such that:

- γ(t
_{0}) = x - γ(t
_{1}) = y - γ is a local extremum of the functional:$$\mathrm{\Phi}(\gamma )={\displaystyle {\int}_{{t}_{0}}^{{t}_{1}}\mathcal{L}(\gamma (t),\dot{\gamma}(t))\mathrm{d}t,}$$
_{0}, t_{1}] → M, such that c(t_{0}) = x, and c(t_{1}) = y.

_{0}) and γ(t

_{1}) is:

_{0}) = x and γ(t

_{1}) = y, and the corresponding Lagrangian function is $(q,v)\mapsto \sqrt{g(v,v)}$. However, any curve following the shortest trajectory will have minimum length. For example, if γ

_{1}: [a, b] → M is a curve of the shortest path, so is γ

_{2}: t ↦ γ

_{1}(t

^{2}): these two curves define the same trajectory in M, but they do not travel along this trajectory at the same speed. This leads us to the following definition:

**Definition 4**(Geodesics). Let I be an interval of ℝ and (M, g) be a Riemannian manifold. A curve γ: I → M is called a geodesic if for all${t}_{0},{t}_{1}\in I,\gamma {|}_{[{t}_{0},{t}_{1}]}$ is a motion in the Lagrangian system (M, $\mathcal{L}$) from γ(t

_{0}) to γ(t

_{1}), where:

**Definition 5.**Let (M, g) be a Riemannian manifold. We call the exponential of M the application:

_{x}(v) = γ(1).

#### 3.2. Noether’s Theorem

**Definition 6.**Let h: M → M, a diffeomorphism. We say that the Lagrangian system (M, $\mathcal{L}$) admits the symmetry h if for any (q, v) ∈ TM,

**Theorem 1**(Noether’s Theorem). If the Lagrangian system (M, $\mathcal{L}$) admits the one-parameter group of symmetries h

^{s}: M → M, s ∈ ℝ, then the following quantity remains constant during motions in the system (M, $\mathcal{L}$). Namely,

## 4. GIGO in ${\tilde{\mathbb{G}}}_{d}$

**Proposition 2.**Let M be a Riemannian manifold; let d ∈ ℕ; let Φ be the Riemannian exponential of M

^{d}; and let φ be the Riemannian exponential of M. We have:

_{i}) is equivalent to d separate one-dimensional GIGO updates using the same samples. Moreover, ${\mathbb{G}}_{1}\cong {\tilde{\mathbb{G}}}_{1}$, the geodesics of which are given below.

#### 4.1. Preliminaries: Poincaré Half-Plane, Hyperbolic Space

**Definition 7**(Poincaré half-plane). We call the “Poincaré half-plane” the Riemannian manifold:

**Proposition 3**(Geodesics of the Poincaré half-plane). The geodesics of the Poincaré half-plane are exactly the:

**Definition 8**(Hyperbolic space). We call the “hyperbolic space of dimension n” the Riemannian manifold:

_{i}, so by Noether’s theorem, its geodesics stay in a plane containing the direction y and the initial speed. The induced metric on this plane is the metric of the Poincaré half-plane. The geodesics are therefore given by the following proposition:

**Proposition 4**(Geodesics of the hyperbolic space). If γ : t ⟼ (x

_{1}(t), …, x

_{n−}

_{1}(t), y(t)) = (x(t), y(t)) is a geodesic of $\mathscr{H}$

_{n}, then there exists a, b, c, d ∈ ℝ, such that ad − bc = 1, and v > 0, such that$x(t)=x(0)+\frac{{\dot{x}}_{0}}{\Vert {\dot{x}}_{0}\Vert}\tilde{x}(t),y(t)=\mathrm{Im}({\gamma}_{\u2102}(t))$, with$\tilde{x}(t)=\mathrm{Re}({\gamma}_{\u2102}(t))$ and:

#### 4.2. Computing the GIGO Update in ${\tilde{\mathbb{G}}}_{d}$

^{2}I). We find:

**Proposition 5.**In${\tilde{\mathbb{G}}}_{d}$, the IGO speed Y is given by:

**Proof.**We recall the IGO speed is defined by $Y={I}^{-1}({\theta}^{t}){\displaystyle {\sum}_{i=1}^{N}{\widehat{w}}_{i}\frac{\partial \mathrm{ln}{P}_{\theta}({x}_{i})}{\partial \theta}}$. Since ${P}_{\mu ,\sigma}(x)={(2\pi {\sigma}^{2})}^{-d/2}\mathrm{exp}(-\frac{{(x-\mu )}^{T}(x-\mu )}{2{\sigma}^{2}})$, we have:

**Theorem 2**(Geodesics of ${\tilde{\mathbb{G}}}_{d}$). If$\gamma :t\mapsto \mathcal{N}(\mu (t),\sigma {(t)}^{2}I)$ is a geodesic of${\tilde{\mathbb{G}}}_{d}$, then there exists a, b, c, d ∈ ℝ, such that ad − bc = 1, and v > 0, such that:$\mu (t)=\mu (0)+\sqrt{2d}\frac{{\dot{\mu}}_{0}}{\Vert {\dot{\mu}}_{0}\Vert}\tilde{r},\sigma (t)=\mathrm{Im}({\gamma}_{\u2102}(t))$, with$\tilde{r}(t)=\mathrm{Re}({\gamma}_{\u2102}(t))$and

_{0}, σ

_{0}) and an initial speed $({\dot{\mu}}_{0},{\dot{\sigma}}_{0})$. This is a tedious but easy computation, the result of which is given in Proposition 17.

## 5. GIGO in ${\tilde{\mathbb{G}}}_{d}$

#### 5.1. Obtaining a First Order Differential Equation for the Geodesics of ${\mathbb{G}}_{d}$

**Theorem 3.**Let$\gamma :t\mapsto \mathcal{N}({\mu}_{t},{\sum}_{t})$ be a geodesic of${\mathbb{G}}_{d}$. Then, the following quantities do not depend on t:

**Proof**. This is a direct application of Noether’s theorem, with suitable groups of diffeomorphisms. By Proposition 1, the Lagrangian associated with the geodesics of ${\mathbb{G}}_{d}$ is:

_{μ}

_{0}

_{,A}: (μ, Σ) ⟼ (Aμ + μ

_{0}, AΣA

^{T}), with μ

_{0}∈ ℝ

^{d}and A ∈ GL

_{d}(ℝ).

_{μ}

_{0}

_{,A}for any μ

_{0}∈ ℝ

^{d}, A ∈ GL

_{d}(ℝ).

- Translations of the mean vector. For any i ∈ [1, d], $\mathrm{let}\phantom{\rule{0.2em}{0ex}}{h}_{i}^{s}:(\mu ,\mathrm{\Sigma})\mapsto (\mu +s{e}_{i},\mathrm{\Sigma})$, where e
_{i}is the i-th basis vector. We have $\frac{\mathrm{d}{h}_{i}^{s}}{\mathrm{d}s}{|}_{s=0}=({e}_{i},0)$, so by Noether’s theorem,$$\frac{\partial \mathcal{L}}{\partial \dot{\theta}}({e}_{i,}0)=2{\dot{\mu}}^{T}{\mathrm{\Sigma}}^{-1}{e}_{i}=2{e}_{i}^{T}{\mathrm{\Sigma}}^{-1}\dot{\mu}$$_{μ}is an invariant immediately follows. - Linear base changes. For any i, j ∈ [1, d], $\mathrm{let}\phantom{\rule{0.2em}{0ex}}{h}_{i,j}^{s}:(\mu ,\mathrm{\Sigma})\mapsto (\mathrm{exp}(s{E}_{ij})\mu ,\mathrm{exp}(s{E}_{ij})\mathrm{\Sigma}\mathrm{exp}(s{E}_{ij})),$, where E
_{ij}is the matrix with a one at position (i, j) and zeros elsewhere. We have:$$\frac{\mathrm{d}{h}_{{E}_{ij}}^{s}}{\mathrm{d}s}{|}_{s=0}=({E}_{ij\mu},{E}_{ij}\mathrm{\Sigma}+{E}_{ji}\mathrm{\Sigma}).$$

_{Σ}in (30) are the (J

_{ij}/2).

**Theorem 4**(GIGO-Σ). $t\phantom{\rule{0.2em}{0ex}}7\mapsto \mathcal{N}({\mu}_{t},{\mathrm{\Sigma}}_{t})$ is a geodesic of${\mathbb{G}}_{d}$ if and only if μ : t 7↦μ

_{t}and Σ : t ↦Σ

_{t}satisfy the equations:

**Proof.**This is an immediate consequence of Proposition 3.

^{2}), and the difference between two different implementations of the GIGO algorithm is O(h

^{2}), where h is the Euler step size; it is easier to reduce the latter. Still, without a closed form for the geodesics of ${\mathbb{G}}_{d}$, the GIGO update is rather expensive to compute, but it can be argued that most of the computation time will still be the computation of the objective function f.

^{T}).

**Theorem 5**(GIGO-A). If μ : t ↦μ

_{t}and A : t ↦A

_{t}satisfy the equations:

**Proof.**This is a simple rewriting of Theorem 4: if we write Σ := AA

^{T}, we find that J

_{μ}and J

_{Σ}are the same as in Theorem 4, and we have:

_{Σ}− J

_{μ}μ

^{T}) is symmetric (since $\dot{\mathrm{\Sigma}}$ has to be symmetric). Therefore, we have $\dot{\mathrm{\Sigma}}\left({J}_{\mathrm{\Sigma}}-{J}_{\mu}{\mu}^{T}\right)$, and the result follows. □

#### 5.2. Explicit Form of the Geodesics of ${\mathbb{G}}_{d}$ (from [5])

**Theorem 6.**Let$\left({\dot{\mu}}_{0},{\dot{\mathrm{\Sigma}}}_{0}\right)\in {T}_{\mathcal{N}}\left(0,I\right){\mathbb{G}}_{d}$. The geodesic of${\mathbb{G}}_{d}$ starting from$\mathcal{N}(0,1)$ with initial speed$\left({\dot{\mu}}_{0},{\dot{\mathrm{\Sigma}}}_{0}\right)$ is given by:

^{−}is a pseudo-inverse of G

_{d}(ℝ), for all μ

_{0}∈ ℝ

^{d}, the application:

**Corollary 1.**Let μ

_{0}∈ ℝ

^{d}, A ∈ GL

_{d}(ℝ) and$({\dot{\mu}}_{0},{\dot{\mathrm{\Sigma}}}_{0})\in {T}_{\mathcal{N}({\mu}_{0.},{A}_{0}{A}_{0}^{T})}{\mathbb{G}}_{d}$. The geodesic of${\mathbb{G}}_{d}$ starting from$\mathcal{N}\left(\mu ,\mathrm{\Sigma}\right)$ with initial speed$({\dot{\mu}}_{0},{\dot{\mathrm{\Sigma}}}_{0})$ is given by:

^{−}is a pseudo-inverse of G.

^{2}, and so are sh(G)G

^{−}and G

^{−}sh(G).

## 6. Comparing GIGO, xNES and Pure Rank-μ CMA-ES

#### 6.1. Definitions

#### 6.1.1. xNES

**Definition 9**(xNES algorithm). The xNES algorithm with sample size N, weights w

_{i}and learning rates η

_{μ}and η

_{Σ}updates the parameters μ ∈ ℝ

^{d}, A ∈ M

_{d}(ℝ) with the following rule: At each step, N points x

_{1}, …, x

_{N}are sampled from the distribution.$\mathcal{N}(\mu ,A{A}^{T})$. Without loss of generality, we assume f(x

_{1}) < … < f(x

_{N}). The parameter is updated according to:

_{i}= A

^{−1}(x

_{i}− μ):

**Proposition 6**(xNES as IGO). The xNES algorithm with sample size N, weights w

_{i}and learning rates η

_{μ}= η

_{Σ}= δt coincides with the IGO algorithm with sample size N, weights w

_{i}, step size δt and in which, given the current position (μ

_{t}, A

_{t}), the set of Gaussians is parametrized by:

^{m}and M ∈ Sym(ℝ

^{m}).

_{i}are sampled from$\mathcal{N}(\mu ,A{A}^{T})$.

**Proof.**Let us compute the IGO update in the parametrization ${\varphi}_{{\mu}_{t},{A}_{t}}$: we have δ

^{t}= 0, M

^{t}= 0, and by using Proposition 1, we can see that for this parametrization, the Fisher information matrix at (0, 0) is the identity matrix. The IGO update is therefore,

#### 6.1.2. Using a Square Root of the Covariance Matrix

_{.}

^{T}: since we do not force A to be symmetric, the decomposition is not unique) of the covariance matrix A has to satisfy the following condition: for a given initial speed, the covariance matrix Σ

^{t}

^{+}

^{δt}after one step must depend only on Σ

^{t}and not on the square root A

^{t}chosen for Σ

^{t}.

^{t}), using the same samples x

_{i}to compute the natural gradient update, then we will have $\sum}_{1}^{t+\delta t}=}{\displaystyle \sum {}_{2}^{t+\delta t$. Using the definitions of Section 6.3, we have just shown that what we will call the “xNES trajectory” is well defined.

^{n}× GL

_{n}(ℝ)), which is too large: there exists infinitely many applications t ↦A

_{t}, such that a given curve $\gamma :t\mapsto \mathcal{N}({\mu}_{t},{\mathrm{\Sigma}}_{t})$ can be written $\gamma (t)=\mathcal{N}\left({\mu}_{t}{A}_{t}{A}_{t}^{T}\right)$. This is why Theorem 5 is simply an implication, whereas Theorem 4 is an equivalence.

_{d}(ℝ) and v

_{A}, ${v}_{A}^{\prime}$ two infinitesimal updates of A. Since Σ = AA

^{T}, the infinitesimal update of Σ corresponding to ${v}_{A}^{\prime}$ (resp. ${v}_{A}^{\prime}$) is ${v}_{\mathrm{\Sigma}}=A{v}_{A}^{T}+{v}_{A}{A}^{T}$ (resp. ${v}_{\mathrm{\Sigma}}^{\prime}=A{{v}^{\prime}}_{A}^{T}+{v}_{A}^{\prime}{A}^{T}).$

_{A}and ${v}_{A}{}^{\prime}$ define the same direction for Σ (i.e., ${v}_{\mathrm{\Sigma}}={v}_{\mathrm{\Sigma}}{}^{\prime}$) if and only if AM

^{T}+ MA

^{T}= 0, where $M={v}_{A}-\phantom{\rule{0.2em}{0ex}}{v}_{A}{}^{\prime}$. This is equivalent to A

^{−}

^{1}M antisymmetric.

_{d}(ℝ), let us denote by T

_{A}the space of the matrices M, such that A

^{−}

^{1}M is antisymmetric or, in other words, T

_{A}:= {u ∈ M

_{d}(ℝ), Au

^{T}+ uA

^{T}= 0}. Having a subspace S

_{A}in direct sum with T

_{A}for all A is sufficient (but not necessary) to have a well-defined update rule. Namely, consider the (linear) application:

_{A}= T

_{A}, and therefore, if we have, for some U

_{A},

_{A}|

_{UA}is an isomorphism. Let v

_{Σ}be an infinitesimal update of Σ. We choose the following update of A corresponding to v

_{Σ}:

_{A}, such that U

_{A}⊕ T

_{A}= M

_{d}(ℝ), is a reasonable choice to pick v

_{A}for a given v

_{Σ}. The choice S

_{A}= {u ∈ M

_{d}(ℝ), Au

^{T}− uA

^{T}= 0} has an interesting additional property; it is the orthogonal of T

_{A}for the norm:

_{A}= {M ∈ M

_{d}(ℝ), A

^{−1}M antisymmetric} and S

_{A}= {M ∈ M

_{d}(ℝ), A

^{−1}M symmetric} and that if M is symmetric and N is antisymmetric, then

**Proposition 7.**Let A ∈ M

_{n}(ℝ). The v

_{A}given by the xNES and GIGO-A algorithms lies in S

_{A}= {u ∈ M

_{d}(ℝ), Au

^{T}− uA

^{T}= 0} = S

_{A}.

**Proof.**For xNES, let us write $\dot{\gamma}(0)=({\upsilon}_{\mu},{\upsilon}_{\Sigma})$ and ${\upsilon}_{A}:=\frac{1}{2}A{G}_{M}$. We have ${A}^{-1}{\upsilon}_{A}=\frac{1}{2}{G}_{M}$, and therefore, forcing M (and G

_{M}) to be symmetric in xNES is equivalent to A

^{−}

^{1}υ

_{A}= (A

^{−}

^{1}υ

_{A})

^{T}, which can be rewritten as $A{\upsilon}_{A}^{T}={\upsilon}_{A}{A}^{T}$. For GIGO-A, Equation (40) shows that ${\Sigma}_{t}({J}_{\Sigma}-{J}_{\mu}{\mu}_{t}^{T})$ is symmetric, and with this fact in mind, Equation (42) shows that we have $A{\upsilon}_{A}^{T}={\upsilon}_{A}{A}^{T}({\upsilon}_{A}\phantom{\rule{0.2em}{0ex}}\text{is}\phantom{\rule{0.2em}{0ex}}{\dot{A}}_{t})$. □

#### 6.1.3. Pure Rank-μ CMA-ES

**Definition 10**(Pure rank-μ CMA-ES algorithm). The pure rank-μ CMA-ES algorithm with sample size N, weights w

_{i}and learning rates η

_{μ}and η

_{Σ}is defined by the following update rule: At each step, N points x

_{1}, …, x

_{N}are sampled from the distribution$N(\mu ,\Sigma )$. Without loss of generality, we assume f)x

_{1}) < … < f(x

_{N}). The parameter is updated according to:

**Proposition 8**(Pure rank-μ CMA-ES as IGO). The pure rank-μ CMA-ES algorithm with sample size N, weights w

_{i}and learning rates η

_{μ}= η

_{Σ}= δt coincides with the IGO algorithm with sample size N, weights w

_{i}, step size δt and the parametrization (μ, Σ).

#### 6.2. Twisting the Metric

**Definition 11**(Twisted Fisher metric). Let η

_{μ}, η

_{Σ}∈ ℝ, and let (P

_{θ})

_{θ}

_{∈Θ}be a family of normal probability distributions: P

_{θ}= N (μ(θ), Σ(θ)), with μ and Σ C

^{1}. We call the “(η

_{μ}, η

_{Σ})-twisted Fisher metric” the metric defined by:

_{μ}, η

_{Σ})-twisted IGO flow reads:

**Definition 12.**The (η

_{μ}, η

_{Σ})-twisted IGO algorithm associated with parametrization θ, sample size N, step size δt and selection scheme w is given by the following update rule:

**Definition 13.**The (η

_{μ}, η

_{Σ})-twisted geodesic IGO algorithm associated with sample size N, step size δt and selection scheme w is given by the following update rule:

_{μ}and η

_{Σ}).

_{μ}and η

_{Σ}: the only values actually appearing in the equations are δtη

_{μ}and δtη

_{Σ}. More formally:

**Proposition 9.**Let k, d, N ∈ N, η

_{μ}, η

_{Σ}, δt, λ

_{1}, λ

_{2}∈ ℝ and w : [0; 1] → ℝ.

_{μ}, η

_{Σ})-twisted IGO algorithm with sample size N, step size δt and selection scheme w coincides with the (λ

_{1}η

_{μ}, λ

_{1}η

_{Σ})-twisted IGO algorithm with sample size N, step size λ

_{2}δt and selection scheme$\frac{1}{{\lambda}_{1}{\lambda}_{2}}w$. The same is true for geodesic IGO.

_{μ}, η

_{Σ}to appear by multiplying the increments of μ and Σ by η

_{μ}and η

_{Σ}.

**Proposition 10**(xNES as IGO). The xNES algorithm with sample size N, weights w

_{i}and learning rates η

_{μ}, η

_{σ}= η

_{B}= η

_{Σ}coincides with the$\frac{{\eta}_{\mu}}{\delta t}$, $\frac{{\eta}_{\mathrm{\Sigma}}}{\delta t}$-twisted IGO algorithm with sample size N, weights w

_{i}, step size δt and in which, given the current position (μ

_{t}, A

_{t}), the set of Gaussians is parametrized by:

^{m}and M ∈ Sym(ℝ

^{m}).

_{i}are sampled from N (μ, AA

^{T}).

**Proposition 11**(Pure rank-μ CMA-ES as IGO). The pure rank-μ CMA-ES algorithm with sample size N, weights w

_{i}and learning rates η

_{μ}and η

_{Σ}coincides with the$\left(\frac{{\eta}_{\mu}}{\delta t},\frac{{\eta}_{\mathrm{\Sigma}}}{\delta t}\right)$-twisted IGO algorithm with sample size N, weights w

_{i}, step size δt and the parametrization (μ, Σ).

_{Σ}factor) by changing μ to $\frac{\sqrt{\eta \sigma}}{\sqrt{\eta \mu}}\mu $.

#### 6.3. Trajectories of Different IGO Steps

**Definition 14.**(1) We call the GIGO update trajectory the application:

^{T}= Σ. The application above does not depend on the choice of a square root A.

_{GIGO}:= ϕ

_{μ}○T

_{GIGO}, μ

_{xNES}:= ϕ

_{μ}○T

_{xNES}, μ

_{CMA}:= ϕ

_{μ}○T

_{CMA}, Σ

_{GIGO}:= ϕ

_{Σ}○ T

_{GIGO}, Σ

_{xNES}:= ϕ

_{Σ}○ T

_{xNES}and Σ

_{CMA}:= ϕ

_{Σ}○ T

_{CMA}, where ϕ

_{μ}(resp. ϕ

_{Σ}) extracts the μ-component (resp. the Σ-component) of a curve.

_{μ}) ⊂ ℝ

^{d}and Im(ϕ

_{Σ}) ⊂ P

_{d}, where P

_{d}(the set of real symmetric positive-definitematrices of dimension d) is seen as a subset of ℝ

^{d}

^{2}.

_{GIGO}(μ, Σ, v

_{μ}, v

_{Σ})(δt) gives the position (mean and covariance matrix) of the GIGO algorithm after a step of size δt, while μ

_{GIGO}and Σ

_{GIGO}give, respectively, the mean component and the covariance component of this position.

**Proposition 12**(Second derivatives of the trajectories). We have:

**Proof.**We can immediately see that the second derivatives of μ

_{xNES}, μ

_{CMA}and Σ

_{CMA}are zero. Next, we have:

_{xNES}(μ, Σ, v

_{μ}, v

_{Σ})

^{″}(0) follows.

_{0}, Σ

_{0}) with initial speed (η

_{μ}v

_{μ}, η

_{Σ}v

_{Σ}). By writing J

_{μ}(0) = J

_{μ}(t), we find $\dot{\mu}(t)=\mathrm{\Sigma}(t){\mathrm{\Sigma}}_{0}^{-1}{\dot{\mu}}_{0}$. We then easily have $\ddot{\mu}(0)={\dot{\mathrm{\Sigma}}}_{0}{\mathrm{\Sigma}}_{0}^{-1}{\dot{\mu}}_{0}$ In other words:

- In [19], it has been noted that xNES converges to quadratic minima slower than CMA-ES and that it is less subject to premature convergence. That fact can be explained by observing that the mean update is exactly the same for CMA-ES and xNES, whereas xNES tends to have a higher variance (Proposition 12 shows this at order two, and it is easy to see that in dimension one, for any μ, Σ, v
_{μ}, v_{Σ}, we have Σ_{xNES}(μ, Σ, v_{μ}, v_{Σ}) > Σ_{CMA}(μ, Σ, v_{μ}, v_{Σ})). - At order two, GIGO moves the mean faster than xNES and CMA-ES if the standard deviation is increasing and more slowly if it is decreasing. This seems to be a reasonable behavior (if the covariance is decreasing, then the algorithm is presumably close to a minimum, and it should not leave the area too quickly). This remark holds only for isolated steps, because we do not take into account the evolution of the variance.
- The geodesics of ${\mathbb{G}}_{1}$ are half-circles (see Figure 2 below; we recall that ${\mathbb{G}}_{1}$ is the Poincaré half-plane). Consequently, if the mean is supposed to move (which always happens), then σ → 0 when δt → ∞. For example, a step whose initial speed has no component on the standard deviation will always decrease it. See also Proposition 15, about the optimization of a linear function.
- For the same reason, for a given initial speed, the update of μ always stays bounded as a function of δt: it is not possible to make one step of the GIGO algorithm go further than a fixed point by increasing δt. Still, the geodesic followed by GIGO changes at each step, so the mean of the overall algorithm is not bounded.

**Proposition 13**(xNES is not GIGO in the general case). Let μ, v

_{μ}∈ ℝ

^{d}, A ∈ GL

_{d}, v

_{Σ}∈ M

_{d}.

_{μ}and v

_{Σ}follow the same trajectory if and only if the mean remains constant. In other words:

**Proof.**If v

_{μ}= 0, then we can compute the GIGO update by using Theorem 4: since J

_{μ}= 0, $\dot{\mu}=0$, and μ remains constant. Now, we have ${J}_{\mathrm{\Sigma}}={\mathrm{\Sigma}}^{-1}\dot{\mathrm{\Sigma}}$; this is enough information to compute the update. Since this quantity is also preserved by the xNES algorithm (see, for example, the proof of Proposition 14), the two updates coincide.

**Figure 2.**One step of the geodesic IGO (GIGO) update.

_{μ}≠ 0, then ${\mathrm{\Sigma}}_{\mathrm{xNES}}{(\mu ,\mathrm{\Sigma},{v}_{\mu},{v}_{\mathrm{\Sigma}})}^{\u2033}(0)-{\mathrm{\Sigma}}_{\mathrm{GIGO}}{(\mu ,\mathrm{\Sigma},{v}_{\mu},{v}_{\mathrm{\Sigma}})}^{\u2033}(0)={\eta}_{\mu}{\eta}_{\mathrm{\Sigma}}{v}_{\mu}{v}_{\mu}^{T}\ne 0$ and, in particular, T

_{GIGO}(μ, Σ, v

_{μ}, v

_{Σ}) ≠ T

_{xNES}(μ, Σ, v

_{μ}, v

_{Σ}).

#### 6.4. Blockwise GIGO

**Definition 15**(Splitting). Let Θ be a Riemannian manifold. A splitting of Θ is n manifolds Θ

_{1}, …, Θ

_{n}and a diffeomorphism Θ ≅ Θ

_{1}× … × Θ

_{n}. If for all x ∈ Θ, for all 1 ≤ i < j ≤ n, we also have T

_{i,x}M ⊥ T

_{j,x}M as subspaces of T

_{x}M (see Notation 2), then the splitting is said to be compatible with the Riemannian structure. If the Riemannian manifold is not ambiguous, we will simply write a “compatible splitting”.

**Notation 2.**Let Θ be a Riemannian manifold, Θ

_{1}, …, Θ

_{n}a splitting of Θ, θ = (θ

_{1}, …, θ

_{n}) ∈ Θ, Y ∈ T

_{θ}Θ and 1 ≤ i ≤ n.

- We denote by Θ
_{θ,i}the Riemannian manifold$$\left\{{\theta}_{1}\right\}\times \dots \times \left\{{\theta}_{i-1}\right\}\times {\mathrm{\Theta}}_{i}\times \left\{{\theta}_{i+1}\right\}\times \dots \times \left\{{\theta}_{n}\right\},$$ - We denote by Φ
_{θ,i}the exponential at θ of the manifold Θ_{θ,i}.

**Definition 16**(Blockwise GIGO update). Let Θ

_{1}, …, Θ

_{n}be a compatible splitting. The blockwise GIGO algorithm in Θ with splitting Θ

_{1}, …, Θ

_{n}associated with sample size N, step sizes δt

_{1}, …, δt

_{n}and selection scheme w is given by the following update rule:

_{k}the TΘ

_{θ,k}-component of Y. This update only depends on the splitting (and not on the parametrization inside each Θ

_{k}).

_{k}. A practical consequence is that the Y

_{k}in Equation (62) can be computed simply by taking the natural gradient in Θ

_{k}:

_{k}is the metric of Θ

_{k}.

_{μ}= η

_{Σ}and δt = 1, then the twisted GIGO is the regular GIGO algorithm, whereas blockwise GIGO is not (actually, we will prove that it is the xNES algorithm). The only thing blockwise GIGO and twisted GIGO have in common is that they are compatible with the (η

_{μ}, η

_{Σ})-twisted IGO flow Equation (57): a parameter θ

^{t}following these updates with δt → 0 and N → ∞ is a solution of Equation (57).

**Proposition 14**(xNES is a Blockwise GIGO algorithm). The Blockwise GIGO algorithm in${\mathbb{G}}_{d}$ with splitting$\mathrm{\Phi}:\mathcal{N}\left(\mu ,\mathrm{\Sigma}\right)\mapsto \left(\mu ,\mathrm{\Sigma}\right)$, sample size N, step sizes δt

_{μ}, δt

_{Σ}and selection scheme w coincides with the xNES algorithm with sample size N, weights w

_{i}and learning rates η

_{μ}= δt

_{μ}, η

_{σ}= η

_{B}= δt

_{Σ}.

**Proof.**Firstly, notice that the splitting (μ, Σ) is compatible, by Proposition 1.

_{d}is the space of real positive-definite matrices of dimension d. We have ${\mathrm{\Theta}}_{{\theta}^{t},1}={\mathbb{R}}^{d}\times (\{{\sum}^{t}\})\to {\mathbb{G}}_{d},{\mathrm{\Theta}}_{{\theta}^{t},2}=(\{{\mu}^{t}\}\times {P}_{d})\to {\mathbb{G}}_{d}$. The induced metric on ${\mathrm{\Theta}}_{{\theta}^{t}}{}_{{,}_{1}}$ is the Euclidean metric, so we have:

_{μ}= AG

_{μ}(in the proof of Proposition 6), we find:

_{θ}t,

_{2}, we have the following Lagrangian for the geodesics:

_{Σ}will satisfy this first-order differential equation and follow the geodesics of ${\mathrm{\Theta}}_{{\theta}^{t},2})$. The xNES update for the covariance matrix is given by A(t) = A

_{0}exp(tG

_{M}/2). Therefore, we have $\mathrm{\Sigma}\left(t\right)={A}_{0}\mathrm{exp}\left(t{\mathbb{G}}_{M}\right){A}_{0}^{T},$${\mathrm{\Sigma}}^{-1}(t)={({A}_{0}^{-1})}^{T}\mathrm{exp}\left(-t{\mathbb{G}}_{M}\right){A}_{0}^{-1}$, $\dot{\sum}(t)={A}_{0}\mathrm{exp}(t{\mathbb{G}}_{M}){\mathbb{G}}_{M}{A}_{0}^{T}$ and, finally, ${\mathrm{\Sigma}}^{-}{}^{1}(t)\dot{\mathrm{\Sigma}}(t)={({A}_{0}^{-1})}^{T}{\mathbb{G}}_{M}{A}_{0}^{T}={\sum}_{0}^{-1}{\dot{\sum}}_{0}$. Therefore, xNES preserves J

_{Σ}, and therefore, xNES follows the geodesics of ${\mathrm{\Theta}}_{{\theta}^{t},2}$ (notice that we had already proven this in Proposition 13, since we are looking at the geodesics of ${\mathbb{G}}_{d}$ with a fixed mean).

## 7. Numerical Experiments

#### 7.1. Benchmarking

- Varying dimension.
- Sample size: $\lfloor 4+3\mathrm{log}(d)\rfloor .$
- Weights: ${w}_{i}=\frac{max\left(0,\mathrm{log}\left(\frac{n}{2}+1\right)-\mathrm{log}(i)\right)}{{\sum}_{j=1}^{N}max\left(0,\mathrm{log}\left(\frac{n}{2}+1\right)-\mathrm{log}(j)\right)}-\frac{1}{N}$.
- IGO step size and learning rates: δt = 1, η
_{μ}= 1, ${\eta}_{\sum}=\frac{3}{5}\frac{3+\mathrm{log}(d)}{d\sqrt{d}}$.. - Initial position: ${\theta}^{0}=\mathcal{N}\left({x}_{0},I\right)$, where x
_{0}is a random point of the circle with center zero, and radius 10. - Euler method for GIGO: Number of steps: 100. We used the GIGO-A variant of the algorithm. No significant difference was noticed with GIGO-Σ or with the exact GIGO algorithm. The only advantage of having an explicit solution of the geodesic equations is that the update is quicker to compute.
- We chose not to use the exact expression of the geodesics for this benchmarking to show that having to use the Euler method is fine. However, we did run the tests, and the results are basically the same as GIGO-A.

^{−}

^{8}). Each algorithm has been tested in dimension 2, 4, 8, 16, 32 and 64: a missing point means that all runs converged prematurely.

#### 7.1.1. Failed Runs

- Only one run reached the optimum for the cigar-tablet function with CMA-ES in dimension eight.
- Seven runs (out of 24) reached the optimum for the Rosenbrock function with CMA-ES in dimension 16.
- About half of the runs reached the optimum for the sphere function with CMA-ES in dimension four.

- GIGO did not find the optimum of the Rosenbrock function in any dimension.
- CMA-ES did not find the optimum of the Rosenbrock function in dimension 2, 4, 32 and 64.
- All of the runs converged prematurely for the cigar-tablet function in dimension two with CMA-ES, for the sphere function in dimension two for all algorithms and for the Rosenbrock function in dimension two and four for all algorithms.

#### 7.1.2. Discussion

_{Σ}, whereas the covariance matrix maintained by CMA-ES (not only pure rank-μ CMA-ES) can stop being positive definite if η

_{Σ}δt > 1. However, in that case, the GIGO algorithm is prone to premature convergence (remember Figure 2 and see Proposition 15 below), and in practice, the learning rates are much smaller.

#### 7.2. Plotting Trajectories in ${\mathbb{G}}_{1}$

- Sample size: λ = 5, 000
- Dimension one only.
- Weights: w = 41
_{q}_{⩽1}_{/}_{4}(w_{i}= 4.1_{i}_{⩽1}_{,}_{250}) - IGO step size and learning rates: η
_{μ}= 1, ${\eta}_{\sum}=\frac{3}{5}\frac{\phantom{\rule{0.2em}{0ex}}3+\mathrm{log}(d)}{d\sqrt{d}}=1.8$, varying δt. - Initial position: ${\theta}^{0}=\mathcal{N}\left(10,1\right)$
- Dots are placed at t = 0, 1, 2 … (except for the graph δt = 1.5, for which there is a dot for each step).

_{cr}(depending on the learning rates η

_{μ}, η

_{σ}and on the weights w

_{i}), above which, GIGO will converge, and we can compute its value when the weights are of the form ${1}_{q}{}_{\le {q}_{0}}$ (for q

_{0}≥ 0.5, the discussion is not relevant, because in that case, even the IGO flow converges prematurely. Compare with the critical δt of the smoothed cross entropy method and IGO-ML in [1]).

**Proposition 15**. Let d ∈ ℕ, k, η

_{μ}, ${\eta}_{\sigma}\in {\mathbb{R}}_{+}^{*}$; let$w=k{.1}_{q\le {q}_{0}}$ and let

_{n}be the first coordinate of the mean, and let${\sigma}_{n}^{2}$ be the variance (at step n) maintained by the (η

_{μ}, η

_{σ})-twisted geodesic IGO algorithm in${\tilde{\mathbb{G}}}_{d}$ associated with selection scheme w, sample size ∞ and step size δt, when optimizing g (“sample size ∞” meaning the limit of the update when the sample size tends to infinity, which is deterministic [1]).

_{cr}, such that:

- if δt > δt
_{cr}, (σ_{n}) converges to zero with exponential speed and (μ_{n}) converges. - if δt = δt
_{cr}, (σ_{n}) remains constant and (μ_{n}) tends to ∞ with linear speed. - if 0 < δt < δt
_{cr}, both (σ_{n}) and μ_{n}tend to ∞ with exponential speed.

_{0}= 1/4, η

_{μ}= 1, η

_{σ}= 1.8, we find:

## 8. Conclusions

## Conflicts of Interest

#### Proof of Proposition 15

_{μ}, η

_{σ})-twisted IGO flow is given by:

_{n}) immediately imply the assertions about the convergence of (μ

_{n}).

_{μ}βσ

_{0}, η

_{σ}ασ

_{0}), with ασ

_{0}> 0 (i.e., the variance should be increased: this is where we need q

_{0}< 0.5).

_{0}). In other words, f(δt) will be the same at each step of the algorithm. The existence of δt

_{cr}easily follows (furthermore, recall Figure 1 in Section 4.1), and δt

_{cr}is the positive solution of f(x) = 1.

#### A1. Generalization of the Twisted Fisher Metric

**Definition 17**. Let (Θ, g) be a Riemannian manifold, $({\mathrm{\Theta}}_{1},g{|}_{{\mathrm{\Theta}}_{1}}),\dots ,({\mathrm{\Theta}}_{n},g{|}_{{\mathrm{\Theta}}_{n}})$, a splitting (as defined in Section 6.4) of Θ compatible with the metric g.

_{1}, …, η

_{n})-twisted metric on (Θ, g) for the splitting Θ

_{1}, …, Θ

_{n}the metric g′ on Θ defined by${g}^{\prime}{|}_{{\mathrm{\Theta}}_{i}}=\frac{1}{{\eta}_{i}}g{|}_{{\mathrm{\Theta}}_{i}}$ for 1 ≤ i ≤ n, and Θ

_{i}⊥ Θ

_{j}for i ≠ j.

**Proposition 16**. The (η

_{μ}, η

_{Σ})-twisted metric on${\mathbb{G}}_{d}$ with the Fisher metric for the splitting$\mathcal{N}(\mu ,\sum )\mapsto (\mu ,\sum )$ coincides with the (η

_{μ}, η

_{Σ})-twisted Fisher metric from Definition 11.

**Proof**. It is easy to see that the (η

_{μ}, η

_{Σ})-twisted Fisher metric satisfies the condition in Definition 17.

#### A2. Twisted Geodesics

_{μ}, η

_{Σ}∈ ℝ, μ

_{0}∈ ℝ

^{d}, A

_{0}∈ GL

_{d}(ℝ), and$({\dot{\mu}}_{0},{\dot{\sum}}_{0})\in {T}_{\mathcal{N}({\mu}_{0},{A}_{0}{A}_{0}^{T})}{\mathbb{G}}_{d}$ Let

_{μ}, η

_{Σ})-twisted Fisher metric) at$\mathcal{N}(\sqrt{\frac{{\eta}_{\mu}}{{\eta}_{\mathrm{\Sigma}}}}{\mu}_{0},{A}_{0}{A}_{0}^{T})$$(resp.\phantom{\rule{0.2em}{0ex}}\mathcal{N}({\mu}_{0},{A}_{0}{A}_{0}^{T}))$. We have:

**Proof.**Let us denote by: $\left(\begin{array}{rr}\hfill {I}_{\mu}& \hfill 0\\ \hfill 0& \hfill {I}_{\mathrm{\Sigma}}\end{array}\right)$ the Fisher metric in the parametrization μ, Σ, and consider the 0 I

_{Σ}following parametrization of ${\mathbb{G}}_{d}:(\tilde{\mu},\mathrm{\Sigma})\mapsto \mathcal{N}(\frac{\sqrt{{\eta}_{\mathrm{\Sigma}}}}{\sqrt{{\eta}_{\mu}}}\tilde{\mu},\mathrm{\Sigma})$.

_{μ}, η

_{Σ})-twisted Fisher metric up to a factor $\frac{1}{{\eta}_{\mathrm{\Sigma}}}$. Consequently, the Christoffel symbols are the same as the Christoffel symbols of the (η

_{μ}, η

_{Σ})-twisted Fisher metric, and so are the geodesics. Therefore, we have:

_{μ}and η

_{Σ}; ${\mathbb{G}}_{d}$ is endowed with the (η

_{μ}, η

_{Σ})-twisted Fisher metric, and ${\tilde{\mathbb{G}}}_{d}$ is endowed with the induced metric. The proofs of the propositions below are a simple rewriting of their non-twisted counterparts that can be found in Sections 4 and 5.1 and can be seen as corollaries of Theorem 7.

**Theorem 8**. If$\gamma :t\mapsto \mathcal{N}(\mu (t),\sigma {(t)}^{2}I)$ is a twisted geodesic of${\tilde{\mathbb{G}}}_{d}$, then there exists a, b, c, d ∈ ℝ, such that ad − bc = 1, and v > 0, such that$\mu (t)=\mu (0)+\sqrt{\frac{2d{\eta}_{\mu}}{{\eta}_{\sigma}}}\frac{{\dot{\mu}}_{0}}{\Vert {\dot{\mu}}_{0}\Vert}\tilde{r}(t)$, σ(t) = Im(γ

_{ℂ}(t)), with$\tilde{r}(t)=\mathrm{Re}({\gamma}_{\u2102}(t))$ and:

**Proposition**17. Let n ∈ ℕ, v

_{μ}∈ ℝ

^{n}

_{r},v

_{σ}, η

_{μ}, η

_{σ}, σ

_{0}∈ ℝ, with σ

_{0}> 0.

_{r}:= ║v

_{μ}║ $\lambda =\sqrt{\frac{2n{\eta}_{\mu}}{{\eta}_{\sigma}}}v:=\sqrt{\frac{\frac{1}{{\lambda}^{2}}{v}_{r}^{2}+{v}_{\sigma}^{2}}{{\sigma}_{0}^{2}}}$, ${M}_{0}:=\frac{1}{\lambda}\frac{{v}_{r}}{v{\sigma}_{0}^{2}}$ and${S}_{0}:=\frac{{v}_{\sigma}}{v{\sigma}_{0}^{2}}$.

_{0}, σ

_{0}) and$\dot{\gamma}(0)=({v}_{\mu},{v}_{\sigma})$. The regular geodesics of${\tilde{\mathbb{G}}}_{n}$ are obtained with η

_{μ}= η

_{σ}= 1.

**Theorem 9**. Let$\gamma :t\mapsto \mathcal{N}({\mu}_{t},{\mathrm{\Sigma}}_{t})$ be a twisted geodesic of${\mathbb{G}}_{d}$. Then, the following quantities are invariant:

**Theorem 10**. If μ : t ⟼ μ

_{t}and Σ : t ⟼ Σ

_{t}satisfy the equations:

**Theorem 11**. If μ : t ⟼ μ

_{t}and A : t ⟼ A

_{t}satisfy the equations:

#### A3. Pseudocodes

#### A3.1. For All Algorithms

^{T}).

^{d}→ ℝ, step size δt, learning rates η

_{μ}, η

_{Σ}, sample size λ, weights (w

_{i})

_{i}

_{∈[1}

_{,λ}

_{]}, N number of steps for the Euler method, r Euler step size reduction factor (for GIGO-Σ only).

_{i}, but the decomposition Σ = AA

^{T}is not unique. Two different decompositions will give two algorithms, such that one is a modification of the other as a stochastic process: same law (the x

_{i}are abstractly sampled from $\mathcal{N}(\mu ,\mathrm{\Sigma})$, but different trajectories (for given z

_{i}, different choices for the square root will give different x

_{i}). For GIGO-Σ, since we have to invert the covariance matrix, we used the Cholesky decomposition (A lower triangular. The the other implementation directly maintains a square root of Σ). Usually, in CMA-ES, the square root of Σ (Σ = AA

^{T}, A symmetric) is used.

#### A3.2. Updates

_{i}and the z

_{i}are those defined in Algorithm 1. For Algorithm 2 (GIGO-Σ), when the covariance matrix after one step is not positive-definite, we compute the update again, with a step size divided by r for the Euler method (we have no reason to recommend any particular value of r, the only constraint is r > 1).

**Algorithm 4.**Exact GIGO, one step. Not exactly our implementation; see the discussion after Corollary 1.

## Acknowledgments

## References

- Ollivier, Y.; Arnold, L.; Auger, A.; Hansen, N. Information-geometric optimization algorithms: A unifying picture via invariance principles
**2011**, arXiv, 1106.3708. - Amari, S.-I.; Nagaoka, H. Methods of Information Geometry (Translations of Mathematical Monographs); American Mathematical Society: Providence, RI, USA, 2007. [Google Scholar]
- Malagò, L.; Pistone, G. Combinatorial optimization with information geometry: The Newton method. Entropy
**2014**, 16, 4260–4289. [Google Scholar] - Eriksen, P. Geodesics Connected with the Fisher Metric on the Multivariate Normal Manifold; Technical Report 86-13; Institute of Electronic Systems, Aalborg University: Aalborg, Denmark, 1986. [Google Scholar]
- Calvo, M.; Oller, J.M. An Explicit Solution of Information Geodesic Equations for the Multivariate Normal Model. Stat. Decis.
**1991**, 9, 119–138. [Google Scholar] - Imai, T.; Takaesu, A.; Wakayama, M. Remarks on geodesics for multivariate normal models. J. Math-for-Industry
**2011**, 3, 125–130. [Google Scholar] - Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat.
**1981**, 11, 211–223. [Google Scholar] - Porat, B.; Friedlander, B. Computation of the Exact Information Matrix of Gaussian Time Series with Stationary Random Components. IEEE Trans. Acoust. Speech Signal Process
**1986**, 34, 118–130. [Google Scholar] - Baluja, S.; Caruana, R. Removing the Genetics from the Standard Genetic Algorithm; Technical Report CMU-CS-95-141; Morgan Kaufmann Publishers: Burlington, MA, USA, 1995; pp. 38–46. [Google Scholar]
- Malagò, L.; Matteucci, M.; Pistone, G. Towards the geometry of estimation of distribution algorithms based on the exponential family, Proceedings of the 11th Workshop Proceedings on Foundations of Genetic Algorithms, Schwarzenberg, Austria, 5–9 January 2011; pp. 230–242.
- Kern, S.; Müller, S.D.; Hansen, N.; Büche, D.; Ocenasek, J.; Koumoutsakos, P. Learning probability distributions in continuous evolutionary algorithms—A comparative review. Nat. Comput.
**2003**, 3, 77–112. [Google Scholar] - Wierstra, D.; Schaul, T.; Glasmachers, T.; Sun, Y.; Peters, J.; Schmidhuber, J. Natural evolution strategies. J. Mach. Learn. Res.
**2014**, 15, 949–980. [Google Scholar] - Huang, W. Optimization Algorithms on Riemannian Manifolds with Applications. Ph.D. Thesis, Florida State University, Tallahassee, FL, USA, 2013. [Google Scholar]
- Absil, P.A.; Mahony, R.; Sepulchre, R. Optimization Algorithms on Matrix Manifolds; Princeton University Press: Princeton, NJ, USA, 2008. [Google Scholar]
- Arnold, V.; Vogtmann, K.; Weinstein, A. Mathematical Methods of Classical Mechanics (Graduate Texts in Mathematics); Springer: New York, NY, USA, 1989. [Google Scholar]
- Bourguignon, J. Calcul variationnel; Ecole Polytechnique: Palaiseau, France, 2007; in French. [Google Scholar]
- Jost, J.; Li-Jost, X. Calculus of Variations (Cambridge Studies in Advanced Mathematics); Cambridge University Press: Cambridge, UK, 1998. [Google Scholar]
- Gallot, S.; Hulin, D.; LaFontaine, J. Riemannian Geometry (Universitext), 3rd ed; Springer: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
- Glasmachers, T.; Schaul, T.; Yi, S.; Wierstra, D.; Schmidhuber, J. Exponential natural evolution strategies, Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation, Portland, OR, USA, 7–11 July 2010.
- Akimoto, Y.; Nagata, Y.; Ono, I.; Kobayashi, S. Bidirectional relation between CMA evolution strategies and natural evolution strategies. In Parallel Problem Solving from Nature, PPSN XI; Schaefer, R., Cotta, C., Kołodziej, J., Rudolph, G., Eds.; Springer: New York, NY, USA, 2010. [Google Scholar]
- Hansen, N. The CMA evolution strategy: A tutorial. Available online: https://www.lri.fr/∼hansen/cmatutorial.pdf accessed on 1 January 2015.
- Bensadon, J. Source Code. Available online: https://www.lri.fr/~bensadon/ accessed on 13 January 2015.
- Akimoto, Y.; Ollivier, Y. Objective improvement in information-geometric optimization, Proceedings of the twelfth workshop on Foundations of genetic algorithms XII, Adelaide, Australia, 16–20 January 2013.

**Figure 3.**Median number of function calls to reach 10

^{−}

^{8}fitness on 24 runs for: sphere function, cigar-tablet function and Rosenbrock function. Initial position θ

^{0}= N (x

_{0}, I), with x

_{0}uniformly distributed on the circle of center zero and radius 10. We recall that the “CMA-ES” algorithm here is using the so-called pure rank-μ CMA-ES update.

**Figure 4.**Trajectories of GIGO, CMA and xNES optimizing x ↦ x

^{2}in dimension one with δt = 0.01, sample size 5000, weights w

_{i}= 4.1

_{i}

_{⩽1250}and learning rates η

_{μ}= 1, η

_{Σ}= 1.8. One dot every 100 steps. All algorithms exhibit a similar behavior

**Figure 5.**Trajectories of GIGO, CMA and xNES optimizing x 7→x

^{2}in dimension one with δt = 0.5, sample size 5000, weights w

_{i}= 4.1

_{i}

_{⩽1250}and learning rates η

_{μ}= 1, η

_{Σ}= 1.8. One dot every two steps. Stronger differences. Notice that after one step, the lowest mean is still GIGO (∼ 8.5, whereas xNES is around 8.75), but from the second step, GIGO has the highest mean, because of the lower variance.

**Figure 6.**Trajectories of GIGO, CMA and xNES optimizing x ⟼ x

^{2}in dimension one with δt = 0.1, sample size 5000, weights w

_{i}= 4.1

_{i}

_{≤1250}and learning rates η

_{μ}= 1, η

_{Σ}= 1.8. One dot every 10 steps. All algorithms exhibit a similar behavior, and differences start to appear. It cannot be seen on the graph, but the algorithm closest to zero after 400 steps is CMA (∼ 1.10

^{−16}, followed by xNES (∼ 6.10

^{−16}) and GIGO (∼ 2.10

^{−15}).

**Figure 7.**Trajectories of GIGO, CMA and xNES optimizing x ⟼ x

^{2}in dimension one with δt = 1, sample size 5000, weights w

_{i}= 4.1

_{i}

_{≤1250}and learning rates η

_{μ}= 1, η

_{Σ}= 1.8. One dot per step. The CMA-ES algorithm fails here, because at the fourth step, the covariance matrix is not positive definite anymore (it is easy to see that the CMA-ES update is always defined if δtη

_{Σ}< 1, but this is not the case here). Furthermore, notice (see also Proposition 15) that at the first step, GIGO decreases the variance, whereas the σ-component of the IGO speed is positive.

**Figure 8.**Trajectories of GIGO, CMA and xNES optimizing x ⟼ x

^{2}in dimension one with δt = 1.5, sample size 5000, weights w

_{i}= 4.1

_{i}

_{≤1250}and learning rates η

_{μ}= 1, η

_{Σ}= 1.8. One dot per step. Same as δt = 1 for CMA. GIGO converges prematurely.

**Figure 9.**Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 0.01, sample size 5000, weights w

_{i}= 4.1

_{i}

_{≤1250}and learning rates η

_{μ}= 1, η

_{Σ}= 1.8. One dot every 100 steps. Almost the same for all algorithms.

**Figure 10.**Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 0.1, sample size 5000, weights w

_{i}= 4.1

_{i}

_{≤1250}and learning rates η

_{μ}= 1, η

_{Σ}= 1.8. One dot every 10 steps. It is not obvious on the graph, but xNES is faster than CMA, which is faster than GIGO.

**Figure 11.**Trajectories of GIGO, CMA and xNES optimizing x ⟼ −x in dimension one with δt = 1, sample size 5, 000, weights w

_{i}= 4.1

_{i}

_{≤1}

_{,}

_{250}and learning rates η

_{μ}= 1, η

_{Σ}= 1.8. One dot per step. GIGO converges, for the reasons discussed earlier.

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Bensadon, J.
Black-Box Optimization Using Geodesics in Statistical Manifolds. *Entropy* **2015**, *17*, 304-345.
https://doi.org/10.3390/e17010304

**AMA Style**

Bensadon J.
Black-Box Optimization Using Geodesics in Statistical Manifolds. *Entropy*. 2015; 17(1):304-345.
https://doi.org/10.3390/e17010304

**Chicago/Turabian Style**

Bensadon, Jérémy.
2015. "Black-Box Optimization Using Geodesics in Statistical Manifolds" *Entropy* 17, no. 1: 304-345.
https://doi.org/10.3390/e17010304