Accelerating Extreme Search of Multidimensional Functions Based on Natural Gradient Descent with Dirichlet Distributions

Abdulkadirov, Ruslan; Lyakhov, Pavel; Nagornov, Nikolay

doi:10.3390/math10193556

Open AccessArticle

Accelerating Extreme Search of Multidimensional Functions Based on Natural Gradient Descent with Dirichlet Distributions

by

Ruslan Abdulkadirov

^1,*

,

Pavel Lyakhov

^1,2

and

Nikolay Nagornov

²

¹

North-Caucasus Center for Mathematical Research, North-Caucasus Federal University, 355009 Stavropol, Russia

²

Department of Mathematical Modeling, North-Caucasus Federal University, 355009 Stavropol, Russia

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(19), 3556; https://doi.org/10.3390/math10193556

Submission received: 5 September 2022 / Revised: 21 September 2022 / Accepted: 26 September 2022 / Published: 29 September 2022

(This article belongs to the Special Issue Mathematical Modeling, Optimization and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

The high accuracy attainment, using less complex architectures of neural networks, remains one of the most important problems in machine learning. In many studies, increasing the quality of recognition and prediction is obtained by extending neural networks with usual or special neurons, which significantly increases the time of training. However, engaging an optimization algorithm, which gives us a value of the loss function in the neighborhood of global minimum, can reduce the number of layers and epochs. In this work, we explore the extreme searching of multidimensional functions by proposed natural gradient descent based on Dirichlet and generalized Dirichlet distributions. The natural gradient is based on describing a multidimensional surface with probability distributions, which allows us to reduce the change in the accuracy of gradient and step size. The proposed algorithm is equipped with step-size adaptation, which allows it to obtain higher accuracy, taking a small number of iterations in the process of minimization, compared with the usual gradient descent and adaptive moment estimate. We provide experiments on test functions in four- and three-dimensional spaces, where natural gradient descent proves its ability to converge in the neighborhood of global minimum. Such an approach can find its application in minimizing the loss function in various types of neural networks, such as convolution, recurrent, spiking and quantum networks.

Keywords:

natural gradient descent; optimization; K–L divergence; Dirichlet distribution; generalized Dirichlet distribution

MSC:

68T20; 65K10; 90C26

1. Introduction

The optimization methods remain the most important and critical problems in artificial neural networks, which significantly impact the process of recognition. They allow us to solve many problems approximately, without a complex analytical approach. The most usable optimization algorithm in machine learning is stochastic gradient descent and its modified versions, which include momentum and the Nesterov condition. Afterwards, for increasing the accuracy, there were proposed AdaGrad in [1,2], RMSprop in [3] and ADADELTA and the Adam algorithm in [4,5], respectively. However, they are not rapid enough and, usually, converge to the local extreme. Even step-size adaptation from [6] cannot achieve the desired accuracy, reducing the number of iterations. The most appropriate solution in this case is researching the optimization of the loss function using the means of Riemannian geometry.

The metric properties are described in Riemannian geometry on arbitrary n-dimensional smooth manifolds with local coordinates. According to the type of manifold, we can provide the gradient flow that improves the quality of the optimization process. The gradient flow from [7] is the product between the metric tensor and gradient of the optimizing function. Such an approach in optimization accelerates the convergence and minimizes iterations (epochs). However, in this manuscript, we provide the extreme search with manifolds of probability distributions.

Probability distribution manifolds mostly can be found in information geometry, where the analog of gradient flow is the natural gradient. The analog of the metric tensor is the Fisher information matrix, which depends on some probability distribution. The Fisher matrix is calculated by the Kullback–Leibler divergence (K–L divergence in [8,9,10]). Taking into account the convexity of the optimizing function, it is possible to increase the accuracy without changing the initial step size and numerical value of the gradient.

Natural gradient descent (NGD) is an alternative to stochastic gradient descent and its modifications, as was noted in [11]. Unfortunately, for models with many parameters, such as convolution neural networks, computing the natural gradient in every iteration is ineffective because of the extremely large size of the Fisher matrix. This problem can be solved using various approximations to the Fisher matrix, as in [12,13]. It is designed to facilitate the computation, storing and finally inversion of the exact Fisher matrix. Moreover, it is possible to reduce the computations by applying a proper probability distribution.

In this article, we propose the algorithm of natural gradient descent with step-size adaptation based on Dirichlet and generalized Dirichlet distributions. We demonstrate that the natural gradient descent with step-size adaptation based on Dirichlet and generalized Dirichlet distributions has higher accuracy and does not take a large number of iterations for minimizing test functions compared to gradient descent and Adam. Such an approach is a continuation of works [14,15], where the final results did not present the ability of natural gradient descent to converge in the neighborhood of the global minimum.

The remainder of the paper is organized as follows. Section 2 presents the background of gradient descent, Adam and gradient flow. Section 3 contains calculations of the Fisher information matrices of Dirichlet and generalized Dirichlet distributions and proposes the algorithms of natural gradient descent based on Dirichlet and generalized Dirichlet distributions. Section 4 consists of experiments on the minimization of test functions with corresponding graphs and tables. In Section 5, we report the conclusions of the obtained results and suggestions for improving the natural gradient descent and its further exploitation in neural networks.

2. Preliminaries

2.1. Gradient Descent with Step-Size Adaptation

Let us consider a continuous function

f : Ω \to R

, defined on closed convex domain

Ω \in R

. The main goal of minimization is finding

{min}_{x \in Ω} f (x)

.

This is a necessary process for achieving the high accuracy of predictions in every artificial neural network, due to the more rapid and exact minimization of loss functions. The gradient descent, presented in Algorithm 1, with appropriate step-size adaptation in [6] has advantages in rate and accuracy over stochastic gradient descent.

Note that, in general cases, gradient descent cannot reach the minimum, because of the constant steps and confusion of the gradient in cases of several local extremes. Performing step-size adaptation does not guarantee descent into the global minimum, due to the large number of local minima. However, it allows us to increase the accuracy. This problem led the researchers to replace this method with Adam.

Algorithm 1 Gradient Descent with Step-Size Adaptation

Input:: $x_{0} \in R^{n}$ (starting point), f (scalar function), $\nabla f$ (gradient), $a_{0}$ (initial step size), n (number of iterations)
Output:: some $x_{n}$ minimizing f

1:: initialize $f_{0} = f (x_{0}) \in R$ , $g_{0} = \nabla f {(x_{0})}^{T} \in R^{n}$
2:: for i from 1 to n do
3:: $x_{i} \leftarrow x_{i - 1} - a_{0} g_{i - 1} / | g_{i - 1} |$
4:: $f_{i} \leftarrow f (x_{i})$
5:: if $f_{i} < f_{i - 1}$ then
6:: $f_{i} \leftarrow f_{i - 1}$
7:: $a_{i} \leftarrow 1.2 a_{i - 1}$
8:: else
9:: $a_{i} \leftarrow 0.5 a_{i - 1}$
10:: end if
11:: $g_{i} \leftarrow \nabla f {(x_{i})}^{T}$
12:: end for

2.2. Adam Algorithm

The Adam algorithm [5] is an attempt to improve the stochastic gradient descent, which updates the exponential moving averages of the gradient

m_{t}

and the squared gradient

v_{t}

with the hyper-parameters

β_{1}, β_{2} \in [0, 1)

to control the exponential decay rates of these moving averages. The moving averages themselves are estimates of the first moment (the mean) and the second raw moment (the uncentered variance) of the gradient. However, these moving averages are initialized as (vectors of) 0 s, leading to moment estimates that are biased towards zero, especially during the initial time steps, and especially when the decay rates are small. The advantage is that this initialization bias can be easily counteracted, resulting in bias-corrected estimates

{\hat{m}}_{t}

and

{\hat{v}}_{t}

. The step-size adaptation improves the quality of the Adam algorithm, which means accelerating the minimization and increasing the accuracy. These amendments can be helpful in deep neural networks because the optimizer gives more accurate results in less time.

Let us present the pseudo-code of the Adam method in Algorithm 2.

Algorithm 2 Adam algorithm

Input:: $x_{0} \in R^{n}$ (starting point), f (scalar function), $\nabla f$ (gradient), $a_{0}$ (initial step size), n (number of iterations), $β_{1}$ , $β_{2}$ (exponential decay rates)
Output:: some $x_{n}$ minimizing f

1:: initialize $f_{0} = f (x_{0}) \in R$ , $g_{0} = \nabla f_{0}^{T} \in R^{n}$ , $m_{0} = 0$ , $v_{0} = 0$
2:: for i from 1 to n do
3:: $m_{i} \leftarrow β_{1} m_{i - 1} + (1 - β_{1}) g_{i - 1}$
4:: $v_{i} \leftarrow β_{2} v_{i - 1} + (1 - β_{2}) g_{i - 1}^{2}$
5:: ${\hat{m}}_{i} \leftarrow m_{i} / (1 - β_{1}^{i})$
6:: ${\hat{v}}_{i} \leftarrow v_{i} / (1 - β_{1}^{i})$
7:: $x_{i} \leftarrow x_{i - 1} - a_{i - 1} \cdot {\hat{m}}_{i} / (\sqrt{\hat{v_{i}}} + ϵ)$
8:: $f_{i} \leftarrow f (x_{i})$
9:: if $f_{i} < f_{i - 1}$ then
10:: $f_{i} \leftarrow f_{i - 1}$
11:: $a_{i} \leftarrow 1.2 a_{i - 1}$
12:: else
13:: $a_{i} \leftarrow 0.5 a_{i - 1}$
14:: end if
15:: $g_{i} \leftarrow \nabla f (x_{i})$
16:: end for

In the process of learning neural networks, the Adam algorithm is the most preferred optimization method, because it converges faster and gives the required accuracy. However, this algorithm does not contain the curvature of the function

n - 1

-surfaces, for

n \geq 2

. Therefore, it does not reach the global minimum for small steps in the case of very convex functions such as Rastrigin, which is shown in Section 4.2.

2.3. Background on Riemannian Gradient Flow

The main idea of natural gradient descent initially comes from Riemannian geometry, where the definitions of derivative, flow and curvature are generally described.

We denote

(M, g)

as a Riemannian manifold, where

M = R^{n}

is the topological space and

g : M \times M \to R

is the metric tensor. The tangent space

T_{x}

, where

T_{x} M = R^{n}

, and the metric tensors

g (x) \in S_{+ +}^{n}

, where

S_{+ +}^{n}

is the cone of real symmetric definite positive matrices [16], can be taken for manifold M. The tensor matrix

g (x, x + δ x)

defines the local distances at x as

d {(x, x + δ x)}^{2} = δ x^{T} g (x, δ x) δ x

for

δ x \to \infty

. The following gradient vector field of the minimizing function f restricted to

Ω

[16] is denoted as

\nabla_{g} f ∣_{Ω} = g {(x, x + δ x)}^{- 1} \nabla f (x) .

(1)

The information geometry [17] is a manifold of probability distributions, e.g., a parametric family

p (x; θ) : θ \in R^{n}

, endowed with the metric, which is derived from the Kullback–Leibler divergence formula. The natural gradient is defined on such manifolds.

2.4. Natural Gradient Descent and K–L Divergence

NGD in [11] is obtained as the forward Euler discretization with step size

η

of the gradient flow (1):

x^{(k + 1)} = x^{(k)} - η_{k} F {(x^{(k)})}^{- 1} \nabla f (x^{(k)}),

(2)

where

x^{(0)} = x_{0}

.

The Fisher information matrix

F (x^{(k)})

from [11,14,17] can be calculated on the manifold of probability distributions, whose curvature we use to minimize the continuous function

f (θ)

. Presume that

p (x; θ)

is a family of probability distributions over space variables x with a parametric vector

θ \in R^{n}

. Let us introduce the K–L divergence.

\begin{matrix} K L (p (x; θ_{t}) | | p (x; θ_{t} + δ θ)) = \int p (x; θ_{t}) log \frac{p (x; θ_{t})}{p (x; θ_{t} + δ θ)} d x \\ = \int p (x; θ_{t}) log p (x; θ_{t}) d x - \int p (x; θ_{t}) log p (x; θ_{t} + δ θ) d x . \end{matrix}

(3)

Next, we provide the second Taylor series expansion of the function f as

f (θ) \approx f (θ_{t}) + \nabla f {(θ_{t})}^{T} δ θ + \frac{1}{2} δ θ^{T} \nabla^{2} f (θ_{t}) δ θ,

(4)

where

θ = θ_{t} + δ θ

, and

\nabla^{2} f

is a Hessian matrix.

Substituting (3) into (4), we obtain

\begin{array}{c} K L (p (x; θ_{t}) | | p (x; θ_{t} + δ θ)) \approx \int p (x; θ_{t}) log p (x; θ_{t}) d x \\ = \int p (x; θ_{t}) log \frac{p (x; θ_{t})}{p (x; θ_{t})} d x - {(\int \nabla p (x; θ_{t}) d x)}^{T} δ θ - \frac{1}{2} δ θ^{T} (\int p (x; θ_{t}) \nabla^{2} log p (x; θ_{t}) d x) δ θ . \end{array}

The first two integrals are equal to 0, because

log \frac{p (x; θ_{t})}{p (x; θ_{t})} = log 1 = 0

and

\int \nabla p (x; θ_{t}) d x = \nabla \int p (x; θ_{t}) d x = \nabla 1 = 0 .

Therefore, we obtain K–L divergence for probability distribution

p (x; θ_{t})

:

K L (p (x; θ_{t}) | | p (x; θ_{t} + δ θ)) \approx \frac{1}{2} δ θ^{T} (\int p (x; θ_{t}) \nabla^{2} log p (x; θ_{t}) d x) δ θ .

Let us provide the Hessian

\nabla^{2} l o g p (x; θ_{t})

, which can be split into two products:

\nabla^{2} log p (x; θ_{t}) = \nabla log p (x; θ_{t}) \nabla log p (x; θ_{t}))^{T} .

Then, the K–L divergence has the following form:

\begin{matrix} K L (p (x; θ_{t}) | | p (x; θ_{t} + δ θ)) \\ = - \frac{1}{2} δ θ^{T} E [\nabla log p (x; θ_{t}) \nabla log {p (x; θ_{t})}^{T}] δ θ = - \frac{1}{2} δ θ^{T} F (θ_{t}) δ θ, \end{matrix}

(5)

where

F (θ_{t})

is a Fisher information matrix, which is a Riemannian structure on a manifold of probability distributions.

3. Theoretical Calculations

Fisher Matrix for Dirichlet and Generalized Dirichlet Distributions

The Dirichlet distribution of order

K \geq 2

with parameters

α_{1}, \dots, α_{K} > 0

[18] has a probability density function with respect to the Lebesgue measure on the Euclidean space

R^{K - 1}

given by

\begin{matrix} p (x_{1}, \dots, x_{K}; α_{1}, \dots, α_{K}) = \frac{1}{B (α)} \prod_{i = 1}^{K} x_{i}^{α_{i} - 1}, B (α) = \frac{\prod_{i} Γ (α_{i})}{Γ (\sum_{i} α_{i})}, \end{matrix}

(6)

where

{x_{i}}_{i = 1}^{K}

belongs to the

K - 1

simplex.

Let us calculate the logarithm of Dirichlet distribution:

\begin{matrix} log p (x_{1}, \dots, x_{K}; α_{1}, \dots, α_{K}) = log [\frac{Γ (\sum_{i} α_{i})}{\prod_{i} Γ (α_{i})} \prod_{i = 1}^{K} x_{i}^{α_{i} - 1}] \\ = log Γ (\sum_{i = 1}^{K} α_{i}) - \sum_{i = 1}^{K} log Γ (α_{i}) + \sum_{i = 1}^{K} (α_{i} - 1) log x_{i} . \end{matrix}

Afterwards, we imply the second-order partial derivative of f with respect to

α

and obtain

\frac{\partial^{2}}{\partial a_{j} \partial a_{k}} log p = ψ^{'} (\sum_{i = 1}^{K} α_{i}), \frac{\partial^{2}}{\partial a_{j}^{2}} log p = ψ^{'} (\sum_{i = 1}^{K} α_{i}) - ψ^{'} (α_{j}) .

Hence, we have the Fisher information matrix of Dirichlet distribution:

\begin{matrix} F_{D i r} (α) = (\begin{matrix} ψ^{'} (α_{1}) - ψ^{'} (\sum_{i} α_{i}) & \dots & - ψ^{'} (\sum_{i} α_{i}) \\ \dots & \dots & \dots \\ - ψ^{'} (\sum_{i} α_{i}) & \dots & ψ^{'} (α_{K}) - ψ^{'} (\sum_{i} α_{i}) \end{matrix}) . \end{matrix}

(7)

The generalized Dirichlet distribution [18] for

x_{1} + \dots + x_{K} \leq 1

and

α_{i} > 0, β_{i} > 0, i = 1, \dots, K - 1

has a probability density function, which is defined as

\begin{matrix} p (x_{1}, \dots, x_{K}; α_{1}, \dots, α_{K}, β_{1}, \dots, β_{K}) = \prod_{i = 1}^{K} \frac{1}{B (α_{i}, β_{i})} x_{i}^{α_{i} - 1} {(1 - \sum_{j = 1}^{i} x_{j})}^{γ_{i}}, \end{matrix}

(8)

where

γ_{i} = β_{i} - α_{i + 1} - β_{i + 1}

for

i = 1, \dots, K - 1

and

γ_{K} = β_{K - 1}

.

The logarithm (7) is

\begin{matrix} log p = log [\prod_{i = 1}^{K} \frac{Γ (α_{i} + β_{i})}{Γ (α_{i}) Γ (β_{i})} x_{i}^{α_{i} - 1} {(1 - \sum_{j = 1}^{i} x_{j})}^{γ_{i}}] = \sum_{i = 1}^{K} log Γ (α_{i} + β_{i}) \\ - \sum_{i = 1}^{K} log Γ (α_{i}) - \sum_{i = 1}^{K} log Γ (β_{i}) + \sum_{i = 1}^{K} (α_{i} - 1) log x_{i} + \sum_{i = 1}^{K} γ_{i} log (1 - \sum_{j = 1}^{i} x_{j}) . \end{matrix}

The second-order partial derivatives of

log f (x; α, β)

are

\begin{matrix} (1) \frac{\partial^{2}}{\partial α_{j} \partial α_{l}} log p = \frac{\partial^{2}}{\partial β_{j} \partial β_{l}} log p = \frac{\partial^{2}}{\partial α_{j} \partial β_{l}} log p = 0, j \neq l, \\ (2) \frac{\partial^{2}}{\partial α_{j}^{2}} log p = ψ^{'} (α_{j} + β_{j}) - ψ^{'} (α_{j}), \frac{\partial^{2}}{\partial β_{j}^{2}} log p = ψ^{'} (α_{j} + β_{j}) - ψ^{'} (β_{j}), \\ (3) \frac{\partial^{2}}{\partial α_{j} \partial β_{j}} log p = \frac{\partial^{2}}{\partial β_{j} \partial α_{j}} log p = ψ^{'} (α_{j} + β_{j}) . \end{matrix}

Then, the Fisher matrix for the generalized Dirichlet distribution has the following form:

\begin{matrix} F_{G e n D i r} (α) = (\begin{matrix} Ψ_{1} & O & \dots & O \\ O & Ψ_{2} & \dots & \dots \\ \dots & \dots & \dots & \dots \\ O & \dots & O & Ψ_{K} \end{matrix}), \end{matrix}

(9)

where

\begin{matrix} Ψ_{i} = (\begin{matrix} ψ^{'} (α_{i}) - ψ^{'} (α_{i} + β_{i}) & - ψ^{'} (α_{i} + β_{i}) \\ - ψ^{'} (α_{i} + β_{i}) & ψ^{'} (β_{i}) - ψ^{'} (α_{i} + β_{i}) \end{matrix}) a n d O i s a z e r o m a t r i x . \end{matrix}

According to the Fisher information matrix for Dirichlet and generalized Dirichlet distributions, and adding the step-size adaptation, we propose Algorithm 3.

Algorithm 3 Natural Gradient Descent with Dirichlet and Generalized Dirichlet Distribution

Input:: $x_{0} \in R^{n}$ (starting point), f (scalar function), $\nabla f$ (gradient), $a_{0}$ (initial step size), n (number of iterations)
Output:: some $x_{n}$ minimizing f

1:: initialize $f_{0} = f (x_{0}) \in R$ , $g_{0} = \nabla f {(x_{0})}^{T} \in R^{n}$ and Fisher matrix F
2:: for i from 1 to n do
3:: $x_{i} \leftarrow x_{i - 1} - a_{i - 1} F^{- 1} g_{i - 1} / | g_{i - 1} |$
4:: $f_{i} \leftarrow f (x_{i})$
5:: if $f_{i} < f_{i - 1}$ then
6:: $f_{i} \leftarrow f_{i - 1}$
7:: $a_{i} \leftarrow 1.2 a_{i - 1}$
8:: else
9:: $a_{i} \leftarrow 0.5 a_{i - 1}$
10:: end if
11:: $g_{i} \leftarrow \nabla f {(x_{i})}^{T}$
12:: end for

Note that, in Algorithm 3, it is unnecessary to reduce the length of steps or the numerical value of the gradient to improve the final values of extremes. The Fisher matrix contains parameters without items of vector x, which allows us to avoid additional computations in the loop. Including curvature properties by the Fisher matrix natural gradient achieves extremes faster. Finally, the Fisher information matrix with generalized Dirichlet distribution is useful only in cases of 2n-dimensional surfaces, where

n \in N

.

4. Experimental Part

4.1. Four-Dimensional Case

The behavior of the algorithms gradient descent with step-size adaptation and natural gradient descent of Dirichlet and general Dirichlet distributions, realized by Python 3.8.10, will be observed in experiments. We choose convex and smooth functions for solving the optimization problem.

Initial points and parameters will be defined for every function. This is intended to determine the proper distribution for every experimental function.

In the first experiment, we minimize the Rayden function, which is defined as

f (x) = \sum_{i = 1}^{4} (exp (x_{i}) - x_{i}),

(10)

with global minimum at

x = (0, 0, 0, 0)

, where

f (x) = 4

.

Figure 1 presents that NGD with generalized Dirichlet distributions has the fastest convergence and achieves a minimal value, which is equal to

4 + 6 e - 9

. For Dirichlet distribution, the optimization is fast enough and gives the least value

4 + 1 e - 9

. For the Adam algorithm, the minimum value is

4 + 2 e - 9

. GD with step-size adaptation gives

4 + 2 e - 8

.

The second minimization is implemented on the generalized Rosenbrock function, which has the form

f (x) = \sum_{i = 1}^{3} [100 {(x_{i + 1} - x_{i}^{2})}^{2} + {(1 - x_{i})}^{2}],

(11)

where global minimum is equal to 0 at

x = (1, 1, 1, 1)

.

Figure 2 shows that NGD with Dirichlet and generalized Dirichlet distributions has the fastest convergence and achieves a minimal value, which is equal to

0.01095

and

0.22170

, respectively. For the Adam algorithm, the minimum value is

0.01391

. GD with step-size adaptation reaches

0.14325

. We can see that the proposed Algorithm 3 converges to the global minimum much earlier and gives the required accuracy compared to well-known analogs.

The extended trigonometric function is

f (x) = \sum_{i = 1}^{4} {[(4 - \sum_{j = 1}^{4} cos x_{i}) + i cos x_{i} - sin x_{i}]}^{2},

(12)

which has the global minimum at

x = (π / 2, π / 2, π / 2, π / 2)

, where

f (x) = 0

Figure 3 demonstrates that NGD with Dirichlet distributions reaches the minimum at

1.57389 e - 10

. For Adam, the minimal value is

2.24709 e - 9

. GD with step-size adaptation shows

5.70724 e - 9

. NGD with Generalized Dirichlet distribution descended to

1.50210 e - 9

. Algorithm 3 rapidly achieves the global minimum, but, for the full convergence, takes 6–9 iterations. However, compared with Algorithms 1 and 2, our approach gives us the highest accuracy.

As we can see, the Fisher information matrix accelerates the convergence, which allows the optimizer in neural networks to work faster.

4.2. Three-Dimensional Case

In three-dimensional space, we can provide the graph of convergence and descent trajectory for each optimization method. The gradient descent and Adam algorithms with step-size adaptation remain unchanged, except the dimension of variable x. However, the Dirichlet and generalized Dirichlet distributions reduce to beta distribution.

p (x; a, b) = \frac{1}{B (a, b)} x^{a - 1} {(1 - x)}^{b - 1},

(13)

where

\begin{matrix} B (a, b) = \int_{0}^{1} t^{a - 1} {(1 - t)}^{b - 1} d t = \frac{Γ (a) Γ (b)}{Γ (a + b)}, \end{matrix}

for

0 < x < 1

and

a > 0, b > 0

.

The beta distribution is the Dirichlet and generalized Dirichlet distribution in three-dimensional Euclidean space. Hence, the Fisher matrix of beta distribution [15] is

\begin{matrix} F_{B e t a} (a, b) = (\begin{matrix} ψ^{'} (a) - ψ^{'} (a + b) & - ψ^{'} (a + b) \\ - ψ^{'} (a + b) & ψ^{'} (b) - ψ^{'} (a + b) \end{matrix}) . \end{matrix}

(14)

In the case of two-dimensional surfaces, we can observe their graphs and descent trajectories for each optimization method. This allows us to understand the work of every algorithm and estimate the efficiency of their models.

The surface, which is described as

f (x, y) = sin (\frac{1}{2} x^{2} - \frac{1}{4} y^{2} + 3) cos (2 x + 1 - e^{y})

(15)

is a Sine–Cosine function with initial point

(x, y) = (5, 5)

.

The best result gives a beta distribution and achieves

- 1 + 2 e - 8

. Adam shows

- 1 + 6 e - 8

, but it converges slower. The gradient descent moves in the wrong direction and achieves

- 0.04198

.

The second simulation was implemented on the Rastrigin function

f (x) = A n + \sum_{i = 1}^{n} [x_{i} - A cos (2 π x_{i})],

(16)

where

A = 10

and

x_{i} \in [- 5.12, 5.12]

. It has the global minimum at

x = (0, 0)

, where

f (x) = 0

.

This function contains many local minima, and a method such as gradient descent will not achieve the global minimum with a small step size. For Adam, the step size needs to be greater than

1.6

. However, for natural gradient descent with beta distribution, the step size can be less than

0.5

.

The NGD with beta distribution reached the global minimum and gave

0.69984

. Adam and GD achieved the local minima with values

12.93451

and

12.93446

, respectively, which do not suffice for minimization. Taking into account the convexity of the minimizing function, Algorithm 3 could descend to the global minimum, unlike its analogs.

The third simulation was implemented on the Rosenbrock function

f (x) = \sum_{i = 1}^{n - 1} [100 {(x_{i + 1} - x_{i}^{2})}^{2} + {(1 - x_{i})}^{2}] .

(17)

This function has a global minimum at

x = (1, 1)

, where

f (x) = 0

.

In this case, we apply descent in the area of local minima, where, for each method, the global minimum is achieved.

The NGD with beta distribution for the least number of iterations achieved the minimum

0.00082

. The GD moves along the area of local minima, but because of the insufficient number of iterations, stops with value

4.36387

. Adam shows a similar result to GD and reaches the value

0.02243

, which is not progressive compared with NGD. As a result, we can see that the proposed algorithm gave the most minimal value, taking a significantly smaller number of iterations.

Let us summarize the results in the Table 1 and Table 2 below.

According to the results of the graphs in Figure 4b, Figure 5b and Figure 6b, we can include the number of iterations in the table. We can conclude that natural gradient descent with Dirichlet (beta) distribution works better than known analogs. It is simple to implement for the program realization and, for the least number of iterations, can give the best results in the optimization process.

5. Discussion

According to the results of experiments, we can conclude that the proposed natural gradient descent with Dirichlet distributions is able to descend into the neighborhood of the global minimum. Unlike the gradient descent [6] and Adam [5], our approach takes into account not only the gradient direction, but the convexity of the minimizing function, which allows it to miss local extremes. Moreover, the accuracy is more qualified, compared with known optimization methods.

In [14], which is a continuation of [15], we explored the natural gradient descent based on Dirichlet distribution. In this research, we added and calculated the Fisher information matrix of the generalized Dirichlet distribution. Moreover, we presented the trajectories of descent of the proposed algorithm in projected 2-d space, where Dirichlet distributions transformed into beta distribution. We verified the capability of the proposed algorithm to converge in the neighborhood of the global minimum in the case of the Rastrigin and Rosenbrock functions, where known algorithms do not achieve the global minimum. Such experiments differ from the experiments in [14,15].

The developed method allows us to minimize the loss functions of various types, which increases the accuracy of neural networks. Such an approach finds its application in convolutional and recurrent neural networks. Moreover, applying the natural gradient descent in spiking and quantum neural networks can extend the class of problems from recognition and prediction to temporal dynamics [19] and quantum computing [20]. There is a possibility to apply natural gradient descent with Dirichlet distributions in physics-informed neural networks [21], where the accuracy of solutions fully depends on the quality of minimization of the loss function.

In further research, we can examine the behavior of the natural gradient descent algorithm with the Fisher matrix of other distributions, such as gamma [22], Gompertz [23], or Gumbel [24] distributions. Moreover, we can reduce the Riemannian gradient flow to another smooth manifold besides the manifold of the probability distribution, which can potentially facilitate the optimization of the method.

Author Contributions

Conceptualization, R.A.; Formal analysis, R.A.; Funding acquisition, P.L., N.N.; Investigation, R.A.; Methodology, P.L.; Project administration, P.L., N.N.; Resources, R.A.; Supervision, P.L.; Writing—original draft, R.A.; Writing—review and editing, P.L. All authors have read and agreed to the published version of the manuscript.

Funding

The research in Section 3 was supported by the Russian Science Foundation (Project No. 21-71-00017). The research in Section 4 was supported by the Russian Science Foundation (Project No. 22-71-00009). The remainder of the work was supported by the North Caucasus Center for Mathematical Research with the Ministry of Science and Higher Education of the Russian Federation.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to thank the North Caucasus Federal University for their support in the project competitions of scientific groups and individual scientists of the North Caucasus Federal University.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ward, R.; Wu, X.; Bottou, L. AdaGrad Stepsizes: Sharp Convergence Over Nonconvex Landscapes. J. Mach. Learn. Res. 2020, 21, 1–30. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Xu, D.; Zhang, S.; Zhang, H.; Mandic, D.P. Convergence of the RMSProp deep learning method with penalty for nonconvex optimization. Neural Netw. 2021, 139, 17–23. [Google Scholar] [CrossRef] [PubMed]
Qu, Z.; Yuan, S.; Chi, R.; Chang, L.; Zhao, L. Genetic Optimization Method of Pantograph and Catenary Comprehensive Monitor Status Prediction Model Based on Adadelta Deep Neural Network. IEEE Access 2019, 7, 23210–23221. [Google Scholar] [CrossRef]
Wu, H.-P.; Li, L. The BP Neural Network with Adam Optimizer for Predicting Audit Opinions of Listed Companies. IAENG Int. J. Comput. Sci. 2021, 48, 364–368. [Google Scholar]
Toussaint, M. Lecture Notes: Some Notes on Sradient Descent; Machine Learning & Robotics Lab, FU Berlin: Berlin, Germany, 2012. [Google Scholar]
Wang, S.; Teng, Y.; Perdikaris, P. Understanding and Mitigating Gradient Flow Pathologies in Physics-Informed Neural Networks. SIAM J. Sci. Comput. 2021, 43, 3055–3081. [Google Scholar] [CrossRef]
Martens, J. New Insights and Perspectives on the Natural Gradient Method. J. Mach. Learn. Res. 2020, 21, 1–76. [Google Scholar]
Huang, Y.; Zhang, Y.; Chambers, J.A. A Novel Kullback–Leibler Divergence Minimization-Based Adaptive Student’s t-Filter. IEEE Trans. Signal Process. 2019, 67, 5417–5432. [Google Scholar] [CrossRef]
Asperti, A.; Trentin, M. Balancing Reconstruction Error and Kullback-Leibler Divergence in Variational Autoencoders. IEEE Access 2019, 8, 199440–199448. [Google Scholar] [CrossRef]
Heck, D.W.; Moshagen, M.; Erdfelder, E. Model selection by minimum description length: Lower-bound sample sizes for the Fisher information approximation. J. Math. Psychol. 2014, 60, 29–34. [Google Scholar] [CrossRef]
Spall, J.C. Monte Carlo Computation of the Fisher Information Matrix in Nonstandard Settings. J. Comput. Graph. Stat. 2012, 14, 889–909. [Google Scholar] [CrossRef]
Alvarez, F.; Bolte, J.; Brahic, O. Hessian Riemannian Gradient Flows in Convex Programming. Soc. Ind. Appl. Math. 2004, 43, 68–73. [Google Scholar] [CrossRef]
Abdulkadirov, R.; Lyakhov, P. Improving Extreme Search with Natural Gradient Descent Using Dirichlet Distribution. In Mathematics and Its Applications in New Computer Systems; MANCS 2021; Lecture Notes in Networks and Systems; Springer: Cham, Switzerland, 2022; Volume 424, pp. 19–28. [Google Scholar]
Lyakhov, P.; Abdulkadirov, R. Accelerating Extreme Search Based on Natural Gradient Descent with Beta Distribution. In Proceedings of the 2021 International Conference Engineering and Telecommunication (En&T), Online, 24–25 November 2021; pp. 1–5. [Google Scholar]
Celledoni, E.; Eidnes, S.; Owren, B.; Ringholm, T. Dissipative Numerical Schemes on Riemannian Manifolds with Applications to Gradient Flows. SIAM J. Sci. Comput. 2018, 40, A3789–A3806. [Google Scholar] [CrossRef]
Liao, Z.; Drummond, T.; Reid, I.; Carneiro, G. Approximate Fisher Information Matrix to Characterize the Training of Deep Neural Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 15–26. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wong, T.-T. Generalized Dirichlet distribution in Bayesian analysis. Appl. Math. Comput. 1998, 97, 165–181. [Google Scholar] [CrossRef]
Wang, X.; Lin, X.; Dang, X. Supervised learning in spiking neural networks: A review of algorithms and evaluations. Neural Netw. 2020, 125, 258–280. [Google Scholar] [CrossRef] [PubMed]
Abbas, A.; Sutter, D.; Zoufal, C.; Figalli, A. The power of quantum neural networks. Nat. Comput. Sci. 2021, 1, 403–409. [Google Scholar] [CrossRef]
Guo, Y.; Cao, X.; Liu, B.; Gao, M. Solving Partial Differential Equations Using Deep Learning and Physical Constraints. Appl. Sci. 2020, 10, 5917. [Google Scholar] [CrossRef]
Klakattawi, H.S. The Weibull-Gamma Distribution: Properties and Applications. Entropy 2019, 21, 438. [Google Scholar] [CrossRef] [PubMed]
Bantan, R.A.R.; Jamal, F.; Chesneau, C.; Elgarhy, M. Theory and Applications of the Unit Gamma/Gompertz Distribution. Mathematics 2021, 9, 1850. [Google Scholar]
Gómez, M.Y.; Bolfarine, H.; Gómez, H.W. Gumbel distribution with heavy tails and applications to environmental data. Math. Comput. Simul. 2019, 157, 115–129. [Google Scholar] [CrossRef]

Figure 1. The rate of convergence on Rayden function using various algorithms.

Figure 2. The rate of convergence on generalized Rosenbrock function using various algorithms.

Figure 3. The rate of convergence on extended trigonometric function using various algorithms.

Figure 4. Experiment on Sine–Cosine function: (a) the appearance of the function; (b) the trajectory of movement to a minimum using various algorithms.

Figure 5. Experiment on Rastrigin function: (a) the appearance of the function; (b) the trajectory of movement to a minimum using various algorithms.

Figure 6. Experiment on Rosenbrock function: (a) the appearance of the function; (b) the trajectory of movement to a minimum using various algorithms.

Table 1. Minimum values achieved by various algorithms.

Function	Optimization Algorithms
Function	GD [6]	Adam [5]	Proposed
Sine–Cosine	−0.04198	$- 1 + 6 e - 8$	$- 1 + 2 e - 8$
Rastrigin	12.93446	12.93451	0.69984
Rosenbrock	4.36387	0.02243	0.00082

Table 2. Number of iterations by various algorithms.

Function	Optimization Algorithms
Function	GD [6]	Adam [5]	Proposed
Sine–Cosine	19	100	26
Rastrigin	5	12	32
Rosenbrock	>500	200	20

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Abdulkadirov, R.; Lyakhov, P.; Nagornov, N. Accelerating Extreme Search of Multidimensional Functions Based on Natural Gradient Descent with Dirichlet Distributions. Mathematics 2022, 10, 3556. https://doi.org/10.3390/math10193556

AMA Style

Abdulkadirov R, Lyakhov P, Nagornov N. Accelerating Extreme Search of Multidimensional Functions Based on Natural Gradient Descent with Dirichlet Distributions. Mathematics. 2022; 10(19):3556. https://doi.org/10.3390/math10193556

Chicago/Turabian Style

Abdulkadirov, Ruslan, Pavel Lyakhov, and Nikolay Nagornov. 2022. "Accelerating Extreme Search of Multidimensional Functions Based on Natural Gradient Descent with Dirichlet Distributions" Mathematics 10, no. 19: 3556. https://doi.org/10.3390/math10193556

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accelerating Extreme Search of Multidimensional Functions Based on Natural Gradient Descent with Dirichlet Distributions

Abstract

1. Introduction

2. Preliminaries

2.1. Gradient Descent with Step-Size Adaptation

2.2. Adam Algorithm

2.3. Background on Riemannian Gradient Flow

2.4. Natural Gradient Descent and K–L Divergence

3. Theoretical Calculations

Fisher Matrix for Dirichlet and Generalized Dirichlet Distributions

4. Experimental Part

4.1. Four-Dimensional Case

4.2. Three-Dimensional Case

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI