Training Deep Neural Networks Using Conjugate Gradient-like Methods

Iiduka, Hideaki; Kobayashi, Yu

doi:10.3390/electronics9111809

Open AccessArticle

Training Deep Neural Networks Using Conjugate Gradient-like Methods

by

Hideaki Iiduka

^*

and

Yu Kobayashi

Department of Computer Science, Meiji University, 1-1-1 Higashimita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571, Japan

^*

Author to whom correspondence should be addressed.

Electronics 2020, 9(11), 1809; https://doi.org/10.3390/electronics9111809

Submission received: 1 October 2020 / Revised: 24 October 2020 / Accepted: 28 October 2020 / Published: 2 November 2020

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The goal of this article is to train deep neural networks that accelerate useful adaptive learning rate optimization algorithms such as AdaGrad, RMSProp, Adam, and AMSGrad. To reach this goal, we devise an iterative algorithm combining the existing adaptive learning rate optimization algorithms with conjugate gradient-like methods, which are useful for constrained optimization. Convergence analyses show that the proposed algorithm with a small constant learning rate approximates a stationary point of a nonconvex optimization problem in deep learning. Furthermore, it is shown that the proposed algorithm with diminishing learning rates converges to a stationary point of the nonconvex optimization problem. The convergence and performance of the algorithm are demonstrated through numerical comparisons with the existing adaptive learning rate optimization algorithms for image and text classification. The numerical results show that the proposed algorithm with a constant learning rate is superior for training neural networks.

Keywords:

adaptive learning rate optimization algorithms; conjugate gradient-like method; deep neural network; nonconvex optimization

1. Introduction

Deep neural networks are used for many tasks, such as natural language processing, computer vision, and text and image classification (see also [1,2,3] for applications of neural networks), and a number of algorithms have been presented to tune the model parameters of such networks. The appropriate parameters are found by solving nonconvex stochastic optimization problems. In particular, the algorithms solve these problems in order to adapt the learning rates of the model parameters. Accordingly, they are called adaptive learning rate optimization algorithms [4] (Subchapter 8.5), and they include AdaGrad [5], RMSProp [4] (Algorithm 8.5), Adam [6], and AMSGrad [7].

Recently, reference [8] preformed convergence analyses on adaptive learning rate optimization algorithms for constant learning rates and diminishing learning rates. The convergence analyses indicated that the algorithms with sufficiently small constant learning rates approximate stationary points of the problems [8] (Theorem 3.1). This implies that useful algorithms, such as Adam and AMSGrad, can use constant learning rates to solve the nonconvex stochastic optimization problems in deep learning, in contrast to the results in [6,7] that presented only analyses assuming the convexity conditions of objective functions for diminishing learning rates. The analyses also indicated that the algorithms with diminishing learning rates converge to stationary points of the problems and achieve a certain convergence rate [8] (Theorem 3.2). Numerical comparisons showed that the algorithms with constant learning rates perform better than the ones with diminishing learning rates.

Meanwhile, conjugate gradient methods are useful for unconstrained nonconvex deterministic optimization (see [9] for details on conjugate gradient methods). These methods use the conjugate gradient direction (see also (2) for the definition of the conjugate gradient direction with the Fletcher–Reeves formula), and they accelerate the steepest descent method. Conjugate gradient methods converge globally and generate the descent direction. In particular, the Hager-Zhang, Polak-Ribière-Polyak, and Hestenes–Stiefel methods have efficient numerical performance [9]. It seems that conjugate gradient methods could be applied to constrained optimization, because they might accelerate the existing methods for constrained optimization. However, the inconvenient possibility that the conjugate gradient methods may not converge to solutions to constrained optimization problems [10] (Proposition 3.2) means that we cannot apply them directly. Actually, the numerical results in [10] showed that the conjugate gradient methods with conventional formulas, such as the Fletcher-Reeves, Polak-Ribière-Polyak, and Hestenes-Stiefel formulas, do not always converge to solutions to constrained optimization problems.

The conjugate gradient direction has been modified so that it can be applied to constrained optimization. The modified direction is called the conjugate gradient-like direction [10,11,12,13,14], and it is obtained by replacing the formula used for finding the conventional conjugate gradient direction with a positive real sequence depending on the number of iterations (see (1) for the definition of the conjugate gradient-like direction). The conjugate gradient-like method with the conjugate gradient-like direction can be applied to constrained convex deterministic optimization. In particular, the conjugate gradient-like method converges to solutions to constrained convex deterministic optimization problems when the step sizes (which are called learning rates) are diminishing [10] (Theorem 3.1). Moreover, the numerical results in [10] showed that it converges faster than the existing steepest descent method.

Roughly speaking, the existing adaptive learning rate optimization algorithms [4] (Subchapter 8.5) are first-order methods using the steepest descent direction of an observed function at each iteration. Accordingly, using the conjugate gradient-like method would be useful to accelerate these algorithms. Hence, in this article, we propose an iterative method combining the existing adaptive learning rate optimization algorithms [4] (Subchapter 8.5) with the conjugate gradient-like method [10,11,12,13,14].

This article provides two convergence analyses. The first analysis shows that with a small constant learning rate, the proposed algorithm approximates a stationary point of a nonconvex optimization problem in deep learning (Theorem 1). The second analysis shows that with diminishing learning rates, it converges to a stationary point of the nonconvex optimization problem (Theorem 2). The convergence and performance of the proposed algorithm are examined through numerical comparisons with the existing adaptive learning rate optimization algorithms for image and text classification. The numerical results show that the proposed algorithm with a constant learning rate is superior for training neural networks, while the one with diminishing learning rates is not good for training neural networks.

This article is organized as follows. Section 2 gives the mathematical preliminaries and states the main problem. Section 3 presents the proposed algorithm for solving the main problem and analyzes its convergence. Section 4 numerically compares the behaviors of the proposed learning algorithms with those of the existing ones. Section 5 discusses the relationship between the previously reported results and the results in Section 3 and Section 4. Section 6 concludes the paper with a brief summary.

2. Mathematical Preliminaries

2.1. Notation and Definitions

N

denotes the set of all positive integers and zero.

R^{d}

denotes a d-dimensional Euclidean space with inner product

〈 \cdot, \cdot 〉

, which induces the norm

∥ \cdot ∥

.

S^{d}

denotes the set of

d \times d

symmetric matrices, i.e.,

S^{d} = {X \in R^{d \times d} : X = X^{⊤}}

.

S_{+ +}^{d}

denotes the set of

d \times d

symmetric positive-definite matrices, i.e.,

S_{+ +}^{d} = {X \in S^{d} : X ≻ O}

.

D^{d}

denotes the set of

d \times d

diagonal matrices, i.e.,

D^{d} = {X \in R^{d \times d} : X = diag (x_{i}), x_{i} \in R (i = 1, 2, \dots, d)}

.

A ⊙ B

denotes the Hadamard product of matrices A and B. For all

x : = (x_{i}) \in R^{d}

, we have

x ⊙ x : = (x_{i}^{2}) \in R^{d}

.

Given

H \in S_{+ +}^{d}

, the H-inner product of

R^{d}

and the H-norm are defined for all

x, y \in R^{d}

by

{〈 x, y 〉}_{H} : = 〈 x, H y 〉

and

{∥ x ∥}_{H}^{2} : = 〈 x, H x 〉

.

The metric projection [15] (Subchapter 4.2, Chapter 28) onto a nonempty, closed convex set X

(\subset R^{d})

, denoted by

P_{X}

, is defined for all

x \in R^{d}

by

P_{X} (x) \in X

and

∥ x - P_{X} (x) ∥ = {inf}_{y \in X} ∥ x - y ∥

.

P_{X}

satisfies the nonexpansivity condition, i.e.,

∥ P_{X} (x) - P_{X} (y) ∥ \leq ∥ x - y ∥

(

x, y \in R^{d}

), and satisfies

Fix (P_{X}) : = {x \in R^{d} : x = P_{X} (x)} = X

[15] (Proposition 4.8, (4.8)). The metric projection onto X under the H-norm is denoted by

P_{X, H}

. When X is an affine subspace, a half-space, or a hyperslab, the projection onto X can be computed within a finite number of arithmetic operations [15] (Chapter 28).

E [X]

denotes the expectation of a random variable X. The history of the process

ξ_{0}, ξ_{1}, \dots

up to time n is denoted by

ξ_{[n]} = (ξ_{0}, ξ_{1}, \dots, ξ_{n})

. For a random process

ξ_{0}, ξ_{1}, \dots

,

E [X | ξ_{[n]}]

denotes the conditional expectation of X given

ξ_{[n]} = (ξ_{0}, ξ_{1}, \dots, ξ_{n})

. Unless stated otherwise, all relations between random variables hold almost surely.

2.2. Stationary Point Problem Associated with Nonconvex Optimization Problem

Let us consider the following problem [8] (see, e.g., Subchapter 1.3.1 in [16] for details on stationary point problems):

Problem 1.

Assume that

(A1): $X \subset R^{d}$ is a nonempty, closed convex set onto which the projection can be easily computed;
(A2): $f : R^{d} \to R$ , which is defined for all $x \in R^{d}$ by $f (x) : = E [F (x, ξ)]$ , is well defined, where $F (\cdot, ξ)$ is continuously differentiable for almost every $ξ \in Ξ$ , where $ξ \in Ξ$ is a random vector whose probability distribution P is supported on a set $Ξ \subset R^{d_{1}}$ .

Then, we would like to find a stationary point

x^{🟉}

of the problem of minimizing f over X, i.e.,

\begin{matrix} x^{🟉} \in X^{🟉} : = {x^{🟉} \in X : 〈 x - x^{🟉}, \nabla f (x^{🟉}) 〉 \geq 0 (x \in X)}, \end{matrix}

where

\nabla f

denotes the gradient of f.

We can see that, if

X = R^{d}

, then

X^{🟉} = {x^{🟉} \in R^{d} : \nabla f (x^{🟉}) = 0}

and that, if f is convex, then

x^{🟉} \in X^{🟉}

is a global minimizer of f over X [16] (Subchapter 1.3.1).

Problem 1 is examined under the following conditions [8].

(C1): There is an independent and identically distributed sample $ξ_{0}, ξ_{1}, \dots$ of realizations of the random vector $ξ$ ;
(C2): There is an oracle which, for a given input point $(x, ξ) \in R^{d} \times Ξ$ , returns a stochastic gradient $G (x, ξ)$ such that $E [G (x, ξ)] = \nabla f (x)$ ;
(C3): There exists a positive number M such that, for all $x \in X$ , ${E [∥ G (x, ξ) ∥}^{2}] \leq M^{2}$ .

3. Conjugate Gradient-Like Method

Algorithm 1 is a method for solving Problem 1 under (C1)–(C3).

First, we would like to emphasize that Algorithm 1 uses a conjugate gradient-like direction [10,11,12,13] (see step 3 in Algorithm 1) defined by

\begin{matrix} \begin{matrix} γ_{n} = γ \in [0, \frac{1}{2}] or \frac{1}{n}, G_{n} = G (x_{n}, ξ_{n}) - γ_{n} G_{n - 1} . \end{matrix} \end{matrix}

(1)

The direction (1) differs from a conventional conjugate gradient direction using, for example, the Fletcher–Reeves formula,

\begin{matrix} \begin{matrix} γ_{n}^{FR} = \frac{∥ G (x_{n}, ξ_{n}) ∥^{2}}{∥ G (x_{n - 1}, ξ_{n - 1}) ∥^{2}}, G_{n} = G (x_{n}, ξ_{n}) - γ_{n}^{FR} G_{n - 1} . \end{matrix} \end{matrix}

(2)

Although conventional conjugate gradient methods are powerful tools for solving unconstrained smooth nonconvex optimization (see, e.g., [9] for details on conjugate gradient methods), iterative methods with the conjugate gradient-like directions are useful for solving constrained smooth optimization problems [10,11,12,13] (see also Section 1 for details). Since Problem 1 is a constrained optimization problem, we will focus on using conjugate gradient-like directions.

Algorithm 1 Conjugate gradient-like method for solving Problem 1

Require:

{(α_{n})}_{n \in N} \subset (0, 1), {(β_{n})}_{n \in N} \subset [0, 1), {(γ_{n})}_{n \in N} \subset [0, 1 / 2], δ \in [0, 1)

1:: $n \leftarrow 0, x_{0}, G_{- 1}, m_{- 1} \in R^{d}, H_{0} \in S_{+ +}^{d} \cap D^{d}$
2:: loop
3:: $G_{n} : = G (x_{n}, ξ_{n}) - γ_{n} G_{n - 1}$
4:: $m_{n} : = β_{n} m_{n - 1} + (1 - β_{n}) G_{n}$
5:: ${\hat{m}}_{n} : = {(1 - δ^{n + 1})}^{- 1} m_{n}$
6:: $H_{n} \in S_{+ +}^{d} \cap D^{d}$
7:: Find $d_{n} \in R^{d}$ that solves $H_{n} d = - {\hat{m}}_{n}$ .
8:: $x_{n + 1} : = P_{X, H_{n}} (x_{n} + α_{n} d_{n})$
9:: $n \leftarrow n + 1$
10:: end loop

We can see that Algorithm 1 with

γ_{n} = 0

(

n \in N

) coincides with the existing algorithm in [8] defined by

\begin{matrix} \begin{matrix} \{\begin{matrix} G_{n} : = G (x_{n}, ξ_{n}), \\ m_{n} : = β_{n} m_{n - 1} + (1 - β_{n}) G_{n}, \\ {\hat{m}}_{n} : = {(1 - δ^{n + 1})}^{- 1} m_{n}, \\ x_{n + 1} : = P_{X, H_{n}} (x_{n} - α_{n} H_{n}^{- 1} {\hat{m}}_{n}), \end{matrix} \end{matrix} \end{matrix}

(3)

where

H_{n} \in S_{+ +}^{d} \cap D^{d}

. We can also show that algorithm (3) (i.e., Algorithm 1 with

γ_{n} = 0

) includes AMSGrad [7] and Adam [6] by referring to [8] (Section 3). For example, consider

H_{n}

and

v_{n}

(

n \in N

) defined for all

n \in N

by

\begin{matrix} \begin{matrix} v_{n} : = ζ v_{n - 1} + (1 - ζ) G (x_{n}, ξ_{n}) ⊙ G (x_{n}, ξ_{n}), \\ {\hat{v}}_{n} = ({\hat{v}}_{n, i}) : = (max {{\hat{v}}_{n - 1, i}, v_{n, i}}), \\ H_{n} : = diag (\sqrt{{\hat{v}}_{n, i}}), \end{matrix} \end{matrix}

(4)

where

v_{- 1} = {\hat{v}}_{- 1} = 0 \in R^{d}

and

ζ \in [0, 1)

. Then, algorithm (3) with (4) and

δ = 0

is the AMSGrad algorithm. When

H_{n}

and

v_{n}

(

n \in N

) are defined for all

n \in N

by

\begin{matrix} \begin{matrix} v_{n} : = ζ v_{n - 1} + (1 - ζ) G (x_{n}, ξ_{n}) ⊙ G (x_{n}, ξ_{n}), \\ {\bar{v}}_{n} : = {(1 - ζ^{n + 1})}^{- 1} v_{n}, \\ {\hat{v}}_{n} = ({\hat{v}}_{n, i}) : = (max {{\hat{v}}_{n - 1, i}, {\bar{v}}_{n, i}}), \\ H_{n} : = diag (\sqrt{{\hat{v}}_{n, i}}), \end{matrix} \end{matrix}

(5)

Algorithm (3) with (5) resembles the Adam algorithm (The original Adam uses

H_{n} : = diag (\sqrt{{\bar{v}}_{n, i}})

and does not always converge [7] (Theorems 1–3). We use

H_{n} : = diag (\sqrt{{\hat{v}}_{n, i}})

to guarantee its convergence (see Theorems 1 and 2 for the convergence of Algorithm 1)).

For example, let us consider Algorithm 1 with (4) and

δ = 0

, i.e.,

\begin{matrix} \begin{matrix} \{\begin{matrix} G_{n} : = G (x_{n}, ξ_{n}) - γ_{n} G_{n - 1}, \\ m_{n} : = β_{n} m_{n - 1} + (1 - β_{n}) G_{n}, \\ v_{n} : = ζ v_{n - 1} + (1 - ζ) G (x_{n}, ξ_{n}) ⊙ G (x_{n}, ξ_{n}), \\ {\hat{v}}_{n} = ({\hat{v}}_{n, i}) : = (max {{\hat{v}}_{n - 1, i}, v_{n, i}}), \\ H_{n} : = diag (\sqrt{{\hat{v}}_{n, i}}), \\ x_{n + 1} : = P_{X, H_{n}} (x_{n} - α_{n} H_{n}^{- 1} m_{n}) . \end{matrix} \end{matrix} \end{matrix}

(6)

From the above discussion, algorithm (6) with

γ_{n} = 0

coincides with AMSGrad. We can see that algorithm (6) uses a conjugate gradient-like direction

G_{n} = G (x_{n}, ξ_{n}) - γ_{n} G_{n - 1}

, while AMSGrad (algorithm (3) with (4)) uses a gradient direction

G_{n} = G (x_{n}, ξ_{n})

.

The convergence analyses of Algorithm 1 assume the following conditions.

Assumption 1.

The sequence

{(H_{n})}_{n \in N} \subset S_{+ +}^{d} \cap D^{d}

, denoted by

H_{n} : = diag (h_{n, i})

, in Algorithm 1 satisfies the following conditions:

(A3): $h_{n + 1, i} \geq h_{n, i}$ almost surely for all $n \in N$ and all $i = 1, 2, \dots, d$ ;
(A4): For all $i = 1, 2, \dots, d$ , a positive number $B_{i}$ exists such that $sup {E [h_{n, i}] : n \in N} \leq B_{i}$ .

Moreover,

(A5): $D : = {max}_{i = 1, 2, \dots, d} sup {{(x_{i} - y_{i})}^{2} : (x_{i}), (y_{i}) \in X} < + \infty$ .

Assumption (A5) holds under the boundedness condition of X, which is assumed in [17] (p. 1574) and [7] (p. 2). In [8] (Section 3), it is shown that

H_{n}

and

v_{n}

defined by (4) or (5) satisfies (A3) and (A4).

3.1. Constant Learning Rate Rule

The following is the convergence analysis of Algorithm 1 with a constant learning rate. Theorem 1 can be inferred by referring to the proof of Theorem 3.1 in [8]. The proof of Theorem 1 is given in Appendix A.

Theorem 1.

Suppose that (A1)–(A5) and (C1)–(C3) hold and

{(x_{n})}_{n \in N}

is the sequence generated by Algorithm 1 with

α_{n} : = α

,

β_{n} : = β

, and

γ_{n} : = γ

(

n \in N

). Then, for all

x \in X

,

\begin{matrix} \underset{n \to + \infty}{lim sup} E [〈 x - x_{n}, \nabla f (x_{n}) 〉] \geq - \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{2 \tilde{b} {\tilde{δ}}^{2}} α - \frac{\sqrt{D d} \tilde{M}}{\tilde{b} \tilde{δ}} β - \frac{2 \sqrt{D d} \hat{M}}{\tilde{δ}} γ, \end{matrix}

where

\tilde{δ} : = 1 - δ

,

\tilde{b} : = 1 - β

, M is defined as in (C3),

{\hat{M}}^{2} : = max {M^{2}, ∥ G_{- 1} ∥^{2}}

,

{\tilde{M}}^{2} : = max {∥ m_{- 1} ∥^{2}, 4 {\hat{M}}^{2}}

, D is defined as in (A5), and

\tilde{B} : = sup {{max}_{i = 1, 2, \dots, d} h_{n, i}^{- 1 / 2} : n \in N} < + \infty

.

Theorem 1 shows that using a small constant learning rate approximates a solution to Problem 1. The result for

γ : = 0

coincides with Theorem 3.1 in [8].

We have the following proposition for convex stochastic optimization.

Proposition 1.

Suppose that (A1)–(A5) and (C1)–(C3) hold,

F (\cdot, ξ)

is convex for almost every

ξ \in Ξ

, and

{(x_{n})}_{n \in N}

is the sequence generated by Algorithm 1 with

α_{n} : = α

,

β_{n} : = β

, and

γ_{n} : = γ

(

n \in N

). Then,

\begin{matrix} \underset{n \to + \infty}{lim inf} E [f (x_{n}) - f^{🟉}] \leq \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{2 \tilde{b} {\tilde{δ}}^{2}} α + \frac{\sqrt{D d} \tilde{M}}{\tilde{b} \tilde{δ}} β + \frac{2 \sqrt{D d} \hat{M}}{\tilde{δ}} γ, \end{matrix}

where

f^{🟉}

denotes the optimal value of the problem of minimizing f over X, and

\tilde{δ}

,

\tilde{b}

, M,

\hat{M}

,

\tilde{M}

, D, and

\tilde{B}

are defined as in Theorem 1.

The previously reported results in [7] showed that AMSGrad, which is an example of Algorithm 1 (see Algorithm (3) with (4) and

δ = 0

), ensures that there exists a positive real number B such that

\begin{matrix} \frac{R (T)}{T} = \frac{1}{T} (\sum_{t = 1}^{T} F (x_{t}, ξ_{t}) - f^{🟉}) \leq B \sqrt{\frac{1 + ln T}{T}}, \end{matrix}

(7)

where T is the number of training examples and

F (\cdot, ξ)

is convex for almost every

ξ \in Ξ

. Inequality (7) indicates that the value

R (T) / T

generated by AMSGrad has an upper bound; however, it is not guaranteed that AMSGrad solves Problem 1. Meanwhile, Proposition 1 shows that Algorithm 1, which includes Adam and AMSGrad, can approximate a global minimizer of f by using a small constant learning rate.

3.2. Diminishing Learning Rate Rule

The following is the convergence analysis of Algorithm 1 with diminishing learning rates. Theorem 2 can be proven by referring to the proof of Theorem 3.2 in [8]. The proof of Theorem 2 is given in Appendix A.

Theorem 2.

Suppose that (A1)–(A5) and (C1)–(C3) hold and

{(x_{n})}_{n \in N}

is the sequence generated by Algorithm 1 with

α_{n}

,

β_{n}

, and

γ_{n}

(

n \in N

) satisfying

\sum_{n = 0}^{+ \infty} α_{n} = + \infty

,

\sum_{n = 0}^{+ \infty} α_{n}^{2} < + \infty

,

\sum_{n = 0}^{+ \infty} α_{n} β_{n} < + \infty

, and

\sum_{n = 0}^{+ \infty} α_{n} γ_{n} < + \infty

. Then, for all

x \in X

,

\begin{matrix} \underset{n \to + \infty}{lim sup} E [〈 x - x_{n}, \nabla f (x_{n}) 〉] \geq 0 . \end{matrix}

(8)

Moreover, suppose that

α_{n} : = 1 / n^{η}

,

β_{n} : = β^{n}

,

γ_{n} : = γ^{n}

or

1 / n^{κ}

, where

η \in [1 / 2, 1)

,

κ > 1 - η

, and

β, γ \in (0, 1)

. Then, Algorithm 1 achieves the following convergence rate:

\begin{matrix} \frac{1}{n} \sum_{k = 1}^{n} E [〈 x - x_{k}, \nabla f (x_{k}) 〉] \geq \{\begin{matrix} - O (\sqrt{\frac{1 + ln n}{n}}) & i f η = \frac{1}{2}, \\ - O (\frac{1}{n^{1 - η}}) & i f η \in (\frac{1}{2}, 1) . \end{matrix} \end{matrix}

Inequality (8) implies that there exists a subsequence

{(x_{n_{j}})}_{j \in N}

of

{(x_{n})}_{n \in N}

such that

{(x_{n_{j}})}_{j \in N}

converges to

x_{🟉}

and, for all

x \in X

,

\begin{matrix} lim_{j \to + \infty} E [〈 x - x_{n_{j}}, \nabla f (x_{n_{j}}) 〉] = \underset{n \to + \infty}{lim sup} E [〈 x - x_{n}, \nabla f (x_{n}) 〉] \geq 0, \end{matrix}

which implies that

x_{🟉}

satisfies

〈 x - x_{🟉}, \nabla f (x_{🟉}) 〉 \geq 0

(

x \in X

); i.e.,

x_{🟉}

is a solution to Problem 1.

Theorem 2 leads to the following proposition, which indicates that Algorithm 1 converges to a global minimizer of f when

F (\cdot, ξ)

is convex for almost every

ξ \in Ξ

.

Proposition 2.

Suppose that (A1)–(A5) and (C1)–(C3) hold,

F (\cdot, ξ)

is convex for almost every

ξ \in Ξ

, and

{(x_{n})}_{n \in N}

is the sequence generated by Algorithm 1 with

α_{n}

,

β_{n}

, and

γ_{n}

satisfying

\sum_{n = 0}^{+ \infty} α_{n} = + \infty

,

\sum_{n = 0}^{+ \infty} α_{n}^{2} < + \infty

,

\sum_{n = 0}^{+ \infty} α_{n} β_{n} < + \infty

, and

\sum_{n = 0}^{+ \infty} α_{n} γ_{n} < + \infty

. Then,

\begin{matrix} \underset{n \to + \infty}{lim inf} E [f (x_{n}) - f^{🟉}] = 0, \end{matrix}

where

f^{🟉}

denotes the optimal value of the problem of minimizing f over X. Moreover, suppose that

α_{n} : = 1 / n^{η}

,

β_{n} : = β^{n}

,

γ_{n} : = γ^{n}

or

1 / n^{κ}

, where

η \in [1 / 2, 1)

,

κ > 1 - η

, and

β, γ \in (0, 1)

. Then, any accumulation point of

{({\tilde{x}}_{n})}_{n \in N}

defined by

{\tilde{x}}_{n} : = (1 / n) \sum_{k = 1}^{n} x_{k}

almost surely belongs to the solution set

X^{🟉}

, and Algorithm 1 achieves the following convergence rate:

\begin{matrix} E [f ({\tilde{x}}_{n}) - f^{🟉}] = \{\begin{matrix} O (\sqrt{\frac{1 + ln n}{n}}) & i f η = \frac{1}{2}, \\ O (\frac{1}{n^{1 - η}}) & i f η \in (\frac{1}{2}, 1) . \end{matrix} \end{matrix}

4. Numerical Experiments

The experiments used a fast scalar computation server (https://www.meiji.ac.jp/isys/hpc/ia.html) at Meiji University. The environment has two Intel(R) Xeon(R) Gold 6148 (2.4 GHz, 20 cores) CPUs, an NVIDIA Tesla V100 (16GB, 900Gbps) GPU, and a Red Hat Enterprise Linux 7.6 operating system. The experimental code was written in Python 3.8.2, and we used the NumPy 1.19.1 package and PyTorch 1.5.0 package.

We compared the existing algorithms, such as the momentum method [18] (9), [19] (Section 2), AdaGrad [5], RMSProp [4] (Algorithm 8.5), Adam [6], and AMSGrad [7] in torch.optim (https://pytorch.org/docs/stable/optim.html) using the default values and learning rate

10^{- 3}

, with Algorithm 1 defined as follows:

Algorithm 1 with a constant learning rate (Algorithm 1 with

γ_{n} = 0

, such as Momentum-Ci, Adam-Ci, and AMSGrad-Ci (

i = 1, 2, 3

), is Algorithm 1 in [8]):

Momentum-C1: Algorithm 1 with $δ = 0$ , $H_{n} = diag (1)$ , $α_{n} = β_{n} = 10^{- 1}$ , and $γ_{n} = 0$ .
Momentum-C2: Algorithm 1 with $δ = 0$ , $H_{n} = diag (1)$ , $α_{n} = β_{n} = 10^{- 2}$ , and $γ_{n} = 0$ .
Momentum-C3: Algorithm 1 with $δ = 0$ , $H_{n} = diag (1)$ , $α_{n} = β_{n} = 10^{- 3}$ , and $γ_{n} = 0$ .
MomentumCG-C1: Algorithm 1 with $δ = 0$ , $H_{n} = diag (1)$ , and $α_{n} = β_{n} = γ_{n} = 10^{- 1}$ .
MomentumCG-C2: Algorithm 1 with $δ = 0$ , $H_{n} = diag (1)$ , and $α_{n} = β_{n} = γ_{n} = 10^{- 2}$ .
MomentumCG-C3: Algorithm 1 with $δ = 0$ , $H_{n} = diag (1)$ , and $α_{n} = β_{n} = γ_{n} = 10^{- 3}$ .
Adam-C1: Algorithm 1 with $δ = 0.9$ , $ζ = 0.999$ , $H_{n}$ defined by (5), $α_{n} = β_{n} = 10^{- 1}$ , and $γ_{n} = 0$ .
Adam-C2: Algorithm 1 with $δ = 0.9$ , $ζ = 0.999$ , $H_{n}$ defined by (5), $α_{n} = β_{n} = 10^{- 2}$ , and $γ_{n} = 0$ .
Adam-C3: Algorithm 1 with $δ = 0.9$ , $ζ = 0.999$ , $H_{n}$ defined by (5), $α_{n} = β_{n} = 10^{- 3}$ , and $γ_{n} = 0$ .
AdamCG-C1: Algorithm 1 with $δ = 0.9$ , $ζ = 0.999$ , $H_{n}$ defined by (5), and $α_{n} = β_{n} = γ_{n} = 10^{- 1}$ .
AdamCG-C2: Algorithm 1 with $δ = 0.9$ , $ζ = 0.999$ , $H_{n}$ defined by (5), and $α_{n} = β_{n} = γ_{n} = 10^{- 2}$ .
AdamCG-C3: Algorithm 1 with $δ = 0.9$ , $ζ = 0.999$ , $H_{n}$ defined by (5), and $α_{n} = β_{n} = γ_{n} = 10^{- 3}$ .
AMSGrad-C1: Algorithm 1 with $δ = 0$ , $ζ = 0.999$ , $H_{n}$ defined by (4), $α_{n} = β_{n} = 10^{- 1}$ , and $γ_{n} = 0$ .
AMSGrad-C2: Algorithm 1 with $δ = 0$ , $ζ = 0.999$ , $H_{n}$ defined by (4), $α_{n} = β_{n} = 10^{- 2}$ , and $γ_{n} = 0$ .
AMSGrad-C3: Algorithm 1 with $δ = 0$ , $ζ = 0.999$ , $H_{n}$ defined by (4), $α_{n} = β_{n} = 10^{- 3}$ , and $γ_{n} = 0$ .
AMSGradCG-C1: Algorithm 1 with $δ = 0$ , $ζ = 0.999$ , $H_{n}$ defined by (4), and $α_{n} = β_{n} = γ_{n} = 10^{- 1}$ .
AMSGradCG-C2: Algorithm 1 with $δ = 0$ , $ζ = 0.999$ , $H_{n}$ defined by (4), and $α_{n} = β_{n} = γ_{n} = 10^{- 2}$ .
AMSGradCG-C3: Algorithm 1 with $δ = 0$ , $ζ = 0.999$ , $H_{n}$ defined by (4), and $α_{n} = β_{n} = γ_{n} = 10^{- 3}$ .

Algorithm 1 with diminishing learning rates

α_{n} = 1 / \sqrt{n}

and

β_{n} = 1 / 2^{n}

based on [7] (Theorem 4 and Corollary 1) (Algorithm 1 with

γ_{n} = 0

, such as Momentum-D1, Adam-D1, and AMSGrad-D1, is Algorithm 1 in [8]):

Momentum-D1: Algorithm 1 with $δ = 0$ , $H_{n} = diag (1)$ , and $γ_{n} = 0$ .
MomentumCG-D1: Algorithm 1 with $δ = 0$ , $H_{n} = diag (1)$ , and $γ_{n} = 1 / 2^{n}$ .
MomentumCG-D2: Algorithm 1 with $δ = 0$ , $H_{n} = diag (1)$ , and $γ_{n} = 1 / n$ .
Adam-D1: Algorithm 1 with $δ = 0.9$ , $ζ = 0.999$ , $H_{n}$ defined by (5), and $γ_{n} = 0$ .
AdamCG-D1: Algorithm 1 with $δ = 0.9$ , $ζ = 0.999$ , $H_{n}$ defined by (5), and $γ_{n} = 1 / 2^{n}$ .
AdamCG-D2: Algorithm 1 with $δ = 0.9$ , $ζ = 0.999$ , $H_{n}$ defined by (5), and $γ_{n} = 1 / n$ .
AMSGrad-D1: Algorithm 1 with $δ = 0$ , $ζ = 0.999$ , $H_{n}$ defined by (4), and $γ_{n} = 0$ .
AMSGradCG-D1: Algorithm 1 with $δ = 0$ , $ζ = 0.999$ , $H_{n}$ defined by (4), and $γ_{n} = 1 / 2^{n}$ .
AMSGradCG-D2: Algorithm 1 with $δ = 0$ , $ζ = 0.999$ , $H_{n}$ defined by (4), and $γ_{n} = 1 / n$ .

Python implementations of the algorithms are available at https://github.com/iiduka-researches/202008-cg-like.

4.1. Image Classification

This experiment used the CIFAR10 dataset (https://www.cs.toronto.edu/~kriz/cifar.html), a benchmark for image classification. The dataset consists of 60,000 color images (

32 \times 32

) in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. The test batch contained exactly 1000 randomly selected images from each class. We trained a 44-layer ResNet (ResNet-44) [20] organized into 43 convolutional layers which had

3 \times 3

filters and a 1000-way-fully-connected layer with a softmax function. We used the cross entropy as the loss function for fitting ResNet in accordance with the commonly used strategy in image classification.

Figure 1, Figure 2 and Figure 3 compare the behaviors of the proposed algorithm with a constant learning rate with those of Momentum, AdaGrad, RMSProp, Adam, and AMSGrad using the default values in torch.optim (i.e.,

α_{n} = 10^{- 3}, β_{n} = 0.9

). Figure 1 shows that Momentum-C1, MomentumCG-C1, and AMSGrad-C2 minimized the training loss function faster than the existing algorithms, and Figure 2 shows that they decreased the training error rate faster as well. Moreover, AdamCG-Ci (resp. AMSGradCG-Ci) (

i = 2, 3

) outperformed AdamCG-C1 (resp. AMSGradCG-C1); this implies AdamCG and AMSGradCG require fewer iterations at smaller learning rates. Figure 3 shows that Adam-C2, AdamCG-C2, AMSGrad-C2, AMSGradCG-C2 decreased the test error rate faster than other algorithms. A similar trend was observed in the numerical results in [21].

Figure 4, Figure 5 and Figure 6 plot the behaviors of the proposed algorithms with diminishing learning rates. These algorithms did not work, and thus, it is clear that using diminishing learning rates is not good for training neural networks (see Section 5 for the details). A similar problem was observed in the numerical results in [8].

Table 1 shows the mean and variance of elapsed time per epoch for the existing algorithms and Algorithm 1 with a constant learning rate. This table indicates that the elapsed time of Momentum was almost the same as those of the proposed algorithms, e.g., Momentum-Ci and MomentumCG-Ci (

i = 1, 2, 3

). Adam and AMSGrad also had such a trend.

Table 2 compares the training error rates of the existing algorithms with those of Algorithm 1 by using the scipy.stats.ttest_ind function in Python. The p-value is the probability associated with a t-test, and the significance level is set at 5%. If the value is less than 0.05, then there is a significant difference between the existing algorithm and the proposed algorithms. Table 2 and Figure 2 indicate that Momentum-C1 and MomentumCG-C1 outperformed Momentum and the performance of the existing algorithm (Momentum) was significantly different from the performances of the proposed algorithms (Momentum-C1 and MomentumCG-C1). Adam-Ci and AdamCG-Ci (

i = 1, 2, 3

) had almost the same performance as Adam, while the performance of AMSGrad was not significantly different from that of AMSGrad-Ci and AMSGradCG-Ci (

i = 1, 2, 3

).

4.2. Text Classification

This experiment used the IMDb dataset (https://datasets.imdbws.com/) for text classification tasks. The dataset contains 50,000 movie reviews along with their associated binary sentiment polarity labels. The dataset is split into 25,000 training and 25,000 test sets. We used an embedding layer that generated 50-dimensional embedding vectors and two bidirectional long short-term memory (LSTM) with an affine layer and a sigmoid function as an activation function for the output. To train it, we used the binary cross entropy (BCE) as a loss function minimized by the existing and proposed algorithms.

Figure 7, Figure 8 and Figure 9 compare the behaviors of the proposed algorithm with a constant learning rate with those of Momentum, AdaGrad, RMSProp, Adam, and AMSGrad, using the default values in torch.optim (i.e.,

α_{n} = 10^{- 3}, β_{n} = 0.9

). These figures show that Adam-C3, AdamCG-C3, AMSGrad-C3, RMSProp, Adam, and AMSGrad all performed well. In particular, Figure 8 shows that AdamCG-C3 (resp. AMSGradCG-C3) performed better than Adam-C3 (resp. AMSGrad-C3), which implies that using conjugate gradient-like directions would be good for training neural networks.

Figure 10, Figure 11 and Figure 12 indicate the behaviors of the proposed algorithms with diminishing learning rates. These figures show that the algorithms did not work, as was the case in Figure 4, Figure 5 and Figure 6 (see Section 5 for the details).

Table 3 indicates that the elapsed time for the existing algorithm was almost the same as the one for the proposed algorithms, as seen in Table 1. Table 4 and Figure 8 show that, although Momentum, Momentum-Ci, and MomentumCG-Ci did not perform better than the existing algorithms such as Adam and AMSGrad, the performance of Momentum was significantly different from that of almost all of proposed algorithms. It can be seen that Adam, Adam-C3, and AdamCG-C3 performed well and that, although AMSGrad, AMSGrad-C3, and AMSGradCG-C3 did not perform better than Adam, AMSGrad-C3 and AMSGradCG-C3 had almost the same performance as AMSGrad.

5. Discussion

Let us first discuss the relationship between the momentum method [18] (9), [19] (Section 2) with MomentumCG used in Section 4. The momentum method [18] (9), [19] (Section 2) is defined by

\begin{matrix} m_{n} : = - ϵ G (x_{n}, ξ_{n}) + μ m_{n - 1}, x_{n + 1} : = P_{X} (x_{n} + m_{n}), i . e ., \end{matrix}

(9a)

\begin{matrix} x_{n + 1} : = P_{X} (x_{n} - ϵ G (x_{n}, ξ_{n}) + μ m_{n - 1}), \end{matrix}

(9b)

where

ϵ > 0

is the learning rate and

μ \in [0, 1]

is the momentum coefficient. We can see that

m_{n}

defined by (9) is the conjugate gradient-like direction of

ϵ G (x_{n}, ξ_{n})

. Meanwhile, MomentumCG used in Section 4 is as follows:

\begin{matrix} G_{n} = G (x_{n}, ξ_{n}) - γ_{n} G_{n - 1}, \end{matrix}

(10a)

\begin{matrix} m_{n} : = (1 - β_{n}) G_{n} + β_{n} m_{n - 1}, \end{matrix}

(10b)

\begin{matrix} x_{n + 1} : = P_{X} (x_{n} - α_{n} m_{n}) . \end{matrix}

(10c)

Algorithm (10) uses the conjugate gradient-like direction

G_{n}

of

G (x_{n}, ξ_{n})

. For simplicity, algorithm (10) with

β_{n} = 0

is such that

\begin{matrix} x_{n + 1} : = P_{X} (x_{n} - α_{n} G (x_{n}, ξ_{n}) + α_{n} γ_{n} m_{n - 1}), \end{matrix}

(11)

which implies that algorithm (11) is the momentum method with a learning rate

α_{n}

and momentum coefficient

α_{n} γ_{n}

.

The numerical comparisons in Section 4 show that Algorithm 1 with a constant learning rate performed better than Algorithm 1 with diminishing learning rates. For example, let us consider the text classification in Section 4.2 and compare AdamCG-C3 defined by

\begin{matrix} \begin{matrix} \{\begin{matrix} G_{n} : = G (x_{n}, ξ_{n}) - 10^{- 3} G_{n - 1}, \\ m_{n} : = 10^{- 3} m_{n - 1} + (1 - 10^{- 3}) G_{n}, \\ {\hat{m}}_{n} : = {(1 - {0.9}^{n + 1})}^{- 1} m_{n}, \\ v_{n} : = 0.999 v_{n - 1} + (1 - 0.999) G (x_{n}, ξ_{n}) ⊙ G (x_{n}, ξ_{n}), \\ {\bar{v}}_{n} : = {(1 - {0.999}^{n + 1})}^{- 1} v_{n}, \\ {\hat{v}}_{n} = ({\hat{v}}_{n, i}) : = (max {{\hat{v}}_{n - 1, i}, {\bar{v}}_{n, i}}), \\ H_{n} : = diag (\sqrt{{\hat{v}}_{n, i}}), \\ x_{n + 1} : = P_{X, H_{n}} (x_{n} - 10^{- 3} H_{n}^{- 1} m_{n}) \end{matrix} \end{matrix} \end{matrix}

(12)

with AdamCG-D1 defined by

\begin{matrix} \begin{matrix} \{\begin{matrix} G_{n} : = G (x_{n}, ξ_{n}) - 2^{- n} G_{n - 1}, \\ m_{n} : = 2^{- n} m_{n - 1} + (1 - 2^{- n}) G_{n}, \\ {\hat{m}}_{n} : = {(1 - {0.9}^{n + 1})}^{- 1} m_{n}, \\ v_{n} : = 0.999 v_{n - 1} + (1 - 0.999) G (x_{n}, ξ_{n}) ⊙ G (x_{n}, ξ_{n}), \\ {\bar{v}}_{n} : = {(1 - {0.999}^{n + 1})}^{- 1} v_{n}, \\ {\hat{v}}_{n} = ({\hat{v}}_{n, i}) : = (max {{\hat{v}}_{n - 1, i}, {\bar{v}}_{n, i}}), \\ H_{n} : = diag (\sqrt{{\hat{v}}_{n, i}}), \\ x_{n + 1} : = P_{X, H_{n}} (x_{n} - n^{- 1 / 2} H_{n}^{- 1} m_{n}) . \end{matrix} \end{matrix} \end{matrix}

(13)

AdamCG-C3 (algorithm (12)) works well for all

n \in N

, since it uses a constant learning rate. Meanwhile, there is a possibility that AdamCG-D1 (Algorithm (13)) does not work for a large number of iterations, because it uses diminishing learning rates. In fact, AdamCG-D1 (algorithm (13)) for a large n is as follows:

\begin{matrix} \begin{matrix} \{\begin{matrix} G_{n} : = G (x_{n}, ξ_{n}) - 2^{- n} G_{n - 1} \approx G (x_{n}, ξ_{n}), \\ m_{n} : = 2^{- n} m_{n - 1} + (1 - 2^{- n}) G_{n} \approx G_{n} \approx G (x_{n}, ξ_{n}), \\ {\hat{m}}_{n} : = {(1 - {0.9}^{n + 1})}^{- 1} m_{n}, \\ v_{n} : = 0.999 v_{n - 1} + (1 - 0.999) G (x_{n}, ξ_{n}) ⊙ G (x_{n}, ξ_{n}), \\ {\bar{v}}_{n} : = {(1 - {0.999}^{n + 1})}^{- 1} v_{n}, \\ {\hat{v}}_{n} = ({\hat{v}}_{n, i}) : = (max {{\hat{v}}_{n - 1, i}, {\bar{v}}_{n, i}}), \\ H_{n} : = diag (\sqrt{{\hat{v}}_{n, i}}), \\ x_{n + 1} : = P_{X, H_{n}} (x_{n} - n^{- 1 / 2} H_{n}^{- 1} m_{n}) \approx P_{X, H_{n}} (x_{n}) = x_{n}, \end{matrix} \end{matrix} \end{matrix}

(14)

which implies that algorithm (14) does not work. As can be seen in Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12, Algorithm 1 with diminishing learning rates would not be good for training neural networks.

Finally, let us compare the existing algorithm with Algorithm 1, in particular, AMSGrad in torch.optim using

α_{n} = 10^{- 3}

,

β_{n} = 0.9

, and

ζ = 0.999

with AMSGrad-C3 using

α_{n} = 10^{- 3}

,

β_{n} = 10^{- 3}

, and

ζ = 0.999

. The difference between AMSGrad and AMSGrad-C3 is the setting of

β_{n}

. According to Figure 7, Figure 8 and Figure 9, AMSGrad-C3 performs comparably to AMSGrad, a useful algorithm. These results are guaranteed by Theorem 1, which indicates that Algorithm 1 with a small constant learning rate approximates a stationary point of the minimization problem in deep neural networks, and more specifically, the sequence

{(x_{n})}_{n \in N}

generated by AMSGrad-C3 (Algorithm 1 with

δ = 0

) satisfying

\begin{matrix} \underset{n \to + \infty}{lim sup} E [〈 x - x_{n}, \nabla f (x_{n}) 〉] \geq - \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{2 \tilde{b}} \frac{1}{10^{3}} - \frac{\sqrt{D d} \tilde{M}}{\tilde{b}} \frac{1}{10^{3}} (x \in X) \end{matrix}

approximates

x^{🟉} \in X^{🟉} : = {x^{🟉} \in X : 〈 x - x^{🟉}, \nabla f (x^{🟉}) 〉 \geq 0 (x \in X)}

.

6. Conclusions

We proposed an iterative algorithm with conjugate gradient-like directions for nonconvex optimization in deep neural networks to accelerate conventional adaptive learning rate optimization algorithms. We presented two convergence analyses of the algorithm. The first convergence analysis showed that the algorithm with a constant learning rate approximates a stationary point of a nonconvex optimization problem. The second analysis showed that the algorithm with a diminishing learning rate converges to a stationary point of the nonconvex optimization problem. We gave numerical results for concrete neural networks. The results showed that the proposed algorithm with a constant learning rate is superior for training neural networks from the viewpoints of theory and practice, while the proposed algorithm with a diminishing learning rate is not good for training neural networks. The reason behind these results is that using a constant learning rate guarantees that the algorithm works well, while a diminishing learning rate for a large number of iterations, which is approximately zero, implies that the algorithm is not updated.

Author Contributions

Conceptualization, H.I.; methodology, H.I.; software, Y.K.; validation, H.I. and Y.K.; formal analysis, H.I.; investigation, H.I. and Y.K.; resources, H.I.; data curation, Y.K.; writing—original draft preparation, H.I.; writing—review and editing, H.I.; visualization, H.I. and Y.K.; supervision, H.I.; project administration, H.I.; funding acquisition, H.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI, Grant Number JP18K11184.

Acknowledgments

The authors would like to thank Michelle Zhou for giving us a chance to submit our paper to this journal. We are sincerely grateful to Assistant Editor Elliot Guo and the two referees for helping us improve the original manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs of Theorems 1 and 2 and Propositions 1 and 2

This section refers to [8]. Let us first prove the following lemma.

Lemma A1.

Suppose that (A1)–(A2) and (C1)–(C2) hold. Then, for all

x \in X

and all

n \in N

,

\begin{matrix} E [{∥x_{n + 1} - x∥}_{H_{n}}^{2}] & \leq E [{∥x_{n} - x∥}_{H_{n}}^{2}] + \frac{2 α_{n}}{1 - δ^{n + 1}} {(1 - β_{n}) E [〈x - x_{n}, \nabla f (x_{n})〉] \\ + β_{n} E [〈x - x_{n}, m_{n - 1}〉] - (1 - β_{n}) γ_{n} E [〈x - x_{n}, G_{n - 1}〉]} + α_{n}^{2} E [{∥d_{n}∥}_{H_{n}}^{2}] . \end{matrix}

Proof.

Choose

x \in X

and

n \in N

. The definition of

x_{n + 1}

and the nonexpansivity of

P_{X, H_{n}}

imply that, almost surely,

\begin{matrix} {∥x_{n + 1} - x∥}_{H_{n}}^{2} & \leq {∥(x_{n} - x) + α_{n} d_{n}∥}_{H_{n}}^{2} \\ = {∥x_{n} - x∥}_{H_{n}}^{2} + 2 α_{n} {〈x_{n} - x, d_{n}〉}_{H_{n}} + α_{n}^{2} {∥d_{n}∥}_{H_{n}}^{2} . \end{matrix}

The definitions of

d_{n}

,

m_{n}

, and

{\hat{m}}_{n}

ensure that

\begin{matrix} {〈x_{n} - x, d_{n}〉}_{H_{n}} = \frac{1}{{\tilde{δ}}_{n}} 〈x - x_{n}, m_{n}〉 = \frac{β_{n}}{{\tilde{δ}}_{n}} 〈x - x_{n}, m_{n - 1}〉 + \frac{1 - β_{n}}{{\tilde{δ}}_{n}} 〈x - x_{n}, G_{n}〉, \end{matrix}

where

{\tilde{δ}}_{n} : = 1 - δ^{n + 1}

. Moreover, the definition of

G_{n}

implies that

\begin{matrix} 〈x - x_{n}, G_{n}〉 = 〈x - x_{n}, G (x_{n}, ξ_{n})〉 - γ_{n} 〈x - x_{n}, G_{n - 1}〉 . \end{matrix}

Hence, almost surely,

\begin{matrix} \begin{matrix} {∥x_{n + 1} - x∥}_{H_{n}}^{2} & \leq {∥x_{n} - x∥}_{H_{n}}^{2} + 2 α_{n} {\frac{β_{n}}{{\tilde{δ}}_{n}} 〈x - x_{n}, m_{n - 1}〉 + \frac{1 - β_{n}}{{\tilde{δ}}_{n}} 〈x - x_{n}, G (x_{n}, ξ_{n})〉 \\ - \frac{(1 - β_{n}) γ_{n}}{{\tilde{δ}}_{n}} 〈x - x_{n}, G_{n - 1}〉} + α_{n}^{2} {∥d_{n}∥}_{H_{n}}^{2} . \end{matrix} \end{matrix}

(A1)

The conditions

x_{n} = x_{n} (ξ_{[n - 1]})

(n \in N)

, (C1), and (C2) imply that

\begin{matrix} E [〈x - x_{n}, G (x_{n}, ξ_{n})〉] & = E [E [〈x - x_{n}, G (x_{n}, ξ_{n})〉 | ξ_{[n - 1]}]] \\ = E [〈x - x_{n}, E [G (x_{n}, ξ_{n}) | ξ_{[n - 1]}]〉] \\ = E [〈x - x_{n}, \nabla f (x_{n})〉] . \end{matrix}

Taking the expectation of (A1) leads to the assertion of Lemma A1. □

Lemma A2.

If (C3) holds, then, for all

n \in N

,

E [∥ G_{n} ∥^{2}] \leq 4 {\hat{M}}^{2}

and

E [∥ m_{n} ∥^{2}] \leq {\tilde{M}}^{2}

, where

{\hat{M}}^{2} : = max {M^{2}, ∥ G_{- 1} ∥^{2}}

and

{\tilde{M}}^{2} : = max {∥ m_{- 1} ∥^{2}, 4 {\hat{M}}^{2}}

. Moreover, if (A3) holds, then, for all

n \in N

,

E [∥ d_{n} ∥_{H_{n}}^{2}] \leq {\tilde{B}}^{2} {\tilde{M}}^{2} / {(1 - δ)}^{2}

, where

\tilde{B} : = sup {{max}_{i = 1, 2, \dots, d} h_{n, i}^{- 1 / 2} : n \in N} < + \infty

.

Proof.

Let us define

{\hat{M}}^{2} : = max {M^{2}, ∥ G_{- 1} ∥^{2}} < + \infty

, where M is defined as in (C3). Let us consider the case where

n = 0

. The inequality

{∥ x + y ∥}^{2} \leq {2 ∥ x ∥}^{2} + 2 {∥ y ∥}^{2}

(

x, y \in R^{d}

) ensures that

\begin{matrix} {∥G_{0}∥}^{2} \leq 2 {∥G (x_{0}, ξ_{0})∥}^{2} + 2 γ_{0}^{2} {∥G_{- 1}∥}^{2}, \end{matrix}

(A2)

which, together with

γ_{n} \leq 1 / 2

(

n \in N

) and the definition of

\hat{M}

, implies that

\begin{matrix} E [{∥G_{0}∥}^{2}] \leq 2 M^{2} + 2 \cdot \frac{1}{4} \cdot 4 {\hat{M}}^{2} \leq 4 {\hat{M}}^{2} . \end{matrix}

Assume that

E [∥ G_{n} ∥^{2}] \leq 4 {\hat{M}}^{2}

for some

n \in N

. The same discussion as for (A2) ensures that

\begin{matrix} E [{∥G_{n + 1}∥}^{2}] \leq 2 E [{∥G (x_{n + 1}, ξ_{n + 1})∥}^{2}] + 2 γ_{n + 1}^{2} E [{∥G_{n}∥}^{2}] \leq 2 M^{2} + 2 \cdot \frac{1}{4} \cdot 4 {\hat{M}}^{2} \leq 4 {\hat{M}}^{2} . \end{matrix}

Accordingly, we have, for all

n \in N

,

\begin{matrix} E [∥ G_{n} ∥^{2}] \leq 4 {\hat{M}}^{2} . \end{matrix}

(A3)

From the definition of

m_{n}

, the convexity of

{∥ \cdot ∥}^{2}

, and (A3), for all

n \in N

,

\begin{matrix} E [{∥m_{n}∥}^{2}] \leq β_{n} E [{∥m_{n - 1}∥}^{2}] + (1 - β_{n}) E [{∥G_{n}∥}^{2}] \leq β_{n} E [{∥m_{n - 1}∥}^{2}] + 4 {\hat{M}}^{2} (1 - β_{n}) . \end{matrix}

Hence, induction leads to, for all

n \in N

,

\begin{matrix} E [{∥m_{n}∥}^{2}] \leq {\tilde{M}}^{2} : = max \{{∥m_{- 1}∥}^{2}, 4 {\hat{M}}^{2}\} < + \infty . \end{matrix}

(A4)

Given

n \in N

,

H_{n} ≻ O

ensures that there exists a unique matrix

{\bar{H}}_{n} ≻ O

such that

H_{n} = {\bar{H}}_{n}^{2}

[22] (Theorem 7.2.6). From

{∥ x ∥}_{H_{n}}^{2} = {∥ {\bar{H}}_{n} x ∥}^{2}

(

x \in R^{d}

) and the definitions of

d_{n}

and

{\hat{m}}_{n}

, we have, for all

n \in N

,

\begin{matrix} E [{∥d_{n}∥}_{H_{n}}^{2}] = E [{∥{\bar{H}}_{n}^{- 1} H_{n} d_{n}∥}^{2}] \leq \frac{1}{{\tilde{δ}}_{n}^{2}} E [{∥{\bar{H}}_{n}^{- 1}∥}^{2} {∥m_{n}∥}^{2}], \end{matrix}

where

{\tilde{δ}}_{n} : = 1 - δ^{n + 1} \geq 1 - δ

and

∥ {\bar{H}}_{n}^{- 1} ∥ = ∥ diag (h_{n, i}^{- 1 / 2}) ∥ = {max}_{i = 1, 2, \dots, d} h_{n, i}^{- 1 / 2}

(

n \in N

). From (A4) and

\tilde{B} : = sup {{max}_{i = 1, 2, \dots, d} h_{n, i}^{- 1 / 2} : n \in N} \leq {max}_{i = 1, 2, \dots, d} h_{0, i}^{- 1 / 2} < + \infty

(by (A3)), we have, for all

n \in N

,

\begin{matrix} E [{∥d_{n}∥}_{H_{n}}^{2}] \leq \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{{(1 - δ)}^{2}}, \end{matrix}

which completes the proof. □

The convergence rate analysis of Algorithm 1 is as follows.

Theorem A1.

Suppose that (A1)–(A5) and (C1)–(C3) hold and

{(θ_{n})}_{n \in N}

defined by

θ_{n} : = α_{n} (1 - β_{n}) / (1 - δ^{n + 1})

and

{(β_{n})}_{n \in N}

satisfy

θ_{n + 1} \leq θ_{n}

(

n \in N

) and

{lim sup}_{n \to + \infty} β_{n} < 1

. Let

V_{n} (x) : = E [〈 x_{n} - x, \nabla f (x_{n}) 〉]

for all

x \in X

and all

n \in N

. Then, for all

x \in X

and all

n \geq 1

,

\begin{matrix} \frac{1}{n} \sum_{k = 1}^{n} V_{k} (x) \leq \frac{D \sum_{i = 1}^{d} B_{i}}{2 \tilde{b} n α_{n}} + \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{2 \tilde{b} {\tilde{δ}}^{2} n} \sum_{k = 1}^{n} α_{k} + \frac{\sqrt{D d} \tilde{M}}{\tilde{b} n} \sum_{k = 1}^{n} β_{k} + \frac{2 \sqrt{D d} \hat{M}}{n} \sum_{k = 1}^{n} γ_{k}, \end{matrix}

where

{(β_{n})}_{n \in N} \subset (0, b] \subset (0, 1)

,

\tilde{b} : = 1 - b

,

\tilde{δ} : = 1 - δ

,

\hat{M}

,

\tilde{M}

and

\tilde{B}

are defined as in Lemma A2, and D and

B_{i}

are defined as in Assumption 1.

Proof.

Let

x \in X

be fixed arbitrarily. Lemma A1 guarantees that, for all

k \in N

,

\begin{matrix} V_{k} (x) & \leq \frac{1}{2 θ_{k}} \{E [{∥x_{k} - x∥}_{H_{k}}^{2}] - E [{∥x_{k + 1} - x∥}_{H_{k}}^{2}]\} \\ + \frac{β_{k}}{1 - β_{k}} E [〈x - x_{k}, m_{k - 1}〉] + γ_{k} E [〈x_{k} - x, G_{n - 1}〉] + \frac{α_{k} {\tilde{δ}}_{k}}{2 (1 - β_{k})} E [{∥d_{k}∥}_{H_{k}}^{2}], \end{matrix}

where

{\tilde{δ}}_{n} : = 1 - δ^{n + 1} \leq 1

(

n \in N

). The condition

{lim sup}_{n \to + \infty} β_{n} < 1

ensures the existence of

b > 0

such that, for all

n \in N

,

β_{n} \leq b < 1

. Let

\tilde{b} : = 1 - b

. Then, for all

n \geq 1

, we have

\begin{matrix} \begin{matrix} \sum_{k = 1}^{n} V_{k} (x) & \leq \frac{1}{2} \underset{Θ_{n}}{\underset{︸}{\sum_{k = 1}^{n} \frac{1}{θ_{k}} \{E [{∥x_{k} - x∥}_{H_{k}}^{2}] - E [{∥x_{k + 1} - x∥}_{H_{k}}^{2}]\}}} \\ + \underset{B_{n}}{\underset{︸}{\sum_{k = 1}^{n} \frac{β_{k}}{1 - β_{k}} E [〈x - x_{k}, m_{k - 1}〉]}} + \underset{Γ_{n}}{\underset{︸}{\sum_{k = 1}^{n} γ_{k} E [〈x_{k} - x, G_{n - 1}〉]}} + \frac{1}{2 \tilde{b}} \underset{A_{n}}{\underset{︸}{\sum_{k = 1}^{n} α_{k} E [{∥d_{k}∥}_{H_{k}}^{2}]}} . \end{matrix} \end{matrix}

(A5)

The definition of

Θ_{n}

and

E [∥ x_{n + 1} - x ∥_{H_{n}}^{2}] / θ_{n} \geq 0

imply that

\begin{matrix} Θ_{n} & \leq \frac{E [{∥x_{1} - x∥}_{H_{1}}^{2}]}{θ_{1}} + \underset{{\tilde{Θ}}_{n}}{\underset{︸}{\sum_{k = 2}^{n} \{\frac{E [{∥x_{k} - x∥}_{H_{k}}^{2}]}{θ_{k}} - \frac{E [{∥x_{k} - x∥}_{H_{k - 1}}^{2}]}{θ_{k - 1}}\}}} . \end{matrix}

(A6)

Accordingly,

\begin{matrix} {\tilde{Θ}}_{n} = E [\sum_{k = 2}^{n} \{\frac{{∥{\bar{H}}_{k} (x_{k} - x)∥}^{2}}{θ_{k}} - \frac{{∥{\bar{H}}_{k - 1} (x_{k} - x)∥}^{2}}{θ_{k - 1}}\}], \end{matrix}

where, for all

k \in N

and all

x : = (x_{i}) \in R^{d}

,

\begin{matrix} {\bar{H}}_{k} = diag (\sqrt{h_{k, i}}) and {∥{\bar{H}}_{k} x∥}^{2} = \sum_{i = 1}^{d} h_{k, i} x_{i}^{2} . \end{matrix}

(A7)

Thus, for all

n \geq 2

,

\begin{matrix} {\tilde{Θ}}_{n} = E [\sum_{k = 2}^{n} \sum_{i = 1}^{d} (\frac{h_{k, i}}{θ_{k}} - \frac{h_{k - 1, i}}{θ_{k - 1}}) {(x_{k, i} - x_{i})}^{2}] . \end{matrix}

The condition

θ_{k} \leq θ_{k - 1}

(k \geq 1)

and (A3) imply that, for all

k \geq 1

and all

i = 1, 2, \dots, d

,

\begin{matrix} \frac{h_{k, i}}{θ_{k}} - \frac{h_{k - 1, i}}{θ_{k - 1}} \geq 0 . \end{matrix}

Hence, for all

n \geq 2

,

\begin{matrix} {\tilde{Θ}}_{n} \leq D E [\sum_{k = 2}^{n} \sum_{i = 1}^{d} (\frac{h_{k, i}}{θ_{k}} - \frac{h_{k - 1, i}}{θ_{k - 1}})] = D E [\sum_{i = 1}^{d} (\frac{h_{n, i}}{θ_{n}} - \frac{h_{1, i}}{θ_{1}})], \end{matrix}

where

{max}_{i = 1, 2, \dots, d} sup {{(x_{n, i} - x_{i})}^{2} : n \in N} \leq D < + \infty

(by (A5)). Therefore, (A6),

E [∥ x_{1} - x ∥_{H_{1}}^{2}] / θ_{1} \leq D E [\sum_{i = 1}^{d} h_{1, i} / θ_{1}]

, and (A4) imply, for all

n \in N

,

\begin{matrix} Θ_{n} & \leq D E [\sum_{i = 1}^{d} \frac{h_{1, i}}{θ_{1}}] + D E [\sum_{i = 1}^{d} (\frac{h_{n, i}}{θ_{n}} - \frac{h_{1, i}}{θ_{1}})] = \frac{D}{θ_{n}} E [\sum_{i = 1}^{d} h_{n, i}] \leq \frac{D}{θ_{n}} \sum_{i = 1}^{d} B_{i}, \end{matrix}

which, together with

θ_{n} : = α_{n} (1 - β_{n}) / (1 - δ^{n + 1}) \geq \tilde{b} α_{n}

, implies

\begin{matrix} Θ_{n} \leq \frac{D \sum_{i = 1}^{d} B_{i}}{\tilde{b} α_{n}} . \end{matrix}

(A8)

The Cauchy–Schwarz inequality, together with

{max}_{i = 1, 2, \dots, d} sup {{(x_{n, i} - x_{i})}^{2} : n \in N} \leq D < + \infty

(by (A5)) and

E [∥ m_{n} ∥] \leq \tilde{M}

(n \in N)

(by Lemma A2), guarantees that, for all

n \in N

,

\begin{matrix} \begin{matrix} B_{n} \leq \frac{\sqrt{D d}}{\tilde{b}} \sum_{k = 1}^{n} β_{k} E [∥m_{k - 1}∥] \leq \frac{\sqrt{D d} \tilde{M}}{\tilde{b}} \sum_{k = 1}^{n} β_{k} . \end{matrix} \end{matrix}

(A9)

A discussion similar to the one for obtaining (A9), together with

E [∥ G_{n} ∥] \leq 2 \hat{M}

(n \in N)

(by Lemma A2), implies that

\begin{matrix} \begin{matrix} Γ_{n} \leq \sqrt{D d} \sum_{k = 1}^{n} γ_{k} E [∥G_{k - 1}∥] \leq 2 \sqrt{D d} \hat{M} \sum_{k = 1}^{n} γ_{k} . \end{matrix} \end{matrix}

(A10)

Since

E [∥ d_{n} ∥_{H_{n}}^{2}] \leq {\tilde{B}}^{2} {\tilde{M}}^{2} / {(1 - δ)}^{2}

(n \in N)

holds (by Lemma A2), we have, for all

n \in N

,

\begin{matrix} A_{n} : = \sum_{k = 1}^{n} α_{k} E [{∥d_{k}∥}_{H_{k}}^{2}] \leq \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{{(1 - δ)}^{2}} \sum_{k = 1}^{n} α_{k} . \end{matrix}

(A11)

Therefore, (A5), (A8)–(A11) leads to the assertion in Theorem A1. This completes the proof. □

Proof of Theorem 1.

Let

α_{n} : = α \in (0, 1)

,

β_{n} : = β = b \in (0, 1)

, and

γ_{n} : = γ \in [0, 1 / 2]

. We show that, for all

ϵ > 0

and all

x \in X

,

\begin{matrix} \underset{n \to + \infty}{lim inf} V_{n} (x) \leq \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{2 \tilde{b} {\tilde{δ}}^{2}} α + \frac{\sqrt{D d} \tilde{M}}{\tilde{b} \tilde{δ}} β + \frac{2 \sqrt{D d} \hat{M}}{\tilde{δ}} γ + \frac{D d ϵ}{2 \tilde{b}} + ϵ . \end{matrix}

(A12)

If (A12) does not hold for all

ϵ > 0

and all

x \in X

, then there exist

ϵ_{0} > 0

and

\hat{x} \in X

such that

\begin{matrix} \underset{n \to + \infty}{lim inf} V_{n} (\hat{x}) > \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{2 \tilde{b} {\tilde{δ}}^{2}} α + \frac{\sqrt{D d} \tilde{M}}{\tilde{b} \tilde{δ}} β + \frac{2 \sqrt{D d} \hat{M}}{\tilde{δ}} γ + \frac{D d ϵ_{0}}{2 \tilde{b}} + ϵ_{0} . \end{matrix}

(A13)

Assumptions (A3) and (A4) ensure that there exists

n_{0} \in N

such that, for all

n \in N

,

n \geq n_{0}

implies that

\begin{matrix} E [\sum_{i = 1}^{d} (h_{n + 1, i} - h_{n, i})] \leq \frac{d α ϵ_{0}}{2} . \end{matrix}

(A14)

Assumptions (A4) and (A5) and (A7) also imply that, for all

n \in N

,

\begin{matrix} X_{n} : = E [{∥x_{n} - \hat{x}∥}_{H_{n}}^{2}] = E [\sum_{i = 1}^{d} h_{n, i} {(x_{n, i} - {\hat{x}}_{i})}^{2}] \leq D \sum_{i = 1}^{d} B_{i} < + \infty . \end{matrix}

(A15)

Moreover, Assumptions (A3) and (A5), (A7), and (A14) ensure that, for all

n \geq n_{0}

,

\begin{matrix} X_{n + 1} - E [{∥x_{n + 1} - \hat{x}∥}_{H_{n}}^{2}] = E [\sum_{i = 1}^{d} (h_{n + 1, i} - h_{n, i}) {(x_{n + 1, i} - {\hat{x}}_{i})}^{2}] \leq \frac{D d α ϵ_{0}}{2} . \end{matrix}

(A16)

The condition

δ \in [0, 1)

and

X_{n + 1} < + \infty

(by (A15)) ensure that there exists

n_{1} \in N

such that, for all

n \in N

,

n \geq n_{1}

implies that

\begin{matrix} X_{n + 1} δ^{n + 1} \leq \frac{D d α ϵ_{0}}{2} . \end{matrix}

(A17)

The definition of the limit inferior of

{(V_{n} (\hat{x}))}_{n \in N}

guarantees that there exists

n_{2} \in N

such that, for all

n \geq n_{2}

,

\begin{matrix} \underset{n \to + \infty}{lim inf} V_{n} (\hat{x}) - \frac{1}{2} ϵ_{0} \leq V_{n} (\hat{x}), \end{matrix}

which, together with (A13), implies that, for all

n \geq n_{1}

,

\begin{matrix} V_{n} (\hat{x}) > \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{2 \tilde{b} {\tilde{δ}}^{2}} α + \frac{\sqrt{D d} \tilde{M}}{\tilde{b} \tilde{δ}} β + \frac{2 \sqrt{D d} \hat{M}}{\tilde{δ}} γ + \frac{D d ϵ_{0}}{2 \tilde{b}} + \frac{1}{2} ϵ_{0} . \end{matrix}

(A18)

Thus, Lemmas A1 and A2 and (A16) lead to the finding that, for all

n \geq n_{3} : = max {n_{0}, n_{1}, n_{2}}

,

\begin{matrix} X_{n + 1} \leq X_{n} + \frac{D d α ϵ_{0}}{2} - \frac{2 α \tilde{b}}{1 - δ^{n + 1}} V_{n} (\hat{x}) + \frac{2 \sqrt{D d} \tilde{M}}{\tilde{δ}} α β + \frac{4 \sqrt{D d} \hat{M} \tilde{b}}{\tilde{δ}} α γ + \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{{\tilde{δ}}^{2}} α^{2}, \end{matrix}

where

\tilde{b} : = 1 - b

and

\tilde{δ} : = 1 - δ

. Hence, from (A17),

1 - δ^{n + 1} \leq 1

, and

(X_{n + 1} - X_{n}) δ^{n + 1} \leq X_{n + 1} δ^{n + 1}

(

n \in N

), we have, for all

n \geq n_{3}

,

\begin{matrix} \begin{matrix} X_{n + 1} & \leq X_{n} + \frac{D d α ϵ_{0}}{2} - 2 α \tilde{b} V_{n} (\hat{x}) + \frac{2 \sqrt{D d} \tilde{M}}{\tilde{δ}} α β + \frac{4 \sqrt{D d} \hat{M} \tilde{b}}{\tilde{δ}} α γ + \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{{\tilde{δ}}^{2}} α^{2} + X_{n + 1} δ^{n + 1} \\ \leq X_{n} + D d α ϵ_{0} - 2 α \tilde{b} V_{n} (\hat{x}) + \frac{2 \sqrt{D d} \tilde{M}}{\tilde{δ}} α β + \frac{4 \sqrt{D d} \hat{M} \tilde{b}}{\tilde{δ}} α γ + \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{{\tilde{δ}}^{2}} α^{2} . \end{matrix} \end{matrix}

(A19)

Therefore, (A18) ensures that, for all

n \geq n_{3}

,

\begin{matrix} X_{n + 1} & < X_{n} + D d α ϵ_{0} - 2 α \tilde{b} \{\frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{2 \tilde{b} {\tilde{δ}}^{2}} α + \frac{\sqrt{D d} \tilde{M}}{\tilde{b} \tilde{δ}} β + \frac{2 \sqrt{D d} \hat{M}}{\tilde{δ}} γ + \frac{D d ϵ_{0}}{2 \tilde{b}} + \frac{1}{2} ϵ_{0}\} \\ + \frac{2 \sqrt{D d} \tilde{M}}{\tilde{δ}} α β + \frac{4 \sqrt{D d} \hat{M} \tilde{b}}{\tilde{δ}} α γ + \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{{\tilde{δ}}^{2}} α^{2} \\ = X_{n} - α \tilde{b} ϵ_{0} \\ < X_{n_{3}} - α \tilde{b} ϵ_{0} (n + 1 - n_{3}) . \end{matrix}

Since the right-hand side of the above inequality approaches minus infinity when n diverges, we have a contradiction. Hence, (A12) holds for all

ϵ > 0

and all

x \in X

. From the arbitrary condition of

ϵ

, we have, for all

x \in X

,

\begin{matrix} \underset{n \to + \infty}{lim inf} V_{n} (x) \leq \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{2 \tilde{b} {\tilde{δ}}^{2}} α + \frac{\sqrt{D d} \tilde{M}}{\tilde{b} \tilde{δ}} β + \frac{2 \sqrt{D d} \hat{M}}{\tilde{δ}} γ, \end{matrix}

which completes the proof. □

Proof of Theorem 2.

Let

x \in X

. Lemmas A1 and A2 and (A15), together with a discussion similar to the one for obtaining (A19), ensure that, for all

k \in N

,

\begin{matrix} X_{k + 1} & \leq X_{k} + D E [\sum_{i = 1}^{d} (h_{k + 1, i} - h_{k, i})] - 2 α_{k} (1 - β_{k}) V_{k} (x) \\ + \frac{2 \sqrt{D d} \tilde{M}}{\tilde{δ}} α_{k} β_{k} + \frac{4 \sqrt{D d} \hat{M} \tilde{b}}{\tilde{δ}} α_{k} γ_{k} + \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{{\tilde{δ}}^{2}} α_{k}^{2} + D \sum_{i = 1}^{d} B_{i} δ^{k + 1}, \end{matrix}

which implies that

\begin{matrix} 2 α_{k} V_{k} (x) & \leq X_{k} - X_{k + 1} + D E [\sum_{i = 1}^{d} (h_{k + 1, i} - h_{k, i})] + \frac{4 \sqrt{D d} \hat{M} \tilde{b}}{\tilde{δ}} α_{k} γ_{k} + \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{{\tilde{δ}}^{2}} α_{k}^{2} \\ + 2 (\frac{\sqrt{D d} \tilde{M}}{\tilde{δ}} + F) α_{k} β_{k} + D \sum_{i = 1}^{d} B_{i} δ^{k + 1}, \end{matrix}

where

F : = sup {| V_{n} (x) | : n \in N} < + \infty

holds from Assumptions (A2) and (A5). Summing up the above inequality from

k = 0

to

k = n

ensures that

\begin{matrix} 2 \sum_{k = 0}^{n} α_{k} V_{k} (x) & \leq X_{0} + D E [\sum_{i = 1}^{d} (h_{n + 1, i} - h_{0, i})] + \frac{4 \sqrt{D d} \hat{M} \tilde{b}}{\tilde{δ}} \sum_{k = 0}^{n} α_{k} γ_{k} + \frac{{\tilde{B}}^{2} {\tilde{M}}^{2}}{{\tilde{δ}}^{2}} \sum_{k = 0}^{n} α_{k}^{2} \\ + 2 (\frac{\sqrt{D d} \tilde{M}}{\tilde{δ}} + F) \sum_{k = 0}^{n} α_{k} β_{k} + D \hat{B} \sum_{k = 0}^{n} δ^{k + 1}, \end{matrix}

where

\hat{B} : = \sum_{i = 1}^{d} B_{i}

. Let

{(α_{n})}_{n \in N}

,

{(β_{n})}_{n \in N}

, and

{(γ_{n})}_{n \in N}

satisfy

\sum_{n = 0}^{+ \infty} α_{n} = + \infty

,

\sum_{n = 0}^{+ \infty} α_{n}^{2} < + \infty

,

\sum_{n = 0}^{+ \infty} α_{n} β_{n} < + \infty

, and

\sum_{n = 0}^{+ \infty} α_{n} γ_{n} < + \infty

. Assumption (A4) and

δ \in [0, 1)

imply that

\begin{matrix} \sum_{k = 0}^{+ \infty} α_{k} V_{k} (x) < + \infty . \end{matrix}

(A20)

We prove that, for all

x \in X

,

{lim inf}_{n \to + \infty} V_{n} (x) \leq 0

. Assume that

{lim inf}_{n \to + \infty} V_{n} (x) \leq 0

does not hold for all

x \in X

. Then there exist

\hat{x} \in X

,

ζ > 0

, and

m_{0} \in N

such that, for all

n \geq m_{0}

,

V_{n} (\hat{x}) \geq ζ

. Accordingly, (A20) and

\sum_{n = 0}^{+ \infty} α_{n} = + \infty

guarantee that

\begin{matrix} + \infty = ζ \sum_{k = m_{0}}^{+ \infty} α_{k} \leq \sum_{k = m_{0}}^{+ \infty} α_{k} V_{k} (\hat{x}) < + \infty, \end{matrix}

which is a contradiction. Hence,

{lim inf}_{n \to + \infty} V_{n} (x) \leq 0

holds for all

x \in X

.

Let

α_{n} : = 1 / n^{η}

(

η \in [1 / 2, 1)

) and

β_{n} : = β^{n}

(

β \in (0, 1)

). First, we consider the case where

γ_{n} : = γ^{n}

(

γ \in (0, 1)

). Then,

θ_{n + 1} \leq θ_{n}

(

n \in N

) and

{lim sup}_{n \to + \infty} β_{n} < 1

. When

η = 1 / 2

, we have

\begin{matrix} \frac{1}{n α_{n}} = \frac{1}{\sqrt{n}} \end{matrix}

and

\begin{matrix} \frac{1}{n} \sum_{k = 1}^{n} α_{k} \leq \frac{1}{n} \sqrt{\sum_{k = 1}^{n} 1^{2}} \sqrt{\sum_{k = 1}^{n} {(\frac{1}{\sqrt{k}})}^{2}} \leq \sqrt{\frac{1 + ln n}{n}}, \end{matrix}

where the first inequality comes from the Cauchy–Schwarz inequality and the second inequality comes from

\sum_{k = 1}^{n} (1 / k) \leq 1 + ln n

. We also have

\begin{matrix} \frac{1}{n} \sum_{k = 1}^{n} β_{k} \leq \frac{1}{n} \sum_{k = 1}^{+ \infty} β^{k} = \frac{β}{(1 - β) n} and \frac{1}{n} \sum_{k = 1}^{n} γ_{k} \leq \frac{1}{n} \sum_{k = 1}^{+ \infty} γ^{k} = \frac{γ}{(1 - γ) n} . \end{matrix}

(A21)

Therefore, Theorem A1 implies that

\begin{matrix} \frac{1}{n} \sum_{k = 1}^{n} V_{k} (x) \leq O (\sqrt{\frac{1 + ln n}{n}}) . \end{matrix}

In the case where

η \in (1 / 2, 1)

, we have

\begin{matrix} \frac{1}{n α_{n}} = \frac{1}{n^{1 - η}} and \frac{1}{n} \sum_{k = 1}^{n} α_{k} \leq \frac{1}{n} \sqrt{\sum_{k = 1}^{n} 1^{2}} \sqrt{\sum_{k = 1}^{n} {(\frac{1}{k^{η}})}^{2}} \leq \frac{B}{\sqrt{n}}, \end{matrix}

(A22)

where

B : = \sum_{n = 1}^{+ \infty} (1 / k^{2 η}) < + \infty

. Therefore, Theorem A1, together with (A21), ensures that

\begin{matrix} \frac{1}{n} \sum_{k = 1}^{n} V_{k} (x) \leq O (\frac{1}{n^{1 - η}}) . \end{matrix}

Next, we consider the case where

γ_{n} : = 1 / n^{κ}

(

κ > 1 - η

). Since

κ > 1 / 2

holds, an argument similar to the one for obtaining (A22) implies that

\begin{matrix} \frac{1}{n} \sum_{k = 1}^{n} γ_{k} = O (\frac{1}{\sqrt{n}}) . \end{matrix}

The discussion in the above paragraph and Theorem A1 lead to the same convergence rate of

(1 / n) \sum_{k = 1}^{n} V_{k} (x)

as the one for

γ_{n} : = γ^{n}

(

γ \in (0, 1)

). This completes the proof. □

Proof of Proposition 1.

Since

F (\cdot, ξ)

is convex for almost every

ξ \in Ξ

, we have, for all

n \in N

,

\begin{matrix} E [f (x_{n}) - f^{🟉}] \leq V_{n} (x^{🟉}), \\ E [f ({\tilde{x}}_{n}) - f^{🟉}] \leq \frac{1}{n} \sum_{k = 1}^{n} E [f (x_{k}) - f^{🟉}] \leq \frac{1}{n} \sum_{k = 1}^{n} V_{k} (x^{🟉}), \end{matrix}

which, together with Theorem 1, leads to Proposition 1. □

Proof of Proposition 2.

Theorem 2 and the proof of Proposition 1 lead to the finding that

{lim inf}_{n \to + \infty} E [f (x_{n}) - f^{🟉}] = 0

and

{lim}_{n \to + \infty} E [f ({\tilde{x}}_{n}) - f^{🟉}] = 0

. Let

\hat{x} \in X

be an arbitrary accumulation point of

{({\tilde{x}}_{n})}_{n \in N} \subset X

. Since there exists

{({\tilde{x}}_{n_{i}})}_{i \in N} \subset {({\tilde{x}}_{n})}_{n \in N}

such that

{({\tilde{x}}_{n_{i}})}_{i \in N}

converges almost surely to

\hat{x}

, the continuity of f and

{lim}_{n \to + \infty} E [f ({\tilde{x}}_{n}) - f^{🟉}] = 0

imply that

E [f (\hat{x}) - f^{🟉}] = 0

, and hence,

\hat{x} \in X^{🟉}

. The convergence rate of

E [f ({\tilde{x}}_{n}) - f^{🟉}]

follows from Theorem A1. □

References

Caciotta, M.; Giarnetti, S.; Leccese, F. Hybrid neural network system for electric load forecasting of telecomunication station. In Proceedings of the 19th IMEKO World Congress 2009, Lisbon, Portugal, 6–11 September 2009; Volume 1, pp. 657–661. [Google Scholar]
Caciotta, M.; Giarnetti, S.; Leccese, F.; Orioni, B.; Oreggia, M.; Pucci, C.; Rametta, S. Flavors mapping by Kohonen network classification of Panel Tests of Extra Virgin Olive Oil. Measurement 2016, 78, 366–372. [Google Scholar] [CrossRef]
Proietti, A.; Liparulo, L.; Leccese, F.; Panella, M. Shapes classification of dust deposition using fuzzy kernel-based approaches. Measurement 2016, 77, 344–350. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of Adam and beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–23. [Google Scholar]
Iiduka, H. Appropriate learning rates of adaptive learning rate optimization algorithms for training deep neural networks. arXiv 2020, arXiv:2002.09647. [Google Scholar]
Hager, W.H.; Zhang, H. A survey of nonlinear conjugate gradient methods. Pac. J. Optim. 2006, 2, 35–58. [Google Scholar]
Iiduka, H. Acceleration method for convex optimization over the fixed point set of a nonexpansive mapping. Math. Program. 2015, 149, 131–165. [Google Scholar] [CrossRef]
Iiduka, H. Hybrid conjugate gradient method for a convex optimization problem over the fixed-point set of a nonexpansive mapping. J. Optim. Theory Appl. 2009, 140, 463–475. [Google Scholar] [CrossRef]
Iiduka, H.; Yamada, I. A use of conjugate gradient direction for the convex optimization problem over the fixed point set of a nonexpansive mapping. SIAM J. Optim. 2009, 19, 1881–1893. [Google Scholar] [CrossRef]
Iiduka, H. Three-term conjugate gradient method for the convex optimization problem over the fixed point set of a nonexpansive mapping. Appl. Math. Comput. 2011, 217, 6315–6327. [Google Scholar] [CrossRef]
Kobayashi, Y.; Iiduka, H. Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning. arXiv 2020, arXiv:2003.00231. [Google Scholar]
Bauschke, H.H.; Combettes, P.L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces; Springer: New York, NY, USA, 2011. [Google Scholar]
Facchinei, F.; Pang, J.S. Finite-Dimensional Variational Inequalities and Complementarity Problems I; Springer: New York, NY, USA, 2003. [Google Scholar]
Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 2009, 19, 1574–1609. [Google Scholar] [CrossRef] [Green Version]
Polyak, B.T. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1–14. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Iiduka, H. Stochastic fixed point optimization algorithm for classifier ensemble. IEEE Trans. Cybern. 2020, 50, 4370–4380. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]

Figure 1. Loss function value versus number of epochs on the CIFAR-10 dataset for training (constant).

Figure 2. Classification error rate versus number of epochs on the CIFAR-10 dataset for training (constant).

Figure 3. Classification error rate versus number of epochs on the CIFAR-10 dataset for testing (constant).

Figure 4. Loss function value versus number of epochs on the CIFAR-10 dataset for training (diminishing).

Figure 5. Classification error rate versus number of epochs on the CIFAR-10 dataset for training (diminishing).

Figure 6. Classification error rate versus number of epochs on the CIFAR-10 dataset for testing (diminishing).

Figure 7. Loss function value versus number of epochs on the IMDb dataset for training (constant).

Figure 8. Classification error rate versus number of epochs on the IMDb dataset for training (constant).

Figure 9. Classification error rate versus number of epochs on the IMDb dataset for testing (constant).

Figure 10. Loss function value versus number of epochs on the IMDb dataset for training (diminishing).

Figure 11. Classification error rate versus number of epochs on the IMDb dataset for training (diminishing).

Figure 12. Classification error rate versus number of epochs on the IMDb dataset for testing (diminishing).

Table 1. Mean and variance of elapsed time per epoch for the existing algorithms and Algorithm 1 on the CIFAR-10 dataset.

		Existing	C1	C2	C3	CG-C1	CG-C2	CG-C3
Momentum	mean	14.815106	14.766352	14.643343	14.191675	14.370240	14.536258	13.732973
	variance	0.268979	1.144346	0.268576	0.363746	0.180754	0.872769	0.314055
Adam	mean	17.621361	17.388947	18.511805	18.084771	18.106918	18.108820	17.127479
	variance	0.149553	0.044539	1.392942	0.056606	0.213341	0.063594	1.317213
AMSGrad	mean	18.122551	17.650377	17.796328	19.335775	18.855297	18.272888	16.328777
	variance	1.245563	0.313088	0.289944	4.738541	2.820650	1.671705	1.754373

Table 2. Results of t-test on the training error rates of the existing algorithms (Momentum, Adam, and AMSGrad) and Algorithm 1 (Ci and CG-Ci (

i = 1, 2, 3

)) on the CIFAR-10 dataset (significance level is 5%; the p-values for the proposed algorithms with significantly low error rates are indicated in bold).

Table 2. Results of t-test on the training error rates of the existing algorithms (Momentum, Adam, and AMSGrad) and Algorithm 1 (Ci and CG-Ci (

i = 1, 2, 3

)) on the CIFAR-10 dataset (significance level is 5%; the p-values for the proposed algorithms with significantly low error rates are indicated in bold).

		C1	C2	C3	CG-C1	CG-C2	CG-C3
Momentum	t-statistic	3.70879	0.23783	−13.65314	3.34063	−0.17214	−12.77890
(Existing)	p-value	2.38 $\times 10^{- 4}$	8.12 $\times 10^{- 1}$	4.44 $\times 10^{- 35}$	9.15 $\times 10^{- 4}$	8.63 $\times 10^{- 1}$	1.43 $\times 10^{- 31}$
Adam	t-statistic	−10.46006	0.03599	0.20774	−6.70248	0.37493	0.04342
(Existing)	p-value	8.73 $\times 10^{- 23}$	9.71 $\times 10^{- 1}$	8.36 $\times 10^{- 1}$	7.03 $\times 10^{- 11}$	7.08 $\times 10^{- 1}$	9.65 $\times 10^{- 1}$
AMSGrad	t-statistic	−157.96917	−1.59278	−0.16230	−157.97057	−1.59440	−0.00869
(Existing)	p-value	0.00 $\times 10^{0}$	1.12 $\times 10^{- 1}$	8.71 $\times 10^{- 1}$	0.00 $\times 10^{0}$	1.12 $\times 10^{- 1}$	9.93 $\times 10^{- 1}$

Table 3. Mean and variance of elapsed time per epoch for the existing algorithms and Algorithm 1 on the IMDb dataset.

		Existing	C1	C2	C3	CG-C1	CG-C2	CG-C3
Momentum	mean	19.029660	18.999186	18.957496	19.098836	19.241769	19.286854	18.671163
	variance	0.095132	0.074935	0.107259	0.196841	0.035649	0.058319	0.003906
Adam	mean	20.256827	20.194220	20.193260	20.260705	20.231550	20.388470	19.536741
	variance	0.061552	0.023485	0.041777	0.060461	0.039103	0.174818	0.165803
AMSGrad	mean	20.109489	20.092463	20.102763	20.025613	20.146646	20.136673	19.335856
	variance	0.075432	0.059149	0.059561	0.089540	0.113563	0.098914	0.003543

Table 4. Results of t-test on the training error rates of the existing algorithms (Momentum, Adam, and AMSGrad) and Algorithm 1 (Ci and CG-Ci (

i = 1, 2, 3

)) on the IMDb dataset (significance level is 5%; the p-values for the proposed algorithms with significantly low error rates are indicated in bold).

Table 4. Results of t-test on the training error rates of the existing algorithms (Momentum, Adam, and AMSGrad) and Algorithm 1 (Ci and CG-Ci (

i = 1, 2, 3

)) on the IMDb dataset (significance level is 5%; the p-values for the proposed algorithms with significantly low error rates are indicated in bold).

		C1	C2	C3	CG-C1	CG-C2	CG-C3
Momentum	t-statistic	13.87142	0.63115	−4.59306	13.22951	1.71477	−4.59306
(Existing)	p-value	5.17 $\times 10^{- 31}$	5.29 $\times 10^{- 1}$	7.76 $\times 10^{- 6}$	4.82 $\times 10^{- 29}$	8.80 $\times 10^{- 2}$	7.76 $\times 10^{- 6}$
Adam	t-statistic	−63.39972	−11.01275	−0.00287	−63.33435	−9.41552	0.11707
(Existing)	p-value	1.79 $\times 10^{- 133}$	2.61 $\times 10^{- 22}$	9.98 $\times 10^{- 1}$	2.17 $\times 10^{- 133}$	1.24 $\times 10^{- 17}$	9.07 $\times 10^{- 1}$
AMSGrad	t-statistic	−63.53084	−5.68706	−0.63240	−63.42279	−7.93451	−0.06863
(Existing)	p-value	1.21 $\times 10^{- 133}$	4.59 $\times 10^{- 8}$	5.28 $\times 10^{- 1}$	1.67 $\times 10^{- 133}$ 3	1.53 $\times 10^{- 13}$	9.45 $\times 10^{- 1}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Iiduka, H.; Kobayashi, Y. Training Deep Neural Networks Using Conjugate Gradient-like Methods. Electronics 2020, 9, 1809. https://doi.org/10.3390/electronics9111809

AMA Style

Iiduka H, Kobayashi Y. Training Deep Neural Networks Using Conjugate Gradient-like Methods. Electronics. 2020; 9(11):1809. https://doi.org/10.3390/electronics9111809

Chicago/Turabian Style

Iiduka, Hideaki, and Yu Kobayashi. 2020. "Training Deep Neural Networks Using Conjugate Gradient-like Methods" Electronics 9, no. 11: 1809. https://doi.org/10.3390/electronics9111809

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Training Deep Neural Networks Using Conjugate Gradient-like Methods

Abstract

1. Introduction

2. Mathematical Preliminaries

2.1. Notation and Definitions

2.2. Stationary Point Problem Associated with Nonconvex Optimization Problem

3. Conjugate Gradient-Like Method

3.1. Constant Learning Rate Rule

3.2. Diminishing Learning Rate Rule

4. Numerical Experiments

4.1. Image Classification

4.2. Text Classification

5. Discussion

6. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proofs of Theorems 1 and 2 and Propositions 1 and 2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI