Next Article in Journal
An Artificial Neural Network Approach and a Data Augmentation Algorithm to Systematize the Diagnosis of Deep-Vein Thrombosis by Using Wells’ Criteria
Previous Article in Journal
Optimal Location and Sizing of Distributed Generators in DC Networks Using a Hybrid Method Based on Parallel PBIL and PSO
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Training Deep Neural Networks Using Conjugate Gradient-like Methods

Department of Computer Science, Meiji University, 1-1-1 Higashimita, Tama-ku, Kawasaki-shi, Kanagawa 214-8571, Japan
*
Author to whom correspondence should be addressed.
Electronics 2020, 9(11), 1809; https://doi.org/10.3390/electronics9111809
Submission received: 1 October 2020 / Revised: 24 October 2020 / Accepted: 28 October 2020 / Published: 2 November 2020
(This article belongs to the Section Artificial Intelligence)

Abstract

:
The goal of this article is to train deep neural networks that accelerate useful adaptive learning rate optimization algorithms such as AdaGrad, RMSProp, Adam, and AMSGrad. To reach this goal, we devise an iterative algorithm combining the existing adaptive learning rate optimization algorithms with conjugate gradient-like methods, which are useful for constrained optimization. Convergence analyses show that the proposed algorithm with a small constant learning rate approximates a stationary point of a nonconvex optimization problem in deep learning. Furthermore, it is shown that the proposed algorithm with diminishing learning rates converges to a stationary point of the nonconvex optimization problem. The convergence and performance of the algorithm are demonstrated through numerical comparisons with the existing adaptive learning rate optimization algorithms for image and text classification. The numerical results show that the proposed algorithm with a constant learning rate is superior for training neural networks.

1. Introduction

Deep neural networks are used for many tasks, such as natural language processing, computer vision, and text and image classification (see also [1,2,3] for applications of neural networks), and a number of algorithms have been presented to tune the model parameters of such networks. The appropriate parameters are found by solving nonconvex stochastic optimization problems. In particular, the algorithms solve these problems in order to adapt the learning rates of the model parameters. Accordingly, they are called adaptive learning rate optimization algorithms [4] (Subchapter 8.5), and they include AdaGrad [5], RMSProp [4] (Algorithm 8.5), Adam [6], and AMSGrad [7].
Recently, reference [8] preformed convergence analyses on adaptive learning rate optimization algorithms for constant learning rates and diminishing learning rates. The convergence analyses indicated that the algorithms with sufficiently small constant learning rates approximate stationary points of the problems [8] (Theorem 3.1). This implies that useful algorithms, such as Adam and AMSGrad, can use constant learning rates to solve the nonconvex stochastic optimization problems in deep learning, in contrast to the results in [6,7] that presented only analyses assuming the convexity conditions of objective functions for diminishing learning rates. The analyses also indicated that the algorithms with diminishing learning rates converge to stationary points of the problems and achieve a certain convergence rate [8] (Theorem 3.2). Numerical comparisons showed that the algorithms with constant learning rates perform better than the ones with diminishing learning rates.
Meanwhile, conjugate gradient methods are useful for unconstrained nonconvex deterministic optimization (see [9] for details on conjugate gradient methods). These methods use the conjugate gradient direction (see also (2) for the definition of the conjugate gradient direction with the Fletcher–Reeves formula), and they accelerate the steepest descent method. Conjugate gradient methods converge globally and generate the descent direction. In particular, the Hager-Zhang, Polak-Ribière-Polyak, and Hestenes–Stiefel methods have efficient numerical performance [9]. It seems that conjugate gradient methods could be applied to constrained optimization, because they might accelerate the existing methods for constrained optimization. However, the inconvenient possibility that the conjugate gradient methods may not converge to solutions to constrained optimization problems [10] (Proposition 3.2) means that we cannot apply them directly. Actually, the numerical results in [10] showed that the conjugate gradient methods with conventional formulas, such as the Fletcher-Reeves, Polak-Ribière-Polyak, and Hestenes-Stiefel formulas, do not always converge to solutions to constrained optimization problems.
The conjugate gradient direction has been modified so that it can be applied to constrained optimization. The modified direction is called the conjugate gradient-like direction [10,11,12,13,14], and it is obtained by replacing the formula used for finding the conventional conjugate gradient direction with a positive real sequence depending on the number of iterations (see (1) for the definition of the conjugate gradient-like direction). The conjugate gradient-like method with the conjugate gradient-like direction can be applied to constrained convex deterministic optimization. In particular, the conjugate gradient-like method converges to solutions to constrained convex deterministic optimization problems when the step sizes (which are called learning rates) are diminishing [10] (Theorem 3.1). Moreover, the numerical results in [10] showed that it converges faster than the existing steepest descent method.
Roughly speaking, the existing adaptive learning rate optimization algorithms [4] (Subchapter 8.5) are first-order methods using the steepest descent direction of an observed function at each iteration. Accordingly, using the conjugate gradient-like method would be useful to accelerate these algorithms. Hence, in this article, we propose an iterative method combining the existing adaptive learning rate optimization algorithms [4] (Subchapter 8.5) with the conjugate gradient-like method [10,11,12,13,14].
This article provides two convergence analyses. The first analysis shows that with a small constant learning rate, the proposed algorithm approximates a stationary point of a nonconvex optimization problem in deep learning (Theorem 1). The second analysis shows that with diminishing learning rates, it converges to a stationary point of the nonconvex optimization problem (Theorem 2). The convergence and performance of the proposed algorithm are examined through numerical comparisons with the existing adaptive learning rate optimization algorithms for image and text classification. The numerical results show that the proposed algorithm with a constant learning rate is superior for training neural networks, while the one with diminishing learning rates is not good for training neural networks.
This article is organized as follows. Section 2 gives the mathematical preliminaries and states the main problem. Section 3 presents the proposed algorithm for solving the main problem and analyzes its convergence. Section 4 numerically compares the behaviors of the proposed learning algorithms with those of the existing ones. Section 5 discusses the relationship between the previously reported results and the results in Section 3 and Section 4. Section 6 concludes the paper with a brief summary.

2. Mathematical Preliminaries

2.1. Notation and Definitions

N denotes the set of all positive integers and zero. R d denotes a d-dimensional Euclidean space with inner product · , · , which induces the norm · . S d denotes the set of d × d symmetric matrices, i.e., S d = { X R d × d : X = X } . S + + d denotes the set of d × d symmetric positive-definite matrices, i.e., S + + d = { X S d : X O } . D d denotes the set of d × d diagonal matrices, i.e., D d = { X R d × d : X = diag ( x i ) , x i R ( i = 1 , 2 , , d ) } . A B denotes the Hadamard product of matrices A and B. For all x : = ( x i ) R d , we have x x : = ( x i 2 ) R d .
Given H S + + d , the H-inner product of R d and the H-norm are defined for all x , y R d by x , y H : = x , H y and x H 2 : = x , H x .
The metric projection [15] (Subchapter 4.2, Chapter 28) onto a nonempty, closed convex set X ( R d ) , denoted by P X , is defined for all x R d by P X ( x ) X and x P X ( x ) = inf y X x y . P X satisfies the nonexpansivity condition, i.e., P X ( x ) P X ( y ) x y ( x , y R d ), and satisfies Fix ( P X ) : = { x R d : x = P X ( x ) } = X [15] (Proposition 4.8, (4.8)). The metric projection onto X under the H-norm is denoted by P X , H . When X is an affine subspace, a half-space, or a hyperslab, the projection onto X can be computed within a finite number of arithmetic operations [15] (Chapter 28).
E [ X ] denotes the expectation of a random variable X. The history of the process ξ 0 , ξ 1 , up to time n is denoted by ξ [ n ] = ( ξ 0 , ξ 1 , , ξ n ) . For a random process ξ 0 , ξ 1 , , E [ X | ξ [ n ] ] denotes the conditional expectation of X given ξ [ n ] = ( ξ 0 , ξ 1 , , ξ n ) . Unless stated otherwise, all relations between random variables hold almost surely.

2.2. Stationary Point Problem Associated with Nonconvex Optimization Problem

Let us consider the following problem [8] (see, e.g., Subchapter 1.3.1 in [16] for details on stationary point problems):
Problem 1. 
Assume that
(A1)
X R d is a nonempty, closed convex set onto which the projection can be easily computed;
(A2)
f : R d R , which is defined for all x R d by f ( x ) : = E [ F ( x , ξ ) ] , is well defined, where F ( · , ξ ) is continuously differentiable for almost every ξ Ξ , where ξ Ξ is a random vector whose probability distribution P is supported on a set Ξ R d 1 .
Then, we would like to find a stationary point x 🟉 of the problem of minimizing f over X, i.e.,
x 🟉 X 🟉 : = { x 🟉 X : x x 🟉 , f ( x 🟉 ) 0 ( x X ) } ,
where f denotes the gradient of f.
We can see that, if  X = R d , then  X 🟉 = { x 🟉 R d : f ( x 🟉 ) = 0 } and that, if f is convex, then  x 🟉 X 🟉 is a global minimizer of f over X [16] (Subchapter 1.3.1).
Problem 1 is examined under the following conditions [8].
(C1)
There is an independent and identically distributed sample ξ 0 , ξ 1 , of realizations of the random vector ξ ;
(C2)
There is an oracle which, for a given input point ( x , ξ ) R d × Ξ , returns a stochastic gradient G ( x , ξ ) such that E [ G ( x , ξ ) ] = f ( x ) ;
(C3)
There exists a positive number M such that, for all x X , E [ G ( x , ξ ) 2 ] M 2 .

3. Conjugate Gradient-Like Method

Algorithm 1 is a method for solving Problem 1 under (C1)–(C3).
First, we would like to emphasize that Algorithm 1 uses a conjugate gradient-like direction [10,11,12,13] (see step 3 in Algorithm 1) defined by
γ n = γ 0 , 1 2 or 1 n , G n = G ( x n , ξ n ) γ n G n 1 .
The direction (1) differs from a conventional conjugate gradient direction using, for example, the Fletcher–Reeves formula,
γ n FR = G ( x n , ξ n ) 2 G ( x n 1 , ξ n 1 ) 2 , G n = G ( x n , ξ n ) γ n FR G n 1 .
Although conventional conjugate gradient methods are powerful tools for solving unconstrained smooth nonconvex optimization (see, e.g., [9] for details on conjugate gradient methods), iterative methods with the conjugate gradient-like directions are useful for solving constrained smooth optimization problems [10,11,12,13] (see also Section 1 for details). Since Problem 1 is a constrained optimization problem, we will focus on using conjugate gradient-like directions.
Algorithm 1 Conjugate gradient-like method for solving Problem 1
Require: ( α n ) n N ( 0 , 1 ) , ( β n ) n N [ 0 , 1 ) , ( γ n ) n N [ 0 , 1 / 2 ] , δ [ 0 , 1 )
1:
n 0 , x 0 , G 1 , m 1 R d , H 0 S + + d D d
2:
loop
3:
G n : = G ( x n , ξ n ) γ n G n 1
4:
m n : = β n m n 1 + ( 1 β n ) G n
5:
m ^ n : = ( 1 δ n + 1 ) 1 m n
6:
H n S + + d D d
7:
 Find d n R d that solves H n d = m ^ n .
8:
x n + 1 : = P X , H n ( x n + α n d n )
9:
n n + 1
10:
end loop
We can see that Algorithm 1 with γ n = 0 ( n N ) coincides with the existing algorithm in [8] defined by
G n : = G ( x n , ξ n ) , m n : = β n m n 1 + ( 1 β n ) G n , m ^ n : = ( 1 δ n + 1 ) 1 m n , x n + 1 : = P X , H n ( x n α n H n 1 m ^ n ) ,
where H n S + + d D d . We can also show that algorithm (3) (i.e., Algorithm 1 with γ n = 0 ) includes AMSGrad [7] and Adam [6] by referring to [8] (Section 3). For example, consider H n and v n ( n N ) defined for all n N by
v n : = ζ v n 1 + ( 1 ζ ) G ( x n , ξ n ) G ( x n , ξ n ) , v ^ n = ( v ^ n , i ) : = max { v ^ n 1 , i , v n , i } , H n : = diag v ^ n , i ,
where v 1 = v ^ 1 = 0 R d and ζ [ 0 , 1 ) . Then, algorithm (3) with (4) and δ = 0 is the AMSGrad algorithm. When  H n and v n ( n N ) are defined for all n N by
v n : = ζ v n 1 + ( 1 ζ ) G ( x n , ξ n ) G ( x n , ξ n ) , v ¯ n : = ( 1 ζ n + 1 ) 1 v n , v ^ n = ( v ^ n , i ) : = max { v ^ n 1 , i , v ¯ n , i } , H n : = diag v ^ n , i ,
Algorithm (3) with (5) resembles the Adam algorithm (The original Adam uses H n : = diag ( v ¯ n , i ) and does not always converge [7] (Theorems 1–3). We use H n : = diag ( v ^ n , i ) to guarantee its convergence (see Theorems 1 and 2 for the convergence of Algorithm 1)).
For example, let us consider Algorithm 1 with (4) and δ = 0 , i.e.,
G n : = G ( x n , ξ n ) γ n G n 1 , m n : = β n m n 1 + ( 1 β n ) G n , v n : = ζ v n 1 + ( 1 ζ ) G ( x n , ξ n ) G ( x n , ξ n ) , v ^ n = ( v ^ n , i ) : = max { v ^ n 1 , i , v n , i } , H n : = diag v ^ n , i , x n + 1 : = P X , H n ( x n α n H n 1 m n ) .
From the above discussion, algorithm (6) with γ n = 0 coincides with AMSGrad. We can see that algorithm (6) uses a conjugate gradient-like direction G n = G ( x n , ξ n ) γ n G n 1 , while AMSGrad (algorithm (3) with (4)) uses a gradient direction G n = G ( x n , ξ n ) .
The convergence analyses of Algorithm 1 assume the following conditions.
Assumption 1. 
The sequence ( H n ) n N S + + d D d , denoted by H n : = diag ( h n , i ) , in Algorithm 1 satisfies the following conditions:
(A3)
h n + 1 , i h n , i almost surely for all n N and all i = 1 , 2 , , d ;
(A4)
For all i = 1 , 2 , , d , a positive number B i exists such that sup { E [ h n , i ] : n N } B i .
Moreover,
(A5)
D : = max i = 1 , 2 , , d sup { ( x i y i ) 2 : ( x i ) , ( y i ) X } < + .
Assumption (A5) holds under the boundedness condition of X, which is assumed in [17] (p. 1574) and [7] (p. 2). In [8] (Section 3), it is shown that H n and v n defined by (4) or (5) satisfies (A3) and (A4).

3.1. Constant Learning Rate Rule

The following is the convergence analysis of Algorithm 1 with a constant learning rate. Theorem 1 can be inferred by referring to the proof of Theorem 3.1 in [8]. The proof of Theorem 1 is given in Appendix A.
Theorem 1. 
Suppose that (A1)–(A5) and (C1)–(C3) hold and ( x n ) n N is the sequence generated by Algorithm 1 with α n : = α , β n : = β , and  γ n : = γ ( n N ). Then, for all x X ,
lim sup n + E x x n , f ( x n ) B ˜ 2 M ˜ 2 2 b ˜ δ ˜ 2 α D d M ˜ b ˜ δ ˜ β 2 D d M ^ δ ˜ γ ,
where δ ˜ : = 1 δ , b ˜ : = 1 β , M is defined as in (C3), M ^ 2 : = max { M 2 , G 1 2 } , M ˜ 2 : = max { m 1 2 , 4 M ^ 2 } , D is defined as in (A5), and  B ˜ : = sup { max i = 1 , 2 , , d h n , i 1 / 2 : n N } < + .
Theorem 1 shows that using a small constant learning rate approximates a solution to Problem 1. The result for γ : = 0 coincides with Theorem 3.1 in [8].
We have the following proposition for convex stochastic optimization.
Proposition 1. 
Suppose that (A1)–(A5) and (C1)–(C3) hold, F ( · , ξ ) is convex for almost every ξ Ξ , and  ( x n ) n N is the sequence generated by Algorithm 1 with α n : = α , β n : = β , and γ n : = γ ( n N ). Then,
lim inf n + E f ( x n ) f 🟉 B ˜ 2 M ˜ 2 2 b ˜ δ ˜ 2 α + D d M ˜ b ˜ δ ˜ β + 2 D d M ^ δ ˜ γ ,
where f 🟉 denotes the optimal value of the problem of minimizing f over X, and  δ ˜ , b ˜ , M, M ^ , M ˜ , D, and  B ˜ are defined as in Theorem 1.
The previously reported results in [7] showed that AMSGrad, which is an example of Algorithm 1 (see Algorithm (3) with (4) and δ = 0 ), ensures that there exists a positive real number B such that
R ( T ) T = 1 T t = 1 T F ( x t , ξ t ) f 🟉 B 1 + ln T T ,
where T is the number of training examples and F ( · , ξ ) is convex for almost every ξ Ξ . Inequality (7) indicates that the value R ( T ) / T generated by AMSGrad has an upper bound; however, it is not guaranteed that AMSGrad solves Problem 1. Meanwhile, Proposition 1 shows that Algorithm 1, which includes Adam and AMSGrad, can approximate a global minimizer of f by using a small constant learning rate.

3.2. Diminishing Learning Rate Rule

The following is the convergence analysis of Algorithm 1 with diminishing learning rates. Theorem 2 can be proven by referring to the proof of Theorem 3.2 in [8]. The proof of Theorem 2 is given in Appendix A.
Theorem 2. 
Suppose that (A1)–(A5) and (C1)–(C3) hold and ( x n ) n N is the sequence generated by Algorithm 1 with α n , β n , and  γ n ( n N ) satisfying n = 0 + α n = + , n = 0 + α n 2 < + , n = 0 + α n β n < + , and  n = 0 + α n γ n < + . Then, for all x X ,
lim sup n + E x x n , f ( x n ) 0 .
Moreover, suppose that α n : = 1 / n η , β n : = β n , γ n : = γ n or 1 / n κ , where η [ 1 / 2 , 1 ) , κ > 1 η , and  β , γ ( 0 , 1 ) . Then, Algorithm 1 achieves the following convergence rate:
1 n k = 1 n E x x k , f ( x k ) O 1 + ln n n i f η = 1 2 , O 1 n 1 η i f η 1 2 , 1 .
Inequality (8) implies that there exists a subsequence ( x n j ) j N of ( x n ) n N such that ( x n j ) j N converges to x 🟉 and, for all x X ,
lim j + E x x n j , f ( x n j ) = lim sup n + E x x n , f ( x n ) 0 ,
which implies that x 🟉 satisfies x x 🟉 , f ( x 🟉 ) 0 ( x X ); i.e., x 🟉 is a solution to Problem 1.
Theorem 2 leads to the following proposition, which indicates that Algorithm 1 converges to a global minimizer of f when F ( · , ξ ) is convex for almost every ξ Ξ .
Proposition 2. 
Suppose that (A1)–(A5) and (C1)–(C3) hold, F ( · , ξ ) is convex for almost every ξ Ξ , and  ( x n ) n N is the sequence generated by Algorithm 1 with α n , β n , and  γ n satisfying n = 0 + α n = + , n = 0 + α n 2 < + , n = 0 + α n β n < + , and  n = 0 + α n γ n < + . Then,
lim inf n + E f ( x n ) f 🟉 = 0 ,
where f 🟉 denotes the optimal value of the problem of minimizing f over X. Moreover, suppose that α n : = 1 / n η , β n : = β n , γ n : = γ n or 1 / n κ , where η [ 1 / 2 , 1 ) , κ > 1 η , and  β , γ ( 0 , 1 ) . Then, any accumulation point of ( x ˜ n ) n N defined by x ˜ n : = ( 1 / n ) k = 1 n x k almost surely belongs to the solution set X 🟉 , and Algorithm 1 achieves the following convergence rate:
E f ( x ˜ n ) f 🟉 = O 1 + ln n n i f η = 1 2 , O 1 n 1 η i f η 1 2 , 1 .

4. Numerical Experiments

The experiments used a fast scalar computation server (https://www.meiji.ac.jp/isys/hpc/ia.html) at Meiji University. The environment has two Intel(R) Xeon(R) Gold 6148 (2.4 GHz, 20 cores) CPUs, an NVIDIA Tesla V100 (16GB, 900Gbps) GPU, and a Red Hat Enterprise Linux 7.6 operating system. The experimental code was written in Python 3.8.2, and we used the NumPy 1.19.1 package and PyTorch 1.5.0 package.
We compared the existing algorithms, such as the momentum method [18] (9), [19] (Section 2), AdaGrad [5], RMSProp [4] (Algorithm 8.5), Adam [6], and AMSGrad [7] in torch.optim (https://pytorch.org/docs/stable/optim.html) using the default values and learning rate 10 3 , with Algorithm 1 defined as follows:
Algorithm 1 with a constant learning rate (Algorithm 1 with γ n = 0 , such as Momentum-Ci, Adam-Ci, and AMSGrad-Ci ( i = 1 , 2 , 3 ), is Algorithm 1 in [8]):
  • Momentum-C1: Algorithm 1 with δ = 0 , H n = diag ( 1 ) , α n = β n = 10 1 , and  γ n = 0 .
  • Momentum-C2: Algorithm 1 with δ = 0 , H n = diag ( 1 ) , α n = β n = 10 2 , and  γ n = 0 .
  • Momentum-C3: Algorithm 1 with δ = 0 , H n = diag ( 1 ) , α n = β n = 10 3 , and  γ n = 0 .
  • MomentumCG-C1: Algorithm 1 with δ = 0 , H n = diag ( 1 ) , and  α n = β n = γ n = 10 1 .
  • MomentumCG-C2: Algorithm 1 with δ = 0 , H n = diag ( 1 ) , and  α n = β n = γ n = 10 2 .
  • MomentumCG-C3: Algorithm 1 with δ = 0 , H n = diag ( 1 ) , and  α n = β n = γ n = 10 3 .
  • Adam-C1: Algorithm 1 with δ = 0.9 , ζ = 0.999 , H n defined by (5), α n = β n = 10 1 , and  γ n = 0 .
  • Adam-C2: Algorithm 1 with δ = 0.9 , ζ = 0.999 , H n defined by (5), α n = β n = 10 2 , and  γ n = 0 .
  • Adam-C3: Algorithm 1 with δ = 0.9 , ζ = 0.999 , H n defined by (5), α n = β n = 10 3 , and  γ n = 0 .
  • AdamCG-C1: Algorithm 1 with δ = 0.9 , ζ = 0.999 , H n defined by (5), and  α n = β n = γ n = 10 1 .
  • AdamCG-C2: Algorithm 1 with δ = 0.9 , ζ = 0.999 , H n defined by (5), and  α n = β n = γ n = 10 2 .
  • AdamCG-C3: Algorithm 1 with δ = 0.9 , ζ = 0.999 , H n defined by (5), and  α n = β n = γ n = 10 3 .
  • AMSGrad-C1: Algorithm 1 with δ = 0 , ζ = 0.999 , H n defined by (4), α n = β n = 10 1 , and  γ n = 0 .
  • AMSGrad-C2: Algorithm 1 with δ = 0 , ζ = 0.999 , H n defined by (4), α n = β n = 10 2 , and  γ n = 0 .
  • AMSGrad-C3: Algorithm 1 with δ = 0 , ζ = 0.999 , H n defined by (4), α n = β n = 10 3 , and  γ n = 0 .
  • AMSGradCG-C1: Algorithm 1 with δ = 0 , ζ = 0.999 , H n defined by (4), and  α n = β n = γ n = 10 1 .
  • AMSGradCG-C2: Algorithm 1 with δ = 0 , ζ = 0.999 , H n defined by (4), and  α n = β n = γ n = 10 2 .
  • AMSGradCG-C3: Algorithm 1 with δ = 0 , ζ = 0.999 , H n defined by (4), and  α n = β n = γ n = 10 3 .
Algorithm 1 with diminishing learning rates α n = 1 / n and β n = 1 / 2 n based on [7] (Theorem 4 and Corollary 1) (Algorithm 1 with γ n = 0 , such as Momentum-D1, Adam-D1, and AMSGrad-D1, is Algorithm 1 in [8]):
  • Momentum-D1: Algorithm 1 with δ = 0 , H n = diag ( 1 ) , and  γ n = 0 .
  • MomentumCG-D1: Algorithm 1 with δ = 0 , H n = diag ( 1 ) , and  γ n = 1 / 2 n .
  • MomentumCG-D2: Algorithm 1 with δ = 0 , H n = diag ( 1 ) , and  γ n = 1 / n .
  • Adam-D1: Algorithm 1 with δ = 0.9 , ζ = 0.999 , H n defined by (5), and  γ n = 0 .
  • AdamCG-D1: Algorithm 1 with δ = 0.9 , ζ = 0.999 , H n defined by (5), and  γ n = 1 / 2 n .
  • AdamCG-D2: Algorithm 1 with δ = 0.9 , ζ = 0.999 , H n defined by (5), and  γ n = 1 / n .
  • AMSGrad-D1: Algorithm 1 with δ = 0 , ζ = 0.999 , H n defined by (4), and  γ n = 0 .
  • AMSGradCG-D1: Algorithm 1 with δ = 0 , ζ = 0.999 , H n defined by (4), and  γ n = 1 / 2 n .
  • AMSGradCG-D2: Algorithm 1 with δ = 0 , ζ = 0.999 , H n defined by (4), and  γ n = 1 / n .
Python implementations of the algorithms are available at https://github.com/iiduka-researches/202008-cg-like.

4.1. Image Classification

This experiment used the CIFAR10 dataset (https://www.cs.toronto.edu/~kriz/cifar.html), a benchmark for image classification. The dataset consists of 60,000 color images ( 32 × 32 ) in 10 classes, with 6000 images per class. There are 50,000 training images and 10,000 test images. The test batch contained exactly 1000 randomly selected images from each class. We trained a 44-layer ResNet (ResNet-44) [20] organized into 43 convolutional layers which had 3 × 3 filters and a 1000-way-fully-connected layer with a softmax function. We used the cross entropy as the loss function for fitting ResNet in accordance with the commonly used strategy in image classification.
Figure 1, Figure 2 and Figure 3 compare the behaviors of the proposed algorithm with a constant learning rate with those of Momentum, AdaGrad, RMSProp, Adam, and AMSGrad using the default values in torch.optim (i.e., α n = 10 3 , β n = 0.9 ). Figure 1 shows that Momentum-C1, MomentumCG-C1, and AMSGrad-C2 minimized the training loss function faster than the existing algorithms, and Figure 2 shows that they decreased the training error rate faster as well. Moreover, AdamCG-Ci (resp. AMSGradCG-Ci) ( i = 2 , 3 ) outperformed AdamCG-C1 (resp. AMSGradCG-C1); this implies AdamCG and AMSGradCG require fewer iterations at smaller learning rates. Figure 3 shows that Adam-C2, AdamCG-C2, AMSGrad-C2, AMSGradCG-C2 decreased the test error rate faster than other algorithms. A similar trend was observed in the numerical results in [21].
Figure 4, Figure 5 and Figure 6 plot the behaviors of the proposed algorithms with diminishing learning rates. These algorithms did not work, and thus, it is clear that using diminishing learning rates is not good for training neural networks (see Section 5 for the details). A similar problem was observed in the numerical results in [8].
Table 1 shows the mean and variance of elapsed time per epoch for the existing algorithms and Algorithm 1 with a constant learning rate. This table indicates that the elapsed time of Momentum was almost the same as those of the proposed algorithms, e.g., Momentum-Ci and MomentumCG-Ci ( i = 1 , 2 , 3 ). Adam and AMSGrad also had such a trend.
Table 2 compares the training error rates of the existing algorithms with those of Algorithm 1 by using the scipy.stats.ttest_ind function in Python. The p-value is the probability associated with a t-test, and the significance level is set at 5%. If the value is less than 0.05, then there is a significant difference between the existing algorithm and the proposed algorithms. Table 2 and Figure 2 indicate that Momentum-C1 and MomentumCG-C1 outperformed Momentum and the performance of the existing algorithm (Momentum) was significantly different from the performances of the proposed algorithms (Momentum-C1 and MomentumCG-C1). Adam-Ci and AdamCG-Ci ( i = 1 , 2 , 3 ) had almost the same performance as Adam, while the performance of AMSGrad was not significantly different from that of AMSGrad-Ci and AMSGradCG-Ci ( i = 1 , 2 , 3 ).

4.2. Text Classification

This experiment used the IMDb dataset (https://datasets.imdbws.com/) for text classification tasks. The dataset contains 50,000 movie reviews along with their associated binary sentiment polarity labels. The dataset is split into 25,000 training and 25,000 test sets. We used an embedding layer that generated 50-dimensional embedding vectors and two bidirectional long short-term memory (LSTM) with an affine layer and a sigmoid function as an activation function for the output. To train it, we used the binary cross entropy (BCE) as a loss function minimized by the existing and proposed algorithms.
Figure 7, Figure 8 and Figure 9 compare the behaviors of the proposed algorithm with a constant learning rate with those of Momentum, AdaGrad, RMSProp, Adam, and AMSGrad, using the default values in torch.optim (i.e., α n = 10 3 , β n = 0.9 ). These figures show that Adam-C3, AdamCG-C3, AMSGrad-C3, RMSProp, Adam, and AMSGrad all performed well. In particular, Figure 8 shows that AdamCG-C3 (resp. AMSGradCG-C3) performed better than Adam-C3 (resp. AMSGrad-C3), which implies that using conjugate gradient-like directions would be good for training neural networks.
Figure 10, Figure 11 and Figure 12 indicate the behaviors of the proposed algorithms with diminishing learning rates. These figures show that the algorithms did not work, as was the case in Figure 4, Figure 5 and Figure 6 (see Section 5 for the details).
Table 3 indicates that the elapsed time for the existing algorithm was almost the same as the one for the proposed algorithms, as seen in Table 1. Table 4 and Figure 8 show that, although Momentum, Momentum-Ci, and MomentumCG-Ci did not perform better than the existing algorithms such as Adam and AMSGrad, the performance of Momentum was significantly different from that of almost all of proposed algorithms. It can be seen that Adam, Adam-C3, and AdamCG-C3 performed well and that, although AMSGrad, AMSGrad-C3, and AMSGradCG-C3 did not perform better than Adam, AMSGrad-C3 and AMSGradCG-C3 had almost the same performance as AMSGrad.

5. Discussion

Let us first discuss the relationship between the momentum method [18] (9), [19] (Section 2) with MomentumCG used in Section 4. The momentum method [18] (9), [19] (Section 2) is defined by
m n : = ϵ G ( x n , ξ n ) + μ m n 1 , x n + 1 : = P X ( x n + m n ) , i . e . ,
x n + 1 : = P X x n ϵ G ( x n , ξ n ) + μ m n 1 ,
where ϵ > 0 is the learning rate and μ [ 0 , 1 ] is the momentum coefficient. We can see that m n defined by (9) is the conjugate gradient-like direction of ϵ G ( x n , ξ n ) . Meanwhile, MomentumCG used in Section 4 is as follows:
G n = G ( x n , ξ n ) γ n G n 1 ,
m n : = ( 1 β n ) G n + β n m n 1 ,
x n + 1 : = P X ( x n α n m n ) .
Algorithm (10) uses the conjugate gradient-like direction G n of G ( x n , ξ n ) . For simplicity, algorithm (10) with β n = 0 is such that
x n + 1 : = P X x n α n G ( x n , ξ n ) + α n γ n m n 1 ,
which implies that algorithm (11) is the momentum method with a learning rate α n and momentum coefficient α n γ n .
The numerical comparisons in Section 4 show that Algorithm 1 with a constant learning rate performed better than Algorithm 1 with diminishing learning rates. For example, let us consider the text classification in Section 4.2 and compare AdamCG-C3 defined by
G n : = G ( x n , ξ n ) 10 3 G n 1 , m n : = 10 3 m n 1 + ( 1 10 3 ) G n , m ^ n : = ( 1 0.9 n + 1 ) 1 m n , v n : = 0.999 v n 1 + ( 1 0.999 ) G ( x n , ξ n ) G ( x n , ξ n ) , v ¯ n : = ( 1 0.999 n + 1 ) 1 v n , v ^ n = ( v ^ n , i ) : = max { v ^ n 1 , i , v ¯ n , i } , H n : = diag v ^ n , i , x n + 1 : = P X , H n ( x n 10 3 H n 1 m n )
with AdamCG-D1 defined by
G n : = G ( x n , ξ n ) 2 n G n 1 , m n : = 2 n m n 1 + ( 1 2 n ) G n , m ^ n : = ( 1 0.9 n + 1 ) 1 m n , v n : = 0.999 v n 1 + ( 1 0.999 ) G ( x n , ξ n ) G ( x n , ξ n ) , v ¯ n : = ( 1 0.999 n + 1 ) 1 v n , v ^ n = ( v ^ n , i ) : = max { v ^ n 1 , i , v ¯ n , i } , H n : = diag v ^ n , i , x n + 1 : = P X , H n ( x n n 1 / 2 H n 1 m n ) .
AdamCG-C3 (algorithm (12)) works well for all n N , since it uses a constant learning rate. Meanwhile, there is a possibility that AdamCG-D1 (Algorithm (13)) does not work for a large number of iterations, because it uses diminishing learning rates. In fact, AdamCG-D1 (algorithm (13)) for a large n is as follows:
G n : = G ( x n , ξ n ) 2 n G n 1 G ( x n , ξ n ) , m n : = 2 n m n 1 + ( 1 2 n ) G n G n G ( x n , ξ n ) , m ^ n : = ( 1 0.9 n + 1 ) 1 m n , v n : = 0.999 v n 1 + ( 1 0.999 ) G ( x n , ξ n ) G ( x n , ξ n ) , v ¯ n : = ( 1 0.999 n + 1 ) 1 v n , v ^ n = ( v ^ n , i ) : = max { v ^ n 1 , i , v ¯ n , i } , H n : = diag v ^ n , i , x n + 1 : = P X , H n ( x n n 1 / 2 H n 1 m n ) P X , H n ( x n ) = x n ,
which implies that algorithm (14) does not work. As can be seen in Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12, Algorithm 1 with diminishing learning rates would not be good for training neural networks.
Finally, let us compare the existing algorithm with Algorithm 1, in particular, AMSGrad in torch.optim using α n = 10 3 , β n = 0.9 , and  ζ = 0.999 with AMSGrad-C3 using α n = 10 3 , β n = 10 3 , and  ζ = 0.999 . The difference between AMSGrad and AMSGrad-C3 is the setting of β n . According to Figure 7, Figure 8 and Figure 9, AMSGrad-C3 performs comparably to AMSGrad, a useful algorithm. These results are guaranteed by Theorem 1, which indicates that Algorithm 1 with a small constant learning rate approximates a stationary point of the minimization problem in deep neural networks, and more specifically, the sequence ( x n ) n N generated by AMSGrad-C3 (Algorithm 1 with δ = 0 ) satisfying
lim sup n + E x x n , f ( x n ) B ˜ 2 M ˜ 2 2 b ˜ 1 10 3 D d M ˜ b ˜ 1 10 3 ( x X )
approximates x 🟉 X 🟉 : = { x 🟉 X : x x 🟉 , f ( x 🟉 ) 0 ( x X ) } .

6. Conclusions

We proposed an iterative algorithm with conjugate gradient-like directions for nonconvex optimization in deep neural networks to accelerate conventional adaptive learning rate optimization algorithms. We presented two convergence analyses of the algorithm. The first convergence analysis showed that the algorithm with a constant learning rate approximates a stationary point of a nonconvex optimization problem. The second analysis showed that the algorithm with a diminishing learning rate converges to a stationary point of the nonconvex optimization problem. We gave numerical results for concrete neural networks. The results showed that the proposed algorithm with a constant learning rate is superior for training neural networks from the viewpoints of theory and practice, while the proposed algorithm with a diminishing learning rate is not good for training neural networks. The reason behind these results is that using a constant learning rate guarantees that the algorithm works well, while a diminishing learning rate for a large number of iterations, which is approximately zero, implies that the algorithm is not updated.

Author Contributions

Conceptualization, H.I.; methodology, H.I.; software, Y.K.; validation, H.I. and Y.K.; formal analysis, H.I.; investigation, H.I. and Y.K.; resources, H.I.; data curation, Y.K.; writing—original draft preparation, H.I.; writing—review and editing, H.I.; visualization, H.I. and Y.K.; supervision, H.I.; project administration, H.I.; funding acquisition, H.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JSPS KAKENHI, Grant Number JP18K11184.

Acknowledgments

The authors would like to thank Michelle Zhou for giving us a chance to submit our paper to this journal. We are sincerely grateful to Assistant Editor Elliot Guo and the two referees for helping us improve the original manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs of Theorems 1 and 2 and Propositions 1 and 2

This section refers to [8]. Let us first prove the following lemma.
Lemma A1. 
Suppose that (A1)–(A2) and (C1)–(C2) hold. Then, for all x X and all n N ,
E x n + 1 x H n 2 E x n x H n 2 + 2 α n 1 δ n + 1 { ( 1 β n ) E x x n , f ( x n ) + β n E x x n , m n 1 ( 1 β n ) γ n E x x n , G n 1 } + α n 2 E d n H n 2 .
Proof. 
Choose x X and n N . The definition of x n + 1 and the nonexpansivity of P X , H n imply that, almost surely,
x n + 1 x H n 2 ( x n x ) + α n d n H n 2 = x n x H n 2 + 2 α n x n x , d n H n + α n 2 d n H n 2 .
The definitions of d n , m n , and m ^ n ensure that
x n x , d n H n = 1 δ ˜ n x x n , m n = β n δ ˜ n x x n , m n 1 + 1 β n δ ˜ n x x n , G n ,
where δ ˜ n : = 1 δ n + 1 . Moreover, the definition of G n implies that
x x n , G n = x x n , G ( x n , ξ n ) γ n x x n , G n 1 .
Hence, almost surely,
x n + 1 x H n 2 x n x H n 2 + 2 α n { β n δ ˜ n x x n , m n 1 + 1 β n δ ˜ n x x n , G ( x n , ξ n ) ( 1 β n ) γ n δ ˜ n x x n , G n 1 } + α n 2 d n H n 2 .
The conditions x n = x n ( ξ [ n 1 ] ) ( n N ) , (C1), and (C2) imply that
E x x n , G ( x n , ξ n ) = E E x x n , G ( x n , ξ n ) | ξ [ n 1 ] = E x x n , E G ( x n , ξ n ) | ξ [ n 1 ] = E x x n , f ( x n ) .
Taking the expectation of (A1) leads to the assertion of Lemma A1. □
Lemma A2. 
If (C3) holds, then, for all n N , E [ G n 2 ] 4 M ^ 2 and E [ m n 2 ] M ˜ 2 , where M ^ 2 : = max { M 2 , G 1 2 } and M ˜ 2 : = max { m 1 2 , 4 M ^ 2 } . Moreover, if (A3) holds, then, for all n N , E [ d n H n 2 ] B ˜ 2 M ˜ 2 / ( 1 δ ) 2 , where B ˜ : = sup { max i = 1 , 2 , , d h n , i 1 / 2 : n N } < + .
Proof. 
Let us define M ^ 2 : = max { M 2 , G 1 2 } < + , where M is defined as in (C3). Let us consider the case where n = 0 . The inequality x + y 2 2 x 2 + 2 y 2 ( x , y R d ) ensures that
G 0 2 2 G ( x 0 , ξ 0 ) 2 + 2 γ 0 2 G 1 2 ,
which, together with γ n 1 / 2 ( n N ) and the definition of M ^ , implies that
E G 0 2 2 M 2 + 2 · 1 4 · 4 M ^ 2 4 M ^ 2 .
Assume that E [ G n 2 ] 4 M ^ 2 for some n N . The same discussion as for (A2) ensures that
E G n + 1 2 2 E G ( x n + 1 , ξ n + 1 ) 2 + 2 γ n + 1 2 E G n 2 2 M 2 + 2 · 1 4 · 4 M ^ 2 4 M ^ 2 .
Accordingly, we have, for all n N ,
E G n 2 4 M ^ 2 .
From the definition of m n , the convexity of · 2 , and (A3), for all n N ,
E m n 2 β n E m n 1 2 + ( 1 β n ) E G n 2 β n E m n 1 2 + 4 M ^ 2 ( 1 β n ) .
Hence, induction leads to, for all n N ,
E m n 2 M ˜ 2 : = max m 1 2 , 4 M ^ 2 < + .
Given n N , H n O ensures that there exists a unique matrix H ¯ n O such that H n = H ¯ n 2 [22] (Theorem 7.2.6). From x H n 2 = H ¯ n x 2 ( x R d ) and the definitions of d n and m ^ n , we have, for all n N ,
E d n H n 2 = E H ¯ n 1 H n d n 2 1 δ ˜ n 2 E H ¯ n 1 2 m n 2 ,
where δ ˜ n : = 1 δ n + 1 1 δ and H ¯ n 1 = diag ( h n , i 1 / 2 ) = max i = 1 , 2 , , d h n , i 1 / 2 ( n N ). From (A4) and B ˜ : = sup { max i = 1 , 2 , , d h n , i 1 / 2 : n N } max i = 1 , 2 , , d h 0 , i 1 / 2 < + (by (A3)), we have, for all n N ,
E d n H n 2 B ˜ 2 M ˜ 2 ( 1 δ ) 2 ,
which completes the proof. □
The convergence rate analysis of Algorithm 1 is as follows.
Theorem A1. 
Suppose that (A1)–(A5) and (C1)–(C3) hold and ( θ n ) n N defined by θ n : = α n ( 1 β n ) / ( 1 δ n + 1 ) and ( β n ) n N satisfy θ n + 1 θ n ( n N ) and lim sup n + β n < 1 . Let V n ( x ) : = E x n x , f ( x n ) for all x X and all n N . Then, for all x X and all n 1 ,
1 n k = 1 n V k ( x ) D i = 1 d B i 2 b ˜ n α n + B ˜ 2 M ˜ 2 2 b ˜ δ ˜ 2 n k = 1 n α k + D d M ˜ b ˜ n k = 1 n β k + 2 D d M ^ n k = 1 n γ k ,
where ( β n ) n N ( 0 , b ] ( 0 , 1 ) , b ˜ : = 1 b , δ ˜ : = 1 δ , M ^ , M ˜ and B ˜ are defined as in Lemma A2, and D and B i are defined as in Assumption 1.
Proof. 
Let x X be fixed arbitrarily. Lemma A1 guarantees that, for all k N ,
V k ( x ) 1 2 θ k E x k x H k 2 E x k + 1 x H k 2 + β k 1 β k E x x k , m k 1 + γ k E x k x , G n 1 + α k δ ˜ k 2 ( 1 β k ) E d k H k 2 ,
where δ ˜ n : = 1 δ n + 1 1 ( n N ). The condition lim sup n + β n < 1 ensures the existence of b > 0 such that, for all n N , β n b < 1 . Let b ˜ : = 1 b . Then, for all n 1 , we have
k = 1 n V k ( x ) 1 2 k = 1 n 1 θ k E x k x H k 2 E x k + 1 x H k 2 Θ n + k = 1 n β k 1 β k E x x k , m k 1 B n + k = 1 n γ k E x k x , G n 1 Γ n + 1 2 b ˜ k = 1 n α k E d k H k 2 A n .
The definition of Θ n and E [ x n + 1 x H n 2 ] / θ n 0 imply that
Θ n E x 1 x H 1 2 θ 1 + k = 2 n E x k x H k 2 θ k E x k x H k 1 2 θ k 1 Θ ˜ n .
Accordingly,
Θ ˜ n = E k = 2 n H ¯ k ( x k x ) 2 θ k H ¯ k 1 ( x k x ) 2 θ k 1 ,
where, for all k N and all x : = ( x i ) R d ,
H ¯ k = diag h k , i and H ¯ k x 2 = i = 1 d h k , i x i 2 .
Thus, for all n 2 ,
Θ ˜ n = E k = 2 n i = 1 d h k , i θ k h k 1 , i θ k 1 ( x k , i x i ) 2 .
The condition θ k θ k 1 ( k 1 ) and (A3) imply that, for all k 1 and all i = 1 , 2 , , d ,
h k , i θ k h k 1 , i θ k 1 0 .
Hence, for all n 2 ,
Θ ˜ n D E k = 2 n i = 1 d h k , i θ k h k 1 , i θ k 1 = D E i = 1 d h n , i θ n h 1 , i θ 1 ,
where max i = 1 , 2 , , d sup { ( x n , i x i ) 2 : n N } D < + (by (A5)). Therefore, (A6), E [ x 1 x H 1 2 ] / θ 1 D E [ i = 1 d h 1 , i / θ 1 ] , and (A4) imply, for all n N ,
Θ n D E i = 1 d h 1 , i θ 1 + D E i = 1 d h n , i θ n h 1 , i θ 1 = D θ n E i = 1 d h n , i D θ n i = 1 d B i ,
which, together with θ n : = α n ( 1 β n ) / ( 1 δ n + 1 ) b ˜ α n , implies
Θ n D i = 1 d B i b ˜ α n .
The Cauchy–Schwarz inequality, together with max i = 1 , 2 , , d sup { ( x n , i x i ) 2 : n N } D < + (by (A5)) and E [ m n ] M ˜ ( n N ) (by Lemma A2), guarantees that, for all n N ,
B n D d b ˜ k = 1 n β k E m k 1 D d M ˜ b ˜ k = 1 n β k .
A discussion similar to the one for obtaining (A9), together with E [ G n ] 2 M ^ ( n N ) (by Lemma A2), implies that
Γ n D d k = 1 n γ k E G k 1 2 D d M ^ k = 1 n γ k .
Since E [ d n H n 2 ] B ˜ 2 M ˜ 2 / ( 1 δ ) 2 ( n N ) holds (by Lemma A2), we have, for all n N ,
A n : = k = 1 n α k E d k H k 2 B ˜ 2 M ˜ 2 ( 1 δ ) 2 k = 1 n α k .
Therefore, (A5), (A8)–(A11) leads to the assertion in Theorem A1. This completes the proof. □
Proof of Theorem 1. 
Let α n : = α ( 0 , 1 ) , β n : = β = b ( 0 , 1 ) , and γ n : = γ [ 0 , 1 / 2 ] . We show that, for all ϵ > 0 and all x X ,
lim inf n + V n ( x ) B ˜ 2 M ˜ 2 2 b ˜ δ ˜ 2 α + D d M ˜ b ˜ δ ˜ β + 2 D d M ^ δ ˜ γ + D d ϵ 2 b ˜ + ϵ .
If (A12) does not hold for all ϵ > 0 and all x X , then there exist ϵ 0 > 0 and x ^ X such that
lim inf n + V n ( x ^ ) > B ˜ 2 M ˜ 2 2 b ˜ δ ˜ 2 α + D d M ˜ b ˜ δ ˜ β + 2 D d M ^ δ ˜ γ + D d ϵ 0 2 b ˜ + ϵ 0 .
Assumptions (A3) and (A4) ensure that there exists n 0 N such that, for all n N , n n 0 implies that
E i = 1 d ( h n + 1 , i h n , i ) d α ϵ 0 2 .
Assumptions (A4) and (A5) and (A7) also imply that, for all n N ,
X n : = E x n x ^ H n 2 = E i = 1 d h n , i ( x n , i x ^ i ) 2 D i = 1 d B i < + .
Moreover, Assumptions (A3) and (A5), (A7), and (A14) ensure that, for all n n 0 ,
X n + 1 E x n + 1 x ^ H n 2 = E i = 1 d ( h n + 1 , i h n , i ) ( x n + 1 , i x ^ i ) 2 D d α ϵ 0 2 .
The condition δ [ 0 , 1 ) and X n + 1 < + (by (A15)) ensure that there exists n 1 N such that, for all n N , n n 1 implies that
X n + 1 δ n + 1 D d α ϵ 0 2 .
The definition of the limit inferior of ( V n ( x ^ ) ) n N guarantees that there exists n 2 N such that, for all n n 2 ,
lim inf n + V n ( x ^ ) 1 2 ϵ 0 V n ( x ^ ) ,
which, together with (A13), implies that, for all n n 1 ,
V n ( x ^ ) > B ˜ 2 M ˜ 2 2 b ˜ δ ˜ 2 α + D d M ˜ b ˜ δ ˜ β + 2 D d M ^ δ ˜ γ + D d ϵ 0 2 b ˜ + 1 2 ϵ 0 .
Thus, Lemmas A1 and A2 and (A16) lead to the finding that, for all n n 3 : = max { n 0 , n 1 , n 2 } ,
X n + 1 X n + D d α ϵ 0 2 2 α b ˜ 1 δ n + 1 V n ( x ^ ) + 2 D d M ˜ δ ˜ α β + 4 D d M ^ b ˜ δ ˜ α γ + B ˜ 2 M ˜ 2 δ ˜ 2 α 2 ,
where b ˜ : = 1 b and δ ˜ : = 1 δ . Hence, from (A17), 1 δ n + 1 1 , and ( X n + 1 X n ) δ n + 1 X n + 1 δ n + 1 ( n N ), we have, for all n n 3 ,
X n + 1 X n + D d α ϵ 0 2 2 α b ˜ V n ( x ^ ) + 2 D d M ˜ δ ˜ α β + 4 D d M ^ b ˜ δ ˜ α γ + B ˜ 2 M ˜ 2 δ ˜ 2 α 2 + X n + 1 δ n + 1 X n + D d α ϵ 0 2 α b ˜ V n ( x ^ ) + 2 D d M ˜ δ ˜ α β + 4 D d M ^ b ˜ δ ˜ α γ + B ˜ 2 M ˜ 2 δ ˜ 2 α 2 .
Therefore, (A18) ensures that, for all n n 3 ,
X n + 1 < X n + D d α ϵ 0 2 α b ˜ B ˜ 2 M ˜ 2 2 b ˜ δ ˜ 2 α + D d M ˜ b ˜ δ ˜ β + 2 D d M ^ δ ˜ γ + D d ϵ 0 2 b ˜ + 1 2 ϵ 0 + 2 D d M ˜ δ ˜ α β + 4 D d M ^ b ˜ δ ˜ α γ + B ˜ 2 M ˜ 2 δ ˜ 2 α 2 = X n α b ˜ ϵ 0 < X n 3 α b ˜ ϵ 0 ( n + 1 n 3 ) .
Since the right-hand side of the above inequality approaches minus infinity when n diverges, we have a contradiction. Hence, (A12) holds for all ϵ > 0 and all x X . From the arbitrary condition of ϵ , we have, for all x X ,
lim inf n + V n ( x ) B ˜ 2 M ˜ 2 2 b ˜ δ ˜ 2 α + D d M ˜ b ˜ δ ˜ β + 2 D d M ^ δ ˜ γ ,
which completes the proof. □
Proof of Theorem 2. 
Let x X . Lemmas A1 and A2 and (A15), together with a discussion similar to the one for obtaining (A19), ensure that, for all k N ,
X k + 1 X k + D E i = 1 d ( h k + 1 , i h k , i ) 2 α k ( 1 β k ) V k ( x ) + 2 D d M ˜ δ ˜ α k β k + 4 D d M ^ b ˜ δ ˜ α k γ k + B ˜ 2 M ˜ 2 δ ˜ 2 α k 2 + D i = 1 d B i δ k + 1 ,
which implies that
2 α k V k ( x ) X k X k + 1 + D E i = 1 d ( h k + 1 , i h k , i ) + 4 D d M ^ b ˜ δ ˜ α k γ k + B ˜ 2 M ˜ 2 δ ˜ 2 α k 2 + 2 D d M ˜ δ ˜ + F α k β k + D i = 1 d B i δ k + 1 ,
where F : = sup { | V n ( x ) | : n N } < + holds from Assumptions (A2) and (A5). Summing up the above inequality from k = 0 to k = n ensures that
2 k = 0 n α k V k ( x ) X 0 + D E i = 1 d ( h n + 1 , i h 0 , i ) + 4 D d M ^ b ˜ δ ˜ k = 0 n α k γ k + B ˜ 2 M ˜ 2 δ ˜ 2 k = 0 n α k 2 + 2 D d M ˜ δ ˜ + F k = 0 n α k β k + D B ^ k = 0 n δ k + 1 ,
where B ^ : = i = 1 d B i . Let ( α n ) n N , ( β n ) n N , and ( γ n ) n N satisfy n = 0 + α n = + , n = 0 + α n 2 < + , n = 0 + α n β n < + , and n = 0 + α n γ n < + . Assumption (A4) and δ [ 0 , 1 ) imply that
k = 0 + α k V k ( x ) < + .
We prove that, for all x X , lim inf n + V n ( x ) 0 . Assume that lim inf n + V n ( x ) 0 does not hold for all x X . Then there exist x ^ X , ζ > 0 , and m 0 N such that, for all n m 0 , V n ( x ^ ) ζ . Accordingly, (A20) and n = 0 + α n = + guarantee that
+ = ζ k = m 0 + α k k = m 0 + α k V k ( x ^ ) < + ,
which is a contradiction. Hence, lim inf n + V n ( x ) 0 holds for all x X .
Let α n : = 1 / n η ( η [ 1 / 2 , 1 ) ) and β n : = β n ( β ( 0 , 1 ) ). First, we consider the case where γ n : = γ n ( γ ( 0 , 1 ) ). Then, θ n + 1 θ n ( n N ) and lim sup n + β n < 1 . When η = 1 / 2 , we have
1 n α n = 1 n
and
1 n k = 1 n α k 1 n k = 1 n 1 2 k = 1 n 1 k 2 1 + ln n n ,
where the first inequality comes from the Cauchy–Schwarz inequality and the second inequality comes from k = 1 n ( 1 / k ) 1 + ln n . We also have
1 n k = 1 n β k 1 n k = 1 + β k = β ( 1 β ) n and 1 n k = 1 n γ k 1 n k = 1 + γ k = γ ( 1 γ ) n .
Therefore, Theorem A1 implies that
1 n k = 1 n V k ( x ) O 1 + ln n n .
In the case where η ( 1 / 2 , 1 ) , we have
1 n α n = 1 n 1 η and 1 n k = 1 n α k 1 n k = 1 n 1 2 k = 1 n 1 k η 2 B n ,
where B : = n = 1 + ( 1 / k 2 η ) < + . Therefore, Theorem A1, together with (A21), ensures that
1 n k = 1 n V k ( x ) O 1 n 1 η .
Next, we consider the case where γ n : = 1 / n κ ( κ > 1 η ). Since κ > 1 / 2 holds, an argument similar to the one for obtaining (A22) implies that
1 n k = 1 n γ k = O 1 n .
The discussion in the above paragraph and Theorem A1 lead to the same convergence rate of ( 1 / n ) k = 1 n V k ( x ) as the one for γ n : = γ n ( γ ( 0 , 1 ) ). This completes the proof. □
Proof of Proposition 1. 
Since F ( · , ξ ) is convex for almost every ξ Ξ , we have, for all n N ,
E [ f ( x n ) f 🟉 ] V n ( x 🟉 ) , E [ f ( x ˜ n ) f 🟉 ] 1 n k = 1 n E [ f ( x k ) f 🟉 ] 1 n k = 1 n V k ( x 🟉 ) ,
which, together with Theorem 1, leads to Proposition 1. □
Proof of Proposition 2. 
Theorem 2 and the proof of Proposition 1 lead to the finding that lim inf n + E [ f ( x n ) f 🟉 ] = 0 and lim n + E [ f ( x ˜ n ) f 🟉 ] = 0 . Let x ^ X be an arbitrary accumulation point of ( x ˜ n ) n N X . Since there exists ( x ˜ n i ) i N ( x ˜ n ) n N such that ( x ˜ n i ) i N converges almost surely to x ^ , the continuity of f and lim n + E [ f ( x ˜ n ) f 🟉 ] = 0 imply that E f ( x ^ ) f 🟉 = 0 , and hence, x ^ X 🟉 . The convergence rate of E [ f ( x ˜ n ) f 🟉 ] follows from Theorem A1. □

References

  1. Caciotta, M.; Giarnetti, S.; Leccese, F. Hybrid neural network system for electric load forecasting of telecomunication station. In Proceedings of the 19th IMEKO World Congress 2009, Lisbon, Portugal, 6–11 September 2009; Volume 1, pp. 657–661. [Google Scholar]
  2. Caciotta, M.; Giarnetti, S.; Leccese, F.; Orioni, B.; Oreggia, M.; Pucci, C.; Rametta, S. Flavors mapping by Kohonen network classification of Panel Tests of Extra Virgin Olive Oil. Measurement 2016, 78, 366–372. [Google Scholar] [CrossRef]
  3. Proietti, A.; Liparulo, L.; Leccese, F.; Panella, M. Shapes classification of dust deposition using fuzzy kernel-based approaches. Measurement 2016, 77, 344–350. [Google Scholar] [CrossRef]
  4. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  5. Duchi, J.; Hazan, E.; Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
  6. Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
  7. Reddi, S.J.; Kale, S.; Kumar, S. On the convergence of Adam and beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 1–23. [Google Scholar]
  8. Iiduka, H. Appropriate learning rates of adaptive learning rate optimization algorithms for training deep neural networks. arXiv 2020, arXiv:2002.09647. [Google Scholar]
  9. Hager, W.H.; Zhang, H. A survey of nonlinear conjugate gradient methods. Pac. J. Optim. 2006, 2, 35–58. [Google Scholar]
  10. Iiduka, H. Acceleration method for convex optimization over the fixed point set of a nonexpansive mapping. Math. Program. 2015, 149, 131–165. [Google Scholar] [CrossRef]
  11. Iiduka, H. Hybrid conjugate gradient method for a convex optimization problem over the fixed-point set of a nonexpansive mapping. J. Optim. Theory Appl. 2009, 140, 463–475. [Google Scholar] [CrossRef]
  12. Iiduka, H.; Yamada, I. A use of conjugate gradient direction for the convex optimization problem over the fixed point set of a nonexpansive mapping. SIAM J. Optim. 2009, 19, 1881–1893. [Google Scholar] [CrossRef]
  13. Iiduka, H. Three-term conjugate gradient method for the convex optimization problem over the fixed point set of a nonexpansive mapping. Appl. Math. Comput. 2011, 217, 6315–6327. [Google Scholar] [CrossRef]
  14. Kobayashi, Y.; Iiduka, H. Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning. arXiv 2020, arXiv:2003.00231. [Google Scholar]
  15. Bauschke, H.H.; Combettes, P.L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces; Springer: New York, NY, USA, 2011. [Google Scholar]
  16. Facchinei, F.; Pang, J.S. Finite-Dimensional Variational Inequalities and Complementarity Problems I; Springer: New York, NY, USA, 2003. [Google Scholar]
  17. Nemirovski, A.; Juditsky, A.; Lan, G.; Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 2009, 19, 1574–1609. [Google Scholar] [CrossRef] [Green Version]
  18. Polyak, B.T. Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
  19. Sutskever, I.; Martens, J.; Dahl, G.; Hinton, G. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 16–21 June 2013; pp. 1–14. [Google Scholar]
  20. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  21. Iiduka, H. Stochastic fixed point optimization algorithm for classifier ensemble. IEEE Trans. Cybern. 2020, 50, 4370–4380. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
Figure 1. Loss function value versus number of epochs on the CIFAR-10 dataset for training (constant).
Figure 1. Loss function value versus number of epochs on the CIFAR-10 dataset for training (constant).
Electronics 09 01809 g001
Figure 2. Classification error rate versus number of epochs on the CIFAR-10 dataset for training (constant).
Figure 2. Classification error rate versus number of epochs on the CIFAR-10 dataset for training (constant).
Electronics 09 01809 g002
Figure 3. Classification error rate versus number of epochs on the CIFAR-10 dataset for testing (constant).
Figure 3. Classification error rate versus number of epochs on the CIFAR-10 dataset for testing (constant).
Electronics 09 01809 g003
Figure 4. Loss function value versus number of epochs on the CIFAR-10 dataset for training (diminishing).
Figure 4. Loss function value versus number of epochs on the CIFAR-10 dataset for training (diminishing).
Electronics 09 01809 g004
Figure 5. Classification error rate versus number of epochs on the CIFAR-10 dataset for training (diminishing).
Figure 5. Classification error rate versus number of epochs on the CIFAR-10 dataset for training (diminishing).
Electronics 09 01809 g005
Figure 6. Classification error rate versus number of epochs on the CIFAR-10 dataset for testing (diminishing).
Figure 6. Classification error rate versus number of epochs on the CIFAR-10 dataset for testing (diminishing).
Electronics 09 01809 g006
Figure 7. Loss function value versus number of epochs on the IMDb dataset for training (constant).
Figure 7. Loss function value versus number of epochs on the IMDb dataset for training (constant).
Electronics 09 01809 g007
Figure 8. Classification error rate versus number of epochs on the IMDb dataset for training (constant).
Figure 8. Classification error rate versus number of epochs on the IMDb dataset for training (constant).
Electronics 09 01809 g008
Figure 9. Classification error rate versus number of epochs on the IMDb dataset for testing (constant).
Figure 9. Classification error rate versus number of epochs on the IMDb dataset for testing (constant).
Electronics 09 01809 g009
Figure 10. Loss function value versus number of epochs on the IMDb dataset for training (diminishing).
Figure 10. Loss function value versus number of epochs on the IMDb dataset for training (diminishing).
Electronics 09 01809 g010
Figure 11. Classification error rate versus number of epochs on the IMDb dataset for training (diminishing).
Figure 11. Classification error rate versus number of epochs on the IMDb dataset for training (diminishing).
Electronics 09 01809 g011
Figure 12. Classification error rate versus number of epochs on the IMDb dataset for testing (diminishing).
Figure 12. Classification error rate versus number of epochs on the IMDb dataset for testing (diminishing).
Electronics 09 01809 g012
Table 1. Mean and variance of elapsed time per epoch for the existing algorithms and Algorithm 1 on the CIFAR-10 dataset.
Table 1. Mean and variance of elapsed time per epoch for the existing algorithms and Algorithm 1 on the CIFAR-10 dataset.
ExistingC1C2C3CG-C1CG-C2CG-C3
Momentummean14.81510614.76635214.64334314.19167514.37024014.53625813.732973
variance0.2689791.1443460.2685760.3637460.1807540.8727690.314055
Adammean17.62136117.38894718.51180518.08477118.10691818.10882017.127479
variance0.1495530.0445391.3929420.0566060.2133410.0635941.317213
AMSGradmean18.12255117.65037717.79632819.33577518.85529718.27288816.328777
variance1.2455630.3130880.2899444.7385412.8206501.6717051.754373
Table 2. Results of t-test on the training error rates of the existing algorithms (Momentum, Adam, and AMSGrad) and Algorithm 1 (Ci and CG-Ci ( i = 1 , 2 , 3 )) on the CIFAR-10 dataset (significance level is 5%; the p-values for the proposed algorithms with significantly low error rates are indicated in bold).
Table 2. Results of t-test on the training error rates of the existing algorithms (Momentum, Adam, and AMSGrad) and Algorithm 1 (Ci and CG-Ci ( i = 1 , 2 , 3 )) on the CIFAR-10 dataset (significance level is 5%; the p-values for the proposed algorithms with significantly low error rates are indicated in bold).
C1C2C3CG-C1CG-C2CG-C3
Momentumt-statistic3.708790.23783−13.653143.34063−0.17214−12.77890
(Existing)p-value2.38 × 10 4 8.12 × 10 1 4.44 × 10 35 9.15 × 10 4 8.63 × 10 1 1.43 × 10 31
Adamt-statistic−10.460060.035990.20774−6.702480.374930.04342
(Existing)p-value8.73 × 10 23 9.71 × 10 1 8.36 × 10 1 7.03 × 10 11 7.08 × 10 1 9.65 × 10 1
AMSGradt-statistic−157.96917−1.59278−0.16230−157.97057−1.59440−0.00869
(Existing)p-value0.00 × 10 0 1.12 × 10 1 8.71 × 10 1 0.00 × 10 0 1.12 × 10 1 9.93 × 10 1
Table 3. Mean and variance of elapsed time per epoch for the existing algorithms and Algorithm 1 on the IMDb dataset.
Table 3. Mean and variance of elapsed time per epoch for the existing algorithms and Algorithm 1 on the IMDb dataset.
ExistingC1C2C3CG-C1CG-C2CG-C3
Momentummean19.02966018.99918618.95749619.09883619.24176919.28685418.671163
variance0.0951320.0749350.1072590.1968410.0356490.0583190.003906
Adammean20.25682720.19422020.19326020.26070520.23155020.38847019.536741
variance0.0615520.0234850.0417770.0604610.0391030.1748180.165803
AMSGradmean20.10948920.09246320.10276320.02561320.14664620.13667319.335856
variance0.0754320.0591490.0595610.0895400.1135630.0989140.003543
Table 4. Results of t-test on the training error rates of the existing algorithms (Momentum, Adam, and AMSGrad) and Algorithm 1 (Ci and CG-Ci ( i = 1 , 2 , 3 )) on the IMDb dataset (significance level is 5%; the p-values for the proposed algorithms with significantly low error rates are indicated in bold).
Table 4. Results of t-test on the training error rates of the existing algorithms (Momentum, Adam, and AMSGrad) and Algorithm 1 (Ci and CG-Ci ( i = 1 , 2 , 3 )) on the IMDb dataset (significance level is 5%; the p-values for the proposed algorithms with significantly low error rates are indicated in bold).
C1C2C3CG-C1CG-C2CG-C3
Momentumt-statistic13.871420.63115−4.5930613.229511.71477−4.59306
(Existing)p-value5.17 × 10 31 5.29 × 10 1 7.76 × 10 6 4.82 × 10 29 8.80 × 10 2 7.76 × 10 6
Adamt-statistic−63.39972−11.01275−0.00287−63.33435−9.415520.11707
(Existing)p-value1.79 × 10 133 2.61 × 10 22 9.98 × 10 1 2.17 × 10 133 1.24 × 10 17 9.07 × 10 1
AMSGradt-statistic−63.53084−5.68706−0.63240−63.42279−7.93451−0.06863
(Existing)p-value1.21 × 10 133 4.59 × 10 8 5.28 × 10 1 1.67 × 10 133 31.53 × 10 13 9.45 × 10 1
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Iiduka, H.; Kobayashi, Y. Training Deep Neural Networks Using Conjugate Gradient-like Methods. Electronics 2020, 9, 1809. https://doi.org/10.3390/electronics9111809

AMA Style

Iiduka H, Kobayashi Y. Training Deep Neural Networks Using Conjugate Gradient-like Methods. Electronics. 2020; 9(11):1809. https://doi.org/10.3390/electronics9111809

Chicago/Turabian Style

Iiduka, Hideaki, and Yu Kobayashi. 2020. "Training Deep Neural Networks Using Conjugate Gradient-like Methods" Electronics 9, no. 11: 1809. https://doi.org/10.3390/electronics9111809

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop