Next Article in Journal
The Franson Experiment as an Example of Spontaneous Breaking of Time-Translation Symmetry
Previous Article in Journal
Installation Quality Inspection for High Formwork Using Terrestrial Laser Scanning Technology

Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

# An Active Set Limited Memory BFGS Algorithm for Machine Learning

1
Center for Applied Mathematics of Guangxi, College of Mathematics and Information Science, Guangxi University, Nanning 530004, China
2
School of Mathematics and Statistics, Baise University, Baise 533000, China
3
School of Business, Suzhou University of Science and Technology, Suzhou 215011, China
*
Authors to whom correspondence should be addressed.
Symmetry 2022, 14(2), 378; https://doi.org/10.3390/sym14020378
Submission received: 21 December 2021 / Revised: 8 January 2022 / Accepted: 17 January 2022 / Published: 14 February 2022

## Abstract

:
In this paper, a stochastic quasi-Newton algorithm for nonconvex stochastic optimization is presented. It is derived from a classical modified BFGS formula. The update formula can be extended to the framework of limited memory scheme. Numerical experiments on some problems in machine learning are given. The results show that the proposed algorithm has great prospects.
PACS:
62L20; 90C30; 90C15; 90C60

## 1. Introduction

Machine learning is an interdisciplinary subject involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and so on. In machine learning, people usually construct an appropriate model from an extraordinary large amount of data. Therefore, the traditional algorithms for solving optimization problems are no longer suitable for machine learning problems. A stochastic algorithm must be used to solve the model optimization problem we encounter in machine learning.
This type of problem is considered in machine learning
$min x ∈ ℜ d f ( x ) = E [ T ( x , τ ) ] ,$
where $T : ℜ D x × ℜ n → ℜ$ is a continuously differentiable function, $E [ · ]$ denotes the expectation taken with respect $τ$ and $τ$ is the random variable of the distribution function $P$. In most practical cases, the function $T ( · , τ )$ is not given intuitively. In addition, even worse, the distribution function $P$ may also be unknown. The objective function (1) is defined using the empirical expectation
$f ( x ) = 1 N ∑ n = 1 N f i ( x ) ,$
where $f i : ℜ D x → ℜ$ is the loss function that corresponds to the ith data sample, and $N$ denotes the number of data samples which is assumed to be extremely large.
The stochastic approximation ($S A$) algorithm is usually used to solve the above problems by Robbins and Monro [1]. The original SA algorithm can also be called random gradient descent ($S G D$). It is somewhat similar to the classical steepest descent method, which adopts the iterative process of $x k + 1 = x k − α k g k$. In general, random gradient $g k$ is used to represent the approximation of the full gradient $∇ f$ of f at $x k$ and $α k$ is the step size (Learning rate). The SA algorithm has been deeply studied by many scholars [2,3,4].
In this thesis, we mainly study the stochastic second-order method, that is, stochastic quasi-Newton methods $( S Q N )$ to solve problem (2). Among the traditional optimization methods, the quasi-Newton methods have faster convergence speed and higher convergence accuracy than the first-order method because it uses the approximate second-order derivative information. The quasi Newton method is usually updated by the following iterative formula:
$x k + 1 = x k − α k B k − 1 ∇ f ( x ) o r x k + 1 = x k − α k H k ∇ f ( x )$
where $B k$ is the symmetric positive definite approximation of Hessian matrix $∇ 2 f ( x )$ at $x k$ or $H k$ is the symmetric positive definite approximation of $[ ∇ 2 f ( x ) ] − 1$. In the traditional BFGS algorithm, the iterative formula of $B k$ is as follows:
$B k = B k − 1 + y k − 1 y k − 1 T s k − 1 T s k − 1 − B k − 1 s k − 1 s k − 1 T B k − 1 s k − 1 T B k − 1 s k − 1 ,$
where $s k − 1 = x k − x k − 1 = α k d k$ and $y k − 1 = ∇ f ( x k ) − ∇ f ( x k − 1 )$. If formula Sherman–Morrison–Woodbury formula is used, the iterative formula of $H k$ can be easily obtained:
$H k = ( I − s k − 1 y k − 1 T s k − 1 T y k − 1 ) H k − 1 ( I − s k − 1 y k − 1 T s k − 1 T y k − 1 ) + s k − 1 s k − 1 T s k − 1 T y k − 1 .$
It is very important to use a limited memory variant for large-scale problems. This so-called L-BFGS [5] algorithm has a linear convergence rate. It produces well scaled and productive search directions that yield an approximate solution in fewer iterations and function evaluations. In stochastic optimization, many stochastic quasi Newton formulas have been proposed.
The LBFGS method has the following iteration rule $x k + 1 = x k − α k H k ∇ f ( x )$. The LBFGS method updates $H k$ by the following rule:
$H k = Q k − 1 T H k − 1 Q k − 1 + ρ k − 1 s k − 1 s k − 1 T = Q k − 1 T [ Q k − 2 T H k − 2 Q k − 2 + ρ k − 2 s k − 2 s k − 2 T ] Q k − 1 + ρ k − 1 s k − 1 s k − 1 T = … = [ Q k − 1 T … Q k − r + 1 T ] H k − r + 1 [ Q k − r + 1 T … Q k − 1 T ] + ρ k − r + 1 [ Q k − 2 T … Q k − r + 2 T ] s k − r + 1 s k − r + 1 T [ Q k − r + 2 T … Q k − 2 T ] + … + ρ k − 1 s k − 1 s k − 1 T ,$
where $Q k − 1 = I − ρ k − 1 y k − 1 s k − 1 T$, $ρ k − 1 = 1 s k − 1 T y k − 1$ and r is the memory size. Bordes, Bottomu, and Gallinari studied the quasi Newton method of diagonal rescaling matrix based on secant in [6]. In [7], Byrd et al. proposed a stochastic LBFGS method based on SA and proved its convergence for strongly convex problems. In [8], Gower, Goldfarb, and Richtárik proposed a variance reduced block L-BFGS method that converges linearly for convex functions. It is worth noting that, in the above quasi-Newton methods, the convergence of the algorithm needs to be convex or strongly convex.
If the objective function itself does not have the property of convexity, there are several problems that the LBFGS method has difficulty overcoming:
• How can we guarantee the positive definiteness of iterative matrix $H k$ without line search?
• How can we guarantee the convergence of the proposed L-BFGS method?
These problems seem particularly difficult. However, a modified stochastic limited BFGS (LMLBFGS) is proposed to solve the above problems. On this basis, a new improved algorithm (LMLBFGS-VR) is proposed. Note that our presented algorithm can be adapted to approximate the solution of a nonlinear system of equations in [9].
This paper is divided into five parts: in Section 2, the LMLBFGS and LMLBFGS-VR are presented and their convergence properties are discussed in Section 3. In Section 4, the numerical experiments of the proposed algorithm are given. A summary is given in the last part.

## 2. Premise Setting and Algorithm

In this part, a new LBFGS(LMLBFGS) algorithm is proposed, which can automatically generate a positive definite matrix $B k$.

#### 2.1. LMLBFGS Algorithm

In order to solve this kind of problem, suppose that $E ⊂ ℜ n$ does not depend on x and the random gradient $g ( x , τ )$ at x is generated by a stochastic first-order oracle (SFO), for which the distribution of $τ$ is supported on $E ⊂ ℜ n$. It is common to use a mini-batch stochastic gradient of the i-th sampling during the k-th iteration, which is described as
$g k = 1 z k ∑ i ∈ Z k g ( x k , τ k , i ) = 1 z k ∑ i ∈ Z k ∇ f i ( x k ) ,$
and a sub-sampled Hessian defined as follows
$G k = 1 z k * ∑ i ∈ Z k * G ( x k , τ k , i ) = 1 z k * ∑ i ∈ Z k * ∇ 2 f i ( x k ) .$
We have the subset $Z k$ and $Z k *$ is the sample number where $z k$ and $z k *$ are the cardinalities of $Z k$ and $Z k *$. $τ k , i$ is a random variable. From the definition of random gradient, it is not difficult to find that the random gradient under this setting can be calculated faster than the full gradient. We assume here that the SFO generation method can separate $x k$ and $τ k$ independently and generate the output $g ( x k , τ k , i )$. Therefore, the stochastic gradient difference and the iterative difference are defined as
$y k = g k − g k − 1 = 1 z k ∑ i ∈ Z k g ( x k , τ k , i ) − 1 z k − 1 ∑ i ∈ Z k − 1 g ( x k − 1 , τ k − 1 , i ) ,$
$s k = x k − x k − 1 .$
In traditional methods, the authors in [10] proposed a new type of $y k ¯$ by using
$y k ¯ = y k + λ k s k ,$
where
$λ k = 2 [ f ( x k − 1 ) − f ( x k ) ] + ( g k + g k − 1 ) T s k ( s k T y k ) 2 · ( y k y k T ) .$
Inspired by their methods, we have the following new definitions:
$y k * = y k + λ k s k ,$
where
$λ k = 2 [ f ( x k − 1 ) − f ( x k ) ] + ( g k + g k − 1 ) T s k m a x ( s k T y k ) 2 , ∥ s k ∥ 4 · ( y k y k T ) .$
Our $λ k$ is guaranteed to be meaningful by $m a x ( s k T y k ) 2 , ∥ s k ∥ 4 > 0 .$
Hence, our stochastic LBFGS algorithm updates $B k$ is
$B k = B k − 1 + y k − 1 * y k − 1 * T s k − 1 T s k − 1 − B k − 1 s k − 1 s k − 1 T B k − 1 s k − 1 T B k − 1 s k − 1 .$
Using the Sherman–Morrison–Woodbury formula, we can update $H k = B k − 1$ as
$H k = ( I − s k − 1 y k − 1 T s k − 1 T y k − 1 ) H k − 1 ( I − s k − 1 y k − 1 T s k − 1 T y k − 1 ) + s k − 1 s k − 1 T s k − 1 T y k − 1 .$
Through simple observation, we can find the fact that, when the function is nonconvex, we can not guarantee that $s k T y k * > 0$ is true. Thus, we add some additional settings to the algorithm to ensure the nonnegativity of $s k T y k *$. Define the index set $K$ as follows:
$K = { i : s k T y k * ≥ m ∥ s k ∥ 2 } ,$
where m is a positive constant.
Hence, our modified stochastic L-BFGS algorithm updates (18) and (19):
$B k = B k − 1 + y k − 1 * y k − 1 * T s k − 1 T s k − 1 − B k − 1 s k − 1 s k − 1 T B k − 1 s k − 1 T B k − 1 s k − 1 , i f k ∈ K , B k − 1 , o t h e r w i s e ,$
$H k = H k = ( I − s k − 1 y k − 1 T s k − 1 T y k − 1 ) H k − 1 ( I − s k − 1 y k − 1 T s k − 1 T y k − 1 ) + s k − 1 s k − 1 T s k − 1 T y k − 1 , i f k ∈ K H k − 1 , o t h e r w i s e .$
As is known to all, the cost of calculating $H k$ through (19) is very huge when n is tremendously large. Hence, the LBFGS method is usually used instead of the BFGS method to overcome the poser of a large amount of calculation in large-scale optimization problems. The advantage of LBFGS is that it only uses curvature information and does not need to store the update matrix, which can effectively reduce the computational cost: Use (6) to iterate
$H k , i = ( I − ρ j s j y j * T ) H k , i − 1 ( I − ρ j y j * s j T ) + ρ j s j s j T , j = k − ( r − i ) ; i = 0 , … , r − 1 ,$
where $ρ j = 1 / ( s j T y j * )$. The initial matrix is often chosen as: $H k , 0 = s k − 1 T y k − 1 * y k − 1 * T I$. Because $s k − 1 T y k − 1 *$ may be exceedingly close to 0, we set
$H k , 0 = γ k − 1 I ,$
and
$γ k = m a x { ∥ y k − 1 * ∥ s k − 1 T y k − 1 * , δ } ≥ δ ,$
where $δ$ is a given constant.
Therefore, our modified stochastic L-BFGS algorithm is outlined in Algorithm 1.
 Algorithm 1: Modified stochastic LBFGS algorithm (LMLBFGS). Input: Given $x 1 ∈ ℜ n$, batch size $z k$, $α k$, the memory size r, a positive definite matrix $H 1$, and a positive constant $δ$  1: for $k = 0 , 1 , … ,$ do  2:    Compute $g k$ by (7) and Hessian matrix $H k$ by Algorithm 2;  3:    Compute the iteration point $x k + 1 = x k − α k H k g k$.  4: end for

#### 2.2. Extension of Our LMLBFGS Algorithm with Variance Reduction

Recently, using variance reduction technology in stochastic optimization methods can make the algorithm have better properties. Motivated by the development of the SVRG method for nonconvex problems, we present a new modified stochastic LBFGS algorithm (called LMLBFGS-VR) with a variance reduction technique for a faster convergence speed, as shown in Algorithm 3.
In LMLBFGS-VR, the mini-batch stochastic gradient is defined as
$g ( x ) = 1 | Z | ∑ i ∈ Z ∇ f i ( x ) , Z ⊂ { 1 , 2 , … , n } .$
 Algorithm 2: Hessian matrix updating. Input: correction pairs $( s j , y j )$, memory parameter r, and $j = k − ( r − i ) ; i = 0 , … , r − 1 ,$Output: new $H k$  1: $H = s k T y k * y t * T y t * I$  2: for $j = k − ( r − i ) ; i = 0 , … , r − 1$ do  3:     $m j = s j T y j * − m ∥ s j ∥ 2 , ρ j = 1 s j T y j *$  4:     if $then m j > 0$  5:         $H = ( I − s j y j * T ρ j ) H ( I − y j * s j T ρ j ) + ρ j s j s j T$  6:     end if  7: end for  8: return $H k = H$
 Algorithm 3: Modified stochastic LBFGS algorithm with variance reduction (LMLBFGS-VR). Input: Given $x ¯ 0 ∈ ℜ n$, $H 0 = I$, batch size $z k$, $α k$, the memory size r and a constant $δ > 0$Output: Iterationxis chosen randomly from a uniform ${ l u k + 1 : u = 0 , … , q − 1 ;$$k = 0 , … , N − 1 }$  1: for $k = 0 , 1 , … , N − 1$ do  2:     $l 0 k + 1 = x ˜ k$  3:     compute $∇ f ( x ¯ k )$  4:     for $u = 0 , 1 , … , q − 1$ do  5:         Samlple a minibatch $Z$ with $| Z | = | Z k |$  6:         Calculate $g u k + 1 = ∇ f Z ( l u k + 1 ) − ∇ f Z ( x ¯ k ) + ∇ f ( x ¯ k )$ where $∇ f Z ( l u k + 1 ) = 1 | Z | ∑ i ∈ Z ∇ f i ( l u k + 1 )$;  7:         Compute $l u + 1 k + 1 = l u k + 1 − α k H k + 1 g u k + 1$;  8:     end for  9:     Generate the updated Hessian matrix $H k + 1$ by Algorithm 2;  10:     $x ¯ k + 1 = l q k + 1$  11: end for

## 3. Global Convergence Analysis

In this section, the convergence of Algorithms 1 and 3 will be discussed and analyzed.

#### 3.1. Basic Assumptions

In the algorithm, it is assumed that the step size satisfies
$∑ k = 1 + ∞ α k = + ∞ , ∑ k = 1 + ∞ α k < + ∞ .$
Assumption 1.
$f : ℜ D x → ℜ$ is continuously differentiable and for any $x ∈ ℜ n$, $f ( x )$ is bounded below. This means that there is constant $L > 0$ that makes
$∥ ∇ f ( l 1 ) − ∇ f ( l 2 ) ∥ ≤ L ∥ l 1 − l 2 ∥$
for any $l 1 , l 2 ∈ ℜ n$.
Assumption 2.
The noise level of the gradient estimation σ such that
$E τ k [ g ( x k , τ k ) ] = ∇ f ( x k ) ,$
$E τ k [ ∥ g ( x k , τ k ) − ∇ f ( x k ) ∥ 2 ] ≤ σ 2 ,$
where $σ > 0$ and $E τ k [ · ]$ denotes the expectation taken with respect to $τ k$.
Assumption 3.
There are positive $h 1$ and $h 2$ such that
$h 1 I ⪯ H k ⪯ h 2 I .$
Our random variables $τ k$ are defined as follows: $τ k = ( τ k , 1 , … , τ k , z k )$ are the random samplings in the k-th iteration, and $τ k = ( τ 1 , … , τ k )$ are the random samplings in the first k-th iterations.
Assumption 4.
For any $k ≥ 2$, the random variable $H k$ depends only on $τ k − 1$.
From (2) and (4), we can get
$E [ H k g k | τ k ] = H k g k .$

#### 3.2. Key Propositions, Lemmas, and Theorem

Lemma 1.
If Assumptions 1–4 hold and $α k ≤ h 1 L h 2 2$ for all k, we have
$E [ f ( x k + 1 ) | x k ] ≤ − 1 2 α k h 1 ∥ ∇ f ( x k ) ∥ 2 + f ( x k ) + L σ 2 h 2 2 2 z k α k 2 ,$
where the conditional expectation is taken with respect to $τ k$.
Proof.
$f ( x k + 1 ) ≤ f ( x k ) + 〈 ∇ f ( x k ) , x k + 1 − x k 〉 + L 2 ∥ x k + 1 − x k ∥ = f ( x k ) − α k 〈 ∇ f ( x k ) , H k g k 〉 + L α k 2 2 ∥ H k g k ∥ 2 ≤ f ( x k ) − α k 〈 ∇ f ( x k ) , H k ∇ f ( x k ) 〉 − α k 〈 ∇ f ( x k ) , H k ( g k − ∇ f ( x k ) ) 〉 + L α k 2 h 2 2 2 ∥ g k ∥ 2 .$
Taking expectation with respect to $τ k$ on both sides of (31) conditioned on $x k$, we gain
$E [ f ( x k + 1 ) | x k ] ≤ f ( x k ) − α k 〈 ∇ f ( x k ) , H k ∇ f ( x k ) 〉 + L α k 2 h 2 2 2 E [ ∥ g k ∥ 2 | x k ] ,$
where we use the fact that $E [ ( g k − ∇ f ( x k ) ) | x k ] = 0$. From Assumption 2, it follows that
$E [ ∥ g k ∥ 2 | x k ] = E [ ∥ g k − ∇ f ( x k ) + ∇ f ( x k ) ∥ 2 | x k ] = E [ ∥ ∇ f ( x k ) ∥ 2 | x k ] + 2 E [ ∥ g k − ∇ f ( x k ) ∥ | x k ] + E [ ∥ g k − ∇ f ( x k ) ∥ 2 | x k ] = E [ ∥ ∇ f ( x k ) ∥ 2 | x k ] + E [ ∥ g k − ∇ f ( x k ) ∥ 2 | x k ] ≤ ∥ ∇ f ( x k ) ∥ 2 + σ 2 z k .$
Together with (32), we have
$E [ f ( x k + 1 ) | x k ] ≤ f ( x k ) − ( α k h 1 − L 2 α k 2 h 2 2 ) ∥ ∇ f ( x k ) ∥ 2 + L σ 2 h 2 2 2 z k α k 2 .$
Then, combining that with $α k ≤ h 1 L h 2 2$ implies (30). □
Before proceeding further, the definition of supermartingale will be introduced [11].
Definition 1.
Let ${ L k }$ be an increasing sequence of σ-algebras. If ${ W k }$ is a stochastic process satisfying
(1)
$C [ | W k | ] < ∞$,
(2)
$W k ∈ L k$and$E [ W k + 1 | L k ] ≤ W k$, for all k,
then ${ W k }$ is called a supermartingale.
Proposition 1.
If ${ W k }$ is a nonnegative supermartingale, then $lim k → ∞ W k → W$ almost surely and $E [ W ] ≤ E [ W 0 ] .$
Lemma 2.
Let ${ x k }$ be generated by Algorithm 1, where the batch size $z k = z ,$ for all k. Then, there is a constant $M 0$ such that
$E [ f ( x k ) ] ≤ M 0$
for all k.
Proof.
For convenience of explanation, we have the following definitions:
$w k = 1 2 α k h 1 ∥ ∇ f ( x k ) ∥ 2 , ψ k = f ( x k ) + L σ 2 h 2 2 2 z ∑ i = k ∞ α i 2 .$
Let $f ̲$ be the the lower bound of the function and $W k$ be the $σ$-algebra measuring $x k$, $w k$ and $ψ k$. From the definition, we obtain
$E [ ψ k + 1 | W k ] = E [ f ( x k + 1 ) | W k ] + L σ 2 h 2 2 2 z ∑ i = k + 1 ∞ α k 2 ≤ f ( x k ) − 1 2 α k h 1 ∥ ∇ f ( x k ) ∥ 2 + L σ 2 h 2 2 z ∑ i = k + 1 ∞ α k 2 = ψ k − w k .$
Hence, we obtain
$E [ ψ k + 1 − f ̲ | W k ] ≤ ψ k − w k − f ̲ .$
As a result, we have
$0 ≤ E [ ψ k + 1 − f ̲ ] ≤ ψ 1 − f ̲ < ∞ .$
□

#### 3.3. Global Convergence Theorem

In this part, we provide the convergence analysis of the proposed Algorithms 1 and 3.
Theorem 1.
Assume that Assumptions 1–4 hold for ${ x k }$ generated by Algorithm 1, where the batch size is $z k = z$. The step size satisfies (24) and $α k ≤ h 1 L h 2$.Then, we have
$lim k → ∞ inf E [ ∥ ∇ f ( x k ) ∥ 2 ] = 0 w i t h p r o b a b i l i t y 1 .$
Proof.
According to Definition 1, $ψ k − f ̲$ is a supermartingale. Hence, there exists a $ψ$ such that $lim k → ∞$ with probability 1, and $E [ ψ ] ≤ E [ ψ 1 ]$ (Proposition 1). Form (36), we have $E [ w k ] ≤ E [ ψ k ] − E [ ψ k + 1 ]$. Thus,
$E [ ∑ k = 1 ∞ w k ] ≤ ∑ k = 1 ∞ ( E [ ψ k ] − E [ ψ k + 1 ] ) ≤ ∞ ,$
which means that
$∑ k = 1 ∞ w k = h 1 2 ∑ α k ∥ ∇ f ( x k ) ∥ 2 < + ∞ w i t h p r o b a b i l i t y 1 .$
Since (24), it follows that (48) holds. □
Next, the convergence of the algorithm can be given.
Theorem 2.
If Assumptions A1, A2, and A4 hold for ${ x k }$ generated by Algorithm 1, where the batch size is $z k = z$. The step size satisfies (24) and $α k ≤ h 1 L h 2$. Then, we have
$lim k → ∞ inf E [ ∥ ∇ f ( x k ) ∥ 2 ] = 0 w i t h p r o b a b i l i t y 1 .$
Proof.
The proof will be established by contradiction, and the discussion is listed as follows.
According to the definition of $y k *$, we have
$s j T y j * = s j T y j + 2 [ f ( x j − 1 ) − f ( x j ) ] + ( g j + g j − 1 ) T s j m a x ( s j T y j ) 2 , ∥ s j ∥ 4 · ( s j T y j y j T s j ) ≤ s j T y j + 2 [ f ( x j − 1 ) − f ( x j ) ] + ( g j + g j − 1 ) T s j ( s j T y j ) 2 · ( s j T y j y j T s j ) = s j T y j + 2 [ f ( x j − 1 ) − f ( x j ) ] + ( g j + g j − 1 ) T s j = s j T y j − 2 g ( x j − 1 + θ ( x j − x j − 1 ) ) T s j + ( g j + g j − 1 ) T s j = 2 s j T ( g j − g ( x j − 1 + θ ( x j − x j − 1 ) ) ) ≤ 2 ( 1 − θ ) L ∥ s j ∥ ∥ ( x j − x j − 1 ) ∥ = 2 ( 1 − θ ) L ∥ s j ∥ 2 ,$
where $θ ∈ ( 0 , 1 )$. It is easy to see that
$m ∥ s j ∥ 2 ≤ s j T y j * ≤ Λ ∥ s j ∥ 2 ,$
where $Λ$ is a positive constant.
According to the definition of $y k *$, we have
$∥ y j * ∥ = ∥ y j + 2 [ f ( x j − 1 ) − f ( x j ) ] + ( g j + g j − 1 ) T s j m a x ( s j T y j ) 2 , ∥ s j ∥ 4 · ( y j y j T ) · s j ∥ ≤ ∥ y j ∥ + | 2 [ f ( x j − 1 ) − f ( x j ) ] + ( g j + g j − 1 ) T s j | ∥ y j y j T ∥ s j ∥ 4 s j ∥ ≤ ∥ y j ∥ + | − 2 g ( x j − 1 + θ ( x j − x j − 1 ) ) T s j + ( g j + g j − 1 ) T s j | · ∥ y j y j T ∥ ∥ s j ∥ 4 ∥ s j ∥ ≤ ∥ y j ∥ + | ( g j − g ( x j − 1 + θ ( x j − x j − 1 ) ) T s j + ( g j − 1 − g ( x j − 1 + θ ( x j − x j − 1 ) ) T s j ( g j + g j − 1 ) T s j | · ∥ y j y j T ∥ ∥ s j ∥ 3 ≤ L ∥ s j ∥ + ( L ( 1 − θ ) ∥ s j ∥ 2 + L θ ∥ s j ∥ 2 ) · ∥ y j ∥ ∥ y j ∥ ∥ s j ∥ 3 = 2 L ∥ s j ∥ .$
From (41) and (42), we have
$λ ≤ ∥ y j * ∥ 2 s j T y j * ≤ ( 2 L ) 2 ∥ s j ∥ 2 λ ∥ s j ∥ 2 = ( 2 L ) 2 λ = M 0 ,$
where the first inequality is derived from the quasi Newton condition. This equation shows that the eigenvalue of our initial matrix $B k 0 = y k * T y k * s k T y k * I$ is bounded, and the eigenvalue is much greater than 0.
Instead of directly analyzing the properties of $H k$, we get the results by analyzing the properties of $B k$. In this situation, the limited memory quasi-Newton updating formula is as follows:
(i)
$B k ( 0 ) = y k * T y k * s k T y k * I$.
(ii)
for $i = 0 , … , r − 1$, $j = k − ( r − i )$ and
$B k i + 1 = B k i − B k i s j s j T B k i s j T B k i s j + y j * y j * T s j T y j * .$
The trace of matrix B is defined as $t r ( B )$. Then, from (43) and (44), and the boundedness of ${ ∥ B k 0 ∥ }$, we obtain
$t r ( B k + 1 ) ≤ t r ( B k 0 ) + ∑ i = 1 r ∥ y j * ∥ 2 s j T y j * ≤ t r ( B k 0 ) + r Λ = M 1 .$
The determinant of $B k$ is now considered because the determinant can be used to prove that the minimum eigenvalue of matrix B is uniformly bounded. From the theory in [12], we can get the following equation about matrix determinant:
$d e t ( B k + 1 ) = d e t ( B k 0 ) ∏ i = 1 r y j i * T s j i s j i T B k i − 1 s j i = d e t ( B k 0 ) ∏ i = 1 r y j i * T s j i s j i T s j i s j i T s j i s j i T B k i − 1 s j i .$
It can be obtained from (45) that the maximum eigenvalue of matrix $B j$ is uniformly bounded. Therefore, according to (41) and combining the fact that the smallest eigenvalue of $B k 0$ is bounded away from zero, the following equation is obtained:
$d e t ( B k + 1 ) ≥ d e t ( B k 0 ) ( λ M 1 ) r ≥ M 2 .$
In this way, the maximum eigenvalue and the minimum eigenvalue of matrix $B j$ are uniformly bounded and much greater than 0. Therefore, we can get
$h 1 I ⪯ H k ⪯ h 2 I ,$
where $h 1$ and $h 2$ are positive constants. According to Theorem 1 that we proved above, the convergence of our proposed Algorithm 1 can be obtained. □
Corollary 1.
If Assumptions 1, 2, and 4 hold for ${ x k }$ generated by Algorithm 3, where the batch size is $z k = z$ and the step size satisfies (24) and $α k ≤ h 1 L h 2$, then, we have
$lim k → ∞ inf E [ ∥ ∇ f ( x k ) ∥ 2 ] = 0 w i t h p r o b a b i l i t y 1 .$

## 4. The Complexity of the Proposed Algorithm

The convergence results of the algorithm have been discussed. Now, let us analyze the complexity of Algorithms 1 and 3.
Assumption 5.
For any k, we have
$α k = h 1 L h 2 2 k − β , β ∈ ( 0.5 , 1 ) .$
Theorem 3.
Suppose Assumptions 1–5 hold, $t k$ is generated by Algorithm 1, and batch size $z k = z$ for all k. Then, we have
$1 N ∑ k = 1 N E [ ∥ ∇ f ( t k ) ∥ 2 ] ≤ 2 L ( M 0 − f ̲ ) h 2 2 h 1 2 N β − 1 + σ 2 ( 1 − β ) z ( N − β − N − 1 + 1 − β N ) ,$
where N denotes the iteration number.
Moreover, for a given $ϵ ∈ ( 0 , 1 )$, to guarantee that$1 N ∑ k = 1 N E [ ∥ ∇ f ( t k ) ∥ 2 ] < ϵ ,$the number of iterations N needed is at most $O ( ϵ − 1 1 − ϵ )$.
Proof.
Obviously, (49) satisfies (24) and the condition $α k ≤ h 1 L h 2 2$. Then, taking expectations on both sides of (30) and summing over all k yield
$1 2 h 1 ∑ k = 1 N E [ ∥ ∇ f ( t k ) ∥ 2 ] ≤ ∑ k = 1 N 1 α k ( E [ f ( t k ) ] − E f ( t k + 1 ) ) + L σ 2 h 2 2 2 z ∑ k = 1 N α k = 1 α 1 f ( t 1 ) + ∑ k = 2 N ( 1 α k − 1 α k − 1 ) E [ f ( t k ) ] − E [ f ( x N + 1 ) ] α N + L σ 2 h 2 2 2 z ∑ k = 1 N α k ≤ M 0 α 1 + M 0 ∑ k = 2 N ( 1 α k − α k − 1 ) − f ̲ α N + L σ 2 h 2 2 2 z ∑ k = 1 N α k = M 0 − f ̲ α N + L σ 2 h 2 2 2 z ∑ k = 1 N α k ≤ L ( M f − f ̲ h 2 2 ) h 1 N β + σ 2 h 1 2 ( 1 − β ) z ( N 1 − β − β ) ,$
which results in (50), where the second inequality is due to Lemma 2, and the last inequality is due to Theorem 1.
Next, for a given $ϵ > 0$, in order to obtain $1 N ∑ k = 1 N E [ ∥ ∇ f ( x k ) ∥ 2 ] ≤ ϵ$, we only need the following equation:
$2 ( M 0 − f ̲ ) L h 2 2 h 1 2 N β − 1 − σ 2 ( 1 − β ) z ( N − 1 − N − β − 1 − β N ) < ϵ .$
Since $β ∈ ( 0.5 , 1 )$, it follows that the number of iterations N needed is at most $O ( ϵ − 1 1 − β ) .$ □
Corollary 2.
Assume that Assumptions 1, 3, 4 and (27) hold for $x k$ generated by Algorithm 3 with batch size $z k = z$ for all k. We also assume that $α k$ is specifically chosen as
$α k = h 1 L h 2 2 k − β$
with $β ∈ ( 0.5 , 1 )$. Then,
$1 N ∑ k = 1 N E [ ∥ ∇ f ( x k ) ∥ 2 ] ≤ 2 L ( M 0 − f ̲ ) h 2 2 h 1 2 N β − 1 + σ 2 ( 1 − β ) z ( N − β − N − 1 + 1 − β N ) ,$
where N denotes the iteration number. Moreover, for a given $ϵ ∈ ( 0 , 1 )$, to guarantee that$1 N ∑ k = 1 N E [ ∥ ∇ f ( x k ) ∥ 2 ] < ϵ$, the number of iterations N needed is at most $O ( ϵ − 1 1 − ϵ )$.

## 5. Numerical Results

In this section, we focus on the numerical performances of the proposed Algorithm 3 for solving nonconvex empirical risk minimization (ERM) problems and nonconvex support vector machine (SVM) problems.

#### 5.1. Experiments with Synthetic Datasets

The models of the nonconvex SVM problems and nonconvex ERM problems are given as follows: $λ > 0$ is a regularization parameter.
Problem 1.
The ERM problem with a nonconvex sigmoid loss function [13,14] is formulated as follows:
$min x ∈ ℜ D x 1 n ∑ i = 1 n f i ( x ) + λ 2 ∥ x ∥ 2 2 , f i ( x ) = 1 1 + e x p ( b i a i T x ) ,$
where $a i ∈ ℜ d$ and $b i ∈ − 1 , 1$ represent the feature vector and corresponding label, respectively.
Problem 2.
The nonconvex support vector machine (SVM) problem with a sigmoid loss function [15,16] is formulated as follows:
$min x ∈ ℜ D x 1 n ∑ i = 1 n f i ( x ) + λ ∥ x ∥ 2 , f i ( x ) = 1 − t a n h ( b i x , a i ) .$
We compare the proposed LMLBFGS-VR algorithm with SGD [1], SVRG [17] and SAGA [18], where the LMLBFGS-VR algorithms use a descent step size and other algorithms use a constant step size $α k$. The data sets in our experiments including Adult, IJCNN, Mnist, and Coctype. All the codes are written in MATLAB 2018b on a PC with AMD Ryzen 7 5800H with Radeon Graphics 3.20 GHz and 16 GB of memory.

#### 5.2. Numerical Results for Problem 1

In this subsection, we present the numerical results of LMLBFGS-VR, SGD, SVRG, and SAGA for solving Problem 1 on the four data sets. For LMLBFGS-VR algorithms, the step size is $α k = 0.02 × k − 0.6$, and the memory size is $r = 10$ and $m = 1 × 10 − 5$. The step size of other algorithms is chosen as 0.02. The number of inner loop q we chose as $n / V$ uniformly, where V is the batch size. The batch-size is set to 100 for Adult, IJCNN, and Covtype, and for Mnist. In order to further test the performance of the algorithm, the regularization parameter is set to $10 − 3 , 10 − 4$, or $10 − 5$. The following pictures demonstrate the performance of different algorithms. Figure 1, Figure 2, Figure 3 and Figure 4 show the convergence performance of all the stochastic algorithms for solving Problem 1 with $λ = 1 × 10 − 3 , λ = 1 × 10 − 4$ or $λ = 1 × 10 − 5$ on four different data sets. From Figure 1, Figure 2, Figure 3 and Figure 4, we obtain that all the algorithms can solve the problem successfully. However, the proposed LMLBFGS- VR algorithms have significantly faster convergence speed than other algorithms. It is clear that the proposed algorithms, especially LMLBFGS-VR, have a great advantage for solving nonconvex support vector machine problems.

#### 5.3. Numerical Results for Problem 2

The numerical results of LMLBFGS-VR, SGD, SVRG, and SAGA for solving Problem 2 on the four data sets are presented in this subsection. All parameters are the same as the above subsection, and the regularization parameter is also set to $1 × 10 − 3 , 1 × 10 − 4 ,$ or $1 × 10 − 5$. The following figures demonstrate the performance of all the stochastic algorithms. The y-axis is the objective function value, and the x-axis denotes the number of effective passes, where computing a full gradient or evaluatingncomponent gradients is regarded as an effective pass. Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 demonstrate that the convergence performance of our LMLBFGS-VR algorithms on the four data sets, which show that they remarkably outperform the other algorithms. When $λ = 1 × 10 − 3$, the objective function is almost minimized by two effective passes. In contrast, the SGD, SVRG, and SAGA algorithms converge slightly slowly, where these algorithms only use first-order information. Due to the use of second-order information and limited memory technique, LMLBFGS-VR requires only a few effective passes to quickly minimize the function value. From Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8, we find that, as $λ$ decreases, the value of the function decreases to a smaller value. Thus, we can choose a smaller $λ$ for practical problems. Combined with the previous discussion, our LMLBFGS-VR algorithms make great progress in improving the computing efficiency for nonconvex machine learning problems.

## 6. Conclusions

In this paper, we proposed one efficient modified stochastic limited BFGS algorithms for solving nonconvex stochastic optimization. The proposed algorithms can preserve the positive definiteness of $H k$ without any convexity properties. The LMLBFGS-VR method with variance reduction was also presented to solve nonconvex stochastic optimization problems. Numerical experiments on nonconvex SVM problems and nonconvex ERM problems were performed to demonstrate the performance of the proposed algorithms, and the results indicated that our algorithms are comparable to other similar methods. In the future, we could consider the following points: (i) Whether we can use a proper line search to determine an appropriate step size, which can reduce the complexity and enhance the accuracy of the algorithm. (ii) Further experiments on the practical problems could be performed in the future to check the performance of the presented algorithms.

## Author Contributions

Writing—original draft preparation, H.L.; writing—review and editing, Y.L. and M.Z. All authors have read and agreed to the published version of the manuscript.

## Funding

This research was funded by the National Natural Science Foundation of China Grant No.11661009, the High Level Innovation Teams and Excellent Scholars Program in Guangxi institutions of higher education Grant No. [2019]52, the Guangxi Natural Science Key Fund No. 2017GXNSFDA198046, the Special Funds for Local Science and Technology Development Guided by the Central Government No. ZY20198003, the special foundation for Guangxi Ba Gui Scholars, and the Basic Ability Improvement Project for Young and Middle-Aged Teachers in Guangxi Colleges and Universities No. 2020KY30018.

## Conflicts of Interest

The authors declare no conflict of interest.

## References

1. Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
2. Chung, K.L. On a stochastic approximation method. Ann. Math. Stat. 1954, 25, 463–483. [Google Scholar] [CrossRef]
3. Polyak, B.T.; Juditsky, A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 1992, 30, 838–855. [Google Scholar] [CrossRef]
4. Ruszczyǹski, A.; Syski, W. A method of aggregate stochastic subgradients with online stepsize rules for convex stochastic programming problems. In Stochastic Programming 84 Part II; Springer: Berlin/Heidelberg, Germany, 1986; pp. 113–131. [Google Scholar]
5. Wright, S.; Nocedal, J. Numerical Optimization; Springer: Berlin/Heidelberg, Germany, 1999; Volume 35, p. 7. [Google Scholar]
6. Bordes, A.; Bottou, L. SGD-QN: Careful quasi-Newton stochastic gradient descent. J. Mach. Learn. Res. 2009, 10, 1737–1754. [Google Scholar]
7. Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 2016, 26, 1008–1031. [Google Scholar] [CrossRef]
8. Gower, R.; Goldfarb, D.; Richtárik, P. Stochastic block BFGS: Squeezing more curvature out of data. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1869–1878. [Google Scholar]
9. Covei, D.P.; Pirvu, T.A. A stochastic control problem with regime switching. Carpathian J. Math. 2021, 37, 427–440. [Google Scholar] [CrossRef]
10. Wei, Z.; Li, G.; Qi, L. New quasi-Newton methods for unconstrained optimization problems. Appl. Math. Comput. 2006, 175, 1156–1188. [Google Scholar] [CrossRef]
11. Durrett, R. Probability: Theory and Examples; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
12. Deng, N.Y.; Li, Z.F. Ome global convergence properties of a conic-variable metric algorithm for minimization with inexact line searches. Numer. Algebra Control Optim. 1995, 5, 105–122. [Google Scholar]
13. Allen-Zhu, Z.; Hazan, E. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 699–707. [Google Scholar]
14. Shalev-Shwartz, S.; Shamir, O.; Sridharan, K. Learning kernel-based halfspaces with the 0–1 loss. SIAM J. Comput. 2011, 40, 1623–1646. [Google Scholar] [CrossRef] [Green Version]
15. Ghadimi, S.; Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 2013, 23, 2341–2368. [Google Scholar] [CrossRef] [Green Version]
16. Mason, L.; Baxter, J.; Bartlett, P.; Frean, M. Boosting algorithms as gradient descent in function space. Proc. Adv. Neural Inf. Process. Syst. 1999, 12, 512–518. [Google Scholar]
17. Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 2013, 26, 315–323. [Google Scholar]
18. Defazio, A.; Bach, F.; Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1646–1654. [Google Scholar]
Figure 1. Comparison of all the algorithms for solving Problem 1 on Adult. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 1. Comparison of all the algorithms for solving Problem 1 on Adult. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 2. Comparison of all the algorithms for solving Problem 1 on Covtype. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 2. Comparison of all the algorithms for solving Problem 1 on Covtype. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 3. Comparison of all the algorithms for solving Problem 1 on IJCNN. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 3. Comparison of all the algorithms for solving Problem 1 on IJCNN. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 4. Comparison of all the algorithms for solving Problem 1 on mnist. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 4. Comparison of all the algorithms for solving Problem 1 on mnist. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 5. Comparison of all the algorithms for solving Problem 2 on Adult. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 5. Comparison of all the algorithms for solving Problem 2 on Adult. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 6. Comparison of all the algorithms for solving Problem 2 on Covtype. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 6. Comparison of all the algorithms for solving Problem 2 on Covtype. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 7. Comparison of all the algorithms for solving Problem 2 on IJCNN. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 7. Comparison of all the algorithms for solving Problem 2 on IJCNN. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 8. Comparison of all the algorithms for solving Problem 2 on mnist. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
Figure 8. Comparison of all the algorithms for solving Problem 2 on mnist. From left to right: $λ = 1 × 10 − 3 , λ = 1 × 10 − 4 , λ = 1 × 10 − 5$.
 Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Share and Cite

MDPI and ACS Style

Liu, H.; Li, Y.; Zhang, M. An Active Set Limited Memory BFGS Algorithm for Machine Learning. Symmetry 2022, 14, 378. https://doi.org/10.3390/sym14020378

AMA Style

Liu H, Li Y, Zhang M. An Active Set Limited Memory BFGS Algorithm for Machine Learning. Symmetry. 2022; 14(2):378. https://doi.org/10.3390/sym14020378

Chicago/Turabian Style

Liu, Hanger, Yan Li, and Maojun Zhang. 2022. "An Active Set Limited Memory BFGS Algorithm for Machine Learning" Symmetry 14, no. 2: 378. https://doi.org/10.3390/sym14020378

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.