# An Active Set Limited Memory BFGS Algorithm for Machine Learning

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- How can we guarantee the positive definiteness of iterative matrix ${H}_{k}$ without line search?
- How can we guarantee the convergence of the proposed L-BFGS method?

## 2. Premise Setting and Algorithm

#### 2.1. LMLBFGS Algorithm

Algorithm 1: Modified stochastic LBFGS algorithm (LMLBFGS). |

Input: Given ${x}_{1}\in {\Re}^{n}$, batch size ${z}_{k}$, ${\alpha}_{k}$, the memory size r, a positive definite matrix ${H}_{1}$, and a positive constant $\delta $1: for $k=0,1,\dots ,$ do2: Compute ${g}_{k}$ by (7) and Hessian matrix ${H}_{k}$ by Algorithm 2; 3: Compute the iteration point ${x}_{k+1}={x}_{k}-{\alpha}_{k}{H}_{k}{g}_{k}$. 4: end for |

#### 2.2. Extension of Our LMLBFGS Algorithm with Variance Reduction

Algorithm 2: Hessian matrix updating. |

Input: correction pairs $({s}_{j},{y}_{j})$, memory parameter r, and $j=k-(r-i);i=0,\dots ,r-1,$Output: new ${H}_{k}$1: $H=\frac{{s}_{k}^{T}{y}_{k}^{*}}{{y}_{t}^{*T}{y}_{t}^{*}}I$ 2: for $j=k-(r-i);i=0,\dots ,r-1$ do3: ${m}_{j}={s}_{j}^{T}{y}_{j}^{*}-m{\parallel {s}_{j}\parallel}^{2},\phantom{\rule{1.em}{0ex}}{\rho}_{j}=\frac{1}{{s}_{j}^{T}{y}_{j}^{*}}$ 4: if $\phantom{\rule{4pt}{0ex}}\mathbf{then}\phantom{\rule{1.em}{0ex}}{m}_{j}>0$5: $H=(I-{s}_{j}{y}_{j}^{*T}{\rho}_{j})H(I-{y}_{j}^{*}{s}_{j}^{T}{\rho}_{j})+{\rho}_{j}{s}_{j}{s}_{j}^{T}$ 6: end if7: end for8: return ${H}_{k}=H$ |

Algorithm 3: Modified stochastic LBFGS algorithm with variance reduction (LMLBFGS-VR). |

Input: Given ${\overline{x}}_{0}\in {\Re}^{n}$, ${H}_{0}=I$, batch size ${z}_{k}$, ${\alpha}_{k}$, the memory size r and a constant $\delta >0$Output: Iterationxis chosen randomly from a uniform $\{{l}_{u}^{k+1}:u=0,\dots ,q-1;$$k=0,\dots ,N-1\}$1: for $k=0,1,\dots ,N-1$ do2: ${l}_{0}^{k+1}={\tilde{x}}_{k}$ 3: compute $\nabla f({\overline{x}}_{k})$ 4: for $u=0,1,\dots ,q-1$ do5: Samlple a minibatch $\mathcal{Z}$ with $\left|\mathcal{Z}\right|=|{Z}_{k}|$ 6: Calculate ${g}_{u}^{k+1}=\nabla {f}_{\mathcal{Z}}({l}_{u}^{k+1})-\nabla {f}_{\mathcal{Z}}({\overline{x}}_{k})+\nabla f({\overline{x}}_{k})$ where $\nabla {f}_{\mathcal{Z}}({l}_{u}^{k+1})=\frac{1}{\left|\mathcal{Z}\right|}{\sum}_{i\in \mathcal{Z}}\nabla {f}_{i}({l}_{u}^{k+1})$; 7: Compute ${l}_{u+1}^{k+1}={l}_{u}^{k+1}-{\alpha}_{k}{H}^{k+1}{g}_{u}^{k+1}$; 8: end for9: Generate the updated Hessian matrix ${H}^{k+1}$ by Algorithm 2; 10: ${\overline{x}}_{k+1}={l}_{q}^{k+1}$ 11: end for |

## 3. Global Convergence Analysis

#### 3.1. Basic Assumptions

**Assumption**

**1.**

**Assumption**

**2.**

**Assumption**

**3.**

**Assumption**

**4.**

#### 3.2. Key Propositions, Lemmas, and Theorem

**Lemma**

**1.**

**Proof.**

**Definition**

**1.**

- (1)
- $C\left[\right|{W}_{k}\left|\right]<\infty $,
- (2)
- ${W}_{k}\in {\mathcal{L}}_{k}$and$\mathbb{E}\left[{W}_{k+1}\right|{\mathcal{L}}_{k}]\le {W}_{k}$, for all k,

**Proposition**

**1.**

**Lemma**

**2.**

**Proof.**

#### 3.3. Global Convergence Theorem

**Theorem**

**1.**

**Proof.**

**Theorem**

**2.**

**Proof.**

- (i)
- ${B}_{k}^{(0)}=\frac{{y}_{k}^{*T}{y}_{k}^{*}}{{s}_{k}^{T}{y}_{k}^{*}}I$.
- (ii)
- for $i=0,\dots ,r-1$, $j=k-(r-i)$ and

**Corollary**

**1.**

## 4. The Complexity of the Proposed Algorithm

**Assumption**

**5.**

**Theorem**

**3.**

**Proof.**

**Corollary**

**2.**

## 5. Numerical Results

#### 5.1. Experiments with Synthetic Datasets

**Problem**

**1.**

**Problem**

**2.**

#### 5.2. Numerical Results for Problem 1

#### 5.3. Numerical Results for Problem 2

## 6. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat.
**1951**, 22, 400–407. [Google Scholar] [CrossRef] - Chung, K.L. On a stochastic approximation method. Ann. Math. Stat.
**1954**, 25, 463–483. [Google Scholar] [CrossRef] - Polyak, B.T.; Juditsky, A.B. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim.
**1992**, 30, 838–855. [Google Scholar] [CrossRef] - Ruszczyǹski, A.; Syski, W. A method of aggregate stochastic subgradients with online stepsize rules for convex stochastic programming problems. In Stochastic Programming 84 Part II; Springer: Berlin/Heidelberg, Germany, 1986; pp. 113–131. [Google Scholar]
- Wright, S.; Nocedal, J. Numerical Optimization; Springer: Berlin/Heidelberg, Germany, 1999; Volume 35, p. 7. [Google Scholar]
- Bordes, A.; Bottou, L. SGD-QN: Careful quasi-Newton stochastic gradient descent. J. Mach. Learn. Res.
**2009**, 10, 1737–1754. [Google Scholar] - Byrd, R.H.; Hansen, S.L.; Nocedal, J.; Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim.
**2016**, 26, 1008–1031. [Google Scholar] [CrossRef] - Gower, R.; Goldfarb, D.; Richtárik, P. Stochastic block BFGS: Squeezing more curvature out of data. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 1869–1878. [Google Scholar]
- Covei, D.P.; Pirvu, T.A. A stochastic control problem with regime switching. Carpathian J. Math.
**2021**, 37, 427–440. [Google Scholar] [CrossRef] - Wei, Z.; Li, G.; Qi, L. New quasi-Newton methods for unconstrained optimization problems. Appl. Math. Comput.
**2006**, 175, 1156–1188. [Google Scholar] [CrossRef] - Durrett, R. Probability: Theory and Examples; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar]
- Deng, N.Y.; Li, Z.F. Ome global convergence properties of a conic-variable metric algorithm for minimization with inexact line searches. Numer. Algebra Control Optim.
**1995**, 5, 105–122. [Google Scholar] - Allen-Zhu, Z.; Hazan, E. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; Volume 48, pp. 699–707. [Google Scholar]
- Shalev-Shwartz, S.; Shamir, O.; Sridharan, K. Learning kernel-based halfspaces with the 0–1 loss. SIAM J. Comput.
**2011**, 40, 1623–1646. [Google Scholar] [CrossRef] [Green Version] - Ghadimi, S.; Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim.
**2013**, 23, 2341–2368. [Google Scholar] [CrossRef] [Green Version] - Mason, L.; Baxter, J.; Bartlett, P.; Frean, M. Boosting algorithms as gradient descent in function space. Proc. Adv. Neural Inf. Process. Syst.
**1999**, 12, 512–518. [Google Scholar] - Johnson, R.; Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst.
**2013**, 26, 315–323. [Google Scholar] - Defazio, A.; Bach, F.; Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1646–1654. [Google Scholar]

**Figure 1.**Comparison of all the algorithms for solving Problem 1 on Adult. From left to right: $\lambda =1\times {10}^{-3},\lambda =1\times {10}^{-4},\lambda =1\times {10}^{-5}$.

**Figure 2.**Comparison of all the algorithms for solving Problem 1 on Covtype. From left to right: $\lambda =1\times {10}^{-3},\lambda =1\times {10}^{-4},\lambda =1\times {10}^{-5}$.

**Figure 3.**Comparison of all the algorithms for solving Problem 1 on IJCNN. From left to right: $\lambda =1\times {10}^{-3},\lambda =1\times {10}^{-4},\lambda =1\times {10}^{-5}$.

**Figure 4.**Comparison of all the algorithms for solving Problem 1 on mnist. From left to right: $\lambda =1\times {10}^{-3},\lambda =1\times {10}^{-4},\lambda =1\times {10}^{-5}$.

**Figure 5.**Comparison of all the algorithms for solving Problem 2 on Adult. From left to right: $\lambda =1\times {10}^{-3},\lambda =1\times {10}^{-4},\lambda =1\times {10}^{-5}$.

**Figure 6.**Comparison of all the algorithms for solving Problem 2 on Covtype. From left to right: $\lambda =1\times {10}^{-3},\lambda =1\times {10}^{-4},\lambda =1\times {10}^{-5}$.

**Figure 7.**Comparison of all the algorithms for solving Problem 2 on IJCNN. From left to right: $\lambda =1\times {10}^{-3},\lambda =1\times {10}^{-4},\lambda =1\times {10}^{-5}$.

**Figure 8.**Comparison of all the algorithms for solving Problem 2 on mnist. From left to right: $\lambda =1\times {10}^{-3},\lambda =1\times {10}^{-4},\lambda =1\times {10}^{-5}$.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Liu, H.; Li, Y.; Zhang, M.
An Active Set Limited Memory BFGS Algorithm for Machine Learning. *Symmetry* **2022**, *14*, 378.
https://doi.org/10.3390/sym14020378

**AMA Style**

Liu H, Li Y, Zhang M.
An Active Set Limited Memory BFGS Algorithm for Machine Learning. *Symmetry*. 2022; 14(2):378.
https://doi.org/10.3390/sym14020378

**Chicago/Turabian Style**

Liu, Hanger, Yan Li, and Maojun Zhang.
2022. "An Active Set Limited Memory BFGS Algorithm for Machine Learning" *Symmetry* 14, no. 2: 378.
https://doi.org/10.3390/sym14020378