On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration

Bellavia , Stefania; Morini , Benedetta; Rebegoldi, Simone

doi:10.3390/axioms12010038

Open AccessArticle

On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration

by

Stefania Bellavia

^1,2,†,

Benedetta Morini

^1,2,†

and

Simone Rebegoldi

^1,2,*,†

¹

Dipartimento di Ingegneria Industriale, Università degli studi di Firenze, Viale G.B. Morgagni 40, 50134 Firenze, Italy

²

INDAM-GNCS Research Group, P.le Aldo Moro 5, 00185 Roma, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Axioms 2023, 12(1), 38; https://doi.org/10.3390/axioms12010038

Submission received: 29 October 2022 / Revised: 19 December 2022 / Accepted: 24 December 2022 / Published: 28 December 2022

(This article belongs to the Section Mathematical Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

We study the convergence properties of SIRTR, a stochastic inexact restoration trust-region method suited for the minimization of a finite sum of continuously differentiable functions. This method combines the trust-region methodology with random function and gradient estimates formed by subsampling. Unlike other existing schemes, it forces the decrease of a merit function by combining the function approximation with an infeasibility term, the latter of which measures the distance of the current sample size from its maximum value. In a previous work, the expected iteration complexity to satisfy an approximate first-order optimality condition was given. Here, we elaborate on the convergence analysis of SIRTR and prove its convergence in probability under suitable accuracy requirements on random function and gradient estimates. Furthermore, we report the numerical results obtained on some nonconvex classification test problems, discussing the impact of the probabilistic requirements on the selection of the sample sizes.

Keywords:

trust-region methods; random models; inexact restoration

MSC:

65K05; 90C30; 90C15

1. Introduction

The solutions of large-scale finite-sum optimization problems have become essential in several machine learning tasks including binary or multinomial classification, regression, clustering, and anomaly detection [1,2]. Indeed, the training of models employed in such tasks is often performed by solving the optimization

min_{x \in R^{n}} f_{N} (x) = \frac{1}{N} \sum_{i = 1}^{N} ϕ_{i} (x),

(1)

where N is the size of the available data set and the functions

ϕ_{i} : R^{n} \to R

are continuously differentiable for all

i = 1, \dots, N

. As a result, the efficient solution of a machine learning problem calls for efficient numerical algorithms for (1).

When the data set is extremely large, the evaluation of the objective function

f_{N}

and its derivatives may be computationally demanding, making deterministic optimization methods inadequate for solving (1). A common strategy consists of approximating both the function and derivatives by employing a small number of loss functions

ϕ_{i}

sampled randomly, making stochastic optimization methods the preferred choice [3,4,5]. A major issue is the sensitivity of most stochastic algorithms to their parameters, such as the learning rate or sample sizes used for building the function and gradient approximations, which usually need to be tuned through multiple trials and errors before the algorithm exhibits acceptable performance. A possible remedy to burdensome tuning is to employ adaptive optimization methods, which compute the parameters according to appropriate globalization strategies [6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]. Most of these methods have probabilistic accuracy requirements for the function and gradient estimates in order to ensure either the iteration complexity of the expectation [7,8,9,10,11,13,15,19,21,22] or the convergence in probability of the iterates [6,12,16,17,18,20]. In turn, these requirements are reflected in the choice of sample size, which needs to progressively grow as the iterations proceed, resulting in an increasing computational cost per iteration.

In [11], the authors proposed the so-called stochastic inexact restoration trust-region (SIRTR) method for solving (1). SIRTR employs subsampled function and gradient estimates and combines the classical trust-region scheme with the inexact restoration method for constrained optimization problems [23,24,25]. This combined strategy involves the reformulation of (1) as an optimization problem with two unknown variables

x, M

, where x is the object to be recovered and M is the sample size of the function estimate, upon which the constraint

M = N

is imposed. Based on this reformulation of (1), the method acts on the two variables in a modular way: first, it selects the sample size with a deterministic rule aimed at improving feasibility with respect to the constraint

M = N

; then, it accepts or rejects the inexact trust-region step by improving optimality with respect to a suitable merit function. SIRTR has shown good numerical performance on a series of classification and regression test problems, as its inexact restoration strategy drastically reduces the computational burden due to the selection of the algorithmic parameters. From a theoretical viewpoint, the authors in [11] provided an upper bound on the expected number of iterations to reach a near-stationary point under some appropriate probability accuracy requirements on the random estimators; remarkably, such requirements are less stringent than others employed in the literature. However, the convergence in probability of SIRTR remains unproved, thus leaving open the question of whether the gradient of the objective function in (1) converges to zero with a probability of one. A positive answer to this question would be an important theoretical confirmation of the numerical stability of the method.

In this paper, we improve on the existing theoretical analysis of SIRTR, showing that its iterates drive the gradient to zero with a probability of one. The results will be obtained by combining the theoretical properties of SIRTR with some tools from martingale theory, as typically done for the convergence analysis of adaptive stochastic methods [6,16,18]. Furthermore, we show the numerical results obtained by applying SIRTR on nonconvex binary classification, discussing the impact of the probabilistic accuracy requirements on the performance of the method.

The paper is structured as follows. In Section 2, we briefly outline the method and its main steps. In Section 3, we perform the convergence analysis of the method, showing that its iterates converge with a probability of one. In Section 4, we provide a numerical illustration of a binary classification test problem. Finally, we report the conclusions and future work in Section 5.

Notations: Throughout the paper,

R

is the set of real numbers, whereas the symbol

∥ \cdot ∥

denotes the standard Euclidean norm on

R^{n}

. We denote with

(Ω, A, P)

a probability space, where

Ω

is the sample space,

A \subseteq P (Ω)

is the

σ -

algebra of events, and

P : A \to [0, 1]

is the probability function. Given an event

A \in A

, the symbol

P (A)

stands for the probability of the event A, and

𝟙_{A} : Ω \to {0, 1}

denotes the indicator function of an event A, i.e., the function such that

𝟙_{A} (ω) = 1

if

ω \in A

, or

𝟙_{A} (ω) = 0

otherwise. Given a random variable

X : Ω \to R

, we denote with

E (X)

the expected value of X. Let

X_{1}, \dots, X_{n}

be n random variables, and the notation

σ (X_{1}, \dots, X_{n})

stands for the

σ -

algebra generated by

X_{1}, \dots, X_{n}

.

2. The SIRTR Method

Here, we present the stochastic inexact restoration trust-region (SIRTR) method, which was formerly proposed in [11]. SIRTR is a trust-region method with subsampled function and gradient estimates, which combines first-order trust-region methodology with the inexact restoration method for constrained optimization [25]. In order to provide a detailed description of SIRTR, we reformulate (1) as the constrained problem

\begin{matrix} min_{x \in R^{n}} f_{M} (x) = \frac{1}{M} \sum_{i \in I_{M}} ϕ_{i} (x), \\ s . t . M = N \end{matrix}

(2)

where

I_{M} \subseteq {1, \dots, N}

is a sample set of the size of the cardinality

| I_{M} |

such that

| I_{M} | = M

. To measure the infeasibility distance of M from the constraint

M = N

, we introduce a function h that measures the distance of

M \in {1, \dots, N}

from N. Function h is supposed to satisfy the following properties.

Assumption 1.

The function

h : {1, 2, \dots, N} \to R

is monotonically decreasing and satisfies

h (1) > 0

,

h (N) = 0

.

From Assumption 1, the existence of some positive constants

\underset{̲}{h}

and

\bar{h}

follows such that

\underset{̲}{h} \leq h (M) \leq \bar{h}, for 1 \leq M \leq N .

(3)

An example of a function

h : {1, \dots, N} \to R

satisfying Assumption 1 is

h (M) = \frac{N - M}{N}

.

SIRTR is a stochastic variant of the classical first-order trust-region method, which accepts the trial point according to the decrease of a convex combination of the function estimate

f_{M}

with function h. We report SIRTR in Algorithm 1.

At the beginning of each iteration

k \geq 0

, we have at our disposal the iterate

x_{k}

, the trust-region radius

δ_{k}

, the sample size

N_{k} \in {1, \dots, N}

, the penalty parameter

θ_{k}

, and the flag iflag, where iflag = succ if the previous iteration was successful, in the sense that is specified below, and iflag = unsucc otherwise. Then, in Steps 1–5, perform the following tasks.

In Step 1, if iflag = succ, we reduce the current value $h (N_{k})$ of the infeasibility measure and find some ${\tilde{N}}_{k + 1} \in {1, \dots, N}$ satisfying $h ({\tilde{N}}_{k + 1}) \leq r h (N_{k})$ with $r \in (0, 1)$ . On the other hand, if iflag=unsucc, ${\tilde{N}}_{k + 1}$ remains the same from one iteration to the other, i.e., we set ${\tilde{N}}_{k + 1} = {\tilde{N}}_{k}$ . Note that ${\tilde{N}}_{k + 1} = N$ if $N_{k} = N$ .
Step 2 determines a trial sample size $N_{k + 1}^{t}$ that satisfies $h (N_{k + 1}^{t}) - h ({\tilde{N}}_{k + 1}) \leq μ δ_{k}^{2}$ with $μ > 0$ and is used to form the random model. In principle, we could fix $N_{k + 1}^{t} = {\tilde{N}}_{k + 1}$ , but selecting a smaller sample size, if possible, yields a computational saving in the successive step. The relation between $N_{k + 1}^{t}$ and ${\tilde{N}}_{k + 1}$ depends on $δ_{k}$ ; small values of $δ_{k}$ give $N_{k + 1}^{t} = {\tilde{N}}_{k + 1}$ ; otherwise, $N_{k + 1}^{t}$ is allowed to be smaller than ${\tilde{N}}_{k + 1}$ .
Step 3 forms the random model $m_{k} (p)$ and the trial step $p_{k}$ . The linear model is given by $m_{k} (p) = f_{N_{k + 1}^{t}} (x_{k}) + g_{k}^{T} p$ , where

$f_{N_{k + 1}^{t}} (x_{k}) = \frac{1}{N_{k + 1}^{t}} \sum_{i \in I_{N_{k + 1}^{t}}} ϕ_{i} (x_{k}),$

(4)

and

$g_{k} = \frac{1}{N_{k + 1, g}} \sum_{i \in I_{N_{k + 1, g}}} \nabla ϕ_{i} (x_{k}),$

(5)

with $I_{N_{k + 1, g}} \subset {1, \dots, N}$ with cardinality $| I_{N_{k + 1, g}} | = N_{k + 1, g}$ .

Minimizing

m_{k}

over the ball of center 0 and radius

δ_{k}

gives the trial step

p_{k} = \underset{∥ p ∥ \leq δ_{k}}{argmin} m_{k} (p) = - δ_{k} \frac{g_{k}}{∥ g_{k} ∥} .

Algorithm 1 Stochastic Inexact Restoration Trust-Region (SIRTR)

Given

x_{0} \in R^{n}

,

N_{0} \in {1, \dots, N}

,

η_{1} \in (0, 1)

,

θ_{0} \in (0, 1)

,

r \in (0, 1)

,

γ > 1

,

μ > 0

,

η_{2} > 0

,

0 < δ_{0} < δ_{max}

.

0.: Set $k = 0$ , iflag=succ.
1.: Reference sample size
If iflag=succ
find ${\tilde{N}}_{k + 1}$ such that $N_{k} \leq {\tilde{N}}_{k + 1} \leq N$ and

$h ({\tilde{N}}_{k + 1}) \leq r h (N_{k}),$

(6)

Else
set ${\tilde{N}}_{k + 1} = {\tilde{N}}_{k}$ .
2.: Trial sample size
If $N_{k} = N$
set $N_{k + 1}^{t} = N$
Else
find $N_{k + 1}^{t}$ such that

$\begin{matrix} h (N_{k + 1}^{t}) - h ({\tilde{N}}_{k + 1}) & \leq & μ δ_{k}^{2} . \end{matrix}$

(7)
3.: Trust-region model
Choose $I_{N_{k + 1}^{t}} \subseteq {1, \dots, N}$ such that $| I_{N_{k + 1}^{t}} | = N_{k + 1}^{t}$ .
Choose $N_{k + 1, g}$ and $I_{N_{k + 1, g}} \subseteq {1, \dots, N}$ such that $| I_{N_{k + 1, g}} | = N_{k + 1, g}$ .
Compute $g_{k}$ as in (5), and set $p_{k} = - δ_{k} \frac{g_{k}}{∥ g_{k} ∥}$ .
Compute $f_{N_{k + 1}^{t}} (x_{k})$ as in (4), and set $m_{k} (p_{k}) = f_{N_{k + 1}^{t}} (x_{k}) + g_{k}^{T} p_{k}$ .
4.: Penalty parameter
If ${Pred}_{k} (θ_{k}) \geq η_{1} (h (N_{k}) - h ({\tilde{N}}_{k + 1}))$
set

$θ_{k + 1} = θ_{k}$

Else
set

$θ_{k + 1} = \frac{(1 - η_{1}) (h (N_{k}) - h ({\tilde{N}}_{k + 1}))}{m_{k} (p_{k}) - f_{N_{k}} (x_{k}) + h (N_{k}) - h ({\tilde{N}}_{k + 1})} .$

(8)
5.: Acceptance test
If ${Ared}_{k} (x_{k} + p_{k}, θ_{k + 1}) \geq η_{1} {Pred}_{k} (θ_{k + 1})$ and $∥ g_{k} ∥ \geq η_{2} δ_{k}$ (success)
     define

$\begin{matrix} x_{k + 1} & = x_{k} + p_{k} \\ δ_{k + 1} & = min \{γ δ_{k}, δ_{max}\} \end{matrix}$

(9)

      set $N_{k + 1} = N_{k + 1}^{t}$ , $k = k + 1$ , iflag=succ and go to Step 1.
Else (unsuccess)
     define

$\begin{matrix} x_{k + 1} & = x_{k} \\ δ_{k + 1} & = \frac{δ_{k}}{γ}, \end{matrix}$

(10)

      set $N_{k + 1} = N_{k}$ , $k = k + 1$ , iflag=unsucc and go to Step 1.

In Step 4, we compute the penalty parameter $θ_{k + 1} \in (0, 1)$ that governs the predicted reduction ${Pred}_{k}$ in the function and infeasibility measure, which we define as

${Pred}_{k} (θ) = θ (f_{N_{k}} (x_{k}) - m_{k} (p_{k})) + (1 - θ) (h (N_{k}) - h ({\tilde{N}}_{k + 1})) .$

(11)

If

θ = θ_{k}

satisfies

{Pred}_{k} (θ) \geq η_{1} (h (N_{k}) - h ({\tilde{N}}_{k + 1})),

(12)

then, we set

θ_{k + 1} = θ_{k}

; otherwise, we compute

θ_{k + 1}

as the biggest value for which Inequality (12) is satisfied and takes the explicit form given in (8).

In Step 5, we establish if we accept (success) or reject (unsuccess) the trial point $x_{k} + p_{k}$ . The actual reduction ${Ared}_{k}$ at point $\hat{x}$ is defined as

${Ared}_{k} (\hat{x}, θ) = θ (f_{N_{k}} (x_{k}) - f_{N_{k + 1}^{t}} (\hat{x})) + (1 - θ) (h (N_{k}) - h (N_{k + 1}^{t})),$

(13)

and we declare a successful iteration whenever the following conditions are both met:

$\begin{matrix} {Ared}_{k} (x_{k} + p_{k}, θ_{k + 1}) & \geq η_{1} {Pred}_{k} (θ_{k + 1}) \end{matrix}$

(14)

$\begin{matrix} ∥ g_{k} ∥ & \geq η_{2} δ_{k} . \end{matrix}$

(15)

Condition (14) reduces to the standard acceptance criterion of deterministic trust-region methods when $N_{k} = {\tilde{N}}_{k + 1} = N_{k + 1}^{t} = N$ . If both conditions are satisfied, we accept the step $p_{k}$ and set $x_{k + 1} = x_{k} + p_{k}$ , increase the trust-region radius based on the update rule (9), and set $N_{k + 1} = N_{k + 1}^{t}$ , iflag=succ; otherwise, we retain the previous iterate, i.e., $x_{k + 1} = x_{k}$ , reduce the trust-region radius according to (10), and set $N_{k + 1} = N_{k}$ , iflag = unsucc.

3. Convergence Analysis

In this section, we are interested in the convergence properties of Algorithm 1. To this aim, we note that the function estimates

f_{N_{k + 1}^{t}} (x_{k})

in (4) and gradient estimates

g_{k}

in (5) are all random quantities. Consequently, Algorithm 1 generates a random process, that is, the iterates

X_{k}

, the trust region radii

Δ_{k}

, the gradient estimates

G_{k}, \nabla f_{N_{k + 1}^{t}} (X_{k})

, and the values

Ψ_{k}

of the Lyapunov function

Ψ

in (21) at iteration k are to be considered as random variables, with their realizations denoted as

x_{k}

,

δ_{k}

,

g_{k}

, and

ψ_{k}

.

Our aim is to show the convergence in probability of the iterates generated by Algorithm 1, in the sense that

P (lim_{k \to \infty} ∥ \nabla f (X_{k}) ∥ = 0) = 1,

(16)

i.e., the event

{lim}_{k \to \infty} ∥ \nabla f (X_{k}) ∥ = 0

holds almost surely. We note that the authors in [11] derived a bound on the expected number of iterations in Algorithm 1 required to reach the desired accuracy in the gradient norm, but did not show the convergence results of Type (16).

3.1. Preliminary Results

We recall some technical preliminary results that were obtained for Algorithm 1 in [11]. First, we impose some basic assumptions on the functions in Problem (1).

Assumption 2.

(i): Each function $ϕ_{i} : R^{n} \to R$ is continuously differentiable for $i = 1, \dots, N$ .
(ii): The functions $f_{M} : R^{n} \to R$ , $M = 1, \dots, N$ , are bounded from below on $R^{n}$ , i.e., there exists $f_{l o w} \in R$ such that

$f_{M} (x) \geq f_{l o w}, 1 \leq M \leq N, x \in R^{n} .$
(iii): The functions $f_{M} : R^{n} \to R$ , $M = 1, \dots, N$ , are bounded from above on a subset $Ω \subseteq R^{n}$ , i.e., there exists $f_{u p} \in R$ such that

$f_{M} (x) \leq f_{u p}, 1 \leq M \leq N, x \in Ω .$

Furthermore, the iterates ${x_{k}}_{k \in N}$ defined by Algorithm 1 are contained in Ω.

Combining Step 4 of Algorithm 1 with Bound (3) and Assumption 2(iii), it is possible to prove that for any realizations of the algorithm, the sequence

{θ_{k}}_{k \in N}

is bounded away from zero.

Lemma 1

([11], Lemma 2). Let Assumptions 1 and 2 hold and consider a particular realization of Algorithm 1. Let

κ_{ϕ} > 0

be defined as follows:

κ_{ϕ} = max {| f_{l o w} |, | f_{u p} |} .

(17)

Then,

{θ_{k}}_{k \in N}

is a positive, non-increasing sequence such that

θ_{k} \geq \underset{̲}{θ} = min \{θ_{0}, \frac{(1 - η_{1}) (1 - r) \underset{̲}{h}}{2 k_{ϕ} + \bar{h}}\}, \forall k \geq 0 .

(18)

Furthermore, Condition (12) holds with

θ = θ_{k + 1}

.

Since the acceptance test in Algorithm 1 employs function and gradient estimates, we cannot expect that the objective function

f_{N}

is decreased from one iteration to the other; however, the authors in [11] showed that an appropriate Lyapunov function

Ψ

is reduced at each iteration. This Lyapunov function is defined as

Ψ (x, M, θ, δ) = v (θ f_{M} (x) + (1 - θ) h (M) + θ Σ) + (1 - v) δ^{2},

(19)

where

v \in (0, 1)

and

Σ \in R

are any constants that satisfy

f_{N_{k}} (x) - h (N_{k}) + Σ \geq 0, x \in Ω, k \geq 0,

(20)

where such a constant exists thanks to Bound (3) and Assumption 2(ii). For all

k \geq 0

, we denote the values of

Ψ

along the iterates of Algorithm 1 as follows:

ψ_{k} = Ψ (x_{k}, N_{k}, θ_{k}, δ_{k}), \forall k \geq 0 .

(21)

Thanks to (20) and the positive sign of h (see Assumption 1), we can easily deduce that the sequence

{ψ_{k}}_{k \in N}

is non-negative, indeed

ψ_{k} \geq v (θ_{k} f_{N_{k}} (x_{k}) + (1 - θ_{k}) h (N_{k}) + θ_{k} (- f_{N_{k}} (x_{k}) + h (N_{k}))) = v h (N_{k}) \geq 0 .

(22)

Furthermore, the difference between two successive values

ψ_{k + 1}

and

ψ_{k}

can be easily rewritten as

\begin{matrix} ψ_{k + 1} - ψ_{k} & = & v (θ_{k + 1} (f_{N_{k + 1}} (x_{k + 1}) - f_{N_{k}} (x_{k})) + (1 - θ_{k + 1}) (h (N_{k + 1}) - h (N_{k}))) \\ + v (θ_{k + 1} - θ_{k}) (f_{N_{k}} (x_{k}) - h (N_{k}) + Σ) + (1 - v) (δ_{k + 1}^{2} - δ_{k}^{2}) . \end{matrix}

(23)

If k is a successful iteration, then

N_{k + 1} = N_{k + 1}^{t}

. By recalling (20) and the fact that the sequence

{θ_{k}}_{k \in N}

is monotone non-increasing (see Lemma 1), then, Equality (23) yields

ψ_{k + 1} - ψ_{k} \leq - v {Ared}_{k} (x_{k + 1}, θ_{k + 1}) + (1 - v) (δ_{k + 1}^{2} - δ_{k}^{2}) .

(24)

Otherwise, Algorithm 1 sets

x_{k + 1} = x_{k}

and

N_{k + 1} = N_{k}

. By inserting these updates in (23), together with (20) and the fact that

{θ_{k}}_{k \in N}

is non-increasing, we obtain

ψ_{k + 1} - ψ_{k} \leq (1 - v) (δ_{k + 1}^{2} - δ_{k}^{2}) .

(25)

Using (24) and (25) in combination with Step 5 of Algorithm 1, we can prove the following results.

Theorem 1

([11], Theorem 1). Let Assumptions 1–2 hold and consider a particular realization of Algorithm 1. In (19), choose

v \in (v^{†}, 1)

, where

v^{†}

is defined by

v^{†} = max \{\frac{(γ^{2} - 1) δ_{max}^{2}}{η_{1}^{2} (1 - r) \underset{̲}{h} + (γ^{2} - 1) δ_{max}^{2}}, \frac{γ^{2} - 1}{η_{1} η_{2} \underset{̲}{θ} + γ^{2} - 1}\} .

(26)

Then, there exists a constant

σ = σ (v) > 0

such that

ψ_{k + 1} - ψ_{k} \leq - σ δ_{k}^{2}, f o r a l l k \geq 0,

(27)

hence, the sequence

{δ_{k}}

in Algorithm 1 satisfies

lim_{k \to \infty} δ_{k} = 0 .

We now introduce a Lipschitz continuity assumption on the gradients of the functions

ϕ_{i}

appearing in (1).

Assumption 3.

Each gradient

\nabla ϕ_{i}

is

L_{i} -

Lipschitz continuous for

i = 1, \dots, N

. We use the notation

L = \frac{1}{2} {max}_{1 \leq i \leq N} L_{i}

.

The gradient estimates are bounded under Assumptions 2 and 3, as stated in the following lemma.

Lemma 2

([11], Lemma 5). Let Assumptions 2 and 3 hold. Then, there exists

g_{max}

such that for any realization of Algorithm 1

∥ g_{k} ∥ \leq g_{max}, k \geq 0,

(28)

where

g_{m a x} = \sqrt{8 L κ_{ϕ}}

, and

κ_{ϕ}

is given in (17).

Let us introduce the following events

\begin{matrix} G_{k, 1} & = & {∥ \nabla f_{N} (X_{k}) - G_{k} ∥ \leq ν Δ_{k}}, \end{matrix}

(29)

\begin{matrix} G_{k, 2} & = & {∥ \nabla f_{N} (X_{k}) - \nabla f_{N_{k + 1}^{t}} (X_{k}) ∥ \leq ν Δ_{k}}, \end{matrix}

(30)

where

ν

is a positive parameter. By using a similar terminology to the one employed in [22], the iteration k is said to be true if the events

G_{k, 1}

and

G_{k, 2}

are both true.

The next lemma shows that k is successful whenever the iteration k is true and the trust-region radius

δ_{k}

is sufficiently small. This result is crucial for the analysis in the next section.

Lemma 3

([11], Lemma 6) . Let Assumptions 1–3 hold and set

η_{3} = \frac{δ_{max} g_{max} (θ_{0} (2 ν + L) + (1 - \underset{̲}{θ}) μ)}{η_{1} (1 - η_{1}) (1 - r) \underset{̲}{h}}

. Suppose that, for a particular realization of Algorithm 1, the iteration k is true and the following condition holds

δ_{k} < min \{\frac{∥ g_{k} ∥}{η_{2}}, \frac{∥ g_{k} ∥}{η_{3}}, \frac{(1 - η_{1}) ∥ g_{k} ∥}{2 ν + L}\} .

(31)

Then, iteration k is successful.

3.2. Novel Convergence Results

Here, we derive two novel convergence results in probability holding for Algorithm 1. The results are provided under the assumption that the random variables

G_{k}

and

\nabla f_{N_{k + 1}^{t}} (X_{k})

are sufficiently accurate estimators of the true gradient at

X_{k}

, in the probabilistic sense specified below.

Assumption 4.

Let

F_{k - 1} = σ (G_{0}, \dots, G_{k - 1}, \nabla f_{N_{1}^{t}} (X_{0}), \dots, \nabla f_{N_{k}^{t}} (X_{k - 1}))

. Then, the events

G_{k, 1}, G_{k, 2}

are true with sufficiently high probability conditioned to

F_{k - 1}

, and the estimators

G_{k}

and

\nabla f_{N_{k + 1}^{t}} (X_{k})

are conditionally independent random variables given

F_{k - 1}

, i.e.,

\begin{matrix} P (G_{k, 1} | F_{k - 1}) = π_{1}, P (G_{k, 2} | F_{k - 1}) = π_{2}, and π_{3} = π_{1} π_{2} > \frac{1}{2} . \end{matrix}

(32)

First, we provide a liminf-type convergence result for SIRTR, which shows that the gradient of the objective function converges in probability to zero relative to a subsequence of the iterates.

Theorem 2.

Suppose that Assumptions 1–4 hold. Then, there holds

P (\underset{k \to \infty}{lim inf} ∥ \nabla f (X_{k}) ∥ = 0) = 1 .

Proof.

The proof parallels that in ([16], Theorem 4.16). By contrast, assume that there exists

ϵ > 0

such that the event

∥ \nabla f (X_{k}) ∥ \geq ϵ, \forall k \geq 0

(33)

holds with positive probability. Then, let

{x_{k}}_{k \in N}

be a realization of

{X_{k}}_{k \in N}

such that

∥ \nabla f (x_{k}) ∥ \geq ϵ

for all

k \geq 0

, and

{δ_{k}}_{k \in N}

is the corresponding realization of

{Δ_{k}}_{k \in N}

. From Theorem 1, we know that

{lim}_{k \to \infty} δ_{k} = 0

; therefore, there exists

\bar{k}

such that

δ_{k} < b = min \{\frac{ϵ}{2 ν}, \frac{ϵ}{2 η_{2}}, \frac{ϵ}{2 η_{3}}, \frac{ϵ (1 - η_{1})}{2 (2 ν + L)}, \frac{δ_{max}}{γ}\}, \forall k \geq \bar{k} .

(34)

Consider the random variable

R_{k}

with realizations given by

r_{k} = {log}_{γ} (\frac{δ_{k}}{b}), k \geq 0 .

(35)

Note that

r_{k}

satisfies the following properties.

(i): If $k \geq \bar{k}$ , then $r_{k} \leq 0$ ; this is a consequence of (34).
(ii): If k is a true iteration and $k \geq \bar{k}$ , then $r_{k + 1} = r_{k} + 1$ ; indeed, since $G_{k, 1}$ is true and $δ_{k} < ϵ / (2 ν)$ , it follows that

$∥ g_{k} - \nabla f_{N} (x_{k}) ∥ \leq ν δ_{k} < \frac{ϵ}{2} .$

(36)

Then, $∥ \nabla f (x_{k}) ∥ \geq ϵ$ yields

$∥ g_{k} ∥ \geq \frac{ϵ}{2},$

(37)

which, combined with (34), implies that $δ_{k}$ satisfies Inequality (31). Thus, Lemma 3 implies that the iteration k is successful. Since $δ_{k} \leq δ_{max} / γ$ and the k-th iteration is successful, by (9) it follows that $δ_{k + 1} = γ δ_{k}$ . Hence, $r_{k + 1} = r_{k} + 1$ .
(iii): If k is not a true iteration and $k \geq \bar{k}$ , then $r_{k + 1} \geq r_{k} - 1$ ; this is because since we cannot apply Lemma 3 (k is not true), all we can say about the trust-region radius is that $δ_{k + 1} \geq δ_{k} / γ$ .

Then, defining the

σ

-algebra

F_{k - 1}^{G} = σ (𝟙_{G_{0, 1}} \cdot 𝟙_{G_{0, 2}}, \dots, 𝟙_{G_{k - 1, 1}} \cdot 𝟙_{G_{k - 1, 2}})

, which is included in

F_{k - 1}

, it follows from properties (ii)–(iii) and Assumption 4 that

E (R_{k + 1} | F_{k - 1}^{G}) \geq π_{1} π_{2} (R_{k} + 1) + (1 - π_{1} π_{2}) (R_{k} - 1) \geq R_{k},

where the second inequality follows from

π_{1} π_{2} > \frac{1}{2}

. Hence, we have that

{R_{k}}_{k \in N}

is a submartingale. We also define the random variable

W_{k} = \sum_{i = 0}^{k} (2 \cdot 𝟙_{G_{i, 1}} \cdot 𝟙_{G_{i, 2}} - 1), k \geq 0 .

(38)

{W_{k}}_{k \in N}

is also a submartingale as

\begin{matrix} E (W_{k + 1} | F_{k - 1}^{G}) & = E (W_{k} | F_{k - 1}^{G}) + 2 E (𝟙_{G_{k + 1, 1}} \cdot 𝟙_{G_{k + 1, 2}} | F_{k - 1}^{G}) - 1 \\ = W_{k} + 2 P (G_{k + 1, 1} \cap G_{k + 1, 2} | F_{k - 1}^{G}) - 1 \\ \geq W_{k}, \end{matrix}

where, again, the last inequality is due to the fact that

π_{1} π_{2} > \frac{1}{2}

. Since

W_{k}

cannot have a finite limit, from ([16], Theorem 4.4) it follows that the event

{lim sup}_{k \to \infty} W_{k} = \infty

holds almost surely. Since we have

r_{k} - r_{k_{0}} \geq w_{k} - w_{k_{0}}

by definition of

{R_{k}}_{k \in N}

and

{W_{k}}_{k \in N}

, it follows that

R_{k}

has to be positive infinitely often with a probability of one. However, this contradicts property (i) listed above, which allows us to conclude that (33) cannot occur. □

In the following, we show that SIRTR generates iterates such that the corresponding gradients evaluated at the SIRTR iterates converge (in probability) to zero. The next lemma is similar to ([6], Lemma 4.2); however, some crucial modifications are needed here; indeed, unlike in [6], we take into account the fact that SIRTR enforces the decrease in the Lyapunov function

Ψ

defined in (19) rather than the objective function.

Lemma 4.

Suppose that Assumptions 1–4 hold. Let

{X_{k}}

and

{Δ_{k}}

be the random sequences generated by Algorithm 1. For a fixed

ϵ > 0

, define the following subset of natural numbers

{K_{i}} = {k \geq 0 : ∥ \nabla f (X_{k}) ∥ > ϵ} .

Then,

\sum_{k \in {K_{i}}} Δ_{k} < \infty .

holds almost surely.

Proof.

Let us consider the generic realizations

{x_{k}}_{k \in N}

,

{g_{k}}_{k \in N}

,

{δ_{k}}_{k \in N}

,

{θ_{k}}_{k \in N}

, and

{k_{i}}_{i \in N}

of Algorithm 1. Furthermore, we let

{p_{i}}

be the subsequence of

{k_{i}}

, where the iteration is true, whereas

{n_{i}}

denotes the complementary subsequence so that

{k_{i}} = {p_{i}} \cup {n_{i}}

. First, we show that

\sum_{k \in {p_{i}}} δ_{k} < \infty

. If

{p_{i}}

is finite, then there is nothing to prove. Otherwise, since

{lim}_{k \to \infty} δ_{k} = 0

, there exists

\tilde{k}

such that

δ_{k} < b

for all

k \geq \tilde{k}

, where b is given in (34). Let us consider any

p_{i} \geq \tilde{k}

. Since

G_{k, 1}

is true,

δ_{p_{i}} < ϵ / (2 ν)

, and

∥ \nabla f (x_{p_{i}}) ∥ > ϵ

, we can reason as in (36) and (37) to conclude that

∥ g_{p_{i}} ∥ \geq ϵ / 2

. Combining this lower bound with

δ_{p_{i}} < b

, we have that Inequality (31) is satisfied with

k = p_{i}

. Hence, iteration

p_{i}

is successful by Lemma 3 and we have

\begin{matrix} {Ared}_{k} (x_{p_{i + 1}}, θ_{p_{i} + 1}) & \geq η_{1} {Pred}_{k} (θ_{p_{i} + 1}) \\ \geq η_{1}^{2} (h (N_{p_{i}}) - h ({\tilde{N}}_{p_{i} + 1})) \\ \geq η_{1}^{2} (1 - r) h (N_{p_{i}}) \\ \geq \frac{η_{1}^{2} (1 - r) \underset{̲}{h}}{g_{max} δ_{max}} δ_{p_{i}} ∥ g_{p_{i}} ∥, \end{matrix}

(39)

where the first inequality is the acceptance test (14), the second follows from Step 4 of the SIRTR algorithm, the third follows from (6), and the last follows from (3) and Lemma 2. Now, starting from Inequality (24) (which holds only for successful iterations), we can derive the following chain of inequalities

\begin{matrix} ψ_{p_{i}} - ψ_{p_{i} + 1} & > & v {Ared}_{k} (x_{p_{i} + 1}, θ_{p_{i} + 1}) - (1 - v) (δ_{p_{i} + 1}^{2} - δ_{p_{i}}^{2}) \\ \geq & \frac{v η_{1}^{2} (1 - r) \underset{̲}{h}}{g_{max} δ_{max}} δ_{p_{i}} ∥ g_{p_{i}} ∥ - (1 - v) \frac{(γ^{2} - 1)}{η_{2}} δ_{p_{i}} ∥ g_{p_{i}} ∥ \\ = & \underset{: = c}{\underset{︸}{(v (\frac{η_{1}^{2} (1 - r) \underset{̲}{h}}{g_{max} δ_{max}} + \frac{(γ^{2} - 1)}{η_{2}}) - \frac{(γ^{2} - 1)}{η_{2}})}} δ_{p_{i}} ∥ g_{p_{i}} ∥, \end{matrix}

(40)

where the second inequality follows from (39), (9), and (15). Now, recalling the definition of

v^{†}

given in (26), we choose v in (19) as

max \{v^{†}, \frac{(γ^{2} - 1) g_{max} δ_{max}}{η_{1}^{2} η_{2} (1 - r) \underset{̲}{h} + (γ^{2} - 1) g_{max} δ_{max}}\} < v < 1,

(41)

and, consequently, c in (40) is positive while keeping Theorem 1 still applicable. Then, plugging

∥ g_{p_{i}} ∥ \geq \frac{ϵ}{2}

into (40) yields

ψ_{p_{i}} - ψ_{p_{i} + 1} > \frac{ϵ c}{2} δ_{p_{i}} .

Summing the previous inequality over

k \in {p_{i}}

,

k \geq \tilde{k}

, and noting that

ψ_{k} - ψ_{k + 1} > 0

for any k (due to (27)), we obtain

\begin{matrix} \sum_{k \in {p_{i}}, k \geq \tilde{k}} δ_{k} & < & \frac{2}{ϵ c} \sum_{k \in {p_{i}}, k \geq \tilde{k}} (ψ_{k} - ψ_{k + 1}) \\ \leq & lim_{K \to \infty} \frac{2}{ϵ c} \sum_{k = \tilde{k}}^{K} (ψ_{k} - ψ_{k + 1}) \\ = & lim_{k \to \infty} \frac{2}{ϵ c} (ψ_{\tilde{k}} - ψ_{K + 1}) \\ \leq & \frac{2}{ϵ c} ψ_{\tilde{k}}, \end{matrix}

where the last inequality follows from (22). Then, we have shown that

\sum_{k \in {p_{i}}} δ_{k} < \infty .

Furthermore, let us introduce the Bernoulli variable

B_{k} = 2 \cdot 𝟙_{G_{k, 1}} \cdot 𝟙_{G_{k, 2}} - 1

, which takes a value of 1 when the iteration k is true and a value of

- 1

otherwise. Note that due to Assumption 4,

P (B_{k} = 1 | F_{k - 1}) > \frac{1}{2} .

Moreover, the sequence

{Δ_{k}}

is a sequence of non-negative uniformly bounded random variables. Then, we can proceed as in the proof in ([6], Lemma 4.2), and using ([6], Lemma 4.1) we obtain

P (\{\sum_{k \in {P_{i}}} Δ_{k} < \infty\} \cap \{\sum_{k \in {N_{i}}} Δ_{k} = \infty\}) = 0 .

This implies that almost surely

\sum_{k \in {N_{i}}} Δ_{k} < \infty,

hence, the thesis follows. □

As a byproduct of the previous lemma, we obtain the expected convergence result in probability in the exact same way as in [6].

Theorem 3.

Suppose that Assumptions 1–4 hold. Let

{X_{k}}

be the sequence of random iterates generated by Algorithm 1. Then, there holds

P (lim_{k \to \infty} ∥ \nabla f (X_{k}) ∥ = 0) = 1 .

Proof.

The proof follows exactly as in ([6], Theorem 4.3). □

4. Numerical Illustration

In this section, we evaluate the numerical performance of Algorithm 1 equipped with the probabilistic accuracy requirements imposed in Assumption 4. Algorithm 1 was implemented using MATLAB R2019a, and the numerical experiments were performed on an 8 GB RAM laptop with an Intel Core i7-4510U CPU 2.00-2.60 GHz processor. The related software can be downloaded from sites.google.com/view/optml-italy-serbia/home/software (accessed on 1 September 2022).

We perform our numerical experiments on a binary classification problem. Denoting with

{(a_{i}, b_{i})}_{i = 1}^{N}

a training set, where

a_{i} \in R^{n}

is the i-th feature vector, and

b_{i} \in {0, 1}

is the associated label, we address the following nonconvex optimization problem:

min_{x \in R^{n}} f_{N} (x) = \frac{1}{N} \sum_{i = 1}^{N} {(b_{i} - \frac{1}{1 + e^{- a_{i}^{T} x}})}^{2} .

(42)

Note that (42) can be framed in Problem (1) by setting

ϕ_{i} (x) = {(b_{i} - 1 / (1 + e^{- a_{i}^{T} x}))}^{2}

,

i = 1, \dots, N

, namely the composition of the least-square loss with the sigmoid function. Furthermore, it is easy to see that the objective function

f_{N}

satisfies Assumption 2 since each

ϕ_{i}

is continuously differentiable and

f_{N}

is bounded from below and above.

In Table 1, we report the four data sets used for our experiments. For each data set, we specify the number of feature vectors N, the number of components n of each feature vector, and the size

N_{T}

of the testing set

I_{N_{T}}

.

We implement two different versions of Algorithm 1, which differ from one another in the way the two sample sizes

N_{k + 1}^{t}

and

N_{k + 1, g}

for the estimators in (4) and (5) are selected.

${SIRTR}_{nop}$ : this is Algorithm 1 implemented as in [11]. In particular, the infeasibility measure h and the initial penalty parameter $θ_{0}$ are chosen as follows:

$h (M) = \frac{N - M}{N}, θ_{0} = 0.9 .$

In Step 1, the reference sample size ${\tilde{N}}_{k + 1}$ is computed as follows:

${\tilde{N}}_{k + 1} = min {N, ⌈ \tilde{c} N_{k} ⌉},$

(43)

where $\tilde{c} = 1.05$ . It is easy to see that Rule (43) complies with Condition (6) by setting $r = (N - (\tilde{c} - 1)) / N$ . In Step 2, the trial sample size $N_{k + 1}^{t}$ is chosen in compliance with Condition (7) as

$N_{k + 1}^{t} = \{\begin{matrix} ⌈ {\tilde{N}}_{k + 1} - μ N δ_{k}^{2} ⌉, & if ⌈ {\tilde{N}}_{k + 1} - μ N δ_{k}^{2} ⌉ \in [N_{0}, 0.95 N] \\ {\tilde{N}}_{k + 1}, & if ⌈ {\tilde{N}}_{k + 1} - μ N δ_{k}^{2} ⌉ < N_{0} \\ N, & if ⌈ {\tilde{N}}_{k + 1} - μ N δ_{k}^{2} ⌉ > 0.95 N . \end{matrix}$

(44)

In Step 3, the sample size $N_{k + 1, g}$ is fixed as

$N_{k + 1, g} = ⌈ c N_{k + 1}^{t} ⌉,$

(45)

where $c = 0.1$ . Furthermore, the set $I_{N_{k + 1}^{t}}$ for computing $f_{N_{k + 1}^{t}} (x_{k})$ and $f_{N_{k + 1}^{t}} (x_{k} + p_{k})$ is sampled uniformly at random using the MATLAB command randsample, whereas $g_{k} \in R^{n}$ is a sample average approximation as in (5) using $I_{N_{k + 1, g}} \subseteq I_{N_{k + 1}^{t}}$ . The other parameters are set as $x_{0} = {(0, 0, \dots, 0)}^{T}$ , $δ_{0} = 1$ , $δ_{max} = 1$ , $γ = 2$ , $η = 10^{- 1}$ , $η_{2} = 10^{- 6}$ , $μ = 100 / N$ . Note that Choices (44)–(45) for the sample sizes $N_{k + 1}^{t}, N_{k + 1, g}$ are not sufficient to guarantee that Assumption 4 holds so that Theorems 2–3 do not apply to this version of Algorithm 1.
${SIRTR}_{p}$ : this implementation of Algorithm 1 differs from the previous one only in the choice of the sample sizes $N_{k + 1}^{t}, N_{k + 1, g}$ . In this case, we force these two parameters to comply with Assumption 4. According to ([29], Theorem 7.2, Table 7.1), a subsampled estimator $\nabla f_{S} (x_{k}) = \frac{1}{S} \sum_{i \in I_{S}} \nabla ϕ_{i} (x_{k})$ with sample size $| I_{S} | = S$ satisfies the probabilistic requirement

$P (∥ \nabla f_{S} (X_{k}) - \nabla f_{N} (X_{k}) ∥ \leq ν δ_{k} | F_{k - 1}) \geq π,$

(46)

where $π \in (0, 1)$ if the sample size S complies with the following lower bound

$S \geq N_{k + 1}^{χ, ν, π} = min \{N, ⌈\frac{4 χ}{ν δ_{k}} (\frac{2 χ}{ν δ_{k}} + \frac{1}{3}) log (\frac{n + 1}{1 - π})⌉\},$

(47)

where $χ = \frac{1}{5} {max}_{i = 1, \dots, N} ∥ a_{i} ∥$ . Based on the previous remark, we choose the samples sizes of ${SIRTR}_{p}$ as follows

$\begin{matrix} N_{k + 1}^{t} & = \{\begin{matrix} max {N_{k + 1}^{χ, ν, π}, ⌈ {\tilde{N}}_{k + 1} - μ N δ_{k}^{2} ⌉}, & if ⌈ {\tilde{N}}_{k + 1} - μ N δ_{k}^{2} ⌉ \in [N_{0}, 0.95 N] \\ max {N_{k + 1}^{χ, ν, π}, {\tilde{N}}_{k + 1}}, & if ⌈ {\tilde{N}}_{k + 1} - μ N δ_{k}^{2} ⌉ < N_{0} \\ N, & if ⌈ {\tilde{N}}_{k + 1} - μ N δ_{k}^{2} ⌉ > 0.95 N . \end{matrix} \end{matrix}$

(48)

$\begin{matrix} N_{k + 1, g} & = max \{N_{k + 1}^{χ, ν, π}, ⌈ c N_{k + 1}^{t} ⌉\} . \end{matrix}$

(49)

Setting $π > 1 / \sqrt{2}$ , choosing $N_{k + 1}^{t}, N_{k + 1, g}$ as in (48) and (49), and sampling $I_{N_{k + 1}^{t}}$ and $I_{N_{k + 1, g}}$ uniformly and independently at random in ${1, \dots, N}$ , we guarantee that Assumption 4 holds with $π_{1} = π_{2} = π$ , thus ensuring the convergence in probability of ${SIRTR}_{p}$ according to Theorems 2–3. For our tests, we set $π = 3 / 4$ and $ν = 10 ∥ \nabla f_{N} (x_{0}) - \nabla f_{N_{0}} (x_{0}) ∥$ .

For each data set, we perform 10 runs of both

{SIRTR}_{nop}

and

{SIRTR}_{p}

and assess their performances by measuring the following metrics:

training loss, given as $f_{N} (x_{k})$ with $f_{N}$ defined in (42);
testing loss, defined as

$f_{N_{T}} (x_{k}) = \frac{1}{N_{T}} \sum_{i \in I_{N_{T}}} ϕ_{i} (x_{k}),$

where $I_{N_{T}}$ denotes the testing set and $N_{T}$ its dimension;
classification error, defined as

$e_{k} = \frac{1}{N_{T}} \sum_{i \in I_{N_{T}}} | b_{i} - b_{i}^{p r e d} (x_{k}) |,$

(50)

where $b_{i}$ denotes the true label of the $i -$ th feature vector of the testing set and $b_{i}^{p r e d} (x_{k}) = max {sign (a_{i}^{T} x_{k}), 0}$ is the corresponding predicted label at iteration k.

We note that (42) can be seen as the optimization problem arising from training a neural network with no hidden layers and the sigmoid function as the activation function for the output layer. Then, as in [30,31], we measure the computational cost of evaluating the objective function and its gradient in terms of forward and backward propagations. Namely, we count the number of full function and gradient evaluations, by considering the computation of a single function

ϕ_{i}

equivalent to

\frac{1}{N}

forward propagations, and the evaluation of a single gradient

\nabla ϕ_{i}

equivalent to

\frac{2}{N}

propagations. Regarding

{SIRTR}_{nop}

, we note that the computational cost per iteration is determined by

\frac{N_{k + 1}^{t} + N_{k + 1, g}}{N}

propagations since

I_{N_{k + 1, g}} \subseteq I_{N_{k + 1}^{t}}

. On the contrary, the computational cost of

{SIRTR}_{p}

is determined by

\frac{N_{k + 1}^{t}}{N} + \frac{2 N_{k + 1, g}}{N}

propagations, as

I_{N_{k + 1}^{t}}

and

I_{N_{k + 1, g}}

are sampled independently from one another. For both algorithms, the computational cost per iteration increases as the iterations proceed; indeed, since

δ_{k}

tends to zero as k tends to infinity (Theorem 1), Rules (44)–(48) will eventually select the trial sample size

N_{k + 1}^{t}

equal to the reference sample size

{\tilde{N}}_{k + 1}

, which is increasing geometrically. We expect that the computational cost increases faster in

{SIRTR}_{p}

, as this algorithm also requires the gradient sample size

N_{k + 1, g}

to increase due to Conditions (47)–(49). Finally, we note that the computational cost per iteration of both algorithms is higher than that of the standard stochastic gradient algorithm, which is usually

\frac{2 N_{g}}{N}

, with

N_{g}

being a prefixed gradient sample size. However, the increasing sample sizes result in more accurate function and gradient approximations so the higher computational cost likely implies a larger reduction in the training loss

f_{N}

per iteration, as seen in previous comparisons of SIRTR with a non-adaptive stochastic approach in [11].

In Figure 1, we show the decrease in the training loss, testing loss, and classification error (all averaged over the 10 runs) versus the average computational cost for the first 20 propagations. For most data sets, we observe that

{SIRTR}_{nop}

performs comparably or even better than

{SIRTR}_{p}

. However, for one of the four data sets (mnist) in

{SIRTR}_{nop}

, the accuracy deteriorates after the first propagations, whereas

{SIRTR}_{p}

provides a more accurate classification and a quite steady decrease in the average training loss, testing loss, and classification error. This different performance of the two algorithms can be explained by looking at Figure 2, which shows the increase in the percentage ratio

\frac{100 N_{k + 1}}{N}

of the sample size

N_{k + 1}

over the data set size N (averaged over the 10 runs) for both algorithms. As we can see, the sample size in

{SIRTR}_{p}

rapidly increases to

60 %

of the data set size in the first 50 iterations, whereas the same percentage is achieved by

{SIRTR}_{nop}

after 150–200 iterations. Overall, we can conclude that the probabilistic requirements of Assumption 4 ensure the theoretical support for convergence in probability but might be excessively demanding. In fact, the numerical examples show that a slower increase in the sample size than that imposed by the probabilistic requirements of Assumption 4 provides a good trade-off between the computational cost and the accuracy in the classification.

In Figure 3, we test the sensitivity of

{SIRTR}_{p}

with respect to the initial penalty parameter

θ_{0}

by reporting the average training loss versus the average computational cost obtained with three different values of

θ_{0}

. We observe that the performance of the algorithm is not considerably affected by the choice of this parameter, although large oscillations in the average training loss occur for smaller values of

θ_{0}

in mnist when

θ_{0} = 0.1

. As a general comment, small initial values of

θ_{0}

may not be convenient, as the sequence

{θ_{k}}

is non-increasing and small values of

θ_{k}

promote a decrease in the infeasibility measure h rather than a decrease in the training loss (see the definition of the actual reduction in (13)). Similar considerations can be made for

{SIRTR}_{nop}

in Figure 4.

5. Conclusions

In this paper, we have proved the convergence in probability of a stochastic trust-region method based on the inexact restoration approach (SIRTR) under the assumption that the function and gradient estimates are sufficiently accurate with sufficiently high probability. This result is novel for SIRTR and agrees with other results obtained in the existing literature [16,18,19]. The numerical experiments on binary classification show that the probabilistic requirements improve the numerical stability of the algorithm, ensuring satisfactory accuracy for all data sets. Future work could address the replacement of the probabilistic requirements considered here with alternative strategies for ensuring convergence in probability, such as variance reduction techniques, or the development of a second-order version of SIRTR for improving accuracy based on approximations of the Hessian obtained through subsampling.

Author Contributions

Conceptualization, S.B. and B.M. and S.R.; methodology, S.B. and B.M. and S.R.; software, S.B. and B.M. and S.R.; validation, S.B. and B.M. and S.R.; writing—original draft preparation, S.B. and B.M. and S.R.; writing—review and editing, S.B. and B.M. and S.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been partially supported by the INdAM GNCS project “Ottimizzazione adattiva per il machine learning” (CUP_E55F22000270001) and the Mobility Project “Second order methods for optimization problems in Machine Learning” (ID: RS19MO05) within the frame of the executive Program of Cooperation in the Field of Science and Technology between the Italian Republic and the Republic of Serbia 2019–2022. The third author also acknowledges the financial support received from the IEA CNRS project entitled “VaMOS”.

Data Availability Statement

The data sets employed in this paper have been accessed on 1 September 2022 at the following repositories: http://www.csie.ntu.edu.tw/~cjlin/libsvm, http://yann.lecun.com/exdb/mnist, https://archive.ics.uci.edu/ml/index.php (accessed on 1 September 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Bottou, L.; Curtis, F.E.; Nocedal, J. Optimization Methods for Large-Scale Machine Learning. SIAM Rev. 2018, 60, 223–311. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Schmidt, M.; Le Roux, N.; Bach, F. Minimizing Finite Sums with the Stochastic Average Gradient. Math. Program. 2017, 162, 83–112. [Google Scholar] [CrossRef] [Green Version]
Bandeira, A.S.; Scheinberg, K.; Vicente, L.N. Convergence of trust-region methods based on probabilistic models. SIAM J. Optim. 2014, 24, 1238–1264. [Google Scholar] [CrossRef] [Green Version]
Bellavia, S.; Gurioli, G.; Morini, B.; Toint, P.L. Adaptive regularization algorithm for nonconvex optimization using inexact function evaluations and randomly perturbed derivatives. J. Complex. 2022, 68, 101591. [Google Scholar] [CrossRef]
Bellavia, S.; Gurioli, G.; Morini, B.; Toint, P.L. Trust-region algorithms: Probabilistic complexity and intrinsic noise with applications to subsampling techniques. EURO J. Comput. Optim. 2022, 10, 100043. [Google Scholar] [CrossRef]
Bellavia, S.; Krejić, N.; Krklec Jerinkic, N. Subsampled Inexact Newton methods for minimizing large sums of convex functions. IMA J. Numer. Anal. 2020, 40, 2309–2341. [Google Scholar] [CrossRef]
Bellavia, S.; Krejić, N.; Morini, B. Inexact restoration with subsampled trust-region methods for finite-sum minimization. Comp. Opt. Appl. 2020, 73, 701–736. [Google Scholar] [CrossRef]
Bellavia, S.; Krejić, N.; Morini, B.; Rebegoldi, S. A stochastic first-order trust-region method with inexact restoration for finite-sum minimization. Comput. Optim. Appl. 2022. [Google Scholar] [CrossRef]
Bergou, E.; Gratton, S.; Vicente, L.N. Levenberg–Marquardt Methods Based on Probabilistic Gradient Models and Inexact Subproblem Solution, with Application to Data Assimilation. SIAM-ASA J. Uncertain. 2016, 4, 924–951. [Google Scholar] [CrossRef] [Green Version]
Bergou, E.H.; Diouane, Y.; Kunc, V.; Kungurtsev, V.; Royer, C.W. A subsampling line-search method with second-order results. INFORMS J. Optim. 2022, 4, 403–425. [Google Scholar] [CrossRef]
Bollapragada, R.; Byrd, R.; Nocedal, J. Adaptive sampling strategies for stochastic optimization. SIAM J. Optim. 2018, 28, 3312–3343. [Google Scholar] [CrossRef]
Bollapragada, R.; Byrd, R. Nocedal, Exact and Inexact Subsampled Newton Methods for Optimization. IMA J. Numer. Anal. 2019, 9, 545–578. [Google Scholar] [CrossRef] [Green Version]
Chen, R.; Menickelly, M.; Scheinberg, K. Stochastic optimization using a trust-region method and random models. Math. Program. 2018, 169, 447–487. [Google Scholar] [CrossRef] [Green Version]
di Serafino, D.; Krejic, N.; Krklec Jerinkic, N.; Viola, M. LSOS: Line-search Second-Order Stochastic optimization methods for nonconvex finite sums. arXiv 2021, arXiv:2007.15966v2. [Google Scholar] [CrossRef]
Larson, J.; Billups, S.C. Stochastic derivative-free optimization using a trust region framework. Comput. Optim. Appl. 2016, 64, 619–645. [Google Scholar] [CrossRef]
Paquette, C.; Scheinberg, K. A Stochastic Line Search Method with Expected Complexity Analysis. SIAM J. Optim. 2020, 30, 349–376. [Google Scholar] [CrossRef]
Xu, P.; Roosta-Khorasani, F.; Mahoney, M.W. Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information. Math. Program. 2020, 184, 35–70. [Google Scholar] [CrossRef] [Green Version]
Blanchet, J.; Cartis, C.; Menickelly, M.; Scheinberg, K. Convergence Rate Analysis of a Stochastic Trust Region Method via Submartingales. INFORMS J. Optim. 2019, 1, 92–119. [Google Scholar] [CrossRef] [Green Version]
Cartis, C.; Scheinberg, K. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 2018, 169, 337–375. [Google Scholar] [CrossRef] [Green Version]
Birgin, E.G.; Krejić, N.; Martínez, J.M. Inexact restoration for derivative-free expensive function minimization and applications. J. Comput. Appl. Math. 2022, 410, 114193. [Google Scholar] [CrossRef]
Bueno, L.F.; Friedlander, A.; Martínez, J.M.; Sobral, F.N.C. Inexact Restoration Method for Derivative-Free Optimization with Smooth Constraints. SIAM J. Optim. 2013, 23, 1189–1213. [Google Scholar] [CrossRef] [Green Version]
Martínez, J.M. Pilotta, Inexact restoration algorithms for constrained optimization. J. Optim. Theory Appl. 2000, 104, 135–163. [Google Scholar] [CrossRef]
Lichman, M. UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 1 September 2022).
LeCun, Y.; Bottou, L.; Bengio, Y. Haffner, The MNIST Database. Available online: http://yann.lecun.com/exdb/mnist (accessed on 1 September 2022).
Chang, C.C.; Lin, C.J. LIBSVM: A Library for Support Vector Machines. Available online: http://www.csie.ntu.edu.tw/~cjlin/libsvm (accessed on 1 September 2022).
Bellavia, S.; Gurioli, G.; Morini, B.; Toint, P.L. Adaptive regularization algorithms with inexact evaluations for nonconvex optimization. SIAM J. Optim. 2019, 29, 2881–2915. [Google Scholar] [CrossRef]
Bellavia, S.; Gurioli, G. Stochastic analysis of an adaptive cubic regularization method under inexact gradient evaluations and dynamic Hessian accuracy. Optimization 2022, 71, 227–261. [Google Scholar] [CrossRef]
Xu, P.; Roosta-Khorasani, F.; Mahoney, M.W. Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study. In Proceedings of the 2020 SIAM International Conference on Data Mining, Cincinnati, OH, USA, 7–9 May 2020. [Google Scholar]

Figure 1. From top to bottom row: data sets a8a, a9a, mnist, phishing. From left to right: average training loss, testing loss, and classification error versus average computational cost.

Figure 2. Average percentage ratio of the sample size

N_{k + 1}

over the data set size N versus iterations. Top row: a8a (left) and a9a (right). Bottom row: mnist (left) and phishing (right).

Figure 2. Average percentage ratio of the sample size

N_{k + 1}

over the data set size N versus iterations. Top row: a8a (left) and a9a (right). Bottom row: mnist (left) and phishing (right).

Figure 3. Average training loss versus average computational cost of

{SIRTR}_{p}

equipped with different values for the initial penalty parameter

θ_{0}

. Top row: a8a (left) and a9a (right). Bottom row: mnist (left) and phishing (right).

Figure 3. Average training loss versus average computational cost of

{SIRTR}_{p}

equipped with different values for the initial penalty parameter

θ_{0}

. Top row: a8a (left) and a9a (right). Bottom row: mnist (left) and phishing (right).

Figure 4. Average training loss versus average computational cost of

{SIRTR}_{nop}

equipped with different values for the initial penalty parameter

θ_{0}

. Top row: a8a (left) and a9a (right).Bottom row: mnist (left) and phishing (right).

Figure 4. Average training loss versus average computational cost of

{SIRTR}_{nop}

equipped with different values for the initial penalty parameter

θ_{0}

. Top row: a8a (left) and a9a (right).Bottom row: mnist (left) and phishing (right).

Table 1. Data sets used.

	Training Set		Testing Set
Data Set	$N$	$n$	$N_{T}$
A8a [26]	15,887	123	6809
A9a [26]	22,793	123	9768
Mnist [27]	60,000	784	10,000
phishing [28]	7739	68	3316

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bellavia , S.; Morini , B.; Rebegoldi, S. On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration. Axioms 2023, 12, 38. https://doi.org/10.3390/axioms12010038

AMA Style

Bellavia S, Morini B, Rebegoldi S. On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration. Axioms. 2023; 12(1):38. https://doi.org/10.3390/axioms12010038

Chicago/Turabian Style

Bellavia , Stefania, Benedetta Morini , and Simone Rebegoldi. 2023. "On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration" Axioms 12, no. 1: 38. https://doi.org/10.3390/axioms12010038

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Convergence Properties of a Stochastic Trust-Region Method with Inexact Restoration

Abstract

1. Introduction

2. The SIRTR Method

3. Convergence Analysis

3.1. Preliminary Results

3.2. Novel Convergence Results

4. Numerical Illustration

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI