1. Introduction
The solutions of largescale finitesum optimization problems have become essential in several machine learning tasks including binary or multinomial classification, regression, clustering, and anomaly detection [
1,
2]. Indeed, the training of models employed in such tasks is often performed by solving the optimization
where
N is the size of the available data set and the functions
${\varphi}_{i}:{\mathbb{R}}^{n}\to \mathbb{R}$ are continuously differentiable for all
$i=1,\dots ,N$. As a result, the efficient solution of a machine learning problem calls for efficient numerical algorithms for (
1).
When the data set is extremely large, the evaluation of the objective function
${f}_{N}$ and its derivatives may be computationally demanding, making deterministic optimization methods inadequate for solving (
1). A common strategy consists of approximating both the function and derivatives by employing a small number of loss functions
${\varphi}_{i}$ sampled randomly, making stochastic optimization methods the preferred choice [
3,
4,
5]. A major issue is the sensitivity of most stochastic algorithms to their parameters, such as the learning rate or sample sizes used for building the function and gradient approximations, which usually need to be tuned through multiple trials and errors before the algorithm exhibits acceptable performance. A possible remedy to burdensome tuning is to employ adaptive optimization methods, which compute the parameters according to appropriate globalization strategies [
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20]. Most of these methods have probabilistic accuracy requirements for the function and gradient estimates in order to ensure either the iteration complexity of the expectation [
7,
8,
9,
10,
11,
13,
15,
19,
21,
22] or the convergence in probability of the iterates [
6,
12,
16,
17,
18,
20]. In turn, these requirements are reflected in the choice of sample size, which needs to progressively grow as the iterations proceed, resulting in an increasing computational cost per iteration.
In [
11], the authors proposed the socalled stochastic inexact restoration trustregion (SIRTR) method for solving (
1). SIRTR employs subsampled function and gradient estimates and combines the classical trustregion scheme with the inexact restoration method for constrained optimization problems [
23,
24,
25]. This combined strategy involves the reformulation of (
1) as an optimization problem with two unknown variables
$x,M$, where
x is the object to be recovered and
M is the sample size of the function estimate, upon which the constraint
$M=N$ is imposed. Based on this reformulation of (
1), the method acts on the two variables in a modular way: first, it selects the sample size with a deterministic rule aimed at improving
feasibility with respect to the constraint
$M=N$; then, it accepts or rejects the inexact trustregion step by improving
optimality with respect to a suitable merit function. SIRTR has shown good numerical performance on a series of classification and regression test problems, as its inexact restoration strategy drastically reduces the computational burden due to the selection of the algorithmic parameters. From a theoretical viewpoint, the authors in [
11] provided an upper bound on the expected number of iterations to reach a nearstationary point under some appropriate probability accuracy requirements on the random estimators; remarkably, such requirements are less stringent than others employed in the literature. However, the convergence in probability of SIRTR remains unproved, thus leaving open the question of whether the gradient of the objective function in (
1) converges to zero with a probability of one. A positive answer to this question would be an important theoretical confirmation of the numerical stability of the method.
In this paper, we improve on the existing theoretical analysis of SIRTR, showing that its iterates drive the gradient to zero with a probability of one. The results will be obtained by combining the theoretical properties of SIRTR with some tools from martingale theory, as typically done for the convergence analysis of adaptive stochastic methods [
6,
16,
18]. Furthermore, we show the numerical results obtained by applying SIRTR on nonconvex binary classification, discussing the impact of the probabilistic accuracy requirements on the performance of the method.
The paper is structured as follows. In
Section 2, we briefly outline the method and its main steps. In
Section 3, we perform the convergence analysis of the method, showing that its iterates converge with a probability of one. In
Section 4, we provide a numerical illustration of a binary classification test problem. Finally, we report the conclusions and future work in
Section 5.
Notations: Throughout the paper, $\mathbb{R}$ is the set of real numbers, whereas the symbol $\parallel \xb7\parallel $ denotes the standard Euclidean norm on ${\mathbb{R}}^{n}$. We denote with $(\mathrm{\Omega},\mathcal{A},\mathbb{P})$ a probability space, where $\mathrm{\Omega}$ is the sample space, $\mathcal{A}\subseteq \mathcal{P}\left(\mathrm{\Omega}\right)$ is the $\sigma $ algebra of events, and $\mathbb{P}:\mathcal{A}\to [0,1]$ is the probability function. Given an event $A\in \mathcal{A}$, the symbol $\mathbb{P}\left(A\right)$ stands for the probability of the event A, and ${\U0001d7d9}_{A}:\mathrm{\Omega}\to \{0,1\}$ denotes the indicator function of an event A, i.e., the function such that ${\U0001d7d9}_{A}\left(\omega \right)=1$ if $\omega \in A$, or ${\U0001d7d9}_{A}\left(\omega \right)=0$ otherwise. Given a random variable $X:\mathrm{\Omega}\to \mathbb{R}$, we denote with $\mathbb{E}\left(X\right)$ the expected value of X. Let ${X}_{1},\dots ,{X}_{n}$ be n random variables, and the notation $\sigma ({X}_{1},\dots ,{X}_{n})$ stands for the $\sigma $ algebra generated by ${X}_{1},\dots ,{X}_{n}$.
2. The SIRTR Method
Here, we present the stochastic inexact restoration trustregion (SIRTR) method, which was formerly proposed in [
11]. SIRTR is a trustregion method with subsampled function and gradient estimates, which combines firstorder trustregion methodology with the inexact restoration method for constrained optimization [
25]. In order to provide a detailed description of SIRTR, we reformulate (
1) as the constrained problem
where
${I}_{M}\subseteq \{1,\dots ,N\}$ is a sample set of the size of the cardinality
${I}_{M}$ such that
${I}_{M}=M$. To measure the infeasibility distance of
M from the constraint
$M=N$, we introduce a function
h that measures the distance of
$M\in \{1,\dots ,N\}$ from
N. Function
h is supposed to satisfy the following properties.
Assumption 1.
The function $h:\{1,2,\dots ,N\}\to \mathbb{R}$ is monotonically decreasing and satisfies $h\left(1\right)>0$, $h\left(N\right)=0$.
From Assumption 1, the existence of some positive constants
$\underline{h}$ and
$\overline{h}$ follows such that
An example of a function
$h:\{1,\dots ,N\}\to \mathbb{R}$ satisfying Assumption 1 is
$h\left(M\right)=\frac{NM}{N}$.
SIRTR is a stochastic variant of the classical firstorder trustregion method, which accepts the trial point according to the decrease of a convex combination of the function estimate ${f}_{M}$ with function h. We report SIRTR in Algorithm 1.
At the beginning of each iteration $k\ge 0$, we have at our disposal the iterate ${x}_{k}$, the trustregion radius ${\delta}_{k}$, the sample size ${N}_{k}\in \{1,\dots ,N\}$, the penalty parameter ${\theta}_{k}$, and the flag iflag, where iflag = succ if the previous iteration was successful, in the sense that is specified below, and iflag = unsucc otherwise. Then, in Steps 1–5, perform the following tasks.
In Step 1, if iflag = succ, we reduce the current value $h\left({N}_{k}\right)$ of the infeasibility measure and find some ${\tilde{N}}_{k+1}\in \{1,\dots ,N\}$ satisfying $h({\tilde{N}}_{k+1})\le rh({N}_{k})$ with $r\in (0,1)$. On the other hand, if iflag=unsucc, ${\tilde{N}}_{k+1}$ remains the same from one iteration to the other, i.e., we set ${\tilde{N}}_{k+1}={\tilde{N}}_{k}$. Note that ${\tilde{N}}_{k+1}=N$ if ${N}_{k}=N$.
Step 2 determines a trial sample size ${N}_{k+1}^{t}$ that satisfies $h({N}_{k+1}^{t})h({\tilde{N}}_{k+1})\le \mu {\delta}_{k}^{2}$ with $\mu >0$ and is used to form the random model. In principle, we could fix ${N}_{k+1}^{t}={\tilde{N}}_{k+1}$, but selecting a smaller sample size, if possible, yields a computational saving in the successive step. The relation between ${N}_{k+1}^{t}$ and ${\tilde{N}}_{k+1}$ depends on ${\delta}_{k}$; small values of ${\delta}_{k}$ give ${N}_{k+1}^{t}={\tilde{N}}_{k+1}$; otherwise, ${N}_{k+1}^{t}$ is allowed to be smaller than ${\tilde{N}}_{k+1}$.
Step 3 forms the random model
${m}_{k}(p)$ and the trial step
${p}_{k}$. The linear model is given by
${m}_{k}(p)={f}_{{N}_{k+1}^{t}}({x}_{k})+{g}_{k}^{T}p$, where
and
with
${I}_{{N}_{k+1,g}}\subset \{1,\dots ,N\}$ with cardinality
${I}_{{N}_{k+1,g}}={N}_{k+1,g}$.
Minimizing
${m}_{k}$ over the ball of center 0 and radius
${\delta}_{k}$ gives the trial step
Algorithm 1 Stochastic Inexact Restoration TrustRegion (SIRTR) 
Given ${x}_{0}\in {\mathbb{R}}^{n}$, ${N}_{0}\in \{1,\dots ,N\}$, ${\eta}_{1}\in (0,1)$, ${\theta}_{0}\in (0,1)$, $r\in (0,1)$, $\gamma >1$, $\mu >0$, ${\eta}_{2}>0$, $0<{\delta}_{0}<{\delta}_{max}$. 
 0.
Set $k=0$, iflag=succ.  1.
Reference sample size If iflag=succ find ${\tilde{N}}_{k+1}$ such that ${N}_{k}\le {\tilde{N}}_{k+1}\le N$ and
Else set ${\tilde{N}}_{k+1}={\tilde{N}}_{k}$.  2.
Trial sample size If ${N}_{k}=N$ set ${N}_{k+1}^{t}=N$ Else find ${N}_{k+1}^{t}$ such that
 3.
Trustregion model Choose ${I}_{{N}_{k+1}^{t}}\subseteq \{1,\dots ,N\}$ such that $\left{I}_{{N}_{k+1}^{t}}\right={N}_{k+1}^{t}$. Choose ${N}_{k+1,g}$ and ${I}_{{N}_{k+1,g}}\subseteq \{1,\dots ,N\}$ such that $\left{I}_{{N}_{k+1,g}}\right={N}_{k+1,g}$. Compute ${g}_{k}$ as in ( 5), and set ${p}_{k}={\delta}_{k}\frac{{g}_{k}}{\parallel {g}_{k}\parallel}$. Compute ${f}_{{N}_{k+1}^{t}}({x}_{k})$ as in ( 4), and set ${m}_{k}({p}_{k})={f}_{{N}_{k+1}^{t}}({x}_{k})+{g}_{k}^{T}{p}_{k}$.  4.
Penalty parameter If ${\mathrm{Pred}}_{k}({\theta}_{k})\ge {\eta}_{1}(h({N}_{k})h({\tilde{N}}_{k+1}))$ Else  5.
Acceptance test If ${Ared}_{k}({x}_{k}+{p}_{k},{\theta}_{k+1})\ge {\eta}_{1}{Pred}_{k}({\theta}_{k+1})$ and $\parallel {g}_{k}\parallel \ge {\eta}_{2}{\delta}_{k}$ (success) set ${N}_{k+1}={N}_{k+1}^{t}$, $k=k+1$, iflag=succ and go to Step 1. Else (unsuccess) set ${N}_{k+1}={N}_{k}$, $k=k+1$, iflag=unsucc and go to Step 1.

In Step 4, we compute the penalty parameter
${\theta}_{k+1}\in (0,1)$ that governs the predicted reduction
${Pred}_{k}$ in the function and infeasibility measure, which we define as
If
$\theta ={\theta}_{k}$ satisfies
then, we set
${\theta}_{k+1}={\theta}_{k}$; otherwise, we compute
${\theta}_{k+1}$ as the biggest value for which Inequality (
12) is satisfied and takes the explicit form given in (
8).
In Step 5, we establish if we accept (success) or reject (unsuccess) the trial point
${x}_{k}+{p}_{k}$. The actual reduction
${Ared}_{k}$ at point
$\widehat{x}$ is defined as
and we declare a successful iteration whenever the following conditions are both met:
Condition (
14) reduces to the standard acceptance criterion of deterministic trustregion methods when
${N}_{k}={\tilde{N}}_{k+1}={N}_{k+1}^{t}=N$. If both conditions are satisfied, we accept the step
${p}_{k}$ and set
${x}_{k+1}={x}_{k}+{p}_{k}$, increase the trustregion radius based on the update rule (
9), and set
${N}_{k+1}={N}_{k+1}^{t}$,
iflag=succ; otherwise, we retain the previous iterate, i.e.,
${x}_{k+1}={x}_{k}$, reduce the trustregion radius according to (
10), and set
${N}_{k+1}={N}_{k}$,
iflag = unsucc.
3. Convergence Analysis
In this section, we are interested in the convergence properties of Algorithm 1. To this aim, we note that the function estimates
${f}_{{N}_{k+1}^{t}}({x}_{k})$ in (
4) and gradient estimates
${g}_{k}$ in (
5) are all random quantities. Consequently, Algorithm 1 generates a random process, that is, the iterates
${X}_{k}$, the trust region radii
${\Delta}_{k}$, the gradient estimates
${G}_{k},\nabla {f}_{{N}_{k+1}^{t}}({X}_{k})$, and the values
${\mathrm{\Psi}}_{k}$ of the Lyapunov function
$\mathrm{\Psi}$ in (
21) at iteration
k are to be considered as random variables, with their realizations denoted as
${x}_{k}$,
${\delta}_{k}$,
${g}_{k}$, and
${\psi}_{k}$.
Our aim is to show the convergence in probability of the iterates generated by Algorithm 1, in the sense that
i.e., the event
${lim}_{k\to \infty}\parallel \nabla f({X}_{k})\parallel =0$ holds almost surely. We note that the authors in [
11] derived a bound on the expected number of iterations in Algorithm 1 required to reach the desired accuracy in the gradient norm, but did not show the convergence results of Type (
16).
3.1. Preliminary Results
We recall some technical preliminary results that were obtained for Algorithm 1 in [
11]. First, we impose some basic assumptions on the functions in Problem (
1).
Assumption 2.
 (i)
Each function ${\varphi}_{i}:{\mathbb{R}}^{n}\to \mathbb{R}$ is continuously differentiable for $i=1,\dots ,N$.
 (ii)
The functions ${f}_{M}:{\mathbb{R}}^{n}\to \mathbb{R}$, $M=1,\dots ,N$, are bounded from below on ${\mathbb{R}}^{n}$, i.e., there exists ${f}_{low}\in \mathbb{R}$ such that  (iii)
The functions ${f}_{M}:{\mathbb{R}}^{n}\to \mathbb{R}$, $M=1,\dots ,N$, are bounded from above on a subset $\mathrm{\Omega}\subseteq {\mathbb{R}}^{n}$, i.e., there exists ${f}_{up}\in \mathbb{R}$ such that Furthermore, the iterates ${\left\{{x}_{k}\right\}}_{k\in \mathbb{N}}$ defined by Algorithm 1 are contained in Ω.
Combining Step 4 of Algorithm 1 with Bound (
3) and Assumption 2(iii), it is possible to prove that for any realizations of the algorithm, the sequence
${\left\{{\theta}_{k}\right\}}_{k\in \mathbb{N}}$ is bounded away from zero.
Lemma 1 ([
11], Lemma 2)
. Let Assumptions 1 and 2 hold and consider a particular realization of Algorithm 1. Let ${\kappa}_{\varphi}>0$ be defined as follows:Then, ${\left\{{\theta}_{k}\right\}}_{k\in \mathbb{N}}$ is a positive, nonincreasing sequence such thatFurthermore, Condition (12) holds with $\theta ={\theta}_{k+1}$. Since the acceptance test in Algorithm 1 employs function and gradient estimates, we cannot expect that the objective function
${f}_{N}$ is decreased from one iteration to the other; however, the authors in [
11] showed that an appropriate Lyapunov function
$\mathrm{\Psi}$ is reduced at each iteration. This Lyapunov function is defined as
where
$v\in (0,1)$ and
$\mathrm{\Sigma}\in \mathbb{R}$ are any constants that satisfy
where such a constant exists thanks to Bound (
3) and Assumption 2(ii). For all
$k\ge 0$, we denote the values of
$\mathrm{\Psi}$ along the iterates of Algorithm 1 as follows:
Thanks to (
20) and the positive sign of
h (see Assumption 1), we can easily deduce that the sequence
${\left\{{\psi}_{k}\right\}}_{k\in \mathbb{N}}$ is nonnegative, indeed
Furthermore, the difference between two successive values
${\psi}_{k+1}$ and
${\psi}_{k}$ can be easily rewritten as
If
k is a successful iteration, then
${N}_{k+1}={N}_{k+1}^{t}$. By recalling (
20) and the fact that the sequence
${\left\{{\theta}_{k}\right\}}_{k\in \mathbb{N}}$ is monotone nonincreasing (see Lemma 1), then, Equality (
23) yields
Otherwise, Algorithm 1 sets
${x}_{k+1}={x}_{k}$ and
${N}_{k+1}={N}_{k}$. By inserting these updates in (
23), together with (
20) and the fact that
${\left\{{\theta}_{k}\right\}}_{k\in \mathbb{N}}$ is nonincreasing, we obtain
Using (
24) and (
25) in combination with Step 5 of Algorithm 1, we can prove the following results.
Theorem 1 ([
11], Theorem 1)
. Let Assumptions 1–2 hold and consider a particular realization of Algorithm 1. In (19), choose $v\in ({v}^{\u2020},1)$, where ${v}^{\u2020}$ is defined byThen, there exists a constant $\sigma =\sigma (v)>0$ such thathence, the sequence $\left\{{\delta}_{k}\right\}$ in Algorithm 1 satisfies We now introduce a Lipschitz continuity assumption on the gradients of the functions
${\varphi}_{i}$ appearing in (
1).
Assumption 3.
Each gradient $\nabla {\varphi}_{i}$ is ${L}_{i}$ Lipschitz continuous for $i=1,\dots ,N$. We use the notation $L=\frac{1}{2}{max}_{1\le i\le N}{L}_{i}$.
The gradient estimates are bounded under Assumptions 2 and 3, as stated in the following lemma.
Lemma 2 ([
11], Lemma 5)
. Let Assumptions 2 and 3 hold. Then, there exists ${g}_{max}$ such that for any realization of Algorithm 1where ${g}_{max}=\sqrt{8L{\kappa}_{\varphi}}$, and ${\kappa}_{\varphi}$ is given in (17). Let us introduce the following events
where
$\nu $ is a positive parameter. By using a similar terminology to the one employed in [
22], the iteration
k is said to be
true if the events
${\mathcal{G}}_{k,1}$ and
${\mathcal{G}}_{k,2}$ are both true.
The next lemma shows that k is successful whenever the iteration k is true and the trustregion radius ${\delta}_{k}$ is sufficiently small. This result is crucial for the analysis in the next section.
Lemma 3 ([
11], Lemma 6)
. Let Assumptions 1–3 hold and set ${\eta}_{3}=\frac{{\delta}_{max}{g}_{max}({\theta}_{0}(2\nu +L)+(1\underline{\theta})\mu )}{{\eta}_{1}(1{\eta}_{1})(1r)\underline{h}}$. Suppose that, for a particular realization of Algorithm 1, the iteration k is true and the following condition holdsThen, iteration k is successful. 3.2. Novel Convergence Results
Here, we derive two novel convergence results in probability holding for Algorithm 1. The results are provided under the assumption that the random variables ${G}_{k}$ and $\nabla {f}_{{N}_{k+1}^{t}}({X}_{k})$ are sufficiently accurate estimators of the true gradient at ${X}_{k}$, in the probabilistic sense specified below.
Assumption 4.
Let ${\mathcal{F}}_{k1}=\sigma ({G}_{0},\dots ,{G}_{k1},\nabla {f}_{{N}_{1}^{t}}({X}_{0}),\dots ,\nabla {f}_{{N}_{k}^{t}}({X}_{k1}))$. Then, the events ${\mathcal{G}}_{k,1},{\mathcal{G}}_{k,2}$ are true with sufficiently high probability conditioned to ${\mathcal{F}}_{k1}$, and the estimators ${G}_{k}$ and $\nabla {f}_{{N}_{k+1}^{t}}({X}_{k})$ are conditionally independent random variables given ${\mathcal{F}}_{k1}$, i.e., First, we provide a liminftype convergence result for SIRTR, which shows that the gradient of the objective function converges in probability to zero relative to a subsequence of the iterates.
Theorem 2.
Suppose that Assumptions 1–4 hold. Then, there holds Proof. The proof parallels that in ([
16], Theorem 4.16). By contrast, assume that there exists
$\u03f5>0$ such that the event
holds with positive probability. Then, let
${\left\{{x}_{k}\right\}}_{k\in \mathbb{N}}$ be a realization of
${\left\{{X}_{k}\right\}}_{k\in \mathbb{N}}$ such that
$\parallel \nabla f({x}_{k})\parallel \ge \u03f5$ for all
$k\ge 0$, and
${\left\{{\delta}_{k}\right\}}_{k\in \mathbb{N}}$ is the corresponding realization of
${\left\{{\Delta}_{k}\right\}}_{k\in \mathbb{N}}$. From Theorem 1, we know that
${lim}_{k\to \infty}{\delta}_{k}=0$; therefore, there exists
$\overline{k}$ such that
Consider the random variable
${R}_{k}$ with realizations given by
Note that ${r}_{k}$ satisfies the following properties.
 (i)
If
$k\ge \overline{k}$, then
${r}_{k}\le 0$; this is a consequence of (
34).
 (ii)
If
k is a true iteration and
$k\ge \overline{k}$, then
${r}_{k+1}={r}_{k}+1$; indeed, since
${\mathcal{G}}_{k,1}$ is true and
${\delta}_{k}<\u03f5/(2\nu )$, it follows that
Then,
$\parallel \nabla f({x}_{k})\parallel \ge \u03f5$ yields
which, combined with (
34), implies that
${\delta}_{k}$ satisfies Inequality (
31). Thus, Lemma 3 implies that the iteration
k is successful. Since
${\delta}_{k}\le {\delta}_{max}/\gamma $ and the
kth iteration is successful, by (
9) it follows that
${\delta}_{k+1}=\gamma {\delta}_{k}$. Hence,
${r}_{k+1}={r}_{k}+1$.
 (iii)
If k is not a true iteration and $k\ge \overline{k}$, then ${r}_{k+1}\ge {r}_{k}1$; this is because since we cannot apply Lemma 3 (k is not true), all we can say about the trustregion radius is that ${\delta}_{k+1}\ge {\delta}_{k}/\gamma $.
Then, defining the
$\sigma $algebra
${\mathcal{F}}_{k1}^{\mathcal{G}}=\sigma ({\U0001d7d9}_{{\mathcal{G}}_{0,1}}\xb7{\U0001d7d9}_{{\mathcal{G}}_{0,2}},\dots ,{\U0001d7d9}_{{\mathcal{G}}_{k1,1}}\xb7{\U0001d7d9}_{{\mathcal{G}}_{k1,2}})$, which is included in
${\mathcal{F}}_{k1}$, it follows from properties (ii)–(iii) and Assumption 4 that
where the second inequality follows from
${\pi}_{1}{\pi}_{2}>\frac{1}{2}$. Hence, we have that
${\left\{{R}_{k}\right\}}_{k\in \mathbb{N}}$ is a submartingale. We also define the random variable
${\left\{{W}_{k}\right\}}_{k\in \mathbb{N}}$ is also a submartingale as
where, again, the last inequality is due to the fact that
${\pi}_{1}{\pi}_{2}>\frac{1}{2}$. Since
${W}_{k}$ cannot have a finite limit, from ([
16], Theorem 4.4) it follows that the event
${lim\; sup}_{k\to \infty}{W}_{k}=\infty $ holds almost surely. Since we have
${r}_{k}{r}_{{k}_{0}}\ge {w}_{k}{w}_{{k}_{0}}$ by definition of
${\left\{{R}_{k}\right\}}_{k\in \mathbb{N}}$ and
${\left\{{W}_{k}\right\}}_{k\in \mathbb{N}}$, it follows that
${R}_{k}$ has to be positive infinitely often with a probability of one. However, this contradicts property (i) listed above, which allows us to conclude that (
33) cannot occur. □
In the following, we show that SIRTR generates iterates such that the corresponding gradients evaluated at the SIRTR iterates converge (in probability) to zero. The next lemma is similar to ([
6], Lemma 4.2); however, some crucial modifications are needed here; indeed, unlike in [
6], we take into account the fact that SIRTR enforces the decrease in the Lyapunov function
$\mathrm{\Psi}$ defined in (
19) rather than the objective function.
Lemma 4.
Suppose that Assumptions 1–4 hold. Let $\left\{{X}_{k}\right\}$ and $\left\{{\Delta}_{k}\right\}$ be the random sequences generated by Algorithm 1. For a fixed $\u03f5>0$, define the following subset of natural numbersThen,holds almost surely. Proof. Let us consider the generic realizations
${\left\{{x}_{k}\right\}}_{k\in \mathbb{N}}$,
${\left\{{g}_{k}\right\}}_{k\in \mathbb{N}}$,
${\left\{{\delta}_{k}\right\}}_{k\in \mathbb{N}}$,
${\left\{{\theta}_{k}\right\}}_{k\in \mathbb{N}}$, and
${\left\{{k}_{i}\right\}}_{i\in \mathbb{N}}$ of Algorithm 1. Furthermore, we let
$\left\{{p}_{i}\right\}$ be the subsequence of
$\left\{{k}_{i}\right\}$, where the iteration is true, whereas
$\left\{{n}_{i}\right\}$ denotes the complementary subsequence so that
$\left\{{k}_{i}\right\}=\left\{{p}_{i}\right\}\cup \left\{{n}_{i}\right\}$. First, we show that
${\sum}_{k\in \left\{{p}_{i}\right\}}{\delta}_{k}<\infty $. If
$\left\{{p}_{i}\right\}$ is finite, then there is nothing to prove. Otherwise, since
${lim}_{k\to \infty}{\delta}_{k}=0$, there exists
$\tilde{k}$ such that
${\delta}_{k}<b$ for all
$k\ge \tilde{k}$, where
b is given in (
34). Let us consider any
${p}_{i}\ge \tilde{k}$. Since
${\mathcal{G}}_{k,1}$ is true,
${\delta}_{{p}_{i}}<\u03f5/(2\nu )$, and
$\parallel \nabla f({x}_{{p}_{i}})\parallel >\u03f5$, we can reason as in (
36) and (
37) to conclude that
$\parallel {g}_{{p}_{i}}\parallel \ge \u03f5/2$. Combining this lower bound with
${\delta}_{{p}_{i}}<b$, we have that Inequality (
31) is satisfied with
$k={p}_{i}$. Hence, iteration
${p}_{i}$ is successful by Lemma 3 and we have
where the first inequality is the acceptance test (
14), the second follows from Step 4 of the SIRTR algorithm, the third follows from (
6), and the last follows from (
3) and Lemma 2. Now, starting from Inequality (
24) (which holds only for successful iterations), we can derive the following chain of inequalities
where the second inequality follows from (
39), (
9), and (
15). Now, recalling the definition of
${v}^{\u2020}$ given in (
26), we choose
v in (
19) as
and, consequently,
c in (
40) is positive while keeping Theorem 1 still applicable. Then, plugging
$\parallel {g}_{{p}_{i}}\parallel \text{}\ge \frac{\u03f5}{2}$ into (
40) yields
Summing the previous inequality over
$k\in \left\{{p}_{i}\right\}$,
$k\ge \tilde{k}$, and noting that
${\psi}_{k}{\psi}_{k+1}>0$ for any
k (due to (
27)), we obtain
where the last inequality follows from (
22). Then, we have shown that
Furthermore, let us introduce the Bernoulli variable
${B}_{k}=2\xb7{\U0001d7d9}_{{\mathcal{G}}_{k,1}}\xb7{\U0001d7d9}_{{\mathcal{G}}_{k,2}}1$, which takes a value of 1 when the iteration
k is true and a value of
$1$ otherwise. Note that due to Assumption 4,
Moreover, the sequence
$\left\{{\Delta}_{k}\right\}$ is a sequence of nonnegative uniformly bounded random variables. Then, we can proceed as in the proof in ([
6], Lemma 4.2), and using ([
6], Lemma 4.1) we obtain
This implies that almost surely
hence, the thesis follows. □
As a byproduct of the previous lemma, we obtain the expected convergence result in probability in the exact same way as in [
6].
Theorem 3.
Suppose that Assumptions 1–4 hold. Let $\left\{{X}_{k}\right\}$ be the sequence of random iterates generated by Algorithm 1. Then, there holds Proof. The proof follows exactly as in ([
6], Theorem 4.3). □
4. Numerical Illustration
In this section, we evaluate the numerical performance of Algorithm 1 equipped with the probabilistic accuracy requirements imposed in Assumption 4. Algorithm 1 was implemented using MATLAB R2019a, and the numerical experiments were performed on an 8 GB RAM laptop with an Intel Core i74510U CPU 2.002.60 GHz processor. The related software can be downloaded from
sites.google.com/view/optmlitalyserbia/home/software (accessed on 1 September 2022).
We perform our numerical experiments on a binary classification problem. Denoting with
${\left\{({a}_{i},{b}_{i})\right\}}_{i=1}^{N}$ a training set, where
${a}_{i}\in {\mathbb{R}}^{n}$ is the
ith feature vector, and
${b}_{i}\in \{0,1\}$ is the associated label, we address the following nonconvex optimization problem:
Note that (
42) can be framed in Problem (
1) by setting
${\varphi}_{i}(x)={({b}_{i}1/(1+{e}^{{a}_{i}^{T}x}))}^{2}$,
$i=1,\dots ,N$, namely the composition of the leastsquare loss with the sigmoid function. Furthermore, it is easy to see that the objective function
${f}_{N}$ satisfies Assumption 2 since each
${\varphi}_{i}$ is continuously differentiable and
${f}_{N}$ is bounded from below and above.
In
Table 1, we report the four data sets used for our experiments. For each data set, we specify the number of feature vectors
N, the number of components
n of each feature vector, and the size
${N}_{T}$ of the testing set
${I}_{{N}_{T}}$.
We implement two different versions of Algorithm 1, which differ from one another in the way the two sample sizes
${N}_{k+1}^{t}$ and
${N}_{k+1,g}$ for the estimators in (
4) and (
5) are selected.
${\mathrm{SIRTR}}_{\mathrm{nop}}$: this is Algorithm 1 implemented as in [
11]. In particular, the infeasibility measure
h and the initial penalty parameter
${\theta}_{0}$ are chosen as follows:
In Step 1, the reference sample size
${\tilde{N}}_{k+1}$ is computed as follows:
where
$\tilde{c}=1.05$. It is easy to see that Rule (
43) complies with Condition (
6) by setting
$r=(N(\tilde{c}1))/N$. In Step 2, the trial sample size
${N}_{k+1}^{t}$ is chosen in compliance with Condition (
7) as
In Step 3, the sample size
${N}_{k+1,g}$ is fixed as
where
$c=0.1$. Furthermore, the set
${I}_{{N}_{k+1}^{t}}$ for computing
${f}_{{N}_{k+1}^{t}}({x}_{k})$ and
${f}_{{N}_{k+1}^{t}}({x}_{k}+{p}_{k})$ is sampled uniformly at random using the MATLAB command randsample, whereas
${g}_{k}\in {\mathbb{R}}^{n}$ is a sample average approximation as in (
5) using
${I}_{{N}_{k+1,g}}\subseteq {I}_{{N}_{k+1}^{t}}$. The other parameters are set as
${x}_{0}={(0,0,\dots ,0)}^{T}$,
${\delta}_{0}=1$,
${\delta}_{max}=1$,
$\gamma =2$,
$\eta ={10}^{1}$,
${\eta}_{2}={10}^{6}$,
$\mu =100/N$. Note that Choices (
44)–(
45) for the sample sizes
${N}_{k+1}^{t},{N}_{k+1,g}$ are not sufficient to guarantee that Assumption 4 holds so that Theorems 2–3 do not apply to this version of Algorithm 1.
${\mathrm{SIRTR}}_{\mathrm{p}}$: this implementation of Algorithm 1 differs from the previous one only in the choice of the sample sizes
${N}_{k+1}^{t},{N}_{k+1,g}$. In this case, we force these two parameters to comply with Assumption 4. According to ([
29], Theorem 7.2, Table 7.1), a subsampled estimator
$\nabla {f}_{S}({x}_{k})=\frac{1}{S}{\sum}_{i\in {I}_{S}}\nabla {\varphi}_{i}({x}_{k})$ with sample size
${I}_{S}=S$ satisfies the probabilistic requirement
where
$\pi \in (0,1)$ if the sample size
S complies with the following lower bound
where
$\chi =\frac{1}{5}{max}_{i=1,\dots ,N}\parallel {a}_{i}\parallel $. Based on the previous remark, we choose the samples sizes of
${\mathrm{SIRTR}}_{\mathrm{p}}$ as follows
Setting
$\pi >1/\sqrt{2}$, choosing
${N}_{k+1}^{t},{N}_{k+1,g}$ as in (
48) and (
49), and sampling
${I}_{{N}_{k+1}^{t}}$ and
${I}_{{N}_{k+1,g}}$ uniformly and independently at random in
$\{1,\dots ,N\}$, we guarantee that Assumption 4 holds with
${\pi}_{1}={\pi}_{2}=\pi $, thus ensuring the convergence in probability of
${\mathrm{SIRTR}}_{\mathrm{p}}$ according to Theorems 2–3. For our tests, we set
$\pi =3/4$ and
$\nu =10\parallel \nabla {f}_{N}({x}_{0})\nabla {f}_{{N}_{0}}({x}_{0})\parallel $.
For each data set, we perform 10 runs of both ${\mathrm{SIRTR}}_{\mathrm{nop}}$ and ${\mathrm{SIRTR}}_{\mathrm{p}}$ and assess their performances by measuring the following metrics:
training loss, given as
${f}_{N}({x}_{k})$ with
${f}_{N}$ defined in (
42);
testing loss, defined as
where
${I}_{{N}_{T}}$ denotes the testing set and
${N}_{T}$ its dimension;
classification error, defined as
where
${b}_{i}$ denotes the true label of the
$i$ th feature vector of the testing set and
${b}_{i}^{pred}({x}_{k})=max\{sign({a}_{i}^{T}{x}_{k}),0\}$ is the corresponding predicted label at iteration
k.
We note that (
42) can be seen as the optimization problem arising from training a neural network with no hidden layers and the sigmoid function as the activation function for the output layer. Then, as in [
30,
31], we measure the computational cost of evaluating the objective function and its gradient in terms of forward and backward propagations. Namely, we count the number of full function and gradient evaluations, by considering the computation of a single function
${\varphi}_{i}$ equivalent to
$\frac{1}{N}$ forward propagations, and the evaluation of a single gradient
$\nabla {\varphi}_{i}$ equivalent to
$\frac{2}{N}$ propagations. Regarding
${\mathrm{SIRTR}}_{\mathrm{nop}}$, we note that the computational cost per iteration is determined by
$\frac{{N}_{k+1}^{t}+{N}_{k+1,g}}{N}$ propagations since
${I}_{{N}_{k+1,g}}\subseteq {I}_{{N}_{k+1}^{t}}$. On the contrary, the computational cost of
${\mathrm{SIRTR}}_{\mathrm{p}}$ is determined by
$\frac{{N}_{k+1}^{t}}{N}+\frac{2{N}_{k+1,g}}{N}$ propagations, as
${I}_{{N}_{k+1}^{t}}$ and
${I}_{{N}_{k+1,g}}$ are sampled independently from one another. For both algorithms, the computational cost per iteration increases as the iterations proceed; indeed, since
${\delta}_{k}$ tends to zero as
k tends to infinity (Theorem 1), Rules (
44)–(
48) will eventually select the trial sample size
${N}_{k+1}^{t}$ equal to the reference sample size
${\tilde{N}}_{k+1}$, which is increasing geometrically. We expect that the computational cost increases faster in
${\mathrm{SIRTR}}_{\mathrm{p}}$, as this algorithm also requires the gradient sample size
${N}_{k+1,g}$ to increase due to Conditions (
47)–(
49). Finally, we note that the computational cost per iteration of both algorithms is higher than that of the standard stochastic gradient algorithm, which is usually
$\frac{2{N}_{g}}{N}$, with
${N}_{g}$ being a prefixed gradient sample size. However, the increasing sample sizes result in more accurate function and gradient approximations so the higher computational cost likely implies a larger reduction in the training loss
${f}_{N}$ per iteration, as seen in previous comparisons of SIRTR with a nonadaptive stochastic approach in [
11].
In
Figure 1, we show the decrease in the training loss, testing loss, and classification error (all averaged over the 10 runs) versus the average computational cost for the first 20 propagations. For most data sets, we observe that
${\mathrm{SIRTR}}_{\mathrm{nop}}$ performs comparably or even better than
${\mathrm{SIRTR}}_{\mathrm{p}}$. However, for one of the four data sets (mnist) in
${\mathrm{SIRTR}}_{\mathrm{nop}}$, the accuracy deteriorates after the first propagations, whereas
${\mathrm{SIRTR}}_{\mathrm{p}}$ provides a more accurate classification and a quite steady decrease in the average training loss, testing loss, and classification error. This different performance of the two algorithms can be explained by looking at
Figure 2, which shows the increase in the percentage ratio
$\frac{100{N}_{k+1}}{N}$ of the sample size
${N}_{k+1}$ over the data set size
N (averaged over the 10 runs) for both algorithms. As we can see, the sample size in
${\mathrm{SIRTR}}_{\mathrm{p}}$ rapidly increases to
$60\%$ of the data set size in the first 50 iterations, whereas the same percentage is achieved by
${\mathrm{SIRTR}}_{\mathrm{nop}}$ after 150–200 iterations. Overall, we can conclude that the probabilistic requirements of Assumption 4 ensure the theoretical support for convergence in probability but might be excessively demanding. In fact, the numerical examples show that a slower increase in the sample size than that imposed by the probabilistic requirements of Assumption 4 provides a good tradeoff between the computational cost and the accuracy in the classification.
In
Figure 3, we test the sensitivity of
${\mathrm{SIRTR}}_{\mathrm{p}}$ with respect to the initial penalty parameter
${\theta}_{0}$ by reporting the average training loss versus the average computational cost obtained with three different values of
${\theta}_{0}$. We observe that the performance of the algorithm is not considerably affected by the choice of this parameter, although large oscillations in the average training loss occur for smaller values of
${\theta}_{0}$ in mnist when
${\theta}_{0}=0.1$. As a general comment, small initial values of
${\theta}_{0}$ may not be convenient, as the sequence
$\left\{{\theta}_{k}\right\}$ is nonincreasing and small values of
${\theta}_{k}$ promote a decrease in the infeasibility measure
h rather than a decrease in the training loss (see the definition of the actual reduction in (
13)). Similar considerations can be made for
${\mathrm{SIRTR}}_{\mathrm{nop}}$ in
Figure 4.