1. Introduction
Logistic regression (LR) has been widely considered as a statistical classification model in cross-discipline applications such as machine learning [
1,
2], signal processing [
3,
4], and computer vision [
5,
6], to name a few. It is developed based on the probabilistic relationship between the continuous feature vector and the binary label and highly relates to the maximum likelihood estimation. It is known that the LR model is established based on the training samples, and thus, its out-of-sample performance is deeply affected by the characteristics of the training data. For instance, the LR model presents poor out-of-sample performance when the training data are sparse [
7]. In practice, the regularization technique is introduced to combat the overfitting effect triggered by the sparse data. In addition, with the increasing penetration of training data, the so-called “adversarial corruptions” are brought about [
8], which describe the situation in which the data uncertainties are unmeasurable through any specific probability distribution. To tackle the sparsity and uncertainty of the training data, the distributionally robust optimization (DRO) approach is introduced, which not only provides a probabilistic interpretation of the existing regularization technique for data sparsity, but also guarantees the LR model to be immune to the risk caused by data uncertainty [
7,
8].
The basic idea to solve the DRO problem is to reformulate it into its tractable counterpart, then utilize the off-the-shelf solvers to tackle the problem [
9,
10,
11]. It has been shown in [
12,
13,
14] that the DRO problems formulated by statistical measures, e.g.,
-divergence,
f-divergence, Wasserstein metric, and so on, possess exact convex reformulations. In practice, however, computing the large-scale DRO problem via off-the-shelf solvers is a demanding work because most of them rely on the general purpose interior-point algorithm [
15]. Thus, it is necessary to devise a fast iterative algorithmic framework for the convex reformulations of DRO problems. In [
16], a stochastic gradient descent framework was proposed for the DRO problem with
f-divergence. In [
17], a stochastic gradient descent algorithm was proposed for distributionally robust learning with
-divergence. The
-divergence builds on the discrete support of the empirical distribution of uncertainty, whose confidence bounds have an asymptotically inexact confidence region. While the
f-divergence is an extension of
-divergence, whose confidence bounds have an asymptotically exact confidence region [
18]. Nevertheless, both
-divergence and
f-divergence can merely consider the training data with identical support. To overcome this hindrance, the Wasserstein metric is introduced to formulate the DRO problem [
19,
20], which is a useful instrument to handle the heterogeneous training data that can be extract from either discrete or continuous probability distributions. In [
21], a branch-and-bound algorithm that combines with a linear approximation scheme was presented to reformulate the DRO problem with the Wasserstein metric into decomposable semi-infinite programs (
Table 1).
Despite the benefits brought by the regularization techniques, they simultaneously incur the computation obstacle, since the regularization term may be nonsmooth. For instance, LASSO regularization widely arises in robust regression problems [
22,
23,
24]. In [
25], the classical proximal gradient method was proposed, which is proven to be effective for evaluating the objective function and its gradient at a given point. Followed by the idea of a proximal gradient, in [
26], a proximal forward-backward splitting (FOBOS) framework was established for signal recovery problems. Then, in [
27], a FOBOS algorithm was devised for regularized convex optimization problem, where the subgradient of loss function was incorporated. Likewise, in [
28,
29], a fast iterative shrinkage-thresholding algorithm was proposed for regularized convex optimization problem, which utilizes the gradient of loss function. In [
30], a FOBOS-based quasi-Newton method was proposed for nonsmooth optimization problems. Recently, in [
31], a stochastic Douglas–Rachford splitting method was developed for sparse LR, which leverages the proximity operator that includes the stochastic gradient-like method. In [
15], a first-order algorithmic framework was developed to address a Wasserstein distributionally robust LR. An ADMM framework that incorporates the gradient of log-loss function was devised, whose convergence performance was demonstrated to outperform the YALMIP solver.
Compared with the gradient-based methods, the nonlinear conjugate gradient (CG) methods are more efficient due to their simplicity and low memory requirement [
32,
33,
34,
35,
36]. At present, nonlinear CG methods have been generalized to the nonsmooth optimization problems. In [
37], a modified HS CG method for nonsmooth convex optimization problems was proposed, whose numerical efficiency was verified via high-dimensional training samples. Furthermore, in [
38], a modified CG method that inherits the advantages of both HS and DY CG methods was constructed for a nonsmooth optimization problem. It is noted that the works in [
37,
38] merely addressed the convergence of the CG methods, while they ignored their convergence rates. Thus, this paper aims to develop a modified CG method for the Wasserstein distributionally robust LR model under the FOBOS framework, which can not only prove its convergence but also estimates its convergence rate. To the best of our knowledge, similar work has not been considered yet (
Table 2).
The outline of this paper is as follows.
Section 2 consists of two parts:
Section 2.1 furnishes the basic framework of Wasserstein distributionally robust LR model;
Section 2.2 establishes the CG method under FOBOS framework, in which some fundamental results that are vital for the convergence analysis in
Section 3 are also derived.
Section 3 presents the convergence analysis of the modified CG method, in which the convergence rate is proven to be
, where
T denotes the predetermined iteration round and error bound, respectively, and simultaneously, the modified CG method is proven to converge to the optimal solution of the Wasserstein distributionally robust LR model with nonsummable steplength.
Section 4 conducts the numerical experiments.
Section 5 concludes the main contributions and provides the future research directions.
3. Convergence Analysis
The conjugate gradient steps (6) and (7) provide the following two propositions.
Proposition 1. For the searching direction in (8) and (9), we haveandfor all . Proof. When
, by (8) and (9), we have
where the second inequality is obtained by the Cauchy–Schwartz inequality.
The proof is completed. □
By (10), we know
is a descent direction, that is,
for all
. In addition, notice that
and
. Thus,
is convex and
is bounded by
, that is,
for all
. Then, by (11), we obtain
Proposition 2. For any vector , we havefor all . Proof. By (8) and (9), we obtain
where the third inequality is obtained by utilizing the convexity of
and
by the Cauchy–Schwartz inequality. The proof is completed. □
The optimal condition of (7) gives that
where
denotes the set of subgradients. Then, substituting (6) into (14), we have
where
is a subgradient of
at
. Notice that
, we know that the subgradient of
is bounded for all
. Thus, there exists a constant
such that
for all
.
Lemma 1. Suppose vector is the optimal solution of (5). Then, the following inequality holdswhere . Proof. We begin with deriving the properties of
and its subgradients. By (15), we have
Notice that
is convex, we obtain
Substituting (17) into (18), we obtain
where the second inequality is derived by the Cauchy–Schwartz inequality, and the last inequality is obtained by
. Similarly, we have
and
We then derive the bound of the difference between
and
recursively.
Considering the last term in (22), we obtain that
Then, inequality (16) holds by substituting (13), (18)–(21) and (23) into (22). The proof is completed. □
Lemma 1 establishes a fundamental result for the convergence properties of conjugate gradient steps (6) and (7). Now, we are ready to derive the convergence results based on the preceding discussion.
Theorem 1. Let D be a constant. Suppose (i) for all ; (ii) vector is the optimal solution of (5); (iii) . We have Proof. Rearranging
and
in (16), we obtain
Term
yields that
where the second inequality is obtained since
for all
and
for all
. The last inequality is obtained since the steplength
for all
. Then, by properly choosing
such that
and combining (25) and (26), we derive (24). The proof is completed. □
In the remainder of this paper, two corollaries are presented to show the convergence properties of the conjugate gradient steps (6) and (7).
Theorem 2. Suppose the initial value is chosen within a neighborhood of . Then, for steplength that satisfies , for all , and , the conjugate gradient descent method (6) and (7) converges to the optimal solution of DRO problem (2). Moreover, for predefined T iterations, the conjugate gradient descent method (6) and (7) yields a convergence rate of .
Proof. Let
, then (27) implies that
Since
, the right-hand side of (28) tends to 0, which indicates that
Moreover, let
for all
and
. Then, (27) yields that
which means that the conjugate gradient steps (6) and (7) yield a convergence rate of
. The proof is completed. □
4. Numerical Experiments
In this section, the infinite norm in distance metric (4) is considered; then, the optimization problem (5) becomes the
-regularized convex optimization problem. Then, several numerical experiments are conducted to validate the efficiency and effectiveness of the modified CG method from different aspects. Firstly, the sensitive analysis is carried out to compare the runtime of the modified CG method under different Wasserstein ball radius
and steplength
to show how these parameters affect the performance of the modified CG method. Secondly, the runtime between our modified CG method and the IPOPT solver (
https://github.com/coin-or/Ipopt (accessed on 21 January 2023)) is compared, where the IPOPT solver is implemented on the YALMIP platform [
40]. Lastly, the runtime among several algorithmic frameworks for
-regularized problems is compared, which includes the modified CG method, the standard sub-gradient method, the ISTA, and the FISTA. All the experiments are conducted using MATLAB R2018a on a laptop running Windows 7 with Intel(R) Core(TM) i3-2310 CPU (2.1 GHz) and 10 GB RAM. In addition, the experiments are implemented on real datasets a1a–a9a from LIBSVM (
https://www.csie.ntu.edu.tw/∼cjlin/libsvm/index.html (accessed on 23 February 2023)), which are extracted from UCI/Adult (
http://archive.ics.uci.edu/ml/index.php (accessed on 23 February 2023)). The data characteristics are listed in
Table 3. To begin with, the conjugate parameter
is set to 5 and the initial values of
are chosen randomly within the neighborhood
. To eliminate the impact of random choices of
on the runtime, each experiment is repeated five times and their runtimes are recorded. Lastly, the average runtime is utilized to reflect the numerical efficiency of the solution methods. The programs stop when
.
4.1. Runtime Comparison: Among Different Parameters
In this part, the average runtime for different Wasserstein ball radius
and steplength
are tested to show how such parameters affect the numerical efficiency of the modified CG method. In particular, for fixed steplength
, the average runtime for five different choices of
are tested, including
,
,
,
, and
. In addition, for fixed Wasserstein ball radius
, the average runtime for five different choices of
is tested, including
,
,
,
,
, and
. The experimental outputs are listed in
Table 4 and
Table 5 and
Figure 1 and
Figure 2. Observing the experimental outputs, it can be seen that with the increasing problem scale, the required runtime of the modified CG method increases significantly. For fixed steplength, a larger Wasserstein ball radius means shorter average runtime. The reason is that the Wasserstein ball radius
is inversely proportional to the radius of the searching neighborhood
. Thus, larger
means smaller searching neighborhood, which results in shorter average runtime. In addition, it is observed from
Table 4 and
Figure 2 that the average runtime of the modified CG method can be controlled by the steplength
. The average runtime of
is the shortest among the five steplengths.
4.2. Runtime Comparison: Modified CG Method versus IPOPT Solver
In this part, the average runtime between the IPOPT solver and the modified CG method are tested. The IPOPT is a typical solver that builds upon the interior-point algorithmic framework. The Wasserstein ball radius
and the steplength are set to be
and
, respectively. The experimental output is given in
Table 6 and
Figure 3. It can be observed that the average CPU time of IPOPT solver far exceeds that of the modified CG method, which demonstrates that the modified CG method is far more effective than the interior-point-based off-the-shelf solvers for large-scale problems.
4.3. Runtime Comparison: Among Different First-Order Algorithmic Frameworks
In this part, the average runtime among four first-order algorithmic frameworks is tested, including the subgradient method, the ISTA, the FISTA, and the modified CG method. The Wasserstein ball radius
is set to be
. The steplength of the subgradient method and the modified CG method are set to be
. The experimental output is illustrated in
Table 7 and
Figure 4. It can be observed that the modified CG method is more effective than the other three widely employed first-order algorithmic frameworks.
5. Conclusions
In this paper, we propose a modified CG method under the FOBOS framework for the Wasserstein distributionally robust LR model. This method consists of two phases: in the first phase, a conjugate gradient descent step is performed, and in the second phase, an instantaneous optimization problem is formulated and solved with a trade-off minimization of the regularization term, while simultaneously staying in close proximity to the interim point obtained in the first phase. With a nonsummable steplength, it is further proven that the modified CG method attains the optimal solution of the Wasserstein distributionally robust LR model. Moreover, the convergence rate of the modified CG method is estimated as . Finally, several numerical experiments are conducted to test the numerical efficiency of the modified CG method and demonstrate that this method outperforms the off-the-shelf solver and the exiting first-order algorithmic frameworks. Future research in this field is focused on two aspects: (i) advancing the algorithmic framework through the exploration of advanced optimization techniques, such as the proximal method, accelerated gradient method, and stochastic approximation method. These investigations aim to refine the optimization algorithms, develop more efficient convergence criteria, and enhance numerical performance. (ii) The application of distributionally robust LR to real-world domains, including healthcare, finance, marketing, and transportation. This entails examining the practical implications, interpretability, and performance of these models in various contexts. Furthermore, conducting case studies will allow for an assessment of the benefits and limitations of distributionally robust approaches in comparison to stochastic and robust methods.