Next Article in Journal
Hermite–Hadamard-Type Inequalities for Coordinated Convex Functions Using Fuzzy Integrals
Previous Article in Journal
The Poisson–Lindley Distribution: Some Characteristics, with Its Application to SPC
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Modified Gradient Method for Distributionally Robust Logistic Regression over the Wasserstein Ball

1
College of Economics and Management, Southwest University, Chongqing 400715, China
2
School of Economics, Chongqing Financial and Economic College, Chongqing 401320, China
3
College of Mathematics and Statistics, Chongqing Jiaotong University, Chongqing 400074, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(11), 2431; https://doi.org/10.3390/math11112431
Submission received: 19 March 2023 / Revised: 18 May 2023 / Accepted: 22 May 2023 / Published: 24 May 2023

Abstract

:
In this paper, a modified conjugate gradient method under the forward-backward splitting framework is proposed to further improve the numerical efficiency for solving the distributionally robust Logistic regression model over the Wasserstein ball, which comprises two phases: in the first phase, a conjugate gradient descent step is performed, and in the second phase, an instantaneous optimization problem is formulated and solved with a trade-off minimization of the regularization term, while simultaneously staying in close proximity to the interim point obtained in the first phase. The modified conjugate gradient method is proven to attain the optimal solution of the Wasserstein distributionally robust Logistic regression model with nonsummable steplength at a convergence rate of 1 / T . Finally, several numerical experiments to validate the effectiveness of theoretical analysis are conducted, which demonstrate that this method outperforms the off-the-shelf solver and the existing first-order algorithmic frameworks.

1. Introduction

Logistic regression (LR) has been widely considered as a statistical classification model in cross-discipline applications such as machine learning [1,2], signal processing [3,4], and computer vision [5,6], to name a few. It is developed based on the probabilistic relationship between the continuous feature vector and the binary label and highly relates to the maximum likelihood estimation. It is known that the LR model is established based on the training samples, and thus, its out-of-sample performance is deeply affected by the characteristics of the training data. For instance, the LR model presents poor out-of-sample performance when the training data are sparse [7]. In practice, the regularization technique is introduced to combat the overfitting effect triggered by the sparse data. In addition, with the increasing penetration of training data, the so-called “adversarial corruptions” are brought about [8], which describe the situation in which the data uncertainties are unmeasurable through any specific probability distribution. To tackle the sparsity and uncertainty of the training data, the distributionally robust optimization (DRO) approach is introduced, which not only provides a probabilistic interpretation of the existing regularization technique for data sparsity, but also guarantees the LR model to be immune to the risk caused by data uncertainty [7,8].
The basic idea to solve the DRO problem is to reformulate it into its tractable counterpart, then utilize the off-the-shelf solvers to tackle the problem [9,10,11]. It has been shown in [12,13,14] that the DRO problems formulated by statistical measures, e.g., ϕ -divergence, f-divergence, Wasserstein metric, and so on, possess exact convex reformulations. In practice, however, computing the large-scale DRO problem via off-the-shelf solvers is a demanding work because most of them rely on the general purpose interior-point algorithm [15]. Thus, it is necessary to devise a fast iterative algorithmic framework for the convex reformulations of DRO problems. In [16], a stochastic gradient descent framework was proposed for the DRO problem with f-divergence. In [17], a stochastic gradient descent algorithm was proposed for distributionally robust learning with ϕ -divergence. The ϕ -divergence builds on the discrete support of the empirical distribution of uncertainty, whose confidence bounds have an asymptotically inexact confidence region. While the f-divergence is an extension of ϕ -divergence, whose confidence bounds have an asymptotically exact confidence region [18]. Nevertheless, both ϕ -divergence and f-divergence can merely consider the training data with identical support. To overcome this hindrance, the Wasserstein metric is introduced to formulate the DRO problem [19,20], which is a useful instrument to handle the heterogeneous training data that can be extract from either discrete or continuous probability distributions. In [21], a branch-and-bound algorithm that combines with a linear approximation scheme was presented to reformulate the DRO problem with the Wasserstein metric into decomposable semi-infinite programs (Table 1).
Despite the benefits brought by the regularization techniques, they simultaneously incur the computation obstacle, since the regularization term may be nonsmooth. For instance, LASSO regularization widely arises in robust regression problems [22,23,24]. In [25], the classical proximal gradient method was proposed, which is proven to be effective for evaluating the objective function and its gradient at a given point. Followed by the idea of a proximal gradient, in [26], a proximal forward-backward splitting (FOBOS) framework was established for signal recovery problems. Then, in [27], a FOBOS algorithm was devised for regularized convex optimization problem, where the subgradient of loss function was incorporated. Likewise, in [28,29], a fast iterative shrinkage-thresholding algorithm was proposed for regularized convex optimization problem, which utilizes the gradient of loss function. In [30], a FOBOS-based quasi-Newton method was proposed for nonsmooth optimization problems. Recently, in [31], a stochastic Douglas–Rachford splitting method was developed for sparse LR, which leverages the proximity operator that includes the stochastic gradient-like method. In [15], a first-order algorithmic framework was developed to address a Wasserstein distributionally robust LR. An ADMM framework that incorporates the gradient of log-loss function was devised, whose convergence performance was demonstrated to outperform the YALMIP solver.
Compared with the gradient-based methods, the nonlinear conjugate gradient (CG) methods are more efficient due to their simplicity and low memory requirement [32,33,34,35,36]. At present, nonlinear CG methods have been generalized to the nonsmooth optimization problems. In [37], a modified HS CG method for nonsmooth convex optimization problems was proposed, whose numerical efficiency was verified via high-dimensional training samples. Furthermore, in [38], a modified CG method that inherits the advantages of both HS and DY CG methods was constructed for a nonsmooth optimization problem. It is noted that the works in [37,38] merely addressed the convergence of the CG methods, while they ignored their convergence rates. Thus, this paper aims to develop a modified CG method for the Wasserstein distributionally robust LR model under the FOBOS framework, which can not only prove its convergence but also estimates its convergence rate. To the best of our knowledge, similar work has not been considered yet (Table 2).
The outline of this paper is as follows. Section 2 consists of two parts: Section 2.1 furnishes the basic framework of Wasserstein distributionally robust LR model; Section 2.2 establishes the CG method under FOBOS framework, in which some fundamental results that are vital for the convergence analysis in Section 3 are also derived. Section 3 presents the convergence analysis of the modified CG method, in which the convergence rate is proven to be 1 / T , where T denotes the predetermined iteration round and error bound, respectively, and simultaneously, the modified CG method is proven to converge to the optimal solution of the Wasserstein distributionally robust LR model with nonsummable steplength. Section 4 conducts the numerical experiments. Section 5 concludes the main contributions and provides the future research directions.

2. Preliminaries

2.1. Distributionally Robust Logistic Regression

Denote R d be a d-dimensional Euclidean space. Let x ^ R d be a feature vector and y ^ { 1 , 1 } be the associated binary label to be predicted one. In Logistic regression, the conditional distribution of y given x ^ is modeled as
P y ^ | x ^ = 1 + e y ^ β , x ^ 1 ,
where the weight vector β R d constitutes a regression parameter [8]. Then, the maximum likelihood estimator of the classical Logistic regression is formulated by solving the following program:
min β R d 1 N i = 1 N l β x ^ i , y ^ i ,
where ( x ^ i , y ^ i ) i = 1 N are the given training samples, β is the unknown parameter to be estimated, and l β x ^ i , y ^ i = log 1 + e y ^ i β , x i ^ . Hereafter, for presentation convenience, define
f ( β ) = 1 N i = 1 N l β x ^ i , y ^ i .
Throughout this paper, the convex function is defined as follows.
Definition 1.
Function f : R d R is called a convex function, if for any β 1 , β 2 R d and 0 λ 1 , the following inequality holds
f ( λ β 1 + ( 1 λ ) β 2 ) λ f ( β 1 ) + ( 1 λ ) f ( β 2 ) .
By Definition 1, it is known that l β x ^ i , y ^ i is convex with respect to β R d . Thus, f ( β ) is also convex with respect to β R d .
In the standard statistical learning setting, all training and test samples are drawn independently from distribution P supported on Ξ = R d × { 1 , 1 } . If P is exactly available, the weight parameter β is found by solving the following stochastic optimization problem:
min β R d E P l β ( x ^ , y ^ ) .
In practice, however, the distribution P is always unavailable. Instead, it should be estimated through independently observed training samples ( x ^ i , y ^ i ) i = 1 N . Thus, the distributionally robust optimization (DRO) approach provides an alternative perspective, which motivates us to screen the potential true distribution with a high confidence from the ambiguity set that has been constructed by such training samples. In this case, the following DRO problem is considered:
inf β R d sup Q B ε P ^ N E Q l β ( x ^ , y ^ ) ,
where B ε P ^ N = Q : W Q , P ^ N ε denotes the Wasserstein ball of radius ε centered at the empirical distribution P ^ N , which is calculated by the training samples ( x ^ i , y ^ i ) i = 1 N . Thus, hereafter, this paper only considers ε > 0 . W · , · denotes the Wasserstein distance between two distributions, which is defined as follows.
Definition 2
([7]). Let two distributions P 1 and P 2 share a common support set Ξ. Given d ( · , · ) that denotes a metric on Ξ × Ξ . The Wasserstein distance between P 1 and P 2 is defined as
W P 1 , P 2 = inf Π Π ( Ξ 2 ) Ξ 2 w ξ , ξ Π d ξ , d ξ : Π d ξ , Ξ = P 1 d ξ , Π Ξ , d ξ = P 2 d ξ ,
where ξ = x ^ , y ^ and ξ = x ^ , y ^ . w ( · , · ) is the distance metric between two vectors ξ and ξ . Π is the joint distribution of ξ and ξ , P 1 and P 2 can be considered as marginal distributions of Π with respect to ξ and ξ , respectively.
The Wasserstein distance is highly related to the well-known optimal transport problem, in which the distance metric d ( · , · ) is interpreted as the unit mass transport cost from one element in Ξ to another, which means that W P 1 , P 2 denotes the minimum expected transport cost among the family of joint distributions [39].
Obviously, the DRO problem (2) is intractable in its present form. In order to solve it via numerical methods, its tractable reformulation has be to found. In [7,15], a tractable reformulation of (2) was established by considering the distance metric of type (3),
d ( x , y ) , ( x , y ) = x x p + κ 2 | y y | ,
where κ is a positive constant, · p is the p-norm on R d . κ represents the trust in the labels of the training samples. It can be seen that κ = + avoids different labels to be assigned to a fixed feature vector. Thus, setting κ = + is reasonable to reflect the exactness of the label measurements. Hence, the following distance metric on Ξ is considered in this paper:
d ( x , y ) , ( x , y ) = x x , if y = y , , otherwise .
According to the reformulation results in [24], the DRO problem (2) with distance metric (4) can be reformulated into the following unconstrained optimization problem:
min β R d r ( β ) + f ( β ) ,
where r ( β ) = ε β 1 , f ( β ) = 1 N i = 1 N l β x ^ i , y ^ i .
Remark 1.
The regularization term r ( β ) = ε β 1 in optimization problem (5) is usually named by LASSO, which is nondifferentiable at the origin. In practice, the LASSO regularization technique is an effective tool and widely investigated for sparse signal processing, including sparse signal reconstructing and sparsity pattern recovering [22].
Remark 2.
In [15], an ADMM-based algorithmic framework was proposed for a modified distributionally robust Logistic regression model, where the gradient of f ( β ) was incorporated. In practice, however, the applications in statistical learning are always of large scale. For large-scale problems, the conjugate gradient methods outperform the gradient methods in both numerical efficiency and convergence performance due to their simplicity and low memory requirement [38]. In the following, an improved algorithmic framework takes into account the conjugate gradient methods.

2.2. The Modified CG Method

To address the convex optimization problem (5), where f ( β ) is smooth and r ( β ) is nonsmooth, the FOBOS framework is employed that incorporates the conjugate gradient steps as a major ingredient, which is an extension of the projected gradient method, by replacing or augmenting the projection step with an minimization problem and alleviates the nonsmooth term r ( β ) , by taking analytical minimization steps interleaved with conjugate gradient steps [26,27]. The FOBOS framework consists of two steps shown in (6) and (7). One is a conjugate gradient descent step (6) which aims to find an interim vector β t + 1 2 along with the searching direction d t that achieves a lower value of f ( β ) ; the other is a proximal searching step (7) which aims to find an update vector β t + 1 that not only stays close to the interim β t + 1 2 , but also attains a low complexity value as expressed by r ( β ) .
β t + 1 2 = β t + α t d t
β t + 1 = argmin β R d 1 2 β β t + 1 2 2 2 + α t r ( β ) ,
where α t is the searching steplength, and d t is the searching direction that is defined as follows:
d t + 1 = g t + 1 f + ϑ t + 1 d t ,
where d 0 = g 0 , g t + 1 f is the gradient of f ( β ) at β t + 1 , ϑ t + 1 is a conjugate parameter, which is defined as
ϑ t + 1 = g t + 1 f 2 η d t 2 ,
where η > 1 is a constant.
Remark 3.
In [37,38], the modified CG methods were proposed under the Moreau–Yosida regularization framework for the nonsmooth convex optimization problems, in which merely the convergence of the proposed modified CG methods was considered, while the convergence rate was ignored. In this paper, an alternative solution way that develops a modified CG method under the FOBOS framework is proposed, in which not only the convergence of the modified CG method can be analyzed, but also its convergence rate can be estimated.

3. Convergence Analysis

The conjugate gradient steps (6) and (7) provide the following two propositions.
Proposition 1.
For the searching direction in (8) and (9), we have
g t + 1 f , d t + 1 1 1 η g t + 1 f 2 2
and
d t + 1 2 1 + 1 η g t + 1 f 2
for all t 0 .
Proof. 
When t 0 , by (8) and (9), we have
g t + 1 f , d t + 1 g t + 1 f 2 2 + g t + 1 f 2 η d t 2 g t + 1 f , d t g t + 1 f 2 2 + g t + 1 f 2 η d t 2 g t + 1 f 2 d t 2 = 1 1 η g t + 1 f 2 2 ,
where the second inequality is obtained by the Cauchy–Schwartz inequality.
On the other hand,
d t + 1 2 g t + 1 f 2 + g t + 1 f 2 η d t 2 d t 2 1 + 1 η g t + 1 f 2 .
The proof is completed. □
By (10), we know d t is a descent direction, that is, g t f , d t < 0 for all t 0 . In addition, notice that f ( β ) = i = 1 N l β x ^ i , y ^ i and l β x ^ i , y ^ i = log 1 + e y ^ i β , x i ^ . Thus, f ( β ) is convex and f ( β ) 2 is bounded by L f = max 1 i N y ^ i x ^ i 2 , that is, g t f 2 L f for all t 0 . Then, by (11), we obtain
d t 2 1 + 1 η L f .
Proposition 2.
For any vector β * R d , we have
d t , β t + 1 β * g t f , β t + 1 β t + f ( β * ) f ( β t ) + 1 2 β t + 1 β * 2 2 + 1 2 L f η 2 2 ,
for all t 0 .
Proof. 
By (8) and (9), we obtain
d t , β t + 1 β * = g t f + ϑ t d t 1 , β t + 1 β * = g t f , β t + 1 β * + g t f 2 η d t 1 2 d t 1 , β t + 1 β * g t f , β t + 1 β t + f ( β * ) f ( β t ) + 1 2 β t + 1 β * 2 2 + 1 2 L f η 2 2 ,
where the third inequality is obtained by utilizing the convexity of f ( β ) and a , b 1 2 a , a + 1 2 b , b by the Cauchy–Schwartz inequality. The proof is completed. □
The optimal condition of (7) gives that
0 1 2 β β t + 1 2 2 2 + α t r ( β ) β = β t + 1 ,
where · denotes the set of subgradients. Then, substituting (6) into (14), we have
β t + 1 = β t + α t d t α t g t + 1 r ,
where g t + 1 r is a subgradient of r ( β ) at β t + 1 . Notice that r ( β ) = ε β 1 , we know that the subgradient of r ( β ) is bounded for all β R d . Thus, there exists a constant L r such that g t + 1 r 2 L r for all t 0 .
Lemma 1.
Suppose vector β * R d is the optimal solution of (5). Then, the following inequality holds
( 1 α t ) β t + 1 β * 2 2 β t β * 2 2 + 2 α t r β * r β t + 1 + 2 α t f ( β * ) f ( β t ) + L 2 η 2 α t + 12 L 2 α t 2 ,
where L = max 1 + 1 η L f , L r .
Proof. 
We begin with deriving the properties of r ( β ) and its subgradients. By (15), we have
β t + 1 β t = α t d t α t g t + 1 r .
Notice that r ( β ) is convex, we obtain
g t + 1 r , β t + 1 β * r ( β * ) r ( β t + 1 ) .
Substituting (17) into (18), we obtain
g t + 1 r , β t + 1 β t = g t + 1 r , α t d t α t g t + 1 r g t + 1 r 2 α t d t α t g t + 1 r 2 α t g t + 1 r 2 d t 2 + α t g t + 1 r 2 2 2 α t L 2 ,
where the second inequality is derived by the Cauchy–Schwartz inequality, and the last inequality is obtained by L = max 1 + 1 η L f , L r . Similarly, we have
d t , β t + 1 β t 2 α t L 2 ,
and
g t f , β t + 1 β t 2 α t L 2 .
We then derive the bound of the difference between β * and β t + 1 recursively.
β t + 1 β * 2 2 = β t β * + α t d t α t g t + 1 r 2 2 = β t β * 2 2 + 2 β t β * , α t d t α t g t + 1 r + α t d t α t g t + 1 r 2 2 β t β * 2 2 + 2 β t + 1 β * , α t d t α t g t + 1 r + 2 β t β t + 1 , α t d t α t g t + 1 r + α t 2 d t g t + 1 r 2 2 .
Considering the last term in (22), we obtain that
d t g t + 1 r 2 2 d t 2 + g t + 1 r 2 2 4 L 2 .
Then, inequality (16) holds by substituting (13), (18)–(21) and (23) into (22). The proof is completed. □
Lemma 1 establishes a fundamental result for the convergence properties of conjugate gradient steps (6) and (7). Now, we are ready to derive the convergence results based on the preceding discussion.
Theorem 1.
Let D be a constant. Suppose (i) α t α t + 1 for all t 0 ; (ii) vector β * R d is the optimal solution of (5); (iii) β 0 β * 2 D . We have
t = 0 T 2 α t f ( β t ) + r β t f ( β * ) r β * λ 2 D 2 + 12 L 2 t = 0 T α t 2 .
Proof. 
Rearranging f β * f β t and r β * r β t + 1 in (16), we obtain
( 1 α t ) β t + 1 β * 2 2 β t β * 2 2 + 2 α t r β * r β t + 1 + 2 α t f ( β * ) f ( β t ) + L 2 η 2 α t + 12 L 2 α t 2 ,
t = 0 T 2 α t f ( β t ) f ( β * ) + r β t + 1 r β * β 0 β * 2 2 1 4 α t L η β T + 1 β * 2 2 + 12 L 2 t = 0 T α t 2 .
Term t = 0 T 2 α t r β t + 1 r β * yields that
t = 0 T 2 α t r β t + 1 r β * t = 0 T 2 α t r β t r β * + α T r β T + 1 r β * α 0 r β 0 r β * t = 0 T 2 α t r β t r β * + α 0 α T r β * t = 0 T 2 α t r β t r β * ,
where the second inequality is obtained since α t > 0 for all t 0 and r ( β ) 0 for all β R d . The last inequality is obtained since the steplength α t α t + 1 for all t 0 . Then, by properly choosing β 0 such that β 0 β * 2 λ D and combining (25) and (26), we derive (24). The proof is completed. □
In the remainder of this paper, two corollaries are presented to show the convergence properties of the conjugate gradient steps (6) and (7).
Theorem 2.
Suppose the initial value β 0 is chosen within a λ D neighborhood of β * . Then, for steplength α t that satisfies α t α t + 1 , α t 0 for all t 0 , and t = 0 α t = , the conjugate gradient descent method (6) and (7) converges to the optimal solution of DRO problem (2). Moreover, for predefined T iterations, the conjugate gradient descent method (6) and (7) yields a convergence rate of 1 / T .
Proof. 
By Theorem 1, we have
min t = 1 , 2 , , T f ( β t ) + r β t f ( β * ) r β * t = 0 T 2 α t λ 2 D 2 + 12 L 2 t = 0 T α t 2 .
Let T = , then (27) implies that
lim inf t f ( β t ) + r β t f ( β * ) r β * λ 2 D 2 + 12 L 2 t = 0 α t 2 2 t = 0 α t .
Since t = 0 α t = , the right-hand side of (28) tends to 0, which indicates that
lim inf t f ( β t ) + r β t = f ( β * ) + r β * .
Moreover, let α t = 1 2 ( T + 1 ) for all t 0 and λ = 1 T + 1 . Then, (27) yields that
min t = 1 , 2 , , T f ( β t ) + r β t f ( β * ) r β * + D 2 + 3 L 2 T + 1 ,
which means that the conjugate gradient steps (6) and (7) yield a convergence rate of 1 / T . The proof is completed. □

4. Numerical Experiments

In this section, the infinite norm in distance metric (4) is considered; then, the optimization problem (5) becomes the 1 -regularized convex optimization problem. Then, several numerical experiments are conducted to validate the efficiency and effectiveness of the modified CG method from different aspects. Firstly, the sensitive analysis is carried out to compare the runtime of the modified CG method under different Wasserstein ball radius ε and steplength { α t } t = 0 to show how these parameters affect the performance of the modified CG method. Secondly, the runtime between our modified CG method and the IPOPT solver (https://github.com/coin-or/Ipopt (accessed on 21 January 2023)) is compared, where the IPOPT solver is implemented on the YALMIP platform [40]. Lastly, the runtime among several algorithmic frameworks for 1 -regularized problems is compared, which includes the modified CG method, the standard sub-gradient method, the ISTA, and the FISTA. All the experiments are conducted using MATLAB R2018a on a laptop running Windows 7 with Intel(R) Core(TM) i3-2310 CPU (2.1 GHz) and 10 GB RAM. In addition, the experiments are implemented on real datasets a1a–a9a from LIBSVM (https://www.csie.ntu.edu.tw/∼cjlin/libsvm/index.html (accessed on 23 February 2023)), which are extracted from UCI/Adult (http://archive.ics.uci.edu/ml/index.php (accessed on 23 February 2023)). The data characteristics are listed in Table 3. To begin with, the conjugate parameter η is set to 5 and the initial values of β are chosen randomly within the neighborhood β R d : β 0.2785 ε . To eliminate the impact of random choices of β 0 on the runtime, each experiment is repeated five times and their runtimes are recorded. Lastly, the average runtime is utilized to reflect the numerical efficiency of the solution methods. The programs stop when β t + 1 β t 2 10 3 .

4.1. Runtime Comparison: Among Different Parameters

In this part, the average runtime for different Wasserstein ball radius ε and steplength α t are tested to show how such parameters affect the numerical efficiency of the modified CG method. In particular, for fixed steplength α t = 1 / ( t + 1 ) , the average runtime for five different choices of ε are tested, including ε = 0.1 , ε = 0.11 , ε = 0.12 , ε = 0.13 , and ε = 0.14 . In addition, for fixed Wasserstein ball radius ε = 0.1 , the average runtime for five different choices of α t is tested, including α t = 1 / t + 1 , α t = 1 / t + 1 3 , α t = 1 / t + 1 4 , α t = 1 / t + 1 4 , α t = 1 / t + 1 5 , and α t = 1 / t + 1 6 . The experimental outputs are listed in Table 4 and Table 5 and Figure 1 and Figure 2. Observing the experimental outputs, it can be seen that with the increasing problem scale, the required runtime of the modified CG method increases significantly. For fixed steplength, a larger Wasserstein ball radius means shorter average runtime. The reason is that the Wasserstein ball radius ε is inversely proportional to the radius of the searching neighborhood β R d : β 0.2785 ε . Thus, larger ε means smaller searching neighborhood, which results in shorter average runtime. In addition, it is observed from Table 4 and Figure 2 that the average runtime of the modified CG method can be controlled by the steplength α t . The average runtime of α t = 1 / t + 1 6 is the shortest among the five steplengths.

4.2. Runtime Comparison: Modified CG Method versus IPOPT Solver

In this part, the average runtime between the IPOPT solver and the modified CG method are tested. The IPOPT is a typical solver that builds upon the interior-point algorithmic framework. The Wasserstein ball radius ε and the steplength are set to be 0.1 and 1 / t + 1 , respectively. The experimental output is given in Table 6 and Figure 3. It can be observed that the average CPU time of IPOPT solver far exceeds that of the modified CG method, which demonstrates that the modified CG method is far more effective than the interior-point-based off-the-shelf solvers for large-scale problems.

4.3. Runtime Comparison: Among Different First-Order Algorithmic Frameworks

In this part, the average runtime among four first-order algorithmic frameworks is tested, including the subgradient method, the ISTA, the FISTA, and the modified CG method. The Wasserstein ball radius ε is set to be 0.1 . The steplength of the subgradient method and the modified CG method are set to be 1 / t + 1 6 . The experimental output is illustrated in Table 7 and Figure 4. It can be observed that the modified CG method is more effective than the other three widely employed first-order algorithmic frameworks.

5. Conclusions

In this paper, we propose a modified CG method under the FOBOS framework for the Wasserstein distributionally robust LR model. This method consists of two phases: in the first phase, a conjugate gradient descent step is performed, and in the second phase, an instantaneous optimization problem is formulated and solved with a trade-off minimization of the regularization term, while simultaneously staying in close proximity to the interim point obtained in the first phase. With a nonsummable steplength, it is further proven that the modified CG method attains the optimal solution of the Wasserstein distributionally robust LR model. Moreover, the convergence rate of the modified CG method is estimated as 1 / T . Finally, several numerical experiments are conducted to test the numerical efficiency of the modified CG method and demonstrate that this method outperforms the off-the-shelf solver and the exiting first-order algorithmic frameworks. Future research in this field is focused on two aspects: (i) advancing the algorithmic framework through the exploration of advanced optimization techniques, such as the proximal method, accelerated gradient method, and stochastic approximation method. These investigations aim to refine the optimization algorithms, develop more efficient convergence criteria, and enhance numerical performance. (ii) The application of distributionally robust LR to real-world domains, including healthcare, finance, marketing, and transportation. This entails examining the practical implications, interpretability, and performance of these models in various contexts. Furthermore, conducting case studies will allow for an assessment of the benefits and limitations of distributionally robust approaches in comparison to stochastic and robust methods.

Author Contributions

Conceptualization, B.Z.; methodology, L.W.; software, L.W.; validation, B.Z.; formal analysis, L.W.; investigation, B.Z.; resources, B.Z.; data curation, B.Z.; writing—original draft preparation, L.W.; writing—review and editing, B.Z.; visualization, L.W.; supervision, B.Z.; project administration, B.Z.; funding acquisition, B.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the National Natural Science Foundation of China [61803056].

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Obuchi, T.; Kabashima, Y. Feature subset selection for logistic regression via mixed integer optimization. Comput. Optim. Appl. 2016, 64, 865–880. [Google Scholar]
  2. Tian, X.C.; Wang, S.A. Cost-Sensitive Laplacian Logistic Regression for Ship Detention Prediction. Mathematics 2023, 11, 119. [Google Scholar] [CrossRef]
  3. Shen, X.Y.; Gu, Y.T. Nonconvex sparse Logistic regression with weakly convex regularization. IEEE Trans. Signal Process. 2018, 66, 1155–1169. [Google Scholar] [CrossRef]
  4. Jayawardena, S.; Epps, J.; Ambikairajah, E. Ordinal logistic regression with partial proportional odds for depression prediction. IEEE Trans. Affect. Comput. 2023, 14, 563–577. [Google Scholar] [CrossRef]
  5. Bogelein, V.; Duzaar, F.; Marcellini, P. A time dependent variational approach to image restoration. SIAM J. Imaging Sci. 2015, 8, 968–1006. [Google Scholar] [CrossRef]
  6. Zhou, J.; McNabb, J.; DeCapite, N.; Ruiz, J.R.; Fisher, D.A.; Grego, S.; Chakrabarty, K. Stool image analysis for digital health monitoring by smart toilets. IEEE Internet Things J. 2023, 10, 3720–3734. [Google Scholar] [CrossRef]
  7. Shafieezadeh-Abadeh, S.; Esfahani, P.M.; Kuhn, D. Distributionally robust logistic regression. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–8 December 2016. [Google Scholar]
  8. Feng, J.S.; Xu, H.; Mannor, S.; Yan, S.C. Robust logistic regression and classification. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
  9. Wiesemann, W.; Kuhn, D.; Sim, M. Distributionally robust convex optimization. Oper. Res. 2014, 62, 1358–1376. [Google Scholar] [CrossRef]
  10. Faccini, D.; Maggioni, F.; Potra, F.A. Robust and distributionally robust optimization models for linear support vector machine. Comput. Oper. Res. 2022, 147, 105930. [Google Scholar] [CrossRef]
  11. Frogner, C.; Claici, S.; Chien, E.; Solomon, J. Incorporating unlabeled data into distributionally-robust learning. J. Mach. Learn. Res. 2021, 22, 1–46. [Google Scholar]
  12. Bertsimas, D.; Gupta, V.; Kallus, N. Data-driven robust optimization. Math. Program. 2018, 167, 235–292. [Google Scholar] [CrossRef]
  13. Ben-Tal, A.; Hertog, D.D.; Waegenaere, A.D.; Melenberg, B.; Rennen, G. Robust solutions of optimization problems affected by uncertain probabilities. Manag. Sci. 2013, 59, 341–357. [Google Scholar] [CrossRef]
  14. Kuhn, D.; Esfahani, P.M.; Nguyen, V.A.; Shafieezadeh-Abadeh, S. Wasserstein distributionally robust optimization: Theory and applications in machine learning. Informs Tutor. Oper. Res. 2019, 130–166. [Google Scholar] [CrossRef]
  15. Li, J.J.; Huang, S.; So, A.M.C. A first-order algorithmic framework for Wasserstein distributionally robust logistic regression. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  16. Namkoong, H.; Duchi, J.C. Stochastic gradient methods for distributionally robust optimization with f-divergence. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–8 December 2016. [Google Scholar]
  17. Ghosh, S.; Squillante, M.S.; Wollega, E.D. Efficient stochastic gradient descent for distributionally robust learning. arXiv 2018, arXiv:1805.08728. [Google Scholar]
  18. Duchi, J.C.; Glynn, P.W.; Namkoong, H. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv 2018, arXiv:1610.03425. [Google Scholar] [CrossRef]
  19. Gao, R.; Kleywegt, A.J. Distributionally robust stochastic optimization with Wasserstein distance. arXiv 2016, arXiv:1604.02199. [Google Scholar] [CrossRef]
  20. Esfahani, P.M.; Kuhn, D. Data-driven distributionally robust optimization using the Wasserstein metric: Performance guarantees and tractable reformulations. Math. Program. 2018, 17, 115–166. [Google Scholar] [CrossRef]
  21. Luo, F.G.; Mehrotra, S. Decomposition algorithm for distributionally robust optimization using Wasserstein metric with an application to a class of regression models. Eur. J. Oper. Res. 2019, 278, 20–35. [Google Scholar] [CrossRef]
  22. Xu, H.; Mannor, S. Robust regression and Lasso. IEEE Trans. Inf. Theory 2010, 56, 3561–3574. [Google Scholar] [CrossRef]
  23. Chen, R.D.; Paschalidis, I.C. A robust learning approach for regression models based on distributionally robust optimization. J. Mach. Learn. Res. 2018, 19, 1–48. [Google Scholar]
  24. Blanchet, J.; Kang, Y.; Murthy, K. Robust Wasserstein profile inference and applications to machine learning. J. Appl. Probab. 2019, 56, 830–857. [Google Scholar] [CrossRef]
  25. Rockafellar, R.T. Monotone operators and proximal point algorithm. SIAM J. Control. Optim. 1976, 14, 877–898. [Google Scholar] [CrossRef]
  26. Combettes, P.L.; Wajs, V.R. Signal recovery by proximal FOBOS. Multiscale Model. Simul. 2015, 4, 1168–1200. [Google Scholar] [CrossRef]
  27. Duchi, J.; Singer, Y. Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 2019, 10, 2899–2934. [Google Scholar]
  28. Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Image Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]
  29. Chambolle, A.; Dossal, C. On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J. Optim. Theory Appl. 2015, 166, 968–982. [Google Scholar] [CrossRef]
  30. Stella, L.; Themelis, A.; Patrinos, P. Forward-backward quasi-Newton methods for nonsmooth optimization problems. Comput. Optim. Appl. 2017, 67, 443–487. [Google Scholar] [CrossRef]
  31. Briceno-Arias, L.M.; Chierchia, G.; Chouzenoux, E.; Pesquet, J.C. A random block-coordinate Douglas-Rachford splitting method with low computational complexity for binary logistic regression. Comput. Optim. Appl. 2019, 72, 707–726. [Google Scholar] [CrossRef]
  32. Dai, Y.H.; Yuan, Y.X. A nonlinear conjugate gradient method with a strongly global convergence property. SIAM J. Optim. 1999, 10, 177–182. [Google Scholar] [CrossRef]
  33. Hager, W.W.; Zhang, H.C. A new conjugate gradient method with guaranteed descent and an efficient line search. SIAM J. Optim. 2005, 16, 170–192. [Google Scholar] [CrossRef]
  34. Hager, W.W.; Zhang, H.C. A survey of nonlinear conjugate gradient methods. Pac. J. Optim. 2006, 2, 35–58. [Google Scholar]
  35. Hager, W.W.; Zhang, H.C. The limited memory conjugate gradient method. SIAM J. Optim. 2013, 23, 2150–2168. [Google Scholar] [CrossRef]
  36. Goncalves, M.L.N.; Prudente, L.F. On the extension of the Hager-Zhang conjugate gradient method for vector optimization. Comput. Optim. Appl. 2020, 76, 889–916. [Google Scholar] [CrossRef]
  37. Yuan, G.L.; Meng, Z.H.; Li, Y. A modified Hestenes and Stiefel conjugate gradient algorithm for large-scale nonsmooth minimizations and nonlinear equations. J. Optim. Theory Appl. 2016, 168, 129–152. [Google Scholar] [CrossRef]
  38. Woldu, T.G.; Zhang, H.B.; Zhang, X.; Fissuh, Y.H. A modified nonlinear conjugate gradient algorithm for large scale nonsmooth convex optimization. J. Optim. Theory Appl. 2020, 185, 223–238. [Google Scholar] [CrossRef]
  39. Blanchet, J.; Murthy, K. Quantifying distributional model risk via optimal transport. Math. Oper. Res. 2019, 44, 565–600. [Google Scholar] [CrossRef]
  40. Lofberg, J. YALMIP: A toolbox for modeling and optimization in MATLAB. In Proceedings of the IEEE International Conference on Robotics and Automation, Taipei, Taiwan, 2–4 September 2004; pp. 284–289. [Google Scholar]
Figure 1. Average runtime comparison on Wasserstein ball radius.
Figure 1. Average runtime comparison on Wasserstein ball radius.
Mathematics 11 02431 g001
Figure 2. Average runtime comparison on steplength.
Figure 2. Average runtime comparison on steplength.
Mathematics 11 02431 g002
Figure 3. Average runtime comparison: IPOPT solver versus Modified CG method.
Figure 3. Average runtime comparison: IPOPT solver versus Modified CG method.
Mathematics 11 02431 g003
Figure 4. Average runtime comparison among different first-order frameworks.
Figure 4. Average runtime comparison among different first-order frameworks.
Mathematics 11 02431 g004
Table 1. Statistical measures utilized in the DRO problems.
Table 1. Statistical measures utilized in the DRO problems.
References No.Statistical Measures
[13,17] ϕ -divergence
[16,18]f-divergence
[12,19,20]Wasserstein metric
This paperWasserstein metric
Table 2. Solution algorithms in the existing works.
Table 2. Solution algorithms in the existing works.
ReferencesSolution Algorithms
[25]Classical proximal gradient method
[26,27,30,31]FOBOS framework
[15]First-order ADMM framework
[32,33,34,35,36]CG method
[37,38]Modified CG method
This paperModified CG method under FOBOS framework
Table 3. The data characteristics of UCI/Adult datasets.
Table 3. The data characteristics of UCI/Adult datasets.
DatasetData Characteristics
Number of SamplesNumber of Features
a1a1605123
a2a2265123
a3a3185123
a4a4781123
a5a6414123
a6a11,220123
a7a16,100123
a8a22,696123
a9a32,561123
Table 4. Comparison of average runtime for different Wasserstein ball radius ε with fixed steplength α t = 1 / t + 1 .
Table 4. Comparison of average runtime for different Wasserstein ball radius ε with fixed steplength α t = 1 / t + 1 .
DatasetAverage Runtime (s)
ε = 0 . 1 ε = 0 . 11 ε = 0 . 12 ε = 0 . 13 ε = 0 . 14
a1a6.684.813.452.701.98
a2a7.966.884.953.762.94
a3a11.978.456.304.783.79
a4a18.4713.319.307.055.36
a5a24.4016.5412.5110.087.29
a6a43.3831.1622.3617.9714.10
a7a63.1646.5134.9127.6220.77
a8a88.1570.6752.2844.9437.72
a9a154.64115.0682.8674.0267.79
Table 5. Comparison of average runtime for different steplengths with fixed Wasserstein ball radius ε = 0.1 .
Table 5. Comparison of average runtime for different steplengths with fixed Wasserstein ball radius ε = 0.1 .
DatasetAverage Runtime (s)
α t = 1 / t + 1 α t = 1 / t + 1 3 α t = 1 / t + 1 4 α t = 1 / t + 1 5 α t = 1 / t + 1 6
a1a6.682.471.751.391.26
a2a7.963.312.441.991.74
a3a11.974.713.462.892.60
a4a18.477.065.594.354.10
a5a24.4010.076.685.895.11
a6a43.3817.5313.4610.249.04
a7a63.1626.2020.9315.3113.07
a8a88.1539.0830.4723.6320.33
a9a154.6459.2049.6433.9428.51
Table 6. Comparison of average CPU time between the IPOPT solver and the modified CG method.
Table 6. Comparison of average CPU time between the IPOPT solver and the modified CG method.
DatasetAverage Runtime (s)
IPOPT SolverModified CG Method
a1a103.506.68
a2a153.437.96
a3a259.1211.97
a4a447.9118.47
a5a891.8424.40
a6a1460.4643.38
a7a2784.3063.16
a8a5293.3288.15
a9a10,214.28154.64
Table 7. Comparison of CPU time among different algorithmic frameworks.
Table 7. Comparison of CPU time among different algorithmic frameworks.
DatasetAverage Runtime (s)
Subgradient MethodISTAFISTAModified CG Method
a1a37.263.581.551.26
a2a57.194.532.261.74
a3a76.576.152.742.60
a4a114.149.393.954.10
a5a164.2512.895.605.11
a6a307.8623.6310.149.04
a7a396.5034.3116.1013.07
a8a578.3052.7623.5220.33
a9a801.0479.5033.8928.51
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, L.; Zhou, B. A Modified Gradient Method for Distributionally Robust Logistic Regression over the Wasserstein Ball. Mathematics 2023, 11, 2431. https://doi.org/10.3390/math11112431

AMA Style

Wang L, Zhou B. A Modified Gradient Method for Distributionally Robust Logistic Regression over the Wasserstein Ball. Mathematics. 2023; 11(11):2431. https://doi.org/10.3390/math11112431

Chicago/Turabian Style

Wang, Luyun, and Bo Zhou. 2023. "A Modified Gradient Method for Distributionally Robust Logistic Regression over the Wasserstein Ball" Mathematics 11, no. 11: 2431. https://doi.org/10.3390/math11112431

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop