Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression

Zhou, Xingcai; Jing, Zhaoyang; Huang, Chao

doi:10.3390/math12050735

Open AccessArticle

Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression

by

Xingcai Zhou

¹

,

Zhaoyang Jing

¹ and

Chao Huang

^2,*

¹

School of Statistics and Data Science, Nanjing Audit University, Nanjing 211815, China

²

Department of Statistics, Florida State University, Tallahassee, FL 32306, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(5), 735; https://doi.org/10.3390/math12050735

Submission received: 21 January 2024 / Revised: 21 February 2024 / Accepted: 27 February 2024 / Published: 29 February 2024

(This article belongs to the Section Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

Modern massive data with enormous sample size and tremendous dimensionality are usually impossible to process with a single machine. They are typically stored and processed in a distributed manner. In this paper, we propose a distributed bootstrap simultaneous inference for a high-dimensional quantile regression model using massive data. Meanwhile, a communication-efficient (CE) distributed learning algorithm is developed via the CE surrogate likelihood framework and ADMM procedure, which can handle the non-smoothness of the quantile regression loss and the Lasso penalty. We theoretically prove the convergence of the algorithm and establish a lower bound on the number of communication rounds

ι_{min}

that warrant statistical accuracy and efficiency. The distributed bootstrap validity and efficiency are corroborated by an extensive simulation study.

Keywords:

distributed statistical learning; multiplier bootstrap; quantile regression; communication efficiency; ADMM algorithm

MSC:

62H30; 62J07; 68W15

1. Introduction

With the rapid development of contemporary networks, science, and technology, large-scale and high-dimensional data have emerged. For a single computer or machine, due to the limitations of memory and computing power, the processing and storage of data have become a great challenge. Therefore, it is necessary to handle data scattered across various machines. Mcdonald et al. [1] considered a simple divide and conquer, that is, the parameters of interest of the model are learned separately on the local samples of each machine, and then these estimated parameters are averaged on a master machine. The divide and conquer is communication-efficient for large-scale data, but its accuracy of learning is low. Van de Geer et al. [2] proposed an AVG-debias method, which improves the accuracy under a strong hypothesis but is time-consuming on computation because of the debiasing step. Therefore, it is necessary to develop communication-efficient distributed learning frameworks.

Afterwards, a novel communication-efficient distributed learning algorithm was proposed by [3,4]. Wang et al. [3] developed an Efficient Distributed Sparse Learning (EDSL) algorithm, which optimizes a shifted

ℓ_{1}

regularized M-estimation problem, and other machines to compute the gradient on local data for the high-dimensional model. Jordan et al. [4] adopted the same framework of distributed learning, which is called Communication-efficient Surrogate Likelihood (CSL), for solving distributed statistical inference problems in low-dimensional learning, high-dimensional regularized learning, and Bayesian inference. The two algorithms significantly improve the communication efficiency of distributed learning and have been widely used to analyze the big data in medical research, economic development, social security, and other fields. Under the CSL framework, Wang and Lian [5] investigated a distributed quantile regression; Tong et al. [6] developed privacy-preserving and communication-efficient distributed learning, which accounts for the heterogeneity caused by a few clinical sites for a distributed electronic health records dataset; and Zhou et al. [7] developed two types of Byzantine-robust distributed learning with optimal statistical rates for strong convex losses and convex (non-smooth) penalties. It can be seen that the CSL framework plays an important role in distributed learning. In this paper, we will adopt the communication-efficient CSL for our distributed bootstrap simultaneous inference for high-dimensional data. When the data are complex with outliers or heteroscedasticity, the conventional mean regressions are unable to fully capture the information contained in the data. Quantile regression (QR) was proposed by [8], which not only captures the relationship between features and outcomes but also allows one to characterize the conditional distribution of the outcomes given these features. Compared with the mean regression model, quantile regression can handle heterogeneous data better, especially for these outcomes with heavy tail distribution or outliers. Quantile regression is widely used in many fields [9,10,11]. For quantile regression in high-dimensional sparse models, Belloni and Chernozhukov [12] considered

ℓ_{1}

-penalized QR and post-penalized QR and showed that under general conditions, the two estimators are consistent at the near-oracle rate uniformly. However, they did not consider large-scale distributed circumstances. Under the distributed framework, quantile regression has also received great attention. For example, Yu et al. [13] proposed a parallel QPADM algorithm for a large-scale heterogeneous high-dimensional quantile regression model; Chen et al. [14] proposed a computationally efficient method, which only requires an initial QR estimator on a small batch of data, and proved that the algorithm with only a few rounds of aggregations achieves the same efficiency as the QR estimator obtained on all the data; Chen et al. [15] developed a distributed learning algorithm that is both computationally and communicationally efficient and showed that distributed learning achieves a near-oracle convergence rate without any restriction on the number of machines; Wang et al. [5] analyzed the high-dimensional sparse quantile regression under the CSL; and Hu et al. [16] considered an ADMM distributed quantile regression model for massive heterogeneous data under the CSL. However, the above works mainly focus on the distributed learning to parameters of quantile regression models in circumstances of massive or high-dimensional data and have not yet involved their distributed statistical inference. Volgushev et al. [17] gave distributed statistical learning on quantile regression processes. So far, the statistical inferences for high-dimensional quantile models are still elusive, especially for distributed bootstrap simultaneous inference on high-dimensional quantile regression.

The bootstrap is a generic method for learning the sampling distribution of a statistic, typically by resampling one’s own data [18,19]. The bootstrap method can be used to evaluate the quality of estimators and can effectively solve the problem of statistical inference of high-dimensional parameters [20,21]. We refer to the fundamentals of the bootstrap method for high-dimensional data in [22]. Kleiner et al. [23] introduced the Bag of Little Bootstraps (BLB) for massive data via incorporating features of both the bootstrap and subsampling, which are suited to modern parallel and distributed computing architectures and maintain the statistical efficiency of the bootstrap. However, the BLB has restrictions on the number of machines in distributed learning. Recently, Yu et al. [24] proposed K-grad and n+K-1-grad distributed bootstrap algorithms for simultaneous inference for a linear model and a generalized linear model, which do not constrain the number of machines and provably achieve optimal statistical efficiency with minimal communication. Yu et al. [25] extended the K-grad and n+K-1-grad distributed bootstrap for simultaneous inference to high-dimensional data and adopted the CSL framework of [4], which not only relaxes the restrictions on the number of machines but also effectively reduces communication costs. In this paper, we will further extend the K-grad and n+K-1-grad distributed bootstrap for simultaneous inference in high-dimensional quantile regression models. This is challenging due to the non-smooth nature of the quantile regression loss function, which cannot directly use the existing methodology.

In this paper, we design a communication-efficient distributed bootstrap simultaneous inference algorithm for high-dimensional quantile regression and provide its theoretical analysis. The algorithm and its statistical theory are the focus of this article, which belongs to the topic of probability and statistics. Its specific sub-field is bootstrap statistical inference. It is a traditional issue in statistics, but it is novel in the context of big data. We consider these methods to fit our model best. First, we adopt a communication-efficient CSL framework for large-scale distributed data, which is a novel distributed learning algorithm proposed by [3,4]. Under the master-worker architectures, CSL makes full use of the total information of the data over the master machine while only merging the first-order gradients from all the workers. Especially, a quasi-newton optimization at the master is solved as the final estimator, instead of merely aggregating all the local estimators like one-shot methods [7]. It has been shown in [3,4] that CSL-based distributed learning can preserve sparsity structure and achieve optimal statistical estimation rates for convex problems in finite-step iterations. Second, we consider high-dimensional quantile regression for large-scale heterogeneous data, especially for these outcomes with heavy tail distribution or outliers. Thus, it brings more robust bootstrap inference. Third, we are motivated by communicate-efficient multiplier bootstrap methods K-grad/n+K-1-grad, which are originally proposed in [24,25] for mean regression, and propose our K-grad-Q/n+K-1-grad-Q Distributed Bootstrap Simultaneous Inference for high-dimensional quantile regression (Q-DistBoots-SI). Our proposed method relaxes the constraint on the number of machines and can provide more accurate and robust data for large-scale heterogeneous data. To the best of our knowledge, there is no more advanced distributed bootstrap simultaneous inference method available than our Q-DistBoots-SI for high-dimensional quantile regression.

Our main contributions are: (1) we develop a communication-efficient distributed bootstrap for simultaneous inference in high-dimensional quantile regression, under the CSL framework of distributed learning. Meanwhile, the ADMM is embedded for penalized quantile learning with distributed data, which is well suited for distributed convex optimization problems under minimum structural assumption [26] and can handle the non-smoothness of quantile loss and the Lasso penalty. (2) We theoretically prove the convergence of the algorithm and establish a lower bound on the number of communication rounds

τ_{m i n}

that warrants the statistical accuracy and efficiency. (3) The distributed bootstrap validity and efficiency are corroborated by an extensive simulation study.

The rest of this article is organized as follows. In Section 2, we present a communication-efficient distributed bootstrap inference algorithm. Some asymptotic properties of bootstrap validity for high-dimensional quantile learning are established in Section 3. The distributed bootstrap validity and efficiency are corroborated by an extensive simulation study in Section 4. Finally, Section 5 contains the conclusion and a discussion. The proof of the main results and additional experimental results is provided in Appendix A.

Notations

For every integer

k \geq 1

,

R^{k}

denotes a

k -

dimensional Euclidean space. The inter product of any two vectors is defined by

u^{T} v = 〈 u, v 〉 = \sum_{k = 1}^{p} u_{k} v_{k}

for

u = {(u_{1}, \dots, u_{p})}^{T}

and

v = {(v_{1}, \dots, v_{p})}^{T}

. We denote the

l_{q}

-norm (

q \geq 1

) of any vector

v = (v_{1}, v_{2}, \dots, v_{n})

by

{∥ v ∥}_{q} = (\sum_{i = 1}^{n} | v_{i} {|^{q})}^{(1 / q)}

, and

{∥ v ∥}_{\infty} = {max}_{1 \leq i \leq n} | v_{i} |

. The induced q-norm and the max-norm of any matrix

M \in R^{m * n}

are denoted by

{| | | M | | |}_{q} = {sup}_{x \in R^{n}, {∥ x ∥}_{q} = 1} {∥ M x ∥}_{q}

, and

{| | | M | | |}_{max} = {max}_{1 \leq i \leq m, 1 \leq j \leq n} | M_{i, j} |

, where the

M_{i j}

is the i-th row and j-th column element of M. The

Λ_{max} (\cdot)

denotes the largest eigenvalue of a real symmetric matrix. Let

f (\cdot | x)

and

F (\cdot | x)

be the conditional density and conditional cumulative distribution function of y given x, respectively. Denote

S = {1 \leq k \leq p : v_{k} \neq 0}

as the index set of nonzero coefficient and

| S |

as the cardinality of S.

M_{k}

denotes the k worker machine. We write

a_{n} ≍ b_{n}

for

a_{n} = O (b_{n})

and

b_{n} = O (a_{n})

,

a_{n} ≲ b_{n}

for

a_{n} = O (b_{n})

and

a_{n} ≳ b_{n}

for

b_{n} = O (a_{n})

, and

a ≪ b

if

a = o (b)

.

2. Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Learning

In this section, we introduce the distributed computing framework of the quantile regression model with high-dimensional data. In this framework, the bootstrap algorithm is used to establish the simultaneous statistical inference of high dimensional sparse parameters, and the ADMM algorithm is used to solve the non-convex optimization problem.

2.1. Problem Formulation of Distributed Quantile Learning

Quantile learning provides an effective way to depict the relationship between outcomes and features, especially for high-dimensional heterogeneity data. First, we define Y as the outcome and

x = {(x_{1}, x_{2}, \dots, x_{p})}^{T}

as the features. Assume that the observed data are independent and identically distributed, and denote

Z = {Z_{i}}_{i = 1}^{N}

with

Z_{i} = (y_{i}, x_{i})

, where N is a total sample size. The conditional quantile of y given the features x and a level of quantile

τ \in (0, 1)

is defined as a linear regression, as follows:

Q_{τ} (y | x) = x^{T} β^{*} (τ) = \sum_{j = 1}^{p} x_{j} β_{j}^{*} (τ),

(1)

where

β^{*} (τ) = {(β_{1}^{*} (τ), \dots, β_{p}^{*} (τ))}^{T} \in R^{p}

is regression coefficients and

p > N

. We know that

Q_{τ} (y | x)

is the

τ -

th quantile function of the response variable y. Let

ε = y - Q_{τ} (y | x)

, then

ε

satisfies

P (ε \leq 0 | x) = τ

. To simplify notation, we omit the subscript

τ

from the parameter

β (τ)

subsequently. For high dimensional inference, we assume that the coefficients

β^{*} (τ)

are sparse in model (1), that is, most of its components are zero. Let

S = {1 \leq k \leq p : β_{k}^{*} \neq 0}

be the index set of nonzero coefficient, and

s = | S | ≪ p

.

The interesting parameter

β^{*}

is a minimizer of the following expected loss, i.e.,

β^{*} = {argmin}_{β \in R^{p}} L^{*} (β), L^{*} (β) = E [ρ_{τ} (y - x^{T} β)],

(2)

where the check function

ρ_{τ} (u) = u (τ - I {u < 0})

and

I (\cdot)

is an indicator function. We learn

β^{*}

by minimizing the empirical loss based on the full observed data Z:

{\hat{β}}_{Z} = {argmin}_{β \in R^{p}} L_{N} (β) + λ {∥β∥}_{1}, L_{N} (β) = \frac{1}{N} \sum_{i = 1}^{N} ρ_{τ} (y_{i} - x_{i}^{T} β),

(3)

where

λ > 0

is a regularized parameter. However, it is not feasible because the data distributed across machines cannot be collected together. Even if these datasets can be gathered on a single machine, the number of data is too high for a single machine to process. Therefore, we consider statistical learning under distributed circumstances, in which the full data Z are distributed in K machines, and each machine has n data. Let

Z_{k} = {Z_{k i}, i = 1, \dots, n}

denote the data on the kth machine,

k = 1, \dots, K

, and

N = n K

. In addition, without a loss of generality, take the first machine as the host (central machine). Define the global and local loss functions as

Global loss : L_{N} (β) = \frac{1}{K} \sum_{k = 1}^{K} L_{k} (β),

Local loss : L_{k} (β) = \frac{1}{n} \sum_{i = 1}^{n} ρ_{τ} (y_{k i} - x_{k i}^{T} β), k \in [K] .

In distributed learning, it is necessary to reduce communication costs, which can be considered in two aspects: what is transmitted between the master and worker, and what kind of distributed framework to use. In this paper, we adopt the CSL framework [3,4], in which each worker machine only transmits its local gradients to the 1st worker machine. The structure of the CSL algorithm leverages both global first-order information and local higher-order information. Thus, CSL has a minimax-optimal estimator with controlled communication cost. Correspondingly, we have the global and local gradients as

Global gradients : \nabla L_{N} (β) = \frac{1}{K} \sum_{k = 1}^{K} \nabla L_{k} (β),

Local gradients : \nabla L_{k} (β) = - \frac{1}{n} \sum_{i = 1}^{n} x_{k i} ψ_{τ} (y_{k i} - x_{k i}^{T} β), k \in [K],

where

ψ_{τ} (u) = τ - I (u < 0)

. Based on the motivation of CSL for high-dimensional problems, the master machine minimizes the following

ℓ_{1}

-regularized surrogate loss:

\hat{β} = {argmin}_{β} L_{1} (β) + 〈 \nabla L_{N} (\tilde{β}) - \nabla L_{1} (\tilde{β}), β 〉 + λ {∥ β ∥}_{1},

(4)

where the

\tilde{β}

is a initial estimator, which can be obtained by the 1st machine via the optimization problem

\tilde{β} = {argmin}_{β} L_{1} (β) + λ {∥β∥}_{1} .

In the distributed environment, the

\hat{β}

of (4) is learned, starting from an initial estimator

β^{0}

. Let

β^{0} = \tilde{β}

as well. Then,

β^{0}

is broadcast to all other worker machines, which is used to compute the local gradients

\nabla L_{k} (β)

by each local worker machine and transmits the gradients back to the 1st machine. Thus, one round of communication is completed. At the

(t + 1)

th round, the 1st machine solves (4) as an iterative program

{\tilde{β}}^{t + 1} = {argmin}_{β} L_{1} (β) + 〈\frac{1}{K} \sum_{k = 1}^{K} \nabla L_{k} ({\tilde{β}}^{t}) - \nabla L_{1} ({\tilde{β}}^{t}), β〉 + λ^{t + 1} {∥ β ∥}_{1},

(5)

which is broadcast to other worker machines, and they are used to compute the local gradients based on the local data as before. The regularized parameter

λ^{t}

can be chosen as the way of [3], so that it decreases with the iteration number t. The iterative program (5) is the optimization problem of nonsmooth convex, which is solved via ADMM for computation efficiency. See Section 2.3 for details.

2.2. Distributed Bootstrap for Simultaneous Inference

First, we consider the case of nondistributed learning. Assume that

{\hat{β}}_{L a s s o}

can be obtain via (3) based on the entire data Z. Then, for a large

α \in (0, 1)

, we construct the simultaneous confidence intervals by learning the quantile

c (α) : = inf {t \in R : P (\hat{T} \leq t) \geq α},

(6)

where

\hat{T} : = {∥ \sqrt{N} ({\hat{β}}_{N} - β^{*}) ∥}_{\infty} .

(7)

The

{\hat{β}}_{N}

can be obtained through the debiased

ℓ_{1}

-penalized quantile learning

{\hat{β}}_{N} = {\hat{β}}_{Z} + Σ_{N}^{- 1} \{\frac{1}{N} \sum_{k = 1}^{K} \sum_{i = 1}^{n} x_{k i} ψ_{τ} (y_{k i} - x_{k i}^{T} {\hat{β}}_{Z})\},

(8)

where

Σ_{N} = \frac{1}{N} \sum_{k = 1}^{K} \sum_{i = 1}^{n} x_{k i} x_{k i}^{T} f (x_{k i}^{T} {\hat{β}}_{Z} | x_{k i})

, and

{\hat{β}}_{Z}

is a global estimator defined in (3). In high dimensions such as

p > N

,

Σ_{N}^{- 1}

cannot be defined because

rank (Σ_{N}) \leq N < p

. So, we need to replace it with an estimator of its inverse, say

{\hat{Θ}}_{N}

; then, the resulting debiased estimator is

ℓ_{1}

-penalized quantile learning

{\hat{\hat{β}}}_{N} = {\hat{β}}_{Z} + {\hat{Θ}}_{N} \{\frac{1}{N} \sum_{k = 1}^{K} \sum_{i = 1}^{n} x_{k i} ψ_{τ} (y_{k i} - x_{k i}^{T} {\hat{β}}_{Z})\} .

(9)

Note that the debiasing mechanism is similar to [2]. Then,

\hat{\hat{T}} : = {∥\sqrt{N} ({\hat{\hat{β}}}_{N} - β^{*})∥}_{\infty} .

(10)

However, the simultaneous inference based on

\hat{\hat{T}}

and

{\hat{\hat{β}}}_{N}

is unfeasible in the distributed framework, which is regarded as an “oracle” in this paper because it is impossible to obtain it based on full data Z. Therefore, we consider the simultaneous inference in distributed learning and keep statistical accuracy as the oracle

\hat{\hat{T}}

.

In the distributed environment, we need to replace

{\hat{β}}_{Z}

with the CSL distributed estimator

\hat{β}

. In addition,

{\hat{Θ}}_{N}

generally is computed via the nodewise lasso [2], which cannot be translated to distributed learning because of communication inefficiency. To avoid large-scale communication, we compute the nodewise lasso only on the 1st machine based on its local data, which does not sacrifice statistical accuracy if

\hat{β}

or

{\tilde{β}}^{T}

for a large enough T sufficiently closes to

β^{*}

. By the ideas of [24,25] for bootstrap inference, we need to estimate the asymptotic quantile

c (α)

of

\hat{T}

by bootstrapping

{∥{\hat{Θ}}_{1} \sqrt{N} \nabla L_{N} (\hat{β})∥}_{\infty}

using K-grad or

n + K - 1

-grad bootstrap.

In practice, the

{∥{\hat{Θ}}_{1} \sqrt{N} \nabla L_{N} (\hat{β})∥}_{\infty}

still cannot be used to bootstrapping because the computation of

{\hat{Θ}}_{1}

is based on the value of

{\hat{Σ}}_{1} = \frac{1}{n} \sum_{i = 1}^{n} x_{1 i} x_{1 i}^{T} f (x_{1 i}^{T} \hat{β} | x_{1 i})

, which depends on the unknown conditional density

f (\cdot | \cdot)

. We can estimate

f (x_{1 i}^{T} \hat{β} | x_{1 i})

by the standard kernel density estimator

ϕ ((y_{1 i} - x_{1 i}^{T} \hat{β}) / h) / h

with a small bandwidth

h > 0

, where

ϕ

is a kernel. For simplicity, we take

ϕ

as the standard normal density; of course, other popular kernels also can be taken. Here, we do not estimate it in isolation because we are more interested in the estimator of

{\hat{Σ}}_{1}

. Therefore, we estimate

{\hat{Σ}}_{1}

by

{\tilde{Σ}}_{1} = \frac{1}{n h} \sum_{i = 1}^{n} x_{1 i} x_{1 i}^{T} ϕ (\frac{y_{1 i} - x_{1 i}^{T} \hat{β}}{h}) .

(11)

It is easy to show that under a mild assumption

{\tilde{Σ}}_{1}

is sufficiently close to

{\hat{Σ}}_{1}

in the matrix norm

{∥\cdot∥}_{\infty}

. By the nodewise lasso, we obtain

{\tilde{Θ}}_{1}

from

{\tilde{Σ}}_{1}

. The multiplier bootstrap [22] can be applied to simulate the distribution of

A = : {\tilde{Θ}}_{1} \sqrt{N} \nabla L_{N} (\hat{β})

. Therefore, we approximate

c (α)

by the percentile of

{{\bar{W}}^{(b)}}_{b = 1}^{B}

via the K-grad algorithm, where

{\bar{W}}^{(b)} = : ∥ \underset{= : \bar{A}}{\underset{︸}{- {\tilde{Θ}}_{1} \frac{1}{\sqrt{K}} \sum_{k = 1}^{K} ϵ_{k}^{(b)} \sqrt{n} (g_{k} - \bar{g})}} ∥_{\infty},

(12)

where

ϵ_{k}^{(b)}_{\sim}^{i . i . d .} N (0, 1)

,

g_{k} = \nabla L_{k} (\tilde{β})

, and

\bar{g} = K^{- 1} \sum_{k = 1}^{K} g_{k}

. We call this method the K-grad-Q(uantile) distributed bootstrap. The K-grad algorithm cannot work well when K is small [24,25]. Further, the improved n+K-1-grad algorithm, such as the proposed method in [24,25], approximates

c (α)

by the percentile of

{{\tilde{W}}^{(b)}}_{b = 1}^{B}

, where

{\tilde{W}}^{(b)} : = ∥ \underset{= : \tilde{A}}{\underset{︸}{- {\tilde{Θ}}_{1} \frac{1}{\sqrt{n + K - 1}} (\sum_{i = 1}^{n} ϵ_{1 i}^{(b)} (g_{1 i} - \bar{g}) + \sum_{k = 2}^{K} ϵ_{k}^{(b)} \sqrt{n} (g_{k} - \bar{g}))}} ∥_{\infty},

(13)

where

ϵ_{1 i}^{(b)}_{\sim}^{i . i . d .} N (0, 1)

,

ϵ_{k}^{(b)}_{\sim}^{i . i . d .} N (0, 1)

,

g_{1 i} = \nabla L_{1} (\tilde{β}; Z_{1 i}) = - x_{1 i} ψ_{τ} (y_{1 i} - x_{1 i}^{T} \tilde{β})

,

g_{k} = \nabla L_{k} (\tilde{β})

, and

\bar{g} = K^{- 1} \sum_{k = 1}^{K} g_{k}

. We call this method the n+K-1-grad-Q distributed bootstrap. In Algorithm 1, we refer to K-grad-Q and n+K-1-grad-Q as Q-DistBoots algorithm. The K-grad-Q and n+K-1-grad-Q distributed bootstrap are communication-efficient because

{\bar{W}}^{(b)}

and

{\tilde{W}}^{(b)}

only compute in the 1st machine without communicating with other worker machines, once all local gradients are collected by the 1st machine.

Since the quantile loss is non-smooth, the Newton–Raphson algorithm cannot be used to obtain

\hat{β}

. We consider the proximal ADMM algorithm [27] for the Quantile regression program (5) in the CSL framework, which is also called the Q-ADMM-CSL algorithm. In addition, for computing

{\tilde{Θ}}_{1}

, we adopt the nodewise lasso technique [2], which is called the NodeLasso algorithm in this paper. Combined with our Q-ADMM-CSL, NodeLasso, and K-grad-Q and n+K-1-grad-Q algorithms under the CSL distributed framework, we propose Communication-Efficient K-grad-Q/n+K-1-grad-Q Distributed Bootstrap Simultaneous Inference for high-dimensional quantile regression (Q-DistBoots-SI), which is presented in Algorithm 1. Note that Q-DistBoots is provided in Algorithm 2, andQ-ADMM-CSL and NodeLasso algorithms are introduced in Algorithms 3 and 4 of Section 2.3.

In the Q-DistBoots-SI algorithm for high-dimensional quantile regression, we need running Q-DistBoots, which is the K-grad-Q/n+K-1-grad-Q distributed bootstrap learning algorithm. The key advantage of Q-DistBoots is that, once the 1st machine receives all local gradients

g_{k}

from the worker machines, simultaneous inferences can be run on the 1st machine only. See Algorithm 2 for a detailed description of K-grad-Q/n+K-1-grad-Q.

Algorithm 1: Q-DistBoots-SI algorithm for high-dimensional quantile

Algorithm 2: Q-DistBoots algorithm about K-grad-Q/n+K-1-grad-Q distributed bootstrap for high-dimensional learning only on

M_{1}

2.3. Q-ADMM-CSL Algorithm and NodeLasso Optimization

In the section, we first develop a Q-ADMM-CSL algorithm for the high-dimensional large-scale data under the CSL distributed framework. It can tackle the non-smoothness of our loss function and has simply closed-form solutions. Then, we introduce the NodeLasso algorithm to approximate inverse Hessian matrix

Σ_{1}

in Algorithm 1.

2.3.1. Q-ADMM-CSL Algorithm for Penalized Quantile Regression in CSL

The alternating direction method of multipliers (ADMM) algorithm is a distributed algorithm and can be parallelized and implemented to learn large-scale problems. We refer to [26] for a comprehensive review of ADMM. Quantile regression typically adopts the simplex method and the interior point method, which work well for small- to moderate-size data. However, they are difficult to handle with the high-dimensional large-scale data because the penalization is normally necessary. Refs. [13,27] considered ADMM-based algorithms for penalized quantile regression with both convex and folded-concave penalties. Based on the penalized ADMMs [13,27], we propose the Q-ADMM-CSL algorithm to solve large-scale

ℓ_{1}

-penalized quantile regression in the CSL distributed framework.

Recall that our optimization problem is to solve the following

ℓ_{1}

-penalized quantile regression problem:

\hat{β} = {argmin}_{β} L_{1} (β) + 〈 \nabla L_{N} (\tilde{β}) - \nabla L_{1} (\tilde{β}), β 〉 + λ {∥ β ∥}_{1},

(14)

Let the

n \times p

designed matrix

X_{1} = {(x_{11}, \dots, x_{1 n})}^{T}

, and the

n \times 1

response vector

y_{1} = {(y_{11}, \dots, y_{1 n})}^{T}

on

M_{1}

. Further, denote

Q_{τ} (r) = \frac{1}{n} \sum_{i = 1}^{n} ρ_{τ} (r_{i})

with

r = {(r_{1}, \dots, r_{n})}^{T} = y_{1} - X_{1} β

, and

g = \nabla L_{N} (\tilde{β}) - \nabla L_{1} (\tilde{β})

. Then, the problem (14) can be recast into an equivalent problem

min_{β \in R^{p}, r \in R^{n}} Q_{τ} (r) + g^{T} β + λ {∥ β ∥}_{1}, subject to X_{1} β + r = y_{1} .

(15)

Following standard convex optimization, problem (15) has the following Lagrangian:

L_{σ} (β, r, u) = Q_{τ} (r) + g^{T} β + λ {∥ β ∥}_{1} - u^{T} (X_{1} β + r - y_{1}) + \frac{σ}{2} {∥X_{1} β + r - y_{1}∥}_{2}^{2},

(16)

where

σ > 0

is a tunable augmentation parameter. That is, it solves the following updates at

m + 1

,

\begin{matrix} β^{m + 1} = {argmin}_{β \in R^{p}} g^{T} β + λ {∥ β ∥}_{1} - {(u^{m})}^{T} X_{1} β + \frac{σ}{2} {∥X_{1} β + r^{m} - y_{1}∥}_{2}^{2}, \\ r^{m + 1} = {argmin}_{r \in R^{n}} Q_{τ} (r) - {(u^{m})}^{T} r + \frac{σ}{2} {∥X_{1} β^{m + 1} + r - y_{1}∥}_{2}^{2}, \\ u^{m + 1} = u^{m} - σ (X_{1} β^{m + 1} + r^{m + 1} - y_{1}) . \end{matrix}

(17)

First, we introduce a proximal operator [27] as

\begin{matrix} {Prox}_{ρ_{τ}} (ξ, α) & = {argmin}_{r \in R} [ρ_{τ} (r) + \frac{α}{2} {(r - ξ)}^{2}] \\ = max (ξ - α^{- 1} τ, 0) - max (- ξ - α^{- 1} (1 - τ), 0) \\ = ξ - max (α^{- 1} (τ - 1), min (ξ, α^{- 1} τ)) . \end{matrix}

(18)

For r-update in (17), we have

\begin{matrix} r^{m + 1} & = & {argmin}_{r \in R^{n}} Q_{τ} (r) + \frac{σ}{2} {∥r - (y_{1} - X_{1} β^{m + 1} + σ^{- 1} u^{m})∥}_{2}^{2} \\ = & {argmin}_{r \in R^{n}} ρ_{τ} (r) + \frac{n σ}{2} {∥r - (y_{1} - X_{1} β^{m + 1} + σ^{- 1} u^{m})∥}_{2}^{2}, \end{matrix}

which has a closed form solutions

\begin{matrix} r^{m + 1} & = & {Prox}_{ρ_{τ}} (y_{1} - X_{1} β^{m + 1} + σ^{- 1} u^{m}, n σ) \\ = & {\{y_{1} - X_{1} β^{m + 1} + σ^{- 1} u^{m} - τ {(n σ)}^{- 1} 1_{n}\}}_{+} \\ - {\{- y_{1} + X_{1} β^{m + 1} - σ^{- 1} u^{m} + (τ - 1) {(n σ)}^{- 1} 1_{n}\}}_{+} . \end{matrix}

(19)

For

β

-update, it does not have a closed-form solution with the

β

-step in (17) in the standard ADMM. To this end, we add a proximal term to the objective function with respect to

β

and then construct an augmented

β

-update as follows:

β^{m + 1} = {argmin}_{β \in R^{p}} g^{T} β + λ {∥ β ∥}_{1} - {(u^{m})}^{T} X_{1} β + \frac{σ}{2} {∥X_{1} β + r^{m} - y_{1}∥}_{2}^{2} + \frac{1}{2} {∥β - β^{m}∥}_{S}^{2},

where

S = σ (η I_{p} - X_{1}^{T} X_{1})

with

η \geq Λ_{max} (X_{1}^{T} X_{1})

. Thus, the augmented

β

-update becomes a Lasso penalized least square problem

\begin{matrix} β^{m + 1} & = & {argmin}_{β \in R^{p}} λ {∥ β ∥}_{1} + \frac{σ η}{2} {∥β - \frac{σ η β^{m} - g + X_{1}^{T} [u^{m} - σ (X_{1} β^{m} + r^{m} - y_{1})]}{σ η}∥}_{2}^{2} \\ = & Shrink (β^{m} + \frac{X_{1}^{T} [u^{m} - σ (X_{1} β^{m} + r^{m} - y_{1})] - g}{σ η}, \frac{λ}{σ η}), \end{matrix}

(20)

where the soft shrinkage operator

Shrink (b, a) = {(sgn (b_{j}) max (| b_{j} | - a, 0))}_{1 \leq j \leq p}

for

b \in R^{p}

and

a \in R

, and

sgn (\cdot)

is the sign function.

Last, the r-update is

u^{m + 1} = u^{m} - σ (X_{1} β^{m + 1} + r^{m + 1} - y_{1}) .

(21)

Thus, (19)–(21) constitute our Q-ADMM-CSL algorithm, which is presented in Algorithm 3.

2.3.2. NodeLasso Algorithm to Approximate Inverse Hessian Matrix

In the Q-DistBoots-SI algorithm, we apply a nodewise Lasso procedure [2,28] to approximate the inverse Hessian matrix

{\tilde{Σ}}_{1}

, which is introduced in Algorithm 4. Let

ξ_{l} = \{ξ_{l, l^{'}} : l^{'} = 1, \dots, p, l^{'} \neq l\}

, and denote

M_{l, - l}

to be the lth row of M without the diagonal element

(l, l)

and

M_{- l, - l}

to be the submatrix without the lth row and lth column.

Algorithm 3: Q-ADMM-CSL for the

(t + 1)

th optimization of Q-DistBoots-SI algorithm on the 1st machine

M_{1}

Algorithm 4: NodeLasso procedure in Q-DistBoots-SI algorithm on the 1st machine

M_{1}

In the NodeLasso algorithm, we need to choose hyperparameters

{λ_{l}}_{l = 1}^{p}

and bandwidth h. As to

{λ_{l}}_{l = 1}^{p}

, Ref. [2] suggested taking the same value for all

λ_{l}

by Cross-Validation. Of course, the potentially good mode is to allow the different

λ_{l}

across l. The bandwidth also can be chosen via Cross-Validation.

Here, we discuss the resources required for our algorithm to solve the distributed sparse quantile regression problems. The NodeLasso algorithm is a debiasing step, which requires solving

O (p)

generalized lasso problems. In addition, our procedure requires solving one

ℓ_{1}

penalized objective (5) in each iteration. When

s^{2} log p ≲ n ≲ K s^{2} log p

, the computational complexity of Centralize learning is

O (K \cdot T_{lasso})

, while ours is only

O (log K \cdot T_{lasso})

, where

T_{lasso}

is the runtime for solving a generalized lasso problem of size

n \times p

. For communication costs, Centralize learning requires

O (n \cdot p)

bits, while ours is only

O (K \cdot p)

bits. Usually,

n ≫ K

. Therefore, our algorithm is effective in communication and computation. Regarding the discussion of computation and communication costs on CSL distributed learning framework, also see [3,4].

3. Theoretical Analysis

Recall that the quantile regression model has the conditional quantile

Q_{τ} (y | x) = x^{T} β^{*}

of Y given the feature x at quantile level

τ

, that is,

y = Q_{τ} (y | x) + ε

with

P (ε \leq 0 | x) = τ

. In this section, we establish theoretical results for distributed bootstrap simultaneous inference on high-dimensional quantile regression. We use the following assumptions.

Assumption 1.

x is sub-Gaussian, that is,

sup_{{∥ w ∥}_{2} \leq 1} E [exp ({(w^{T} x)}^{2} / L^{2})] = O (1),

for some absolute constant

L > 0

.

Assumption 2.

β^{*}

is the unique minimizer of the objective function

E {ρ_{τ} (Y - x^{T} β)}

, and

β^{*}

is an inner point in

B

, where

B \in R^{p}

is an compact subset.

Assumption 3.

F (y | x)

is absolutely continuous at y, its conditional density function

f (y | x)

is bounded and continuously differentiable at y for all x in the support of x, and

f^{'} (y | x)

is uniformly bounded by a constant. In addition,

f (x^{T} β^{*} | x))

is uniformly bounded away from zero.

Assumption 4.

β^{*}

and

Θ_{l}

are sparse for

l = 1, \dots, p

, where

Θ = {(Σ^{*})}^{- 1}

with

Σ^{*} = E (x x^{T} f (x^{T} β^{*} | x))

. Especially,

S = \{l : β_{l}^{*} \neq 0\}

,

S_{l} = \{l^{'} \neq l : Θ \neq 0\}

,

s = | S |

,

s^{*} = {max}_{l} | S_{l} |

, and

\bar{s} = s \lor s^{*} ≪ p

.

Remark 1.

Assumption 1 holds if the covariates are Gaussian. Under Assumption 1,

c_{n} : = {max}_{i} {∥x_{i}∥}_{\infty} = O_{p} (\sqrt{log n p}) = O_{p} (\sqrt{log p})

when

n < p

by Lemma 2.2.2 in [29]. Assumptions 2 and 3 are common in standard quantile regression [9]. Assumption 4 is a sparsity assumption typically adopted in penalized variable selection.

In order to state next assumptions, define a restricted set

Δ : = Δ (S, 3) = {δ \in R^{p} : ∥ δ_{S^{c}} ∥_{1} \leq 3 ∥ Δ_{S} ∥_{1}}

, and

\bar{S} (δ, J) \subset {1, \dots, p} ∖ S

is a support of J largest in absolute value components of

δ

outside S.

Assumption 5.

Assume that

κ_{J}^{2} : = inf_{δ \in Δ, δ \neq 0} \frac{δ^{T} E [x x^{T}] δ}{∥ δ_{S \cup \bar{S} (δ, J)} ∥_{2}^{2}} > 0

and

q : = inf_{δ \in Δ, δ \neq 0} \frac{(E | x_{i}^{T} δ {|^{2})}^{3 / 2}}{E | x_{i}^{T} {δ |}^{3}} > 0 .

Remark 2.

The assumptions of

κ_{J}^{2}

and q come from [12]; they are called restricted eigenvalue conditions and restricted nonlinear impact coefficients, respectively.

κ_{J}^{2} > 0

holds when x are mean zero with diagonal elements of

E (x x^{T})

being 1’s by Lemma 1 in [12]. The restricted eigenvalue condition is analogous to the condition in [30]. The q controls the quality of minoration of the quantile regression objective function by a quadratic function over the restricted set, which holds under Design 1 in [12].

First, we give the convergence rates of distributed learning for high-dimensional quantile regression models under the CSL framework.

Theorem 1.

Assume that Assumptions 1–5 hold, and

λ^{t + 1} \geq 2 {∥\nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t})∥}_{\infty}

. Then, with probability at least

1 - p^{- C}

, we have

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{1} \leq C (λ^{t + 1} s + \frac{s^{2} c_{n}^{2} log n}{n}),

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{2} \leq C (λ^{t + 1} \sqrt{s} + \frac{s^{3 / 2} c_{n}^{2} log n}{n}),

where

c_{n} = \sqrt{log p}

.

Remark 3.

Recall that Lemma A1,

{∥\nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t})∥}_{\infty} \leq C (\sqrt{\frac{log p}{N}} + \frac{s {log}^{3 / 2} p}{n} + δ_{t} (\sqrt{\frac{s {log}^{2} p}{n}} + \sqrt{\frac{{log}^{3} p}{n}}) + δ_{t}^{2} \sqrt{{log}^{3} p}),

where

δ_{t} = {∥{\tilde{β}}^{t} - β^{*}∥}_{1}

. We can take

λ^{t + 1} = C (\sqrt{\frac{log p}{N}} + \frac{s {log}^{3 / 2} p}{n} + δ_{t} (\sqrt{\frac{s {log}^{2} p}{n}} + \sqrt{\frac{{log}^{3} p}{n}}) + δ_{t}^{2} \sqrt{{log}^{3} p}) .

(22)

Thus, Theorem 1 upper bounds the learning error

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{1}

as a function of

{∥{\tilde{β}}^{t} - β^{*}∥}_{1}

. So, applying it to the iterative program, we obtain the following learning error bound, which depends on the local

ℓ_{1}

-regularized estimation error

{∥{\tilde{β}}^{0} - β^{*}∥}_{1}

.

Corollary 1.

Suppose the conditions of Theorem 1 are satisfied;

λ^{t + 1}

takes as (22); and for all t,

{∥{\tilde{β}}^{0} - β^{*}∥}_{1} \leq C s \sqrt{\frac{log p}{n}}

. Then, with probability at least

1 - p^{- C}

, we have

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{1} ≲ {(1 - b_{n})}^{- 1} (1 - b_{n}^{t + 1}) (s \sqrt{\frac{log p}{N}} + \frac{s^{2} {log}^{2} p}{n}) + b_{n}^{t + 1} {∥{\tilde{β}}^{0} - β^{*}∥}_{1},

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{2} ≲ {(1 - b_{n})}^{- 1} (1 - b_{n}^{t + 1}) (\sqrt{\frac{s log p}{N}} + \frac{s^{3 / 2} {log}^{2} p}{n}) + b_{n}^{t} a_{n} {∥{\tilde{β}}^{0} - β^{*}∥}_{1},

where

b_{n} = s^{2} \sqrt{\frac{{log}^{4} p}{n}}

, and

a_{n} = s^{\frac{3}{2}} \sqrt{\frac{{log}^{4} p}{n}}

.

Remark 4.

For initialization estimation, we refer to Theorem 2 in [12],

∥ {\tilde{β}}^{(0)} - β^{*} ∥_{1} = O_{p} \{s \sqrt{\frac{log (p \lor n)}{n}}\} .

We further explain the bound and see the scaling with respect to n, K, s, and p. When

n ≫ s^{4} {log}^{4} p

, it is easy to see by taking

λ^{t + 1} = C \{\sqrt{\frac{log p}{N}} + \frac{s^{2} {log}^{2.5} p}{n} + (\sqrt{\frac{log p}{n}}) {(s^{2} \sqrt{\frac{{log}^{4} p}{n}})}^{t + 1}\},

we have the following error bounds:

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{1} = O_{p} \{s \sqrt{\frac{log p}{N}} + (s \sqrt{\frac{l o g p}{n}}) {(s^{2} \sqrt{\frac{{log}^{4} p}{n}})}^{t + 1}\},

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{2} = O_{p} \{\sqrt{\frac{s log p}{N}} + (\sqrt{\frac{s l o g p}{n}}) {(s^{2} \sqrt{\frac{{log}^{4} p}{n}})}^{t + 1}\} .

Moreover, as long as t is large enough so that

{(s^{2} \sqrt{\frac{{log}^{4} p}{n}})}^{t + 1} ≲ s \sqrt{\frac{log p}{N}}

, and

n ≫ s^{4} {log}^{4} p

, then

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{1} = O_{p} (s \sqrt{\frac{log p}{N}}),

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{2} = O_{p} (\sqrt{\frac{s log p}{N}}),

which match the centralized lasso without any additional error term [30] as [3] has done in sparse linear regression distributed learning.

Based on the proposed Q-DistBoots-SI algorithm, we define

T = {∥\sqrt{N} ({\tilde{β}}^{ι} - β^{*})∥}_{\infty} .

(23)

Theorem 2.

(K-gard-Q) Assume that Assumptions 1–5 hold, let

λ^{t} ≍ C \{\sqrt{\frac{log p}{N}} + \frac{s^{2} {log}^{2.5} p}{n} + (\sqrt{\frac{log p}{n}}) {(s^{2} \sqrt{\frac{{log}^{4} p}{n}})}^{t}\}

(24)

for

t \geq 0

, and

λ_{l} ≍ \sqrt{\frac{log p}{n}}

for

l = 1, \dots, p

. Then, if

n ≫ (s^{4} s^{*} + {(s^{*})}^{3}) {log}^{6 + 2 κ} p + s^{4} {log}^{4} p + s^{6} {(s^{*})}^{2} {log}^{8 + 4 κ} p

,

K ≫ ({(s^{*})}^{2} s^{2} + {(s^{*})}^{2}) {log}^{5 + 2 κ} p

, and for

ι \geq ι_{m i n}

, where

\begin{matrix} ι_{min} = 1 + O & ⌊max \{\frac{log (s^{2}) + log ({(s^{*})}^{2}) + log ({log}^{5 + 2 κ} p)}{log n - log (s^{4}) - log {log}^{4} p}, \\ \frac{log (s^{3}) + log (s^{*}) + log K + log ({log}^{4 + 2 κ} p) - log (n^{\frac{1}{2}})}{log n - log (s^{4}) - log ({log}^{4} p)}\}⌋, \end{matrix}

we have

sup_{α \in (0, 1)} | P (T \leq c_{\bar{W}} (α)) - α | = o (1),

(25)

where

c_{\bar{W}} (α) : = inf {w \in R : P_{ϵ} (\bar{W} \leq w) \geq α}

, in which

{\bar{W}}^{(b)}

is the K-grad-Q bootstrap statistics with the same distribution as

\bar{W}

in (12) and

P_{ϵ}

denotes the probability with respect to the randomness from the multipliers. In addition,

sup_{α \in (0, 1)} |P (\hat{T} \leq c_{\bar{W}} (α)) - α| = o (1),

(26)

where

\hat{T}

is defined in (7).

Theorem 2 ensures the effectiveness of constructing simultaneous confidence intervals for quantile regression model parameters using the “K-grad-Q” bootstrap method in Algorithm 1. Moreover, it indicates that bootstrap quantiles can approximate the prior statistics, implying that our proposed bootstrap procedure possesses statistical validity similar to the prior estimation method.

Remark 5.

If

n = p^{Υ_{n}}

,

\bar{s} = s \lor s^{*} = p^{Υ_{\bar{s}}}

,

K = p^{Υ_{K}}

for some constants

Υ_{n}

,

Υ_{\bar{s}}

, and

Υ_{K}

, then a sufficient condition is

Υ_{n} ≫ 8 Υ_{\bar{s}}

,

Υ_{K} ≫ 4 Υ_{\bar{s}}

, and

ι_{min} = 1 + ⌊max \{\frac{4 Υ_{\bar{s}}}{Υ_{n} - 4 Υ_{\bar{s}}}, \frac{4 Υ_{\bar{s}} + Υ_{K} - \frac{1}{2} Υ_{n}}{Υ_{n} - 4 Υ_{\bar{s}}}\}⌋ .

Notice that the representation of

ι_{m i n}

mentioned above is independent of the dimension p; the direct effect of p only enters through an iterative logarithmic term

log (log p)

, which is dominated by

log \bar{s} ≍ log p

.

Theorem 3.

(n+K-1-grad-Q) Assume that Assumptions 1–5 hold; take

λ^{t}

as (24) for

t \geq 0

and

λ_{l} ≍ \sqrt{\frac{log p}{n}}

for

l = 1, \dots, p

. Then, if

n ≫ (s^{4} s^{*} + {(s^{*})}^{3}) {log}^{6 + 2 κ} p + s^{4} {log}^{4} p + s^{6} {(s^{*})}^{2} {log}^{8 + 4 κ} p

,

K ≫ ({(s^{*})}^{2} s^{2} + {(s^{*})}^{2}) {log}^{5 + 2 κ} p

, and when

ι \geq ι_{m i n}

, where

\begin{matrix} ι_{min} = 1 + O & ⌊max \{\frac{log (s^{2}) + log ({(s^{*})}^{2}) + log (n + {log}^{2} p) + log ({log}^{5 + 2 κ} p) - l o g n}{log n - log (s^{4}) - log ({log}^{4} p)}, \\ \frac{log (s^{3}) + log (s^{*}) + log K + log ({log}^{4 + 2 κ} p) - log (n^{\frac{1}{2}})}{log n - log (s^{4}) - log {log}^{4} p}\}⌋, \end{matrix}

we have

sup_{α \in (0, 1)} | P (T \leq c_{\tilde{W}} (α)) - α | = o (1),

(27)

where

c_{\tilde{W}} (α) : = inf {w \in R : P_{ϵ} (\tilde{W} \leq w) \geq α}

, in which

\tilde{W}

is the n+K-1-grad-Q bootstrap statistics with the same distribution as

{\tilde{W}}^{(b)}

in (13), and

P_{ϵ}

denotes the probability with respect to the randomness from the multipliers. In addition, (27) also holds.

Theorem 3 establishes the statistical validity of the distributed bootstrap method when using “n+K-1-grad-Q”. To gain insight into the difference between “K-grad-Q” and “n+K-1-grad-Q”, we compare the difference between the covariance of the oracle score A and the conditional covariance of

\bar{A}

and

\tilde{A}

conditioning on the data, and we obtain

{∥c o v_{ϵ} (\bar{A}) - c o v (A)∥}_{m a x} = O_{p} (s^{*} \sqrt{n} r_{\bar{β}} + s^{*} \sqrt{\frac{log p}{K}} + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}} + s^{*} n r_{\bar{β}}^{2}),

and

{∥c o v_{ϵ} (\tilde{A}) - c o v (A)∥}_{m a x} = O_{p} (s^{*} \sqrt{n} r_{\bar{β}} + s^{*} \sqrt{\frac{log p}{n + K}} + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}} + s^{*} (n \land K)_{\bar{β}}^{2}) .

Remark 6.

If

n = p^{Υ_{n}}

,

\bar{s} = s \lor s^{*} = p^{Υ_{\bar{s}}}

, and

K = p^{Υ_{K}}

for some constants

Υ_{n}

,

Υ_{\bar{s}}

, and

Υ_{K}

, then a sufficient condition is

Υ_{n} ≫ 8 Υ_{\bar{s}}

,

Υ_{n} + Υ_{K} ≫ 4 Υ_{\bar{s}}

, and

ι_{min} = 1 + ⌊max \{\frac{4 Υ_{\bar{s}}}{Υ_{n} - 4 Υ_{\bar{s}}}, \frac{4 Υ_{\bar{s}} + Υ_{K} - \frac{1}{2} Υ_{n}}{Υ_{n} - 4 Υ_{\bar{s}}}\}⌋ .

Notice that the representation of

ι_{m i n}

mentioned above is independent of the dimension p; the direct effect of p only enters through an iterative logarithmic term

log (log p)

, which is dominated by

log \bar{s} ≍ log p

.

Remark 7.

The rates of

{λ^{(t)}}_{t = 0}^{ι}

and

{λ_{l}}_{l = 1}^{p}

in Theorems 3.2 and 3.3 are motivated by Theorem 1 and [2]. Therefore, we fix

{λ^{(t)}}_{t = 0}^{ι}

(e.g., 0.01 in the simulation study) and use the cross-validation method to choose

{λ_{l}}_{l = 1}^{p}

.

Remark 8.

The total communication cost in our algorithm is of the order

ι_{min} K p

because in each iteration we communicate p-dimensional vectors between the master node and K-1 worker nodes, and

ι_{m i n}

only grows logarithmically with K when n and p are fixed. Our order matches those in the existing communication-efficient statistical inference literature, e.g., [3,4,25].

4. Simulation Experiments

In this section, we demonstrate the advantages of our proposed approach through numerical simulation. We consider the problem of parameter estimation for high-dimensional quantile regression models in a distributed environment. In Section 4.1, we compare our algorithm Q-ADMM-CSL with the oracle estimation (Q-Oracle) and simple divide and conquer (Q-Avg) for high-dimensional quantile regression, evaluating the computational effectiveness of our proposed algorithm. In Section 4.2, we construct confidence intervals and assess their validity. The data are generated from the following model:

y_{k i} = x_{k i}^{T} β^{*} + ε_{k i}, k = 1, \dots, K, i = 1, \dots, n,

where

ε_{k i} = {\tilde{ε}}_{k i} - F_{{\tilde{ε}}_{k i}}^{- 1} (τ)

by, respectively, taking

{\tilde{ε}}_{k i} \sim N (0, 0.25)

and

t (2)

to demonstrate the benefits of our method for large-scale high-dimensional data with heavy-tailed distribution.

In this section, we consider a high-dimensional quantile regression model with dimension of feature

p = 1000

; fix the total sample size

N = 3000

; and select the numbers of machines

K = 5, 10

, and 20, respectively. Therefore, the sample size on each machine is

n = N / K

, that is,

n =

600 for

K = 5

, 300 for

K = 10

and 150 for

K = 20

. Considering the scenario of parameter sparsity, we choose the real coefficient

β

to be p-dimensional, in which s coefficients are non-zero, and the remaining

p - s

parameters are 0. We consider the two cases: (1) sparsity with

s = 4

,

β_{5}^{*} = 1, β_{8}^{*} = 1, β_{11}^{*} = 1, β_{14}^{*} = 1

, and the rest of the components are 0; (2) sparsity with

s = 8

,

β_{5}^{*} = 1, β_{8}^{*} = 1, β_{11}^{*} = 1, β_{14}^{*} = 1, β_{16}^{*} = 1,

β_{19}^{*} = 1,

β_{21}^{*} = 1, β_{25}^{*} = 1

, and the rest are 0. We generate independent and identically distributed covariates

x_{i}

from a multivariate normal distribution

N (0, Σ)

, where the covariance matrix

Σ_{l, l^{^{'}}} = 0 . 5^{| l - l^{^{'}} |}

. Given a quantile level

τ \in (0, 1)

, we consider three levels: 0.25, 0.5, and 0.75.

4.1. Parameter Estimation

In this section, we study the effect of our proposed algorithm. We repeatedly generate 100 datasets of independent data, and we use the

l_{2}

- norm to evaluate the quality of parameter estimate. That is,

∥ \tilde{β} - β^{*} ∥_{2}

. Meanwhile, we compare the effect of

\tilde{β}

obtained by our proposed algorithm,

β_{o r c a l e}

, obtained by all data estimation and

β_{a v g}

, obtained by naively average data estimation.

For the choice of penalty parameter

λ

, in the oracle estimation, we refer to the method of selecting penalty parameter in [12]; in the construction of the average estimation, we choose

λ = 0.1

; in our proposed distributed multi-round communication process,

λ^{(0)} = 0.1

; and when

t > 0

we set

λ^{(t)} = 0.01

. For the parameters

σ

and

η

in ADMM, we refer to the selection in [27] to choose

σ = 0.01

and

η = Λ (x x^{T})

.

Figure 1 and Figure 2 show the relationship between the number of communication rounds and the estimation error of parameters, when the noise distributions are normal and

t (2)

, for the sparsity levels

s = 4

and

s = 8

, respectively. We consider various scenarios involving different quantile levels and number of machines. It can be observed that after sufficient communication rounds, our parameter estimation method (Q-ADMM-CSL) can approximate the performance of the Oracle estimation (Q-Oracle), and the performance is significantly better than Q-Avg after a round of communication. In addition, our proposed method converges quickly, and matches the Oracle method after only about 30 rounds of communication.

4.2. Simultaneous Confidence Intervals

In this section, we demonstrate the statistical validity of confidence intervals constructed by our proposed method. For each choice of s and K, we run Algorithm 1 with “K-grad“ and “n+K-1-grad” on 100 independently generated datasets and compute the empirical coverage probabilities and the average widths based on the 100 running. At each running, we draw

B = 500

bootstrap samples and calculate B bootstrap statistics (

\bar{W}

or

\tilde{W}

) simultaneously. We obtain the

90 %

and

95 %

empirical quantiles and further obtain the

90 %

and

95 %

simultaneous confidence intervals. For the selection of the adjustment parameter

λ_{l}

in the nodewise algorithm, we refer to the method proposed in [25].

In Figure 3, for the case of

ϵ \sim N (0, 0 . 5^{2})

, and Figure 4, for the one of

ϵ \sim t (2)

, the coverage probabilities and the ratios of the average widths to the prior widths, calculated using the “K-grad-Q“ and “n+K-1-grad-Q” methods for different quantile levels, are displayed. The confidence levels are 95%. The sparsity levels are

s = 4

and

s = 8

. To determine whether the true values of the non-zero elements in the parameter

β^{*}

lie within the intervals constructed by our proposed method, we examine different values for the different number of machines, K. We observe that the confidence intervals constructed by our method are capable of effectively encompassing the true values of unknown parameters. In Figure 5 for

ϵ \sim N (0, 0 . 5^{2})

and Figure 6 for

ϵ \sim t (2)

, we construct a 95% confidence interval for the fifth element of the true parameter (

β_{5}^{*} = 1

). The case for the confidence level of

90 %

is listed in the Appendix A.

When the round of communication is low, the accuracy of parameter estimation is poor and the coverage probabilities of both methods are low. However, when the round of communication is sufficiently large, the estimation accuracy is relatively high, and the “k-1-grad-Q” method tends to coverage. In addition, the “n+k-1-grad-Q” method is relatively more accurate. The confidence intervals via our method can effectively cover the unknown true parameters. We also find that when the number of machines is too large (with small amounts of data on each machine), the estimation accuracy is low, which also leads to a low coverage probability; when K is too small, both algorithms perform poorly, which is consistent with the results in Remarks 7 and 8.

5. Conclusions and Discussions

Constructing confidence intervals for parameters in high-dimensional sparse quantile regression models is a challenging task. The bootstrap, as a standard inference tool, has been shown useful in handling the issue. However, previous works that extended the bootstrap technique to high-dimensional models focus on non-distributed mean regression [25] or distributed mean regression [24]. We extend their “k-1-grad” and “n+k-1-grad” bootstrap techniques to “k-1-grad-Q” and “n+k-1-grad-Q” distributed bootstrap simultaneous inference for high-dimensional quantile regression, which is applicable to large-scale heterogeneous data. Our proposed Q-DistBoots-SI algorithm is based on a communication-efficient distributed learning framework [3,4]. Therefore, the Q-DistBoots-SI is a novel communication-efficient distributed bootstrap inference, which relaxes the constraint on the number of machines and is more accurate and robust for the large-scale heterogeneous data. We theoretically prove the convergence of the algorithm and establish a lower bound on the number of communication rounds

ι_{min}

that warrants statistical accuracy and efficiency. This also enriches the statistical theory of distributed bootstrap inference and provides a theoretical basis for its widespread application. In addition, our proposed Q-DistBoots-SI algorithm can also be applied to large-scale distributed data in various fields. In fact, the bootstrap method has been applied to statistical inference for a long time. For example, Chattergee and Lahiri [31] studied the performance of the bootstrapping Lasso estimators on the prostate cancer data and stated that the covariates log(cancer volume), log(prostate weight), seminal vesicle invasion, and Gleason score have a nontrivial effect on log(prostate specific antigen); the rest of the variables (age, log(benign prostatic hyperplasia amount), log(capsular penetration), and percentage Gleason scores 4 or 5) were judged insignificant at level

1 - α = 0.1

; Liu et al. [32] applied their proposed bootstrap lasso + partial ridge method to a data set containing 43,827 gene expression measurements from the Illumina RNA sequencing of 498 neuroblastoma samples and found some significant genes. Yu et al. [25] tested their distributed bootstrap for simultaneous inference on a semi-synthetic dataset based on the US Airline On-time Performance dataset and successfully selected the relevant variables associated with arrival delay. However, Refs. [25,31,32] mainly focused on bootstrap inference for mean regression. Therefore, they cannot select relevant predictive variables with the responses at the different quantile levels. In contrast, our method can be successfully applied to the US Airline On-time Performance dataset and gene expression dataset to infer predictors with significant effects on the response at each quantile level. This is very important because we may be more interested in the influencing factors of response variables at extreme quantile levels. For example, our approach can be applied to gene expression data to identify genes that have significant effects on predicting a cancer gene’s expression levels in a quantile regression model. Compared with mean regression methods, our method finds genes that should be biologically more reasonable and interpretable because of the characteristics of quantile regression. Future work is needed to investigate the applications of our distributed bootstrap simultaneous inference for quantile regression to large-scale distributed datasets in various fields. Although our Q-DistBoots-SI algorithm is communication-efficient, when the feature dimension of the data is extra-high, the gradient transmitted by each worker machine in the algorithm is still an ultra-high dimensional vector, which also has unbearable communication costs. Thus, we need to develop a more communication-efficient Q-DistBoots-SI algorithm via quantization and sparse techniques (such as Topk) for large-scale ultra-high-dimensional distributed data. In addition, our Q-DistBoots-SI algorithm cannot cope with Byzantine failure in distributed statistical learning. However, Byzantine failure has recently attracted significant attention [7] and is becoming more common in distributed learning frameworks because worker machines may exhibit abnormal behavior due to crashes, faulty hardware, and stalled computation or unpredictable communication channels. Byzantine-robust distributed bootstrap inference will also be a topic of our future research. Additionally, in the future, we can also extend our distributed bootstrap inference method into transfer learning and graph models for large-scale high-dimensional data.

Author Contributions

Conceptualization, X.Z.; methodology, X.Z. and Z.J.; software, Z.J.; validation, X.Z., Z.J. and C.H.; formal analysis, X.Z. and Z.J.; investigation, X.Z. and C.H.; data curation, Z.J.; writing—original draft preparation, Z.J. and X.Z.; writing—review and editing, X.Z.; supervision, X.Z.; project administration, X.Z.; and funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 12171242, 12371267) and the Postgraduate Research and Practice Innovation Program of Jiangsu Province (Grant No. KYCX23_2258).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proof of Theorems

We give the proofs of the main results.

Denote

{\tilde{L}}_{1} (β, {\tilde{β}}^{t}) = L_{1} (β) + 〈\frac{1}{K} \sum_{k = 1}^{K} \nabla L_{k} ({\tilde{β}}^{t}) - \nabla L_{1} ({\tilde{β}}^{t}), β〉 .

So, we have

\nabla {\tilde{L}}_{1} (β, {\tilde{β}}^{t}) = \nabla L_{1} (β) + \frac{1}{K} \sum_{k = 1}^{K} \nabla L_{k} ({\tilde{β}}^{t}) - \nabla L_{1} ({\tilde{β}}^{t}) .

Notice that

L_{N} (β) = \frac{1}{K} \sum_{k = 1}^{K} \nabla L_{k} (β)

. Let

R ({\tilde{β}}^{t + 1}, β^{*}; {\tilde{β}}^{t}) = {\tilde{L}}_{1} ({\tilde{β}}^{t + 1}, {\tilde{β}}^{t}) - {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}) - 〈\nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}), {\tilde{β}}^{t + 1} - β^{*}〉 .

Appendix A.1. Proof of Theorem 1

We first show how the estimation error bound decreases after one round of communication, i.e.,

∥{\tilde{β}}^{t + 1} - β^{*}∥

decreases with

∥{\tilde{β}}^{t} - β^{*}∥

. The proof is divided into three steps as follows:

Step 1: Choose

λ^{t + 1} \geq 2 {∥\nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t})∥}_{\infty}

. Then, with probability at least

1 - ϵ

, we have

{\tilde{β}}^{t + 1} - β^{*} \in Δ (S, 3) .

Since

{\tilde{L}}_{1} (β, {\tilde{β}}^{t})

is convex in

β

, we have

{\tilde{L}}_{1} ({\tilde{β}}^{t + 1}, {\tilde{β}}^{t}) - {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}) \geq \nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}) ({\tilde{β}}^{t + 1} - β^{*}) .

(A1)

By the optimality of

{\tilde{β}}^{t + 1}

in (5), we have

{\tilde{L}}_{1} ({\tilde{β}}^{t + 1}, {\tilde{β}}^{t}) - {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}) \leq λ^{t + 1} {∥β^{*}∥}_{1} - λ^{t + 1} {∥{\tilde{β}}^{t + 1}∥}_{1} .

(A2)

By (A1) and (A2), we have

- {∥\nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t})∥}_{\infty} {∥{\tilde{β}}^{t + 1} - β^{*}∥}_{1} \leq {\tilde{L}}_{1} ({\tilde{β}}^{t + 1}, {\tilde{β}}^{t}) - {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}) \leq λ^{t + 1} {∥β^{*}∥}_{1} - λ^{t + 1} {∥{\tilde{β}}^{t + 1}∥}_{1} .

Let

δ = {\tilde{β}}^{t + 1} - β^{*}

. By

λ^{t + 1} \geq 2 {∥\nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t})∥}_{\infty}

, we obtain

- \frac{λ^{t + 1}}{2} {∥ δ ∥}_{1} \leq λ^{t + 1} {∥β^{*}∥}_{1} - λ^{t + 1} {∥β^{*} + δ∥}_{1} .

Since

{∥ δ ∥}_{1} = ∥ δ_{S} ∥_{1} + {∥ δ_{S^{c}} ∥}_{1}

,

∥ β^{*} ∥_{1} = {∥ β_{S}^{*} ∥}_{1}

, and

∥ β^{*} {+ δ ∥}_{1} = ∥ β_{S}^{*} + δ_{S} ∥_{1} + {∥ δ_{S^{c}} ∥}_{1}

, we obtain

- \frac{λ^{t + 1}}{2} ∥ δ_{S} ∥_{1} - \frac{λ^{t + 1}}{2} ∥ δ_{S^{c}} ∥_{1} \leq λ^{t + 1} ∥ δ_{S} ∥_{1} - λ^{t + 1} {∥ δ_{S^{c}} ∥}_{1} .

Thus, by rearranging, we have

∥ δ_{S^{c}} ∥_{1} \leq 3 {∥ δ_{S} ∥}_{1} .

Step 2: To show with probability

1 - n^{- C}

,

sup_{∥ δ ∥_{2} \leq ν} | R ({\tilde{β}}^{t + 1}, β^{*}; {\tilde{β}}^{t}) - E [R ({\tilde{β}}^{t + 1}, β^{*}; {\tilde{β}}^{t})] | \leq C (\frac{s^{3 / 4} {(c_{n} ν)}^{3 / 2} \sqrt{log n}}{\sqrt{n}} + \frac{c_{n} \sqrt{s} ν log n}{n}),

where

c_{n} = O (\sqrt{log p})

. The result can be obtained by Step 2 in [5] and the fact

R ({\tilde{β}}^{t + 1}, β^{*}; {\tilde{β}}^{t}) = L_{1} (β^{*} + δ) - L_{1} (β^{*}) - δ^{T} \nabla L_{1} (β^{*}) .

(A3)

Amazingly, the result is irrelevant to the

{\tilde{β}}^{t}

.

Step 3: To show

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{2} \leq C (λ^{t + 1} \sqrt{s} + \frac{s^{3 / 2} c_{n}^{2} log n}{n})

with probability at least

1 - ϵ

.

By Step 1,

δ = {\tilde{β}}^{t + 1} - β^{*} \in Δ (S, 3)

. Assuming

∥δ∥ = ν > 0

, by the optimality of

{\tilde{β}}^{t + 1}

in (5), we have

inf_{∥ δ ∥_{2} = ν, δ \in Δ (S, 3)} {\tilde{L}}_{1} ({\tilde{β}}^{t + 1}, {\tilde{β}}^{t}) + λ^{t + 1} {∥{\tilde{β}}^{t + 1}∥}_{1} - {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}) - λ^{t + 1} {∥β^{*}∥}_{1} \leq 0 .

(A4)

Further, we have

\begin{matrix} λ^{t + 1} {∥β^{*}∥}_{1} - λ^{t + 1} {∥{\tilde{β}}^{t + 1}∥}_{1} \overset{(i)}{\geq} {\tilde{L}}_{1} ({\tilde{β}}^{t + 1}, {\tilde{β}}^{t}) - {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}) \\ \overset{(i i)}{\geq} & E [R ({\tilde{β}}^{t + 1}, β^{*}; {\tilde{β}}^{t})] + δ^{T} \nabla L_{1} (β^{*}) - C (\frac{s^{3 / 4} {(c_{n} ν)}^{3 / 2} \sqrt{log n}}{\sqrt{n}} + \frac{c_{n} \sqrt{s} ν log n}{n}) \\ \overset{(i i i)}{\geq} & E [L_{1} (β^{*} + δ) - L_{1} (β^{*})] - {∥δ∥}_{1} {∥\nabla L_{1} (β^{*})∥}_{\infty} - C (\frac{s^{3 / 4} {(c_{n} ν)}^{3 / 2} \sqrt{log n}}{\sqrt{n}} + \frac{c_{n} \sqrt{s} ν log n}{n}) \\ \overset{(i v)}{\geq} & C (ν^{2} \land ν) - \frac{C}{2} λ^{t + 1} \sqrt{s} ν - C (\frac{s^{3 / 4} {(c_{n} ν)}^{3 / 2} \sqrt{log n}}{\sqrt{n}} + \frac{c_{n} \sqrt{s} ν log n}{n}), \end{matrix}

(A5)

where (i) because of (A4), (ii) by Step 2 and the fact (A3), (iii) by the fact (A3) and the Cauchy–Schwarz inequality, and (iv) by Equation (3.7) in Lemma 3 of [12],

{∥δ∥}_{1} = {∥δ_{S}∥}_{1} + {∥δ_{S^{c}}∥}_{1} \leq 4 {∥δ_{S}∥}_{1} \leq 4 \sqrt{s} {∥δ_{S}∥}_{2} \leq \frac{C}{2} \sqrt{s} ν,

(A6)

and

\nabla L_{1} (β^{*}) = \nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}) + (\frac{1}{K} \sum_{k = 1}^{K} \nabla L_{k} ({\tilde{β}}^{t}) - \nabla L_{1} ({\tilde{β}}^{t})) \to \nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t})

in probability since

(\frac{1}{K} \sum_{k = 1}^{K} \nabla L_{k} ({\tilde{β}}^{t}) - \nabla L_{1} ({\tilde{β}}^{t})) \to 0

when

n \to \infty

by the law of large numbers, thus

{∥\nabla L_{1} (β^{*})∥}_{\infty} \leq 2 ∥\nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t})∥ \leq λ^{t + 1}

. In addition,

{∥β^{*}∥}_{1} - {∥{\tilde{β}}^{t + 1}∥}_{1} \leq {∥δ∥}_{1} \leq \frac{C}{2} \sqrt{s} ν .

Combining (A5), we have

C (ν^{2} \land ν) - C λ^{t + 1} \sqrt{s} ν - C (\frac{s^{3 / 4} {(c_{n} ν)}^{3 / 2} \sqrt{log n}}{\sqrt{n}} + \frac{c_{n} \sqrt{s} ν log n}{n}) \leq 0 .

Some algebra shows that this implies

ν \leq C (λ^{t + 1} \sqrt{s} + \frac{c_{n} \sqrt{s} log n}{n} + \frac{s^{3 / 2} c_{n}^{2} log n}{n}) \leq C (λ^{t + 1} \sqrt{s} + \frac{s^{3 / 2} c_{n}^{2} log n}{n}) .

Thus, we have

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{2} \leq C (λ^{t + 1} \sqrt{s} + \frac{s^{3 / 2} c_{n}^{2} log n}{n}) .

Then, by

{∥{\tilde{β}}^{t + 1} - β^{*}∥}_{1} \leq \sqrt{s} {∥{\tilde{β}}^{t + 1} - β^{*}∥}_{2}

, we easily obtain the second result.

Appendix A.2. Proof of Theorem 2

Our proof refers to the argument process of Theorem 3 in the literature [25]. First, if

n ≫ s^{4} l o g^{4} p

, we can obtain

{∥\bar{β} - β^{*}∥}_{1} = {∥{\tilde{β}}^{ι} - β^{*}∥}_{1} = O_{P} (s \sqrt{\frac{log p}{N}} + (s \sqrt{\frac{log p}{n}}) {(s^{2} \sqrt{\frac{{log}^{4} p}{n}})}^{ι}) .

Then, as long as

n ≫ (s^{4} s^{*} + {(s^{*})}^{3}) {log}^{6 + 2 κ} p + s^{4} l o g^{4} p

,

K ≫ {(s^{*})}^{2} {log}^{5 + 2 κ} p

, and

s \sqrt{\frac{log p}{N}} + (s \sqrt{\frac{log p}{n}}) {(s^{2} \sqrt{\frac{{log}^{4} p}{n}})}^{ι} ≪ min \{\frac{1}{s^{*} \sqrt{n} {log}^{2 + κ} p}, \frac{1}{\sqrt{s s^{*} K} n^{1 / 4} {log}^{3 / 2 + κ} p}\}

by Lemma A2, we have

s u p_{α \in (0, 1)} | P (T \leq c_{\bar{W}} (α)) - α | = o (1) .

These conditions hold if

n ≫ (s^{4} s^{*} + {(s^{*})}^{3}) {log}^{6 + 2 κ} p + s^{4} {log}^{4} p + s^{6} {(s^{*})}^{2} {log}^{8 + 4 κ} p

,

K ≫ ({(s^{*})}^{2} s^{2} + {(s^{*})}^{2}) {log}^{5 + 2 κ} p

, and

\begin{matrix} ι_{m i n} \geq & m a x \{\frac{log (s^{2}) + log ({(s^{*})}^{2}) + log (C {log}^{5 + 2 κ} p)}{log n - log (s^{4}) - log ({log}^{4} p)}, \\ \frac{log (s^{3}) + log (s^{*}) + l o g K + log (C {log}^{4 + 2 κ} p) - log (n^{\frac{1}{2}})}{log n - log (s^{4}) - log ({log}^{4} p)}\} . \end{matrix}

Appendix A.3. Proof of Theorem 3

Following the same proof as Theorem 3.2 in [25], by applying Theorem 3.1 in [25] and Lemma A8, we have that

|P (T \leq c_{\tilde{W}} (α)) - α| = o (1)

, as long as

n ≫ (s^{4} s^{*} + {(s^{*})}^{3}) {log}^{6 + 2 κ} p

+ s^{4} {log}^{4} p

,

n + K ≫ s^{* 2} {log}^{5 + 2 κ} p

, and

s \sqrt{\frac{log p}{N}} + (s \sqrt{\frac{log p}{n}}) {(s^{2} \sqrt{\frac{{log}^{4} p}{n}})}^{ι} ≪ min \{\frac{1}{s^{*} (\sqrt{n} + log p) {log}^{2 + κ} p}, \frac{1}{\sqrt{s s^{*} K} n^{\frac{1}{4}} {log}^{3 / 2 + κ} p}\} .

These conditions holds if

n ≫ (s^{4} s^{*} + {(s^{*})}^{3}) {log}^{6 + 2 κ} p + s^{4} {log}^{4} p + s^{6} {(s^{*})}^{2} {log}^{8 + 4 κ} p

,

n + K ≫ s^{* 2} {log}^{5 + 2 κ} p

,

N ≳ {(s^{*})}^{2} s^{2} {log}^{7 + 2 κ} p

,

K ≫ {(s^{*})}^{2} s^{2} {log}^{5 + 2 κ} p

and

\begin{matrix} ι_{m i n} \geq m a x \{\frac{log s^{2} + log {(s^{*})}^{2} + log (n + {log}^{2} p) + log (C {log}^{5 + 2 κ} p) - log n}{log n - log (s^{4}) - log ({log}^{4} p)}, \\ \frac{log (s^{3}) + log s^{*} + log K + log (C {log}^{4 + 2 κ} p) - log (n^{\frac{1}{2}})}{log n - log (s^{4}) - log ({log}^{4} p)}\} . \end{matrix}

Appendix B. Lemmas and Their Proofs

Lemma A1.

Under the conditions of Theorem 1, with the probability at least

1 - p^{- C}

, we have

{∥\nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t})∥}_{\infty} \leq C (\sqrt{\frac{log p}{N}} + \frac{s {log}^{3 / 2} p}{n} + δ_{t} (\sqrt{\frac{s {log}^{2} p}{n}} + \sqrt{\frac{{log}^{3} p}{n}}) + δ_{t}^{2} \sqrt{{log}^{3} p}),

where

δ_{t} = {∥{\tilde{β}}^{t} - β^{*}∥}_{1}

.

Proof of Lemma A1.

Based on the definition, we have

\begin{matrix} \nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t}) & = & \nabla L_{1} (β^{*}) + \nabla L_{N} ({\tilde{β}}^{t}) - \nabla L_{1} ({\tilde{β}}^{t}) \\ = & \frac{1}{K} \sum_{k = 1}^{K} \nabla L_{k} (β^{*}) + (\nabla L_{1} (β^{*}) - \nabla L_{1} ({\tilde{β}}^{t})) - (\nabla L_{N} (β^{*}) - \nabla L_{N} ({\tilde{β}}^{t})) \\ = & \frac{1}{K} \sum_{k = 1}^{K} \nabla L_{k} (β^{*}) + [(\nabla L_{1} (β^{*}) - \nabla L_{1} ({\tilde{β}}^{t})) - E (\nabla L_{1} (β^{*}) - \nabla L_{1} ({\tilde{β}}^{t}))] \\ - [(\nabla L_{N} (β^{*}) - \nabla L_{N} ({\tilde{β}}^{t})) - E (\nabla L_{N} (β^{*}) - \nabla L_{N} ({\tilde{β}}^{t}))] \\ + [E (\nabla L_{1} (β^{*}) - \nabla L_{1} ({\tilde{β}}^{t})) - E (\nabla L_{N} (β^{*}) - \nabla L_{N} ({\tilde{β}}^{t}))] \\ = : & I_{1} + I_{2} - I_{3} + I_{4} . \end{matrix}

(A7)

For

I_{1}

, by Theorem 1 in [12], it shows that with probability at least

1 - p^{- C}

,

I_{1} ≲ \sqrt{1 + C} \sqrt{\frac{log p}{N}} .

For

I_{2}

, by Lemma A.1 in [5], it shows that with probability

1 - p^{- C}

,

\begin{matrix} {∥I_{2}∥}_{\infty} & = & {∥\frac{1}{n} \sum_{i = 1}^{n} x_{1 i} [I (y_{i} \leq x_{1 i}^{T} β^{*}) - I (y_{i} \leq x_{1 i}^{T} {\tilde{β}}^{t}) - (F (x_{1 i}^{T} β^{*} | x_{1 i}) - F (x_{1 i}^{T} {\tilde{β}}^{t} | x_{1 i}))]∥}_{\infty} \\ ≲ & δ_{t} \sqrt{\frac{s {log}^{2} p}{n}} + \frac{s {log}^{3 / 2} p}{n} . \end{matrix}

For

I_{3}

, similarly,

{∥I_{3}∥}_{\infty} ≲ δ_{t} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N} .

For

I_{4}

, we have

\begin{matrix} {∥I_{4}∥}_{\infty} \\ = & {∥\frac{1}{n} \sum_{i = 1}^{n} x_{1 i} (F (x_{1 i}^{T} β^{*} | x_{1 i}) - F (x_{1 i}^{T} {\tilde{β}}^{t} | x_{1 i})) - \frac{1}{n K} \sum_{k = 1}^{K} \sum_{i = 1}^{n} x_{k i} (F (x_{k i}^{T} β^{*} | x_{k i}) - F (x_{k i}^{T} {\tilde{β}}^{t} | x_{k i}))∥}_{\infty} \\ \overset{(i)}{\leq} & δ_{t} {∥\frac{1}{n} \sum_{i = 1}^{n} x_{1 i} x_{1 i}^{T} f (x_{1 i}^{T} β^{*} | x_{1 i}) - \frac{1}{n K} \sum_{k = 1}^{K} \sum_{i = 1}^{n} x_{k i} x_{k i}^{T} f (x_{k i}^{T} β^{*} | x_{k i})∥}_{\infty} + C δ_{t}^{2} (max_{k, i} {∥x_{k i}∥}_{\infty}^{3}) \\ \leq & δ_{t} {∥\frac{1}{n} \sum_{i = 1}^{n} [x_{1 i} x_{1 i}^{T} f (x_{1 i}^{T} β^{*} | x_{1 i}) - E (x_{1 i} x_{1 i}^{T} f (x_{1 i}^{T} β^{*} | x_{1 i}))]∥}_{\infty} \\ + δ_{t} {∥\frac{1}{n K} \sum_{k = 1}^{K} \sum_{i = 1}^{n} [x_{k i} x_{k i}^{T} f (x_{k i}^{T} β^{*} | x_{k i}) - E (x_{k i} x_{k i}^{T} f (x_{k i}^{T} β^{*} | x_{k i}))]∥}_{\infty} + C δ_{t}^{2} (max_{k, i} {∥x_{k i}∥}_{\infty}^{3}) \\ \overset{(i i)}{≲} & δ_{t} (log (N p) \sqrt{\frac{log (p)}{n}}) + δ_{t} (log p \sqrt{\frac{log p}{N}}) + δ_{t}^{2} {(log (N p))}^{3 / 2} \\ \overset{(i i i)}{≲} & δ_{t} (\sqrt{\frac{{log}^{3} p}{n}}) + δ_{t}^{2} \sqrt{{log}^{3} p} \end{matrix}

in probability, where (i) by Taylor expansion, (ii) by Bernstein’s inequality and

{max}_{k, i} {∥x_{k i}∥}_{\infty} = O_{p} (log (\sqrt{N p}))

, and (iii) by

O_{p} (log (N p)) = O_{p} (log (n p)) = O_{p} (log (p))

.

Combining with the above results on

{∥I_{j}∥}_{\infty}

(

j = 1, \dots, 4

), and (A7), we have

{∥\nabla {\tilde{L}}_{1} (β^{*}, {\tilde{β}}^{t})∥}_{\infty} ≲ \sqrt{\frac{log p}{N}} + δ_{t} \sqrt{\frac{s {log}^{2} p}{n}} + \frac{s {log}^{3 / 2} p}{n} + δ_{t} (\sqrt{\frac{{log}^{3} p}{n}}) + δ_{t}^{2} \sqrt{{log}^{3} p} .

□

Lemma A2.

(K-grad-Q) Under Assumptions 1–5 hold, if

n ≫ (s^{4} s^{*} + {(s^{*})}^{3}) {log}^{6 + 2 κ} p

,

K ≫ {(s^{*})}^{2} {log}^{5 + 2 κ} p

, and

{∥{\tilde{β}}^{t} - β^{*}∥}_{1} ≪ min \{\frac{1}{s^{*} \sqrt{n} {log}^{2 + κ} p}, \frac{1}{\sqrt{s s^{*} K} n^{1 / 4} l o g^{3 / 2 + κ} p}\}

for some

κ > 0

, then we obtain that

sup_{α \in (0, 1)} |P (T \leq c_{\bar{W}} (α)) - α| = o (1),

(A8)

sup_{α \in (0, 1)} |P (\hat{T} \leq c_{\tilde{W}} (α)) - α| = o (1) .

(A9)

Proof.

To simplify notation, let

\bar{β} = {\tilde{β}}^{ι - 1}

, and

\tilde{β} = {\tilde{β}}^{ι}

in T of (23). Denote

L (β; z_{k i}) = ρ_{τ} (y_{k i} - x_{k i}^{T} β)

,

\nabla L (β; z_{k i}) = x_{k i} ψ_{τ} (y_{k i} - x_{k i}^{T} β)

,

Σ^{*} = E (x x^{T} f (x^{T} β^{*} | x))

and

Σ_{x} = E (x x^{T})

. Recall that

{(Σ^{*})}^{- 1} = Θ

. Note that its proof needs Lemmas A3–A6.

Notice that

∥ \sqrt{N} (\tilde{β} - β^{*}) ∥_{\infty} = {max}_{l} \sqrt{N} | \tilde{β_{l}} - β_{l}^{*} | = \sqrt{N} {max}_{l} ((\tilde{β_{l}} - β_{l}^{*}) \lor (β_{l}^{*} - \tilde{β_{l}}))

. With the same arguments as [21], we only need to consider the bootstrap consistency for

T = max_{l} \sqrt{N} {(\tilde{β} - β^{*})}_{l},

(A10)

\hat{T} = max_{l} \sqrt{N} {({\hat{β}}_{N} - β^{*})}_{l},

(A11)

which imply the bootstrap consistency for

T = ∥ \sqrt{N} (\tilde{β} - β^{*}) ∥_{\infty}

and

\hat{T} = {∥ \sqrt{N} ({\hat{β}}_{N} - β^{*}) ∥}_{\infty}

. From now on, we only consider T and

\hat{T}

as (A10) and (A11), respectively. In addition, we also define an oracle multiplier bootstrap statistic as

W^{*} : = max_{1 \leq l \leq p} - \frac{1}{\sqrt{N}} \sum_{i = 1}^{n} \sum_{k = 1}^{K} {[{(Σ^{*})}^{- 1} \nabla L (β^{*}; z_{k i})]}_{l} ϵ_{k i}^{*},

(A12)

where

ϵ_{i k}_{\sim}^{i . i . d .} N (0, 1)

,

i = 1, \dots, n; k = 1, \dots, K

, and independent of the entire dataset. The proof is divided into three steps: Step 1 is to prove that

W^{*}

is bootstrap-consistent; Step 2 is to prove that T of our algorithm Q-DistBoots-SI with K-grad-Q is bootstrap-consistent by proving the quantile of W equals to

W^{*}

in probability; and Step 3 is to prove (A8) and (A9).

Step 1: To show that

sup_{α \in (0, 1)} | P (T \leq c_{W^{*}} (α)) - α | = o (1),

where

c_{W^{*}} (α) : = inf {w \in R : P_{ϵ} (W^{*} \leq w) \geq α}

.

Note that

{(Σ^{*})}^{- 1} \nabla L (β^{*}; z) = Θ [- x (τ - I (y - x^{T} β^{*} \leq 0))] = Θ x [I (ε \leq 0) - τ],

and

\begin{matrix} E [{(Σ^{*})}^{- 1} \nabla L (β^{*}; z) {({(Σ^{*})}^{- 1} \nabla L (β^{*}; z))}^{T}] & = Θ E [x x^{T} {[I (ε \leq 0) - τ]}^{2}] Θ^{T} \\ = τ (1 - τ) Θ E [x x^{T}] Θ^{T} \\ = τ (1 - τ) Θ Σ_{x} Θ^{T} . \end{matrix}

Thus, we have

E [{(Σ^{*})}^{- 1} \nabla L (β^{*}; z)] = 0

, and by Assumptions 3 and 5, one gets

\begin{matrix} min_{l} E \{{[{(Σ^{*})}^{- 1} \nabla L (β^{*}; z)]}_{l}^{2}\} & \geq & C τ (1 - τ) min_{l} Θ_{l, l} \\ \geq & C τ (1 - τ) Λ_{min} (Θ) \\ = & \frac{C τ (1 - τ)}{Λ_{max} (Σ)} > 0 . \end{matrix}

(A13)

Under Assumption 1,

w^{T} Θ x

is a sub-Gaussian with

O (∥Θ w∥) = O (Λ_{max} (Θ) = Λ_{min} (Σ) = O (1)

ψ_{2}

-norm, for any

w \in S^{p - 1}

. In addition,

I (ε \leq 0) - τ

is also sub-Gaussian because

| I (ε \leq 0) - τ | \leq 1 + τ

and is independent with

w^{T} Θ x

. Therefore,

w^{T} Θ x [I (ε \leq 0) - τ]

is sub-exponential with uniformly bounded

ψ_{1}

-norm for any

w \in S^{p - 1}

. So,

{[{(Σ^{*})}^{- 1} \nabla L (β^{*}; z)]}_{l}

(

l = 1, \dots, p

) are sub-exponential with uniformly bounded

ψ_{1}

-norm. Thus, combining with (A13), we have verified Assumption (E.1) and Comment 2.2 of [22] for

{(Σ^{*})}^{- 1} \nabla L (β^{*}; z)

.

Define a Bahadur representation of T as

T_{0} : = max_{1 \leq l \leq p} - \sqrt{N} {[{(Σ^{*})}^{- 1} \nabla L_{N} (β^{*})]}_{l} .

(A14)

By Theorem 3.2 and Corollary 2.1 in [22], if

{log}^{7 + κ} p / N ≲ N^{- c}

holds for some constant

c > 0

(the condition holds if

N ≳ {log}^{7 + κ} p

for some

κ > 0

), we have that for every

v, ζ > 0

,

\begin{matrix} sup_{α \in (0, 1)} | P (T \leq c_{W^{*}} (α)) - α | & ≲ & N^{- c} + v^{1 / 3} {(1 \lor log \frac{p}{v})}^{2 / 3} + P ({∥\hat{Ω} - Ω_{0}∥}_{max} > v) \\ + ζ \sqrt{1 \lor log \frac{p}{ζ}} + P (| T - T_{0} | > ζ), \end{matrix}

(A15)

where

\begin{matrix} \hat{Ω} : & = & {Cov}_{ϵ} (- \frac{1}{\sqrt{N}} \sum_{i = 1}^{n} \sum_{k = 1}^{K} [{(Σ^{*})}^{- 1} \nabla L (β^{*}; z_{i k})] ϵ_{i k}^{*}) \\ = & {(Σ^{*})}^{- 1} {Cov}_{ϵ} (\frac{1}{\sqrt{N}} \sum_{i = 1}^{n} \sum_{k = 1}^{K} [\nabla L (β^{*}; z_{i k})] ϵ_{i k}^{*}) {(Σ^{*})}^{- 1} \\ = & {(Σ^{*})}^{- 1} (\frac{1}{N} \sum_{i = 1}^{n} \sum_{k = 1}^{K} {Cov}_{ϵ} ([\nabla L (β^{*}; z_{i k})] ϵ_{i k}^{*})) {(Σ^{*})}^{- 1} \\ = & {(Σ^{*})}^{- 1} (\frac{1}{N} \sum_{i = 1}^{n} \sum_{k = 1}^{K} [\nabla L (β^{*}; z_{i k})] {[\nabla L (β^{*}; z_{i k})]}^{T}) {(Σ^{*})}^{- 1}, \end{matrix}

(A16)

E [\nabla L (β^{*};)] = E [- x_{i} (τ - I (y_{i} - x_{i}^{T} β^{*} \leq 0))]

, and

Ω_{0} : = Cov (- {(Σ^{*})}^{- 1} \nabla L (β^{*}; z)) = τ (1 - τ) {(Σ^{*})}^{- 1} Σ_{x} {(Σ^{*})}^{- 1} .

(A17)

By Lemmas A3 and A5, we have

ζ \sqrt{1 \lor log \frac{p}{ζ}} + P (|T - T_{0}| > ζ) = o (1),

(A18)

v^{1 / 3} {(1 \lor log \frac{p}{v})}^{2 / 3} + P ({∥\hat{Ω} - Ω_{0}∥}_{\max} > v) = o (1) .

(A19)

By (A15), (A18), and (A19), the result of Step 1 holds.

Step 2: To show that the quantiles of

\bar{W}

and

W^{*}

are close. That is,

P ({T \leq c_{\bar{W}} (α)} ⊖ {T \leq c_{W^{*}} (α)}) = o (1),

where ⊖ is a symmetric difference.

For any

ω

such that

α + ω, α - ω \in (0, 1)

, we have

\begin{matrix} P ({T \leq c_{\bar{W}} (α)} ⊖ {T \leq c_{W^{*}} (α)}) \\ = & P ([{T \leq c_{\bar{W}} (α)} - {T \leq c_{W^{*}} (α)}] \cup [T \leq c_{W^{*}} (α)} - {T \leq c_{\bar{W}} (α)}]) \\ \leq & 2 P (c_{W^{*}} (α - ω) < T \leq c_{W^{*}} (α + ω)) + P (c_{W^{*}} (α - ω) > c_{\bar{W}} (α)) \\ + P (c_{\bar{W}} (α) > c_{W^{*}} (α + ω)) \\ \overset{(i)}{\leq} & 2 P (c_{W^{*}} (α - π (u)) < T \leq c_{W^{*}} (α + π (u))) + 2 P ({∥\bar{Ω} - \hat{Ω}∥}_{\max} > u) \\ \leq & 2 P (T \leq c_{W^{*}} (α + π (u))) - 2 P (T \leq c_{W^{*}} (α - π (u))) + 2 P ({∥\bar{Ω} - \hat{Ω}∥}_{\max} > u) \\ \leq & 2 [P (T \leq c_{W^{*}} (α + π (u))) - (α + π (u))] - 2 [P (T \leq c_{W^{*}} (α - π (u))) - (α - π (u))] \\ + 4 π (u) + 2 P ({∥\bar{Ω} - \hat{Ω}∥}_{\max} > u) \\ \overset{(i i)}{≲} & \underset{(A 18)}{\underset{︸}{ζ \sqrt{1 \lor log \frac{p}{ζ}} + P (| T - T_{0} | > ζ)}} + \underset{(A 19)}{\underset{︸}{v^{1 / 3} {(1 \lor log \frac{p}{v})}^{2 / 3} + P ({∥\hat{Ω} - Ω_{0}∥}_{max} > v)}} \\ + N^{- c} + \underset{L e m m a A 6}{\underset{︸}{π (u) + P ({∥\bar{Ω} - \hat{Ω}∥}_{\max} > u)}} \\ \overset{(i i i)}{=} & o (1), \end{matrix}

where (i) follows the arguments in the proof of Lemma 3.2 in [22], and let

ω = π (u) : = u^{1 / 3} {(1 \lor log (p / u))}^{2 / 3}

, (ii) by (A15), and (iii) hold from (A18) and (A19) and Lemma A6. In addition,

\begin{matrix} \bar{Ω} : & = & {Cov}_{ϵ} (- \frac{1}{\sqrt{K}} \sum_{k = 1}^{K} {\tilde{Θ}}_{1} \sqrt{n} (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) ϵ_{k}) \\ = & {\tilde{Θ}}_{1} (\frac{1}{K} \sum_{k = 1}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{T}) {\tilde{Θ}}_{1}^{T} . \end{matrix}

(A20)

Step 3: From Step 2, we have

\begin{matrix} sup_{α \in (0, 1)} | P (T \leq c_{\bar{W}} (α)) - α | & ≲ & \underset{(A 18)}{\underset{︸}{ζ \sqrt{1 \lor log \frac{p}{ζ}} + P (| T - T_{0} | > ζ)}} \\ + \underset{(A 19)}{\underset{︸}{v^{1 / 3} {(1 \lor log \frac{p}{v})}^{2 / 3} + P ({∥\hat{Ω} - Ω_{0}∥}_{max} > v)}} \\ + N^{- c} + \underset{L e m m a A 6}{\underset{︸}{π (u) + P ({∥\bar{Ω} - \hat{Ω}∥}_{\max} > u)}} \\ = & o (1) . \end{matrix}

(A21)

Thus, the first result (A8) holds. Further, by Lemma A4, which has

ξ \sqrt{1 \lor log \frac{p}{ξ}}

+ P (|\hat{T} - T_{0}| > ξ) = o (1)

, we can obtain the second results (A9). □

Lemma A3.

T and

T_{0}

are defined as in (23) and (A14), respectively. Under Assumptions 1–3, provided that

∥ \bar{β} - β^{*} ∥_{1} = O_{P} (r_{\bar{β}})

and

n ≫ {(s^{*})}^{2} {log}^{3} p

, we have that

| T - T_{0} | = O_{P} (r_{β} \sqrt{s^{*} K {log}^{2} p} + \frac{s^{*} {log}^{3 / 2} p}{\sqrt{n}} + \sqrt{s^{*} N} {log}^{3 / 2} (p) r_{β}^{2}) .

Moreover, for some

κ > 0

, if

n ≫ {(s \sqrt{s^{*}} + s^{*})}^{2} {(log p)}^{4 + 2 κ}

and

∥ \bar{β} - β^{*} ∥_{1} ≪ \frac{1}{\sqrt{s s^{*} K} n^{1 / 4} l o g^{3 / 2 + κ} p},

then some

ζ > 0

exist,

ζ \sqrt{1 \lor log \frac{p}{ζ}} + P (| T - T_{0} | > ζ) = o (1) .

Proof of Lemma A3.

Let

τ = τ 1_{N}

, where

1_{N} = {(1, \dots, 1)}^{T} \in R^{N}

. Denote

e_{N}^{*} = y_{N} - X_{N} β^{*}

,

{\bar{e}}_{N} = y_{N} - X_{N} \bar{β}

and

Σ_{N}^{*} = \frac{1}{N} \sum_{k = 1}^{K} \sum_{i = 1}^{n} x_{k i} x_{k i}^{T} f [x_{k i}^{T} β^{*} | x_{k i}]

. Note that

\begin{matrix} \bar{β} - β^{*} - {\tilde{Θ}}_{1} \frac{X_{N}^{⊤} [I_{{{\bar{e}}_{N} \leq 0}} - I_{{e_{N}^{*} \leq 0}}]}{N} \\ = & \bar{β} - β^{*} - {\tilde{Θ}}_{1} (\frac{1}{N} \sum_{k = 1}^{K} \sum_{i = 1}^{n} x_{k i} (I [y_{k i} \leq x_{k i}^{T} \bar{β}] - I [y_{k i} \leq x_{k i}^{T} β^{*}])) \\ \overset{(i)}{=} & \bar{β} - β^{*} - {\tilde{Θ}}_{1} (\frac{1}{N} \sum_{k = 1}^{K} \sum_{i = 1}^{n} x_{k i} (F [x_{k i}^{T} \bar{β} | x_{k i}] - F [x_{k i}^{T} β^{*} | x_{k i}])) \\ + {\tilde{Θ}}_{1} O_{p} (r_{\bar{β}} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N}) \\ \overset{(i i)}{=} & \bar{β} - β^{*} - {\tilde{Θ}}_{1} (\frac{1}{N} \sum_{k = 1}^{K} \sum_{i = 1}^{n} \{x_{k i} x_{k i}^{T} f [x_{k i}^{T} β^{*} | x_{k i}] (\bar{β} - β^{*}) + C [x_{k i} {(x_{k i}^{T} (\bar{β} - β^{*}))}^{2}]\}) \\ + {\tilde{Θ}}_{1} O_{p} (r_{\bar{β}} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N}) \\ = & (I_{p} - {\tilde{Θ}}_{1} Σ_{N}^{*}) (\bar{β} - β^{*}) - C {\tilde{Θ}}_{1} \frac{1}{N} \sum_{k = 1}^{K} \sum_{i = 1}^{n} [x_{k i} {(x_{k i}^{T} (\bar{β} - β^{*}))}^{2}] \\ + {\tilde{Θ}}_{1} O_{p} (r_{\bar{β}} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N}), \end{matrix}

(A22)

where (i) by Lemma A.1 in [5] and (ii) by Taylor expansion.

By (A22), we have

\begin{matrix} N^{- 1 / 2} | T - T_{0} | \leq max_{1 \leq l \leq p} |{(\tilde{β} - β^{*})}_{l} + \sqrt{N} {({(Σ^{*})}^{- 1} \nabla L_{N} (β^{*}))}_{l}| \\ = & {∥\tilde{β} - β^{*} + {(Σ^{*})}^{- 1} \nabla L_{N} (β^{*})∥}_{\infty} \\ = & {∥\tilde{β} - β^{*} + Θ \nabla L_{N} (β^{*})∥}_{\infty} \\ = & {∥\tilde{β} - β^{*} - {\tilde{Θ}}_{1} \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N} + {\tilde{Θ}}_{1} \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N} - Θ \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} \\ \leq & {∥\tilde{β} - β^{*} - {\tilde{Θ}}_{1} \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} + {∥{\tilde{Θ}}_{1} \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N} - Θ \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} \\ \leq & {∥\bar{β} - β^{*} + {\tilde{Θ}}_{1} \frac{X_{N}^{⊤} (τ - I_{{{\bar{e}}_{N} \leq 0}})}{N} - {\tilde{Θ}}_{1} \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} \\ + {∥({\tilde{Θ}}_{1} - Θ) \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} \\ \leq & {∥\bar{β} - β^{*} - {\tilde{Θ}}_{1} \frac{X_{N}^{⊤} [I_{{{\bar{e}}_{N} \leq 0}} - I_{{e_{N}^{*} \leq 0}}]}{N}∥}_{\infty} + {∥({\tilde{Θ}}_{1} - Θ) \frac{X_{N}^{⊤} (τ - I_{{e_{N} \leq 0}})}{N}∥}_{\infty} \\ \leq & {∥I_{p} - {\tilde{Θ}}_{1} Σ_{N}^{*}∥}_{max} {∥\bar{β} - β^{*}∥}_{1} + {∥{\tilde{Θ}}_{1}∥}_{\infty} max_{i, k} {| x_{k i} |}^{3} r_{\bar{β}}^{2} + {∥{\tilde{Θ}}_{1}∥}_{\infty} O_{p} (r_{\bar{β}} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N}) \\ + {∥{\tilde{Θ}}_{1} - Θ∥}_{\infty} {∥\frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} \\ \leq & ({∥I_{p} - {\tilde{Θ}}_{1} Σ_{1}^{*}∥}_{max} + {∥{\tilde{Θ}}_{1}∥}_{\infty} {∥Σ_{1}^{*} - Σ_{N}^{*}∥}_{max}) {∥\bar{β} - β^{*}∥}_{1} + {∥{\tilde{Θ}}_{1}∥}_{\infty} max_{i, k} {| x_{i k} |}^{3} r_{β}^{2} \\ + {∥{\tilde{Θ}}_{1} - Θ∥}_{\infty} {∥\frac{X_{N}^{⊤} (τ - I_{{e_{N} \leq 0}})}{N}∥}_{\infty} + {∥{\tilde{Θ}}_{1}∥}_{\infty} O_{p} (r_{β} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N}) \\ = & (O_{p} (\sqrt{\frac{log p}{n}}) + O_{p} (\sqrt{s^{*}}) O_{p} (\sqrt{\frac{log p}{n}})) O_{p} (r_{β}) + O_{p} (\sqrt{s^{*}}) O_{p} ({log}^{3 / 2} (N p)) O_{p} (r_{β}^{2}) \\ + O_{p} (s^{*} \sqrt{\frac{log p}{n}}) O_{p} (\sqrt{\frac{log p}{N}}) + O_{p} (r_{β} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N}) \sqrt{s^{*}} \\ = & O_{p} (r_{\bar{β}} \sqrt{\frac{s s^{*} {log}^{2} p}{n}} + \frac{(s \sqrt{s^{*}} + s^{*}) {log}^{3 / 2} p}{n \sqrt{K}} + \sqrt{s^{*}} {log}^{3 / 2} (N p) r_{\bar{β}}^{2}), \end{matrix}

where we use

{∥A α∥}_{\infty} \leq {∥A∥}_{max} {∥α∥}_{1}

,

{∥A α∥}_{\infty} \leq {∥A∥}_{\infty} {∥α∥}_{\infty}

for any matrix A and vector

α

, and

{∥A B∥}_{max} \leq {∥A∥}_{\infty} {∥B∥}_{max}

for any matrices A and B; and

\begin{matrix} {∥I_{p} - {\tilde{Θ}}_{1} Σ_{1}^{*}∥}_{max} & = & O_{p} (\sqrt{\frac{log p}{n}}), \\ {∥{\tilde{Θ}}_{1}∥}_{\infty} = max_{l} {∥{\tilde{Θ}}_{l}∥}_{1} & = & O_{P} (\sqrt{s^{*}}), \\ {∥\tilde{Θ} - Θ∥}_{\infty} = max_{l} {∥{\tilde{Θ}}_{l} - Θ_{l}∥}_{1} & = & O_{P} (s^{*} \sqrt{\frac{log p}{n}}), \\ {∥Σ_{1}^{*} - Σ_{N}^{*}∥}_{max} & = & O_{p} (\sqrt{\frac{log p}{n}}), \\ {∥\frac{X_{N}^{⊤} (τ - I_{{e_{N} \leq 0}})}{N}∥}_{\infty} & = & O_{p} (\sqrt{\frac{log p}{N}}) . \end{matrix}\}

(A23)

The proofs of (A23) are similar to the ones of the corresponding formulas in [25]. Thus,

| T - T_{0} | = O_{p} (r_{β} \sqrt{s s^{*} K {log}^{2} p} + \frac{(s \sqrt{s^{*}} + s^{*}) {log}^{3 / 2} p}{\sqrt{n}} + \sqrt{s^{*} N} {log}^{3 / 2} (p) r_{β}^{2}) .

Choosing

ζ = {(r_{β} \sqrt{s s^{*} K {log}^{2} p} + \frac{(s \sqrt{s^{*}} + s^{*}) {log}^{3 / 2} p}{\sqrt{n}} + \sqrt{s^{*} N} {log}^{3 / 2} (p) r_{β}^{2})}^{1 - κ}

for any

κ > 0

, we deduce that

P (|T - T_{0}| > ζ) = o (1) .

We also have that

ζ \sqrt{1 \lor log \frac{p}{ζ}} = o (1),

provided that

(r_{\bar{β}} \sqrt{s s^{*} K {log}^{2} p} + \frac{(s \sqrt{s^{*}} + s^{*}) {log}^{3 / 2} p}{\sqrt{n}} + \sqrt{s^{*} N} {log}^{3 / 2} (p) r_{\bar{β}}^{2}) {log}^{1 / 2 + κ} p = o (1),

which holds if

n ≫ {(s \sqrt{s^{*}} + s^{*})}^{2} {log}^{4 + 2 κ} p,

and

r_{\bar{β}} ≪ \frac{1}{\sqrt{s s^{*} K} n^{\frac{1}{4}} l o g^{3 / 2 + κ} p} .

□

Lemma A4.

\hat{T}

and

T_{0}

are defined as in (7) and (A14) respectively. Under Assumptions 1–3, provided that

n ≫ s^{*} l o g p

, we have that

| \hat{T} - T_{0} | = O_{P} (\frac{(s^{2} \sqrt{s^{*}} + s^{*}) {log}^{2.5} p}{\sqrt{n}}) .

Moreover, if

n ≫ (s^{4} s^{*} + {(s^{*})}^{2}) {log}^{6 + 2 κ} p

for some

κ > 0

, and there exist some

ξ > 0

, such that

ξ \sqrt{1 \lor log \frac{p}{ξ}} + P (| \hat{T} - T_{0} | > ξ) = o (1) .

Proof of Lemma A4.

By the proof of Lemma A3, we have

\begin{matrix} N^{- 1 / 2} | \hat{T} - T_{0} | \\ = max_{1 \leq l \leq p} |{({\hat{β}}_{N} - β^{*})}_{l} + {(Θ \nabla L_{N} (β^{*}))}_{l}| \\ = {∥{\hat{β}}_{N} - β^{*} + Θ \nabla L_{N} (β^{*})∥}_{\infty} \\ = {∥{\hat{β}}_{Z} - β^{*} + {\tilde{Θ}}_{1} \frac{X_{N}^{⊤} (τ - I_{{{\hat{e}}_{N} \leq 0}})}{N} - Θ \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} \\ = ∥{\hat{β}}_{Z} - β^{*} + {\tilde{Θ}}_{1} (\frac{X_{N}^{⊤} (τ - I_{{{\hat{e}}_{N} \leq 0}})}{N} - \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}) \\ {+ ({\tilde{Θ}}_{1} - Θ) \frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} \\ \leq {∥({\hat{β}}_{Z} - β^{*}) (I_{p} - {\tilde{Θ}}_{1} Σ_{N}^{*}) - C {\tilde{Θ}}_{1} \frac{1}{N} \sum_{k = 1}^{K} \sum_{i = 1}^{n} [x_{k i} {(x_{k i}^{T} ({\hat{β}}_{Z} - β^{*}))}^{2}]∥}_{\infty} \\ + {\tilde{Θ}}_{1} O_{p} ({∥{\hat{β}}_{Z} - β^{*}∥}_{1} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N}) + {∥({\tilde{Θ}}_{1} - Θ)∥}_{\infty} {∥\frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} \\ \leq {∥({\hat{β}}_{Z} - β^{*})∥}_{1} {∥I_{p} - {\tilde{Θ}}_{1} Σ_{N}^{*}∥}_{max} + C {∥{\tilde{Θ}}_{1}∥}_{\infty} max_{i k} {| x_{k i} |}^{3} {∥({\hat{β}}_{Z} - β^{*})∥}_{1}^{2} \\ + {\tilde{Θ}}_{1} O_{p} ({∥{\hat{β}}_{Z} - β^{*}∥}_{1} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N}) + {∥({\tilde{Θ}}_{1} - Θ)∥}_{\infty} {∥\frac{X_{N}^{⊤} (τ - I_{{e_{N}^{*} \leq 0}})}{N}∥}_{\infty} \\ = O_{p} (s \sqrt{\frac{log p}{N}}) O_{p} (\sqrt{\frac{s^{*} log p}{n}}) + O_{p} (\sqrt{s^{*}}) O_{p} ({log}^{3 / 2} p) O_{p} (s^{2} \frac{log p}{N}) \\ + O_{p} (s \sqrt{\frac{log p}{N}} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{3 / 2} p}{N}) \sqrt{s^{*}} + O_{p} (s^{*} \sqrt{\frac{log p}{n}}) O_{p} (\sqrt{\frac{log p}{N}}) \end{matrix}

\begin{matrix} = O_{p} (\frac{(s^{2} \sqrt{s^{*}} + s^{*}) {log}^{2.5} p}{\sqrt{N n}}) . \end{matrix}

Therefore,

| \hat{T} - T_{0} | = \frac{(s^{2} \sqrt{s^{*}} + s^{*}) {log}^{2.5} p}{\sqrt{n}} .

Taking

ξ = {(\frac{(s^{2} \sqrt{s^{*}} + s^{*}) {log}^{2.5} p}{\sqrt{n}})}^{1 - κ}

for any

κ > 0

, we deduce that

P (|\hat{T} - T_{0}| > ζ) = o (1) .

We also have that

ξ \sqrt{1 \lor log \frac{p}{ξ}} = o (1),

provided that

(\frac{(s^{2} \sqrt{s^{*}} + s^{*}) {log}^{2.5} p}{\sqrt{n}}) {log}^{1 / 2 + κ} p = o (1),

which holds if

n ≫ (s^{4} s^{*} + {(s^{*})}^{2}) {log}^{6 + 2 κ} p .

□

Lemma A5.

The

\hat{Ω}

and

Ω_{0}

defined in (A16) and (A17), respectively. Under Assumptions 1–4, we have

{∥\hat{Ω} - Ω_{0}∥}_{max} = O_{P} (\sqrt{\frac{log p}{N}} + \frac{{log}^{2} (p N) log p}{N}) .

In addition, if

N ≫ {log}^{5 + κ} p

for some

κ > 0

, there are for some

v > 0

,

v^{1 / 3} {(1 \lor log \frac{p}{v})}^{2 / 3} + P ({∥\hat{Ω} - Ω_{0}∥}_{max} > v) = o (1) .

Proof of Lemma A5.

This proof is similar to Lemma C.10 in [25]. We omit it. □

Lemma A6.

\bar{Ω}

and

\hat{Ω}

defined in (A20) and (A16), respectively. Under Assumptions 1–4, provided that

∥ \bar{β} - β^{*} ∥_{1} = O_{P} (r_{\bar{β}})

,

n ≫ s^{2} {log}^{3} p + s^{*} log p

,

K ≳ {log}^{2} (p K) log p

,

r_{\bar{β}} {log}^{3} p ≲ 1

,

\sqrt{n} r_{\bar{β}} ≲ 1

, we have that

{∥\bar{Ω} - \hat{Ω}∥}_{max} = O_{p} [s^{*} (\sqrt{n} r_{\bar{β}} + \sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) + \sqrt{\frac{s^{*} log p}{n}}] .

In addition, if

n ≫ {(s^{*})}^{3} {log}^{5 + 2 κ} p

,

K ≫ {(s^{*})}^{2} {log}^{5 + 2 κ} p,

and

∥ \bar{β} - β^{*} ∥_{1} ≪ \frac{1}{s^{*} \sqrt{n} {log}^{2 + κ} p}

for some

κ > 0

, then some

u > 0

exist,

u^{1 / 3} {(1 \lor log \frac{p}{u})}^{2 / 3} + P ({∥\bar{Ω} - \hat{Ω}∥}_{\max} > u) = o (1) .

Proof of Lemma A6.

By the triangle inequality, one gets that

{∥\bar{Ω} - \hat{Ω}∥}_{max} \leq {∥\bar{Ω} - Ω_{0}∥}_{max} + {∥\hat{Ω} - Ω_{0}∥}_{max} .

From Lemma A5, we only need to show

{∥\bar{Ω} - Ω_{0}∥}_{max}

.

By the definitions of

\bar{Ω}

and

Ω_{0}

, we have that

\begin{matrix} {∥\bar{Ω} - Ω_{0}∥}_{max} \\ = & ∥{\tilde{Θ}}_{1} (\frac{1}{K} \sum_{i = 1}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{T}) {\tilde{Θ}}_{1}^{T} \\ {- Θ E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{⊤}] Θ^{T}∥}_{max} \\ \leq & {∥{\tilde{Θ}}_{1} (\frac{1}{K} \sum_{k = 1}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{T} - E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{T}]) {\tilde{Θ}}_{1}^{⊤}∥}_{max} \\ + {∥{\tilde{Θ}}_{1} E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{⊤}] {\tilde{Θ}}_{1}^{T} - Θ E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{⊤}] Θ^{T}∥}_{max} \\ \leq & {∥{\tilde{Θ}}_{1}∥}_{\infty}^{2} {∥\frac{1}{K} \sum_{k = 1}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{T} - E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{T}]∥}_{max} \\ + {∥{\tilde{Θ}}_{1} E [{(τ - I {y - x^{T} β^{*} \leq 0})}^{2} x x^{T}] {\tilde{Θ}}_{1}^{T} - Θ {E [{(τ - I {y - x^{T} β^{*} \leq 0})}^{2} x x^{T}]} Θ∥}_{max} \\ \overset{(i)}{=} & O_{p} (s^{*}) O_{p} (\sqrt{n} r_{\bar{β}} + \sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) + τ (1 - τ) {∥{\tilde{Θ}}_{1} Σ_{x} {\tilde{Θ}}_{1}^{T} - Θ Σ_{x} Θ^{T}∥}_{max} \\ \overset{(i i)}{=} & O_{p} (s^{*}) O_{p} (\sqrt{n} r_{\bar{β}} + \sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) + O_{p} ({(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}) \\ = & O_{p} [s^{*} (\sqrt{n} r_{\bar{β}} + \sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}], \end{matrix}

(A24)

where (i) by Lemma A7, provided that

∥ \bar{β} - β^{*} ∥_{1} = O_{P} (r_{\bar{β}})

,

n ≫ s^{2} {log}^{3} p

,

K ≳ {log}^{2} (p K) log p

,

r_{\bar{β}} {log}^{3} p ≲ 1

,

\sqrt{n} r_{\bar{β}} ≲ 1

and

n ≫ s^{*} l o g p

,

{∥{\tilde{Θ}}_{1}∥}_{\infty} = O_{p} (\sqrt{s^{*}})

, and (ii) is obtained from the following formula, which is similar to the result of

I_{2}

in Lemma C.9 of [25]

\begin{matrix} {∥{\tilde{Θ}}_{1} Σ_{x} {\tilde{Θ}}_{1}^{T} - Θ Σ_{x} Θ^{T}∥}_{max} \\ \leq {∥{\tilde{Θ}}_{1} Σ_{x} {\tilde{Θ}}_{1}^{T} - Θ Σ_{x} {\tilde{Θ}}_{1}^{T}∥}_{max} + {∥Θ Σ_{x} {\tilde{Θ}}_{1}^{T} - Θ Σ_{x} Θ^{T}∥}_{max} \\ \leq {∥({\tilde{Θ}}_{1} - Θ) Σ_{x} {\tilde{Θ}}_{1}^{T}∥}_{max} + {∥Θ Σ_{x} ({\tilde{Θ}}_{1}^{T} - Θ^{T})∥}_{max} \\ \leq ∥ {\tilde{Θ}}_{1} {- Θ ∥}_{\infty} ∥ Σ_{x} ∥_{m a x} {∥ {\tilde{Θ}}_{1}^{T} ∥}_{1} = O_{p} ({(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}) . \end{matrix}

Taking

u = {[s^{*} (\sqrt{n} r_{\bar{β}} + \sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}]}^{1 - κ}

with any

κ > 0

, we deduce that

P ({∥\bar{Ω} - \hat{Ω}∥}_{max} > u) = o (1) .

We also have that

u^{1 / 3} {(1 \lor log \frac{p}{u})}^{2 / 3} = o (1),

provided that

[s^{*} (\sqrt{n} r_{\bar{β}} + \sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}] {log}^{2 + κ} p = o (1),

which holds if

n ≫ {(s^{*})}^{3} {log}^{5 + 2 κ} p,

K ≫ {(s^{*})}^{2} {log}^{5 + 2 κ} p,

and

r_{\bar{β}} ≪ \frac{1}{s^{*} \sqrt{n} {log}^{2 + κ} p}

. □

Lemma A7.

Under Assumptions 1–4, provided that

{∥\bar{β} - β^{*}∥}_{1} = O_{P} (r_{\bar{β}})

, and

n ≫ s^{2} {log}^{3} p

and

r_{\bar{β}} {log}^{3} p ≲ 1

, we have that

\begin{matrix} {∥\frac{1}{K} \sum_{k = 1}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{T} - E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{T}]∥}_{max} \\ = O_{P} ((1 + {(\frac{log p}{K})}^{1 / 4} + \sqrt{\frac{{log}^{2} (p K) log p}{K}}) \sqrt{n} r_{\bar{β}} + n r_{\bar{β}}^{2} + \sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) . \end{matrix}

Proof of Lemma A7.

By Lemma C.15 in [25], we only need to bound the following

U_{1} (\bar{β})

,

U_{2}

, and

U_{3} (\bar{β})

, which are respectively denoted as

U_{1} (\bar{β}) = {∥\frac{1}{K} \sum_{k = 1}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β}))}^{T} - n \nabla L_{k} (β^{*}) \nabla L_{k} {(β^{*})}^{T}∥}_{max}, U_{2} = {∥\frac{1}{K} \sum_{k = 1}^{K} n \nabla L_{k} (β^{*}) \nabla L_{k} {(β^{*})}^{T} - E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{T}]∥}_{max}, U_{3} (\bar{β}) = n {∥\nabla L_{N} (\bar{β}) - \nabla L^{*} (\bar{β})∥}_{\infty}^{2} .

(i) For

U_{2}

, with similar arguments to

U_{2}

in Lemma C.16 of [25], we also have

U_{2} = O_{P} (\sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) .

(A25)

(ii) For

U_{3} (\bar{β})

, we first consider

\begin{matrix} {∥\nabla L_{N} (\bar{β}) - \nabla L^{*} (\bar{β})∥}_{\infty} = {∥\frac{1}{N} \sum_{i = 1}^{N} [x_{i} I {y_{i} \leq x_{i}^{T} \bar{β}} - E (x_{i} I {y_{i} \leq x_{i}^{T} \bar{β}})]∥}_{\infty} \\ \leq & {∥\frac{1}{N} \sum_{i = 1}^{N} x_{i} (I {y_{i} \leq x_{i}^{T} \bar{β}} - I {y_{i} \leq x_{i}^{T} β^{*}} - F (x_{i}^{T} \bar{β} | x_{i}) + F (x_{i}^{T} β^{*} | x_{i}))∥}_{\infty} \\ + {∥\frac{1}{N} \sum_{i = 1}^{N} x_{i} (I {y_{i} \leq x_{i}^{T} β^{*}} - E (I {y_{i} \leq x_{i}^{T} β^{*}}))∥}_{\infty} \\ = & O_{p} (r_{\bar{β}} \sqrt{\frac{s {log}^{2} p}{N}} + \frac{s {log}^{1.5} p}{N}) + O_{p} (\sqrt{\frac{log p}{N}}) . \end{matrix}

(A26)

So, we have

U_{3} (\bar{β}) = O_{p} (r_{\bar{β}}^{2} \frac{s {log}^{2} p}{K} + \frac{s^{2} {log}^{3} p}{K N} + \frac{log p}{K}) .

(A27)

(iii) For

U_{1} (\bar{β})

, we can write

U_{1} (\bar{β}) \leq U_{11} (\bar{β}) + U_{12} (\bar{β}),

(A28)

where

U_{11} (\bar{β}) = {∥\frac{1}{K} \sum_{k = 1}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L_{k} (β^{*})) {(\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L_{k} (β^{*}))}^{T}∥}_{max},

U_{12} (\bar{β}) = {∥\frac{1}{K} \sum_{k = 1}^{K} n \nabla L_{k} (β^{*}) {(\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L_{k} (β^{*}))}^{T}∥}_{max} .

For

U_{11} (\bar{β})

, we have

\begin{matrix} U_{11} (\bar{β}) & = & {∥\frac{1}{K} \sum_{k = 1}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L_{k} (β^{*})) {(\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L_{k} (β^{*}))}^{T}∥}_{max} \\ \leq & \frac{1}{K} \sum_{k = 1}^{K} n {∥(\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L_{k} (β^{*})) {(\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L_{k} (β^{*}))}^{T}∥}_{max} \\ = & \frac{1}{K} \sum_{k = 1}^{K} n {∥(\nabla L_{k} (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L_{k} (β^{*}))∥}_{\infty}^{2} \\ \leq & \frac{2}{K} \sum_{k = 1}^{K} n ({∥(\nabla L_{k} (\bar{β}) - \nabla L_{k} (β^{*}))∥}_{\infty}^{2} + {∥\nabla L^{*} (\bar{β}) - \nabla L^{*} (β^{*})∥}_{\infty}^{2}) \\ = & O_{p} (n r_{\bar{β}}^{2} (1 + \sqrt{\frac{s {log}^{2} p}{n}}) + n r_{\bar{β}}^{4} {log}^{3} p + \frac{s^{2} {log}^{3} p}{n}) + O_{p} (n r_{\bar{β}}^{2} + n r_{\bar{β}}^{4} {log}^{3} p) \\ = & O_{p} (n r_{\bar{β}}^{2} (1 + \sqrt{\frac{s {log}^{2} p}{n}}) + n r_{\bar{β}}^{4} {log}^{3} p + \frac{s^{2} {log}^{3} p}{n}), \\ = & O_{p} (n r_{\bar{β}}^{2}), \end{matrix}

(A29)

where we use the triangle inequality and the fact that

∥ a a^{T} ∥_{max} = {∥ a ∥}_{\infty}^{2}

for any vector a,

n ≫ s^{2} {log}^{3} p

,

r_{\bar{β}} {log}^{3} p ≲ 1

, and similar arguments to (A26).

We apply the Cauchy–Schwarz inequality on matrix inner product, that is,

〈 A, B 〉 = {∥A B^{T}∥}_{max} \leq {∥A∥}_{max}^{1 / 2} {∥B∥}_{max}^{1 / 2}

, to

U_{12} (\bar{β})

, and by (A25) and (A29) obtain

\begin{matrix} U_{12}^{2} (\bar{β}) & \leq & {∥\frac{1}{K} \sum_{k = 1}^{K} n \nabla L_{k} (β^{*}) \nabla L_{k} {(β^{*})}^{T}∥}_{max} U_{11} (\bar{β}) \\ \leq & {∥\frac{1}{K} \sum_{k = 1}^{K} n \nabla L_{k} (β^{*}) \nabla L_{k} {(β^{*})}^{T} - E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{T}]∥}_{max} U_{11} (\bar{β}) \\ + {∥E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{T}]∥}_{max} U_{11} (\bar{β}) \\ = & {[U_{2} + τ (1 - τ) ∥Σ_{x}∥]}_{max} U_{11} (\bar{β}) \\ = & O_{P} (1 + \sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) O_{p} (n r_{\bar{β}}^{2}) . \end{matrix}

Therefore, one obtains

U_{12} (\bar{β}) = O_{P} ((1 + {(\frac{log p}{K})}^{1 / 4} + \sqrt{\frac{{log}^{2} (p K) log p}{K}}) \sqrt{n} r_{\bar{β}}) .

(A30)

By (A28)–(A30), we have

U_{1} (\bar{β}) = O_{P} ((1 + {(\frac{log p}{K})}^{1 / 4} + \sqrt{\frac{{log}^{2} (p K) log p}{K}}) \sqrt{n} r_{\bar{β}} + n r_{\bar{β}}^{2}) .

(A31)

Last, we combine (A25), (A27), and (A31) to obtain

\begin{matrix} {∥\frac{1}{K} \sum_{k = 1}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{T} - E [\nabla L (β^{*}; z) \nabla L {(β^{*}; z)}^{T}]∥}_{max} \\ \leq & U_{1} (\bar{β}) + U_{2} + U_{3} (\bar{β}) \\ = & O_{P} ((1 + {(\frac{log p}{K})}^{1 / 4} + \sqrt{\frac{{log}^{2} (p K) log p}{K}}) \sqrt{n} r_{\bar{β}} + n r_{\bar{β}}^{2} + \sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) . \end{matrix}

□

Lemma A8.

(n+K-1-grad-Q) Under Assumptions 1–5 hold, if

n ≫ (s^{4} s^{*} + {(s^{*})}^{3}) {log}^{6 + 2 κ} p

,

n + K ≫ {(s^{*})}^{2} {log}^{5 + 2 κ} p

,

N ≫ {log}^{5 + κ} p

and

{∥{\tilde{β}}^{t} - β^{*}∥}_{1} ≪ min \{\frac{1}{s^{*} (\sqrt{n} + log p) {log}^{2 + κ} p}, \frac{1}{\sqrt{s s^{*} K} n^{\frac{1}{4}} {log}^{3 / 2 + κ} p}\}

for some

κ > 0

, then we obtain that

sup_{α \in (0, 1)} |P (T \leq c_{\tilde{W}} (α)) - α| = o (1),

(A32)

sup_{α \in (0, 1)} |P (\hat{T} \leq c_{\tilde{W}} (α)) - α| = o (1) .

(A33)

Proof of Lemma A8.

By the argument in the proof of Lemma 1, if for some

κ > 0

,

N ≳ {log}^{7 + κ} p

, we have that

\begin{matrix} sup_{α \in (0, 1)} | P (T \leq c_{\tilde{W}} (α)) - α | & ≲ & \underset{(A 18)}{\underset{︸}{ζ \sqrt{1 \lor log \frac{p}{ζ}} + P (| T - T_{0} | > ζ)}} \\ + \underset{(A 19)}{\underset{︸}{v^{1 / 3} {(1 \lor log \frac{p}{v})}^{2 / 3} + P ({∥\hat{Ω} - Ω_{0}∥}_{max} > v)}} \\ + N^{- c} + \underset{L e m m a A 9}{\underset{︸}{π (u) + P ({∥\tilde{Ω} - \hat{Ω}∥}_{\max} > u)}} \\ = & o (1), \end{matrix}

(A34)

where

\begin{matrix} \tilde{Ω} : & = {cov}_{ϵ} (- {\tilde{Θ}}_{1} \frac{1}{\sqrt{n + K - 1}} (\sum_{i = 1}^{n} ϵ_{1 i} (g_{1 i} - \bar{g}) + \sum_{k = 2}^{K} ϵ_{k} \sqrt{n} (g_{k} - \bar{g}))) \\ = {\tilde{Θ}}_{1} \frac{1}{n + K - 1} (\sum_{i = 1}^{n} (\nabla L (\bar{β}; Z_{1 i}) - \nabla L_{N} (\bar{β})) {(\nabla L (\bar{β}; Z_{1 i}) - \nabla L_{N} (\bar{β}))}^{⊤} \\ + \sum_{k = 2}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{⊤}) {\tilde{Θ}}_{1}^{T} . \end{matrix}

(A35)

Applying Lemmas A3, A5, and A9, we have that some

ζ, u, v > 0

exist such that (A18), (A19), and

\begin{matrix} u^{1 / 3} {(1 \lor log \frac{p}{u})}^{2 / 3} + P ({∥\tilde{Ω} - \hat{Ω}∥}_{max} > u) = o (1) \end{matrix}

(A36)

hold, and hence, after simplifying the conditions, obtain the first result (A32) in the lemma. To obtain the second result (A33), we use Lemma A4, which yields

ξ \sqrt{1 \lor log \frac{p}{ξ}} + P (|\hat{T} - T_{0}| > ξ) = o (1)

. □

Lemma A9.

\tilde{Ω}

and

\hat{Ω}

are defined as in (A35) and (A16). In the sparse quantile regression model, under Assumptions 1 and 4, prodided that

∥ \bar{β} - β^{*} ∥_{1} = O_{P} (r_{\bar{β}})

,

n ≫ s^{2} {log}^{3} p + s^{*} log p

and

r_{\bar{β}} {log}^{\frac{3}{2}} p ≲ 1

,

(log p + \sqrt{n}) r_{\bar{β}} ≲ 1

, and

{log}^{2} (p (n + K)) log p ≲ n + K

, we have that

\begin{matrix} {∥\tilde{Ω} - \hat{Ω}∥}_{max} = O_{P} (s^{*} (\sqrt{\frac{log p}{n + K}} + \frac{{log}^{2} (p (n + K)) log d}{n + K} + (\sqrt{n} + log p) r_{\bar{β}} \\ + (\frac{n K}{n + K} + {log}^{2} p) r_{\vec{β}}^{2}) + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}) . \end{matrix}

In addition, for some

κ > 0

, if

n ≫ {(s^{*})}^{3} {log}^{5 + 2 κ} p

,

n + K ≫ {(s^{*})}^{2} {log}^{5 + 2 κ} p

and

{∥\bar{β} - β^{*}∥}_{1} ≪ min \{\frac{1}{s^{*} (\sqrt{n} + log p) {log}^{2 + κ} p}, \frac{1}{\sqrt{s^{*}} (\sqrt{\frac{n K}{n + K} + {log}^{2} p}) {log}^{1 + κ} p}\},

then some

u > 0

exist, so Formula (A36) holds, i.e.,

u^{1 / 3} {(1 \lor log \frac{p}{u})}^{2 / 3} + P ({∥\tilde{Ω} - \hat{Ω}∥}_{\max} > u) = o (1) .

Proof of Lemma A9.

Note by the triangle inequality that

{∥\tilde{Ω} - \hat{Ω}∥}_{max} \leq {∥\tilde{Ω} - Ω_{0}∥}_{max} + {∥\hat{Ω} - Ω_{0}∥}_{max},

where

Ω_{0}

is defined as in (A17). By the proof of Lemma A5, we have that

{∥\hat{Ω} - Ω_{0}∥}_{max} = O_{P} (\sqrt{\frac{log p}{N}} + \frac{{log}^{2} (p N) log p}{N}) .

Next, we bound

{∥\tilde{Ω} - Ω_{0}∥}_{max}

using the same argument as in the proof of Lemma A6. By definitions of

\tilde{Ω}

and

Ω_{0}

, we have that

\begin{matrix} {∥\tilde{Ω} - Ω_{0}∥}_{max} \\ = ∥{\tilde{Θ}}_{1} \frac{1}{n + K - 1} (\sum_{i = 1}^{n} (\nabla L (\bar{β}; Z_{i 1}) - \nabla L_{N} (\bar{β})) {(\nabla L (\bar{β}; Z_{i 1}) - \nabla L_{N} (\bar{β}))}^{⊤} \\ {+ \sum_{k = 2}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{⊤}) {\tilde{Θ}}_{1}^{⊤} - Θ E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}] Θ∥}_{max} \\ \leq ∥{\tilde{Θ}}_{1} (\frac{1}{n + K - 1} (\sum_{i = 1}^{n} (\nabla L (\bar{β}; Z_{i 1}) - \nabla L_{N} (\bar{β})) {(\nabla L (\bar{β}; Z_{i 1}) - \nabla L_{N} (\bar{β}))}^{⊤} \\ {+ \sum_{k = 2}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{⊤}) - E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}]) {\tilde{Θ}}_{1}^{⊤}∥}_{max} \\ + {∥{\tilde{Θ}}_{1} E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}] {\tilde{Θ}}_{1}^{⊤} - Θ E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}] Θ∥}_{max} \\ = I_{1}^{'} (\bar{β}) + I_{2} . \end{matrix}

We have shown in the proof of Lemma A6 that

\begin{matrix} I_{2} & = {∥{\tilde{Θ}}_{1} E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}] {\tilde{Θ}}_{1}^{⊤} - Θ E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}] Θ∥}_{max} \\ = O_{P} ({(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}) . \end{matrix}

To bound

I_{1}^{'} (\bar{β})

, we note that

\begin{matrix} I_{1}^{'} (\bar{β}) & = ∥{\tilde{Θ}}_{1} (\frac{1}{n + K - 1} (\sum_{i = 1}^{n} (\nabla L (\bar{β}; Z_{i 1}) - \nabla L_{N} (\bar{β})) {(\nabla L (\bar{β}; Z_{i 1}) - \nabla L_{N} (\bar{β}))}^{⊤} \\ {+ \sum_{k = 2}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{⊤}) - E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}]) {\tilde{Θ}}_{1}^{⊤}∥}_{max} \\ \leq {∥{\tilde{Θ}}_{1}∥}_{\infty}^{2} ∥\frac{1}{n + K - 1} (\sum_{i = 1}^{n} (\nabla L (\bar{β}; Z_{i 1}) - \nabla L_{N} (\bar{β})) {(\nabla L (\bar{β}; Z_{i 1}) - \nabla L_{N} (\bar{β}))}^{⊤} \\ {+ \sum_{k = 2}^{k} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{⊤}) - E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}]∥}_{max} \\ \overset{(i)}{=} O_{p} (s^{*}) O_{p} (\sqrt{\frac{log p}{n + K}} + \frac{{log}^{2} (p (n + K)) log p}{n + K} + (log p + \sqrt{n}) r_{\bar{β}} + (\frac{n K}{n + K} + {log}^{2} p) r_{\bar{θ}}^{2}) \\ = O_{P} (s^{*} (\sqrt{\frac{log p}{n + K}} + \frac{{log}^{2} (p (n + K)) log p}{n + K} + (log p + \sqrt{n}) r_{\bar{β}} + (\frac{n K}{n + K} + {log}^{2} p) r_{\bar{θ}}^{2})), \end{matrix}

where (i) by Lemma A10 and if

n ≫ s^{*} log p

,

{∥{\tilde{Θ}}_{1}∥}_{\infty} = O_{p} (\sqrt{s^{*}})

, under Assumptions (A1) and (A2), provided that

∥ \bar{β} - β^{*} ∥_{1} = O_{P} (r_{\bar{β}})

,

(log p + \sqrt{n}) r_{\bar{β}} ≲ 1

, and

{log}^{2} (p (n + K)) log p ≲ n + K

.

Putting all the preceding bounds together, we obtain that

\begin{matrix} {∥\tilde{Ω} - Ω_{0}∥}_{max} & = O_{P} (s^{*} \sqrt{\frac{log p}{n + K}} + s^{*} \frac{{log}^{2} (p (n + K)) log p}{n + K} + s^{*} (log p + \sqrt{n}) r_{\bar{β}} \\ + s^{*} (\frac{n K}{n + K} + {log}^{2} p) r_{\bar{θ}}^{2} + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}) \end{matrix}

and

\begin{matrix} {∥\tilde{Ω} - \hat{Ω}∥}_{max} = & O_{P} (s^{*} \sqrt{\frac{log p}{n + K}} + s^{*} \frac{{log}^{2} (p (n + K)) log p}{n + K} + s^{*} (log p + \sqrt{n}) r_{\bar{β}} \\ + s^{*} (\frac{n K}{n + K} + {log}^{2} p) r_{\bar{θ}}^{2} + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}) . \end{matrix}

Choosing

\begin{matrix} u = & (s^{*} \sqrt{\frac{log p}{n + K}} + s^{*} \frac{{log}^{2} (p (n + K)) log p}{n + K} + s^{*} (log p + \sqrt{n}) r_{\bar{β}} \\ {+ s^{*} (\frac{n K}{n + K} + {log}^{2} p) r_{\bar{θ}}^{2} + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}})}^{1 - κ} \end{matrix}

with any

κ > 0

, we deduce that

P ({∥\tilde{Ω} - \hat{Ω}∥}_{max} > u) = o (1) .

We also have that

u^{\frac{1}{3}} {(1 \lor l o g (\frac{p}{u}))}^{\frac{2}{3}} = o (1),

provided that

\begin{matrix} (s^{*} \sqrt{\frac{log p}{n + K}} + s^{*} \frac{{log}^{2} (p (n + K)) log p}{n + K} + s^{*} (log p + \sqrt{n}) r_{\bar{β}} \\ + s^{*} (\frac{n K}{n + K} + {log}^{2} p) r_{\bar{θ}}^{2} + {(s^{*})}^{\frac{3}{2}} \sqrt{\frac{log p}{n}}) {log}^{2 + κ} p = o (1), \end{matrix}

which holds if

n ≫ {(s^{*})}^{3} {log}^{5 + 2 κ} p,

n + K ≫ {(s^{*})}^{2} {log}^{5 + 2 κ} p

and

r_{\bar{β}} ≪ min \{\frac{1}{s^{*} (\sqrt{n} + log p) {log}^{2 + κ} p}, \frac{1}{\sqrt{s^{*}} (\sqrt{\frac{n K}{n + K} + {log}^{2} p}) {log}^{1 + κ} p}\} .

□

Lemma A10.

In sparse quantile regression, under Assumptions 1–4, provided that

{∥\bar{β} - β^{*}∥}_{1} = O_{P} (r_{\bar{β}})

, and

n ≫ s^{2} {log}^{3} p + s * l o g p

and

r_{\bar{β}} {log}^{\frac{3}{2}} p ≲ 1

, we have

\begin{matrix} ∥\frac{1}{n + K - 1} (\sum_{i = 1}^{n} (\nabla L (\bar{β}; Z_{1 i}) - \nabla L_{N} (\bar{β})) {(\nabla L (\bar{β}; Z_{1 i}) - \nabla L_{N} (\bar{β}))}^{⊤} \\ {+ \sum_{k = 2}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{⊤}) - E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}]∥}_{max} \\ = O_{P} (\sqrt{\frac{log p}{n + K}} + \frac{{log}^{2} (p (n + K)) log p}{n + K} + (\frac{n K}{n + K} + \frac{s l o g^{2} p}{n + K} + (1 + {log}^{2} p)) r_{\bar{β}}^{2} \\ + ((log p + \sqrt{n}) + \frac{{log}^{1 / 4} p (log p + \sqrt{n})}{{(n + K)}^{1 / 4}} + (log p + \sqrt{n}) \sqrt{\frac{{log}^{2} (p (n + K)) log p}{n + K}}) r_{\bar{β}}) . \end{matrix}

Proof of Lemma A10.

By Lemmas 31 and 32 in [25], we only need to bound the following

V_{1} (\bar{β}), V_{1}^{'} (\bar{β}), V_{2}, V_{2}^{'}, and V_{3} (\bar{β})

, which are respectively denoted as

V_{1} (\bar{β}) = \frac{k - 1}{n + k - 1} {∥\frac{1}{k - 1} \sum_{j = 2}^{k} n (\nabla L_{j} (\bar{β}) - \nabla L^{*} (\bar{β})) {(\nabla L_{j} (\bar{β}) - \nabla L^{*} (\bar{β}))}^{⊤} - n \nabla L_{j} (β^{*}) \nabla L_{j} {(β^{*})}^{⊤}∥}_{max}, V_{1}^{'} (\bar{β}) = \frac{n}{n + K - 1} {∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}) - \nabla L^{*} (\bar{β})) {(\nabla L (\bar{β}) - \nabla L^{*} (\bar{β}))}^{⊤} - \nabla L (β^{*}) L {(β^{*})}^{⊤}∥}_{m a x}, V_{2} = \frac{k - 1}{n + k - 1} {∥\frac{1}{k - 1} \sum_{j = 2}^{k} n \nabla L_{k} (β^{*}) \nabla L_{j} {(β^{*})}^{⊤} - E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}]∥}_{\max}, V_{2}^{'} = \frac{n}{n + K - 1} {∥\frac{1}{n} \sum_{i = 1}^{n} \nabla L (β^{*}; Z_{1 i}) \nabla L {(β^{*}; Z_{1 i})}^{⊤} - E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}]∥}_{max}, V_{3} (\bar{β}) = \frac{n K}{N + K - 1} {∥ \nabla L_{N} (\bar{β}) - \nabla L^{*} (\bar{β}) ∥}_{\infty}^{2} .

(i) For

V_{1} (\bar{β})

,

V_{2}

, and

V_{3} (\bar{β})

, with similar arguments to

V_{1} (\bar{β})

,

V_{2}

, and

V_{3} (\bar{β})

in Lemma 32 of [25], and by the proof of Lemma A7, we have

\begin{matrix} V_{1} (\bar{β}) & = \frac{K - 1}{n + K - 1} O_{P} (n r_{\bar{β}}^{2} + (1 + {(\frac{log p}{K})}^{1 / 4} + \sqrt{\frac{{log}^{2} (p K) log p}{K}}) \sqrt{n} r_{\bar{β}}) \\ = O_{P} (\frac{K}{n + K} n r_{\bar{β}}^{2} + (1 + {(\frac{log p}{K})}^{1 / 4} + \sqrt{\frac{{log}^{2} (p K) log p}{K}}) \frac{K}{n + K} \sqrt{n} r_{\bar{β}}), \end{matrix}

\begin{matrix} V_{2} = \frac{K - 1}{n + K - 1} O_{P} (\sqrt{\frac{log p}{K}} + \frac{{log}^{2} (p K) log p}{K}) = O_{P} (\frac{\sqrt{K log p}}{n + K} + \frac{{log}^{2} (p K) log p}{n + K}), \end{matrix}

\begin{matrix} V_{3} (\bar{β}) = \frac{n K}{n + K - 1} O_{P} (\frac{s {log}^{2} p}{N} r_{\bar{β}}^{2} + \frac{log p}{N} + \frac{s^{2} {log}^{3} p}{N^{2}}) = O_{P} (\frac{s {log}^{2} p}{n + K} r_{\bar{β}}^{2} + \frac{log p}{n + K} + \frac{s^{2} {log}^{3} p}{(n + K) N}) . \end{matrix}

(ii) To bound

V_{2}^{'}

, with similar arguments to

V_{2}^{'}

in Lemma 32 of [25], we have

V_{2}^{'} = \frac{n}{n + K - 1} O_{P} (\sqrt{\frac{log p}{n}} + \frac{{log}^{2} (p K) log p}{n}) = O_{P} (\frac{\sqrt{n log p}}{n + K} + \frac{{log}^{2} (p n) log p}{n + K}) .

(iii) For

V_{1}^{'} (\bar{β})

, we use the same argument as in bounding

U_{1} (\bar{β})

in the proof of Lemma A7. We write

\nabla L (\bar{β}; Z_{1 i}) - \nabla L^{*} (\bar{β})

as

(\nabla L (\bar{β}; Z_{1 i}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*}; Z_{1 i})) + \nabla L (β^{*}; Z_{1 i})

, and obtain by the triangle inequality that

\begin{matrix} \frac{n + k - 1}{n} V_{1}^{'} (\bar{β}) & = {∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}) - \nabla L^{*} (\bar{β})) {(\nabla L (\bar{β}) - \nabla L^{*} (\bar{β}))}^{⊤} - \nabla L (β^{*}) L {(β^{*})}^{⊤}∥}_{max} \\ = ∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*}) + \nabla L (β^{*})) \\ {{(\nabla L (\bar{β}; Z_{1 i}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*}) + \nabla L (β^{*}))}^{⊤} - \nabla L (β^{*}) L {(β^{*})}^{⊤}∥}_{max} \\ \leq V_{11}^{'} (\bar{β}) + 2 V_{12}^{'} (\bar{β}), \end{matrix}

where

\begin{matrix} V_{11}^{'} (\bar{β}) & = {∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}; Z_{1 i}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*}; Z_{1 i})) {(\nabla L (\bar{β}; Z_{1 i}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*}; Z_{1 i}))}^{T}∥}_{max}, \end{matrix}

\begin{matrix} V_{12}^{'} (\bar{β}) & = {∥\frac{1}{n} \sum_{i = 1}^{n} \nabla L (β^{*}; Z_{1 i}) {(\nabla L (\bar{β}; Z_{1 i}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*}; Z_{1 i}))}^{T}∥}_{max} . \end{matrix}

It remains to bound

V_{11}^{'} (\bar{β})

. We have

\begin{matrix} V_{11}^{'} (\bar{β}) & = {∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}, Z_{1 i}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*}, Z_{1 i})) {(\nabla L (\bar{β}, Z_{1 i}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*}, Z_{1 i}))}^{T}∥}_{max} \\ \leq {∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}, Z_{1 i}) - \nabla L (β^{*}, Z_{i 1})) {(\nabla L (\bar{β}, Z_{1 i}) - \nabla L (β^{*}, Z_{1 i}))}^{T}∥}_{max} \\ + 2 {∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}, Z_{1 i}) - \nabla L (β^{*}, Z_{1 i})) \nabla L^{*} {(\bar{β})}^{T}∥}_{max} + {∥\frac{1}{n} \sum_{i = 1}^{n} \nabla L^{*} (\bar{β}) \nabla L^{*} {(\bar{β})}^{T}∥}_{max} \\ = F_{1} + F_{2} + F_{3} . \end{matrix}

For

F_{3}

, we have

{∥ Θ ∥}_{m a x} ≲ {∥ Θ ∥}_{2} = O (1)

and

\begin{matrix} ∥ \nabla L^{*} (\bar{β}) ∥_{\infty} & = ∥ \nabla L^{*} (\bar{β}) - \nabla L^{*} (β^{*}) ∥_{\infty} \\ = ∥ E [x (I (y \leq x^{T} \bar{β}) - I (y \leq x^{T} β^{*}))] ∥_{\infty} \\ = ∥ E [x (F (x^{T} \bar{β} | x) - F (x^{T} β^{*} | x))] ∥_{\infty} \\ \leq {∥E [x (f (x^{T} β^{*} | x) x^{T} (\bar{β} - β^{*}) + C {(x^{T} (\bar{β} - β^{*}))}^{2})]∥}_{\infty} \\ \leq {∥E [x x^{T} f (x^{T} β^{*} | x) (\bar{β} - β^{*})]∥}_{\infty} + C {∥x {(x^{T} (\bar{β} - β^{*}))}^{2}∥}_{\infty} \\ \leq r_{\bar{β}} {∥E [x x^{T} f (x^{T} β^{*} | x) (\bar{β} - β^{*})]∥}_{max} + r_{\bar{β}}^{2} {log}^{\frac{3}{2}} p \\ = O_{P} (r_{\bar{β}} + r_{\bar{β}}^{2} {log}^{\frac{3}{2}} p), \end{matrix}

then

\begin{matrix} F_{3} = {∥\frac{1}{n} \sum_{i = 1}^{n} \nabla L^{*} (\bar{β}) \nabla L^{*} {(\bar{β})}^{T}∥}_{max} = {∥\nabla L^{*} (\bar{β}) \nabla L^{*} {(\bar{β})}^{T}∥}_{max} \leq {∥ \nabla L^{*} (\bar{β}) ∥}_{\infty}^{2} = O_{P} (r_{\bar{β}}^{2} + r_{\bar{β}}^{4} {log}^{3} p) . \end{matrix}

For

F_{2}

, similar to the proof of (A23), we have

\begin{matrix} {∥\frac{1}{n} \sum_{i = 1}^{n} x_{1 i} (I (y \leq x^{T} \bar{β}) - I (y \leq x^{T} β^{*}))∥}_{\infty} = O_{P} (\frac{s {log}^{\frac{3}{2}} p}{n} + r_{\bar{β}} \sqrt{\frac{s {log}^{2} p}{n}} + r_{\bar{β}}^{2} {log}^{\frac{3}{2}} p), \end{matrix}

then

\begin{matrix} F_{2} = & 2 {∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}, Z_{1 i}) - \nabla L (β^{*}, Z_{1 i})) \nabla L^{*} {(\bar{β})}^{T}∥}_{max} \\ \leq ∥ \nabla L^{*} (\bar{β}) ∥_{\infty} {∥ \frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}, Z_{1 i}) - \nabla L (β^{*}, Z_{1 i})) ∥}_{max} \\ = ∥ \nabla L^{*} (\bar{β}) ∥_{\infty} {∥ \frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}, Z_{1 i}) - \nabla L (β^{*}, Z_{1 i})) ∥}_{\infty} \\ = ∥ \nabla L^{*} (\bar{β}) ∥_{\infty} {∥ \frac{1}{n} \sum_{i = 1}^{n} x_{1 i} (I (y \leq x^{T} \bar{β}) - I (y \leq x^{T} β^{*})) ∥}_{\infty} \end{matrix}

\begin{matrix} = O_{P} (r_{\bar{β}} + r_{\bar{β}}^{2} {log}^{\frac{3}{2}} p) O_{P} (\frac{s {log}^{\frac{3}{2}} p}{n} + r_{\bar{β}} \sqrt{\frac{s {log}^{2} p}{n}} + r_{\bar{β}}^{2} {log}^{\frac{3}{2}} p) \\ = O_{p} (r_{\bar{β}} \frac{s {log}^{\frac{3}{2}} p}{n} + r_{\bar{β}}^{2} \sqrt{\frac{s^{2} {log}^{6} p}{n}} + r_{\bar{β}}^{3} {log}^{\frac{3}{2}} p + r_{\bar{β}}^{4} {log}^{3} p) . \end{matrix}

For

F_{1}

, we have

\begin{matrix} F_{1} = & {∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}, Z_{1 i}) - \nabla L (β^{*}, Z_{1 i})) {(\nabla L (\bar{β}, Z_{1 i}) - \nabla L (β^{*}, Z_{1 i}))}^{T}∥}_{max} \\ = {∥\frac{1}{n} \sum_{i = 1}^{n} x_{1 i} x_{1 i}^{T} {(I (y_{1 i} \leq x_{1 i}^{T} \bar{β}) - I (y_{1 i} \leq x_{1 i}^{T} β^{*}))}^{2}∥}_{max} \\ = ∥\frac{1}{n} \sum_{i = 1}^{n} x_{1 i} x_{1 i}^{T} {(I (y_{1 i} \leq x_{1 i}^{T} \bar{β}) - I (y_{1 i} \leq x_{1 i}^{T} β^{*}))}^{2} \\ {- x_{1 i} x_{1 i}^{T} E [{(I (y_{1 i} \leq x_{1 i}^{T} \bar{β}) - I (y_{1 i} \leq x_{1 i}^{T} β^{*}))}^{2}]∥}_{max} \\ + {∥\frac{1}{n} \sum_{i = 1}^{n} x_{1 i} x_{1 i}^{T} E [{(I (y_{1 i} \leq x_{1 i}^{T} \bar{β}) - I (y_{1 i} \leq x_{1 i}^{T} β^{*}))}^{2}]∥}_{max} \\ \leq {(\sqrt{log p})}^{2} {∥\frac{1}{n} \sum_{i = 1}^{n} {(I (y_{1 i} \leq x_{1 i}^{T} \bar{β}) - I_{{y_{1 i} \leq x_{1 i}^{T} β^{*}}})}^{2} - E [{(I (y_{1 i} \leq x_{1 i}^{T} \bar{β}) - I (y_{1 i} \leq x_{1 i}^{T} β^{*}))}^{2}]∥}_{max} \\ + {∥\frac{1}{n} \sum_{i = 1}^{n} x_{1 i} x_{1 i}^{T} E [{(I (y_{1 i} \leq x_{1 i}^{T} \bar{β}) - I (y_{1 i} \leq x_{1 i}^{T} β^{*}))}^{2}]∥}_{max} \\ \leq log p \sqrt{\frac{log p}{n}} + {∥\frac{1}{n} \sum_{i = 1}^{n} x_{1 i} x_{1 i}^{T} (F (x_{1 i}^{T} \bar{β} | x_{1 i}) - 2 F (x_{1 i}^{T} \bar{β} \land x_{1 i}^{T} β^{*} | x_{1 i}) + F (x_{1 i}^{T} β^{*} | x_{1 i}))∥}_{max} \\ \leq log p \sqrt{\frac{log p}{n}} + r_{\bar{β}} \sqrt{\frac{{log}^{2} p}{n}} + r_{\bar{β}}^{2} {log}^{2} p \\ = O_{p} (\sqrt{\frac{{log}^{3} p}{n}} + r_{\bar{β}} \sqrt{\frac{{log}^{2} p}{n}} + r_{\bar{β}}^{2} {log}^{2} p) . \end{matrix}

And then we can obtain that

\begin{matrix} V_{11}^{'} (\bar{β}) & = F_{1} + F_{2} + F_{3} \\ = O_{p} (\sqrt{\frac{{log}^{3} p}{n}} + r_{\bar{β}} \sqrt{\frac{{log}^{2} p}{n}} + r_{\bar{β}}^{2} {log}^{2} p) \\ + O_{p} (r_{\bar{β}} \frac{s {log}^{\frac{3}{2}} p}{n} + r_{\bar{β}}^{2} \sqrt{\frac{s^{2} {log}^{6} p}{n}} + r_{\bar{β}}^{3} {log}^{\frac{3}{2}} p + r_{\bar{β}}^{4} {log}^{3} p) + O_{P} (r_{\bar{β}}^{2} + r_{\bar{β}}^{4} {log}^{3} p) \\ = O_{p} ((1 + {log}^{2} p) r_{\bar{β}}^{2}), \end{matrix}

where provided that

n ≫ s^{2} {log}^{3} p

,

r_{\bar{β}} {log}^{\frac{3}{2}} p ≲ 1

.

Applying the Cauchy–Schwarz inequality and the result of

V_{11}^{'} (\bar{β})

and

V_{2}^{'}

, we obtain that

\begin{matrix} V_{12}^{'} (\bar{β}) & \leq {∥\frac{1}{n} \sum_{i = 1}^{n} \nabla L (β^{*}) \nabla L {(β^{*})}^{⊤}∥}_{max}^{1 / 2} \\ * {∥\frac{1}{n} \sum_{i = 1}^{n} (\nabla L (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*})) {(\nabla L (\bar{β}) - \nabla L^{*} (\bar{β}) - \nabla L (β^{*}))}^{T}∥}_{max}^{1 / 2} \\ \leq {∥\frac{1}{n} \sum_{i = 1}^{n} \nabla L (β^{*}) \nabla L {(β^{*})}^{⊤}∥}_{max}^{1 / 2} V_{11}^{'} {(\bar{β})}^{1 / 2} \\ \leq [{∥\frac{1}{n} \sum_{i = 1}^{n} \nabla L (β^{*}) \nabla L {(β^{*})}^{⊤} - E [\nabla L (β^{*}, Z) \nabla L {(β^{*}, Z)}^{T}]∥}_{max} \\ {+ {∥E [\nabla L (β^{*}, Z) \nabla L {(β^{*}, Z)}^{T}]∥}_{max}]}^{1 / 2} V_{11}^{'} {(\bar{β})}^{1 / 2} \\ = {[\frac{n + K - 1}{n} V_{2}^{'} + τ (1 - τ) {∥Σ_{x}∥}_{m a x}]}^{1 / 2} V_{11}^{'} {(\bar{β})}^{1 / 2} \\ = O_{P} (1 + \sqrt{\frac{log p}{n}} + \frac{{log}^{2} (p n) log p}{n}) O_{p} {((1 + {log}^{2} p) r_{\bar{β}}^{2})}^{1 / 2} \\ = O_{P} ((1 + {(\frac{log p}{n})}^{1 / 4} + \sqrt{\frac{{log}^{2} (p n) log p}{n}}) (1 + log p) r_{\bar{β}}), \end{matrix}

\begin{matrix} V_{1}^{'} (\bar{β}) & = \frac{n}{n + K - 1} (V_{11}^{'} (\bar{β}) + 2 V_{12}^{'} (\bar{β})) \\ = \frac{n}{n + K - 1} O_{P} ((1 + {(\frac{log p}{n})}^{1 / 4} + \sqrt{\frac{{log}^{2} (p n) log p}{n}}) (1 + log p) r_{\bar{β}} + (1 + {log}^{2} p) r_{\bar{β}}^{2}) \\ = O_{P} ((1 + {(\frac{log p}{n})}^{1 / 4} + \sqrt{\frac{{log}^{2} (p n) log p}{n}}) (1 + log p) \frac{n}{n + K} r_{\bar{β}} + (1 + {log}^{2} p) \frac{n}{n + K} r_{\bar{β}}^{2}) . \end{matrix}

Finally, we have

\begin{matrix} ∥\frac{1}{n + K - 1} (\sum_{i = 1}^{n} (\nabla L (\bar{β}; Z_{1 i}) - \nabla L_{N} (\bar{β})) {(\nabla L (\bar{β}; Z_{1 i}) - \nabla L_{N} (\bar{β}))}^{⊤} \\ {+ \sum_{k = 2}^{K} n (\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β})) {(\nabla L_{k} (\bar{β}) - \nabla L_{N} (\bar{β}))}^{⊤}) - E [\nabla L (β^{*}; Z) \nabla L {(β^{*}; Z)}^{⊤}]∥}_{max} \\ = O_{P} (\sqrt{\frac{log p}{n + K}} + \frac{{log}^{2} (p (n + K)) log p}{n + K} + (\frac{n K}{n + K} + (1 + {log}^{2} p) + \frac{s {log}^{2} p}{n + K}) r_{\bar{β}}^{2} \\ + ((log p + \sqrt{n}) + \frac{{log}^{1 / 4} p (log p + \sqrt{n})}{{(n + K)}^{1 / 4}} + (log p + \sqrt{n}) \sqrt{\frac{{log}^{2} (p (n + K)) log p}{n + K}}) r_{\bar{β}}) . \end{matrix}

□

Lemma A11.

In the high-dimensional quantile regression model, under Assumption 1, if

n ≫ s^{*} log p

, we have that

{∥\tilde{Θ}∥}_{\infty} = O_{P} (\sqrt{s^{*}}),

{∥\tilde{Θ} - Θ∥}_{\infty} = O_{P} (s^{*} \sqrt{\frac{log p}{n}}),

{∥\tilde{Θ} \frac{X_{1}^{⊤} X_{1}}{n} - I_{p}∥}_{max} = O_{P} (\sqrt{\frac{log p}{n}}),

max_{l} {∥{\tilde{Θ}}_{l} - Θ_{l}∥}_{2} = O_{P} (\sqrt{\frac{s^{*} log p}{n}}) .

Proof of Lemma A11.

In the high-dimensional setting,

\tilde{Θ}

is constructed using nodewise Lasso. We obtain the bounds from the proof of Lemma 5.3 and Theorem 2.4 of [2]. □

Appendix C. Additional Experimental Results

Figure A1 and Figure A2, respectively, depict the relationship between communication rounds and estimation error when the number of machines

K = 5

.

Figure A1. Comparison of three methods in terms of estimation errors for different quantile levels

τ = {0.25, 0.5, 0.75}

, when noise follows a normal distribution. The number of machines

K = 5

; sparsity levels

s = 4

and

s = 8

. The x-axis is the number of iterations or the rounds of communications, and the y-axis is the estimation error

∥ β_{t} - β^{*} ∥_{2}

.

Figure A1. Comparison of three methods in terms of estimation errors for different quantile levels

τ = {0.25, 0.5, 0.75}

, when noise follows a normal distribution. The number of machines

K = 5

; sparsity levels

s = 4

and

s = 8

. The x-axis is the number of iterations or the rounds of communications, and the y-axis is the estimation error

∥ β_{t} - β^{*} ∥_{2}

.

Figure A2. Comparison of three methods in terms of estimation errors for different quantile levels

τ = {0.25, 0.5, 0.75}

, when noise follows

t (2)

. The number of machines

K = 5

; sparsity levels

s = 4

and

s = 8

. The x-axis is the number of iterations or the rounds of communications, and the y-axis is the estimation error

∥ β_{t} - β^{*} ∥_{2}

.

Figure A2. Comparison of three methods in terms of estimation errors for different quantile levels

τ = {0.25, 0.5, 0.75}

, when noise follows

t (2)

. The number of machines

K = 5

; sparsity levels

s = 4

and

s = 8

. The x-axis is the number of iterations or the rounds of communications, and the y-axis is the estimation error

∥ β_{t} - β^{*} ∥_{2}

.

Figure A3 and Figure A4, respectively, indicate the coverage probability and estimation width ratio for the noise with normal and

t (2)

distribution when the confidence level is

90 %

.

Figure A5 and Figure A6 depict the

90 %

confidence bands constructed using the proposed method.

Figure A3. Empirical coverage probability and average width ratio of simultaneous confidence interval with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q” method, when confidence level is

90 %

.

s = 4

and

s = 8

; noise

ϵ \sim N (0, 0 . 5^{2})

; and quantile levels

τ = {0.25, 0.5, 0.75}

.

Figure A3. Empirical coverage probability and average width ratio of simultaneous confidence interval with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q” method, when confidence level is

90 %

.

s = 4

and

s = 8

; noise

ϵ \sim N (0, 0 . 5^{2})

; and quantile levels

τ = {0.25, 0.5, 0.75}

.

Figure A4. Empirical coverage probability and average width ratio of simultaneous confidence interval with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q” method, when confidence level is

90 %

.

s = 4

and

s = 8

; noise

ϵ \sim t (2)

; and quantile levels

τ = {0.25, 0.5, 0.75}

.

Figure A4. Empirical coverage probability and average width ratio of simultaneous confidence interval with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q” method, when confidence level is

90 %

.

s = 4

and

s = 8

; noise

ϵ \sim t (2)

; and quantile levels

τ = {0.25, 0.5, 0.75}

.

Figure A5. Confidence interval of non-zero elements with different quantile levels calculated by “k-grad-Q” and “n+K-1-grad-Q” for confidence level is

90 %

, when noise

ϵ \sim N (0, 0 . 5^{2})

. The quantile levels

τ = {0.25, 0.5, 0.75}

, and true parameter

β_{5}^{*} = 1

. The blue, red and black lines indicate the true parameter, our estimator and oracle estimator, respectively.

Figure A5. Confidence interval of non-zero elements with different quantile levels calculated by “k-grad-Q” and “n+K-1-grad-Q” for confidence level is

90 %

, when noise

ϵ \sim N (0, 0 . 5^{2})

. The quantile levels

τ = {0.25, 0.5, 0.75}

, and true parameter

β_{5}^{*} = 1

. The blue, red and black lines indicate the true parameter, our estimator and oracle estimator, respectively.

Figure A6. Confidence interval of non-zero elements with different quantile levels calculated by “k-grad-Q” and “n+K-1-grad-Q” for confidence level is

90 %

, when noise

ϵ \sim t (2)

. The quantile levels

τ = {0.25, 0.5, 0.75}

, and true parameter

β_{5}^{*} = 1

. The blue, red and black lines indicate the true parameter, our estimator and oracle estimator, respectively.

Figure A6. Confidence interval of non-zero elements with different quantile levels calculated by “k-grad-Q” and “n+K-1-grad-Q” for confidence level is

90 %

, when noise

ϵ \sim t (2)

. The quantile levels

τ = {0.25, 0.5, 0.75}

, and true parameter

β_{5}^{*} = 1

. The blue, red and black lines indicate the true parameter, our estimator and oracle estimator, respectively.

References

Mcdonald, R.; Mohri, M.; Silberman, N.; Walker, D.; Mann, G. Efficient large-scale distributed training of conditional maximum entropy models. Adv. Neural Inf. Process. Syst. 2009, 22, 1231–1239. [Google Scholar]
Van de Geer, S.; Bühlmann, P.; Ritov, Y.; Dezeure, R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist. 2014, 42, 1166–1202. [Google Scholar] [CrossRef]
Wang, J.; Kolar, M.; Srebro, N.; Zhang, T. Efficient distributed learning with sparsity. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; Volume 70, pp. 3636–3645. [Google Scholar]
Jordan, M.I.; Lee, J.D.; Yang, Y. Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 2019, 114, 668–681. [Google Scholar] [CrossRef]
Wang, L.; Lian, H. Communication-efficient estimation of high-dimensional quantile regression. Anaysis Appl. 2020, 18, 1057–1075. [Google Scholar] [CrossRef]
Tong, J.; Duan, R.; Li, R.; Scheuemie, M.J.; Moore, J.H.; Chen, Y. Robust-ODAL: Learning from heterogeneous health systems without sharing patient-level data. In Proceedings of the Pacific Symposium on Biocomputing 2020, Fairmont Orchid, HI, USA, 3–7 January 2020; pp. 695–706. [Google Scholar]
Zhou, X.C.; Le, C.; Xu, P.F.; Lv, S.G. Communication-efficient Byzantine-robust distributed learning with statistical guarantee. Pattern Recognit. 2023, 137, 109312. [Google Scholar] [CrossRef]
Koenker, R.; Bassett, G. Regression quantiles. Econometrica 1978, 46, 33–50. [Google Scholar] [CrossRef]
Koenker, R. Quantile Regression; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
Scheetz, T.E.; Kim, K.Y.; Swiderski, R.E.; Philp, A.R.; Braun, T.A.; Knudtson, K.L.; Dorrance, A.M.; DiBona, G.F.; Huang, J.; Casavant, T.L.; et al. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proc. Natl. Acad. Sci. USA 2006, 103, 14429–14434. [Google Scholar] [CrossRef]
Wang, L.; Wu, Y.; Li, R. Quantile regression for analyzing heterogeneity in ultra-high dimension. J. Am. Stat. Assoc. 2012, 107, 214–222. [Google Scholar] [CrossRef]
Belloni, A.; Chernozhukov, V. ℓ₁-Penalized quantile regression in high dimensional sparse models. Ann. Stat. 2011, 39, 82–130. [Google Scholar] [CrossRef]
Yu, L.; Lin, N.; Wang, L. A parallel algorithm for large-scale nonconvex penalized quantile regression. J. Comput. Graph. Stat. 2017, 26, 935–939. [Google Scholar] [CrossRef]
Chen, X.; Liu, W.; Zhang, Y. Quantile regression under memory constraint. Ann. Statist. 2019, 47, 3244–3273. [Google Scholar] [CrossRef]
Chen, X.; Liu, W.; Mao, X.; Yang, Z. Distributed High-dimensional Regression Under a Quantile Loss Function. J. Mach. Learn. Res. 2020, 21, 1–43. [Google Scholar]
Hu, A.; Jiao, Y.; Liu, Y.; Shi, Y.; Wu, Y. Distributed quantile regression for massive heterogeneous data. Neurocomputing 2021, 448, 249–262. [Google Scholar] [CrossRef]
Volgushev, S.; Chao, S.K.; Cheng, G. Distributed inference for quantile regression processes. Ann. Statist. 2019, 47, 1634–1662. [Google Scholar] [CrossRef]
Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Efron, B.; Tibshirani, R. An Introduction to the Bootstrap; Chapman & Hall/CRC: Boca Raton, FL, USA, 1993. [Google Scholar]
Dezeure, R.; Bühlmann, P.; Zhang, C.H. High-Dimensional Simultaneous Inference with the Bootstrap; Springer: Berlin/Heidelberg, Germany, 2017; Volume 26, pp. 685–719. [Google Scholar]
Zhang, X.; Cheng, G. Simultaneous inference for high-dimensional linear models. J. Am. Stat. Assoc. 2017, 112, 757–768. [Google Scholar] [CrossRef]
Chernozhukov, V.; Chetverikov, D.; Kato, K. Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors. Ann. Stat. 2013, 41, 2786–2819. [Google Scholar] [CrossRef]
Kleiner, A.; Talwalkar, A.; Sarkar, P.; Jordan, M.I. A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. Stat. Methodol. 2014, 795–816. [Google Scholar] [CrossRef]
Yu, Y.; Chao, S.K.; Cheng, G. Simultaneous inference for massive data: Distributed bootstrap. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 10892–10901. [Google Scholar]
Yu, Y.; Chao, S.K.; Cheng, G. Distributed bootstrap for simultaneous inference under high dimensionality. J. Mach. Learn. Res. 2022, 23, 8819–8895. [Google Scholar]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends® Mach. Learn. 2011, 1, 1–122. [Google Scholar]
Gu, Y.; Fan, J.; Kong, L.; Ma, S.; Zou, H. ADMM for high-dimensional sparse penalized quantile regression. Technometrics 2018, 60, 319–331. [Google Scholar] [CrossRef]
Tan, K.M.; Battey, H.; Zhou, W.X. Communication-constrained distributed quantile regression with optimal statistical guarantees. J. Mach. Learn. Res. 2022, 23, 1–61. [Google Scholar]
van der Vaart, A.W.; Wellner, J.A. Weak Convergence and Empirical Processes; Springer: New York, NY, USA, 1996. [Google Scholar]
Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
Chatterjee, A.; Lahiri, S.N. Bootstrapping Lasso estimators. J. Am. Stat. Assoc. 2011, 106, 608–625. [Google Scholar] [CrossRef]
Liu, H.; Xu, X.; Li, J.J. A bootstrap lasso + partial ridge method to construct confidence intervals for parameters in high-dimensional sparse linear models. Stat. Sin. 2020, 30, 1333–1355. [Google Scholar] [CrossRef]

Figure 1. Comparison of three methods (Q-CSL-ADMM, Q-Oracle, and Q-Avg) in terms of estimation errors for different quantile levels

τ = {0.25, 0.5, 0.75}

, when noise follows normal distribution. Number of machines

K = {10, 20}

, and sparsity levels

s = 4

and

s = 8

. The x-axis is the rounds of communications, and y-axis is the estimation error

∥ β_{t} - β^{*} ∥_{2}

.

Figure 1. Comparison of three methods (Q-CSL-ADMM, Q-Oracle, and Q-Avg) in terms of estimation errors for different quantile levels

τ = {0.25, 0.5, 0.75}

, when noise follows normal distribution. Number of machines

K = {10, 20}

, and sparsity levels

s = 4

and

s = 8

. The x-axis is the rounds of communications, and y-axis is the estimation error

∥ β_{t} - β^{*} ∥_{2}

.

Figure 2. Comparison of three methods (Q-CSL-ADMM, Q-Oracle, and Q-Avg) in terms of estimation errors for different quantile levels

τ = {0.25, 0.5, 0.75}

, when noise follows

t (2)

distribution. Number of machines

K = {10, 20}

and sparsity levels

s = 4

and

s = 8

. The x-axis is the rounds of communications, and y-axis is the estimation error

∥ β_{t} - β^{*} ∥_{2}

.

Figure 2. Comparison of three methods (Q-CSL-ADMM, Q-Oracle, and Q-Avg) in terms of estimation errors for different quantile levels

τ = {0.25, 0.5, 0.75}

, when noise follows

t (2)

distribution. Number of machines

K = {10, 20}

and sparsity levels

s = 4

and

s = 8

. The x-axis is the rounds of communications, and y-axis is the estimation error

∥ β_{t} - β^{*} ∥_{2}

.

Figure 3. Empirical coverage probability and average width ratio of simultaneous confidence interval with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q” method when confidence level is

95 %

.

s = 4

and

s = 8

; noise

ϵ \sim N (0, 0 . 5^{2})

; and quantile levels

τ = {0.25, 0.5, 0.75}

.

Figure 3. Empirical coverage probability and average width ratio of simultaneous confidence interval with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q” method when confidence level is

95 %

.

s = 4

and

s = 8

; noise

ϵ \sim N (0, 0 . 5^{2})

; and quantile levels

τ = {0.25, 0.5, 0.75}

.

Figure 4. Empirical coverage probability and average width ratio of simultaneous confidence interval with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q” method when confidence level is

95 %

.

s = 4

and

s = 8

; noise

ϵ \sim t (2)

; and quantile levels

τ = {0.25, 0.5, 0.75}

.

Figure 4. Empirical coverage probability and average width ratio of simultaneous confidence interval with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q” method when confidence level is

95 %

.

s = 4

and

s = 8

; noise

ϵ \sim t (2)

; and quantile levels

τ = {0.25, 0.5, 0.75}

.

Figure 5. Confidence Interval (CI) of non-zero elements with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q”, when confidence level is

95 %

. The quantile levels

τ = {0.25, 0.5, 0.75}

; noise

ϵ \sim N (0, 0 . 5^{2})

; and true parameter

β_{5}^{*} = 1

. The blue, red and black lines indicate the true parameter, our estimator and oracle estimator, respectively.

Figure 5. Confidence Interval (CI) of non-zero elements with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q”, when confidence level is

95 %

. The quantile levels

τ = {0.25, 0.5, 0.75}

; noise

ϵ \sim N (0, 0 . 5^{2})

; and true parameter

β_{5}^{*} = 1

. The blue, red and black lines indicate the true parameter, our estimator and oracle estimator, respectively.

Figure 6. Confidence Interval (CI) of non-zero elements with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q”, when confidence level is

95 %

. The quantile levels

τ = {0.25, 0.5, 0.75}

; noise

ϵ \sim t (2)

; and true parameter

β_{5}^{*} = 1

. The blue, red and black lines indicate the true parameter, our estimator and oracle estimator, respectively.

Figure 6. Confidence Interval (CI) of non-zero elements with different quantile levels by “k-grad-Q” and “n+K-1-grad-Q”, when confidence level is

95 %

. The quantile levels

τ = {0.25, 0.5, 0.75}

; noise

ϵ \sim t (2)

; and true parameter

β_{5}^{*} = 1

. The blue, red and black lines indicate the true parameter, our estimator and oracle estimator, respectively.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, X.; Jing, Z.; Huang, C. Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression. Mathematics 2024, 12, 735. https://doi.org/10.3390/math12050735

AMA Style

Zhou X, Jing Z, Huang C. Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression. Mathematics. 2024; 12(5):735. https://doi.org/10.3390/math12050735

Chicago/Turabian Style

Zhou, Xingcai, Zhaoyang Jing, and Chao Huang. 2024. "Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression" Mathematics 12, no. 5: 735. https://doi.org/10.3390/math12050735

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression

Abstract

1. Introduction

Notations

2. Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Learning

2.1. Problem Formulation of Distributed Quantile Learning

2.2. Distributed Bootstrap for Simultaneous Inference

2.3. Q-ADMM-CSL Algorithm and NodeLasso Optimization

2.3.1. Q-ADMM-CSL Algorithm for Penalized Quantile Regression in CSL

2.3.2. NodeLasso Algorithm to Approximate Inverse Hessian Matrix

3. Theoretical Analysis

4. Simulation Experiments

4.1. Parameter Estimation

4.2. Simultaneous Confidence Intervals

5. Conclusions and Discussions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Theorems

Appendix A.1. Proof of Theorem 1

Appendix A.2. Proof of Theorem 2

Appendix A.3. Proof of Theorem 3

Appendix B. Lemmas and Their Proofs

Appendix C. Additional Experimental Results

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI