Next Article in Journal
Complex Dynamic Behaviour of Food Web Model with Generalized Fractional Operator
Next Article in Special Issue
Group Logistic Regression Models with lp,q Regularization
Previous Article in Journal
An Extended ORESTE Approach for Evaluating Rockburst Risk under Uncertain Environments
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations

1
Center for Statistics and Data Science, Beijing Normal University, Zhuhai 516087, China
2
Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA
3
Department of Statistics, University of Chicago, Chicago, IL 60637, USA
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(10), 1700; https://doi.org/10.3390/math10101700
Submission received: 21 April 2022 / Revised: 6 May 2022 / Accepted: 11 May 2022 / Published: 16 May 2022
(This article belongs to the Special Issue New Advances in High-Dimensional and Non-asymptotic Statistics)

Abstract

:
Recently, the high-dimensional negative binomial regression (NBR) for count data has been widely used in many scientific fields. However, most studies assumed the dispersion parameter as a constant, which may not be satisfied in practice. This paper studies the variable selection and dispersion estimation for the heterogeneous NBR models, which model the dispersion parameter as a function. Specifically, we proposed a double regression and applied a double 1 -penalty to both regressions. Under the restricted eigenvalue conditions, we prove the oracle inequalities for the lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, derived from the oracle inequalities, the consistency and convergence rate for the estimators are the theoretical guarantees for further statistical inference. Finally, both simulations and a real data analysis demonstrate that the new methods are effective.

1. Introduction

In many scientific fields, such as biomedical science, ecology, and economics, experimental and observational studies often yield count data, a type of data in which the observations can take only the non-negative integer values. The Poisson regression models are commonly used for count data. However, it needs a restrictive assumption that the variance equals the mean. For many count data, the variance is often larger than the mean [1], which is called overdispersion. Because the Poisson regression model is invalid under the overdispersion case, a more general and flexible regression model, the negative binomial regression, has attracted lots of research attention and become popular in analyzing count data [2,3,4].
With the advance of modern data collection techniques, high-dimensional data are becoming increasingly common in scientific studies. The widely used estimations for the high-dimensional parameter include the lasso [5], the scad [6], the elastic net [7], the adaptive lasso [8], and so on. Recently, there has been much research on the high-dimensional NBR model, such as [9,10,11,12,13,14]. All of these works assumed the dispersion parameter as a constant. In practice, however, not all models satisfy the assumption. If the dispersion parameter is wrongly assumed to be a constant, the estimation of the mean regression will perform poorly as shown in the simulation in Section 4.1, thus the need to model the dispersion parameter as a function of some covariates. The heterogeneous negative binomial regression (HNBR) extends the NBR by observation-specific parameterization of the dispersion parameter [3]. The HNBR is a valuable tool for assessing the source of overdispersion. It belongs to the double-generalized linear models (DGLMs) or vector-generalized linear models (VGLMs), which are very useful in fitting more complex and potentially realistic models [15,16,17,18]. However, it appears that there is no study on selecting the dispersion explanation variables in the HNBR model.
In this paper, we study the variable selection and dispersion estimation for the heterogeneous NBR models. To the best of our knowledge and based on the literature, this study is the first. Specifically, we propose a double regression to estimate the coefficients of NB dispersion and NBR simultaneously. Because of the high dimension of the covariates, we apply a double 1 penalty to both regressions. The two adjustment parameters we set are different because the first-order conditions for estimating the regression coefficients are entirely different from those for estimating the dispersion parameters. We construct an algorithm to perform variable selection and dispersion estimation simultaneously. Similar studies on high-dimensional NBR models include [19], which assumed the dispersion parameter as a constant. Their method requires an iterative algorithm to estimate the mean regression and dispersion alternatively and implement a lasso in each iteration. If there are many iterations, such an algorithm is a waste of computing resources.
The rest of the paper is organized as follows. Section 2 introduces the heterogeneous overdispersed count data model and defines the double 1 -penalized estimators for the mean and dispersion regressions. Then we use a technique called the stochastic Lipschitz condition to derive the asymptotic results in Section 3. Simulation studies and a real data application are given in Section 4. Finally, Section 5 concludes the article with a discussion. All proofs and technical details are provided in Appendix A.

2. Double 1 -Penalized NBR

2.1. Heterogeneous Overdispersed Count Data Regressions

Suppose we have n count responses Y i and p-dimensional covariates X i = ( x i 1 , , x i p ) , i [ n ] : = { 1 , 2 , , n } . For the Poisson regression models, the response obeys the Poisson distribution
P ( Y i = y i λ i ) = λ i y i y i ! e λ i , i [ n ] .
with λ i = E ( Y i ) , we require that the positive parameter λ i is related to a linear combination of p covariates. A plausible assumption for the link function is η ( λ i ) = log ( λ i ) = X i β . It is worth noting that E ( Y i | X i ) = var ( Y i | X i ) = exp ( X i β ) > 0 .
For the traditional negative binomial regression, it assumes that the count data response obeys the NB distribution with overdispersion:
P ( Y i = y i | X i ) = : f ( y i ; k , μ i ) = Γ ( k + y i ) Γ ( k ) y i ! ( μ i k + μ i ) y i ( k k + μ i ) k , i [ n ] ,
with E ( Y i | X i ) = μ i = exp ( β X i ) and k is an unknown qualification of the overdispersion level. When k , we have var ( Y i | X i ) = μ i + μ i 2 k μ i = E ( Y i | X i ) , the Poisson regression for the mean parameter μ i . Thus, the Poisson regression is a limiting case of negative binomial regression when the dispersion parameter k tends to infinite.
In the heterogeneous negative binomial regression, k is proposed as a specific parameterization, i.e., k = k ( X i ) . More specifically, we assume in this paper that
μ ( x ) = exp { θ ( 1 ) x } , k ( x ) = exp { θ ( 2 ) x } .
For notation simplicity, we denote
P f : = E f ( X i , Y i ) , P n f : = 1 n i = 1 n f ( X i , Y i ) , G n f : = n ( P n P ) f ,
for any measurable and integrable function f.
Let θ = ( θ ( 1 ) , θ ( 2 ) ) R 2 p , the log-likelihood is
n ( θ ) = log i = 1 n f ( y i , k i , μ i ) = i = 1 n log Γ ( k i + Y i ) Γ ( k i ) Y i ! ( μ i k i + μ i ) Y i ( k i k i + μ i ) k i = i = 1 n [ log Γ exp { X i θ ( 2 ) } Γ Y i + exp { X i θ ( 2 ) } + Y i X i θ ( 1 ) θ ( 2 ) Y i + exp { X i θ ( 2 ) } log 1 + exp { X i θ ( 1 ) θ ( 2 ) } log Y i ! ]
We use the negative log-likelihood as the loss function γ , and define
γ ( θ ) : = log f ( y | x , θ ) + log y ! .
Denote j : = θ ( j ) , j = 1 , 2 , the score function for θ ( 1 ) is
1 ( θ ) = P n 1 γ ( θ ) = 1 n i = 1 n ( Y i e X i θ ( 1 ) ) e X i θ ( 2 ) X i e X i θ ( 1 ) + e X i θ ( 2 ) .
Furthermore, fix θ ( 1 ) , the score function for θ ( 2 ) is
2 ( θ ) = P n 2 γ ( θ ) = 1 n i = 1 n log 1 + e X i ( θ ( 1 ) θ ( 2 ) ) j = 0 Y i 1 1 j + e X i θ ( 2 ) + Y i e X i θ ( 1 ) e X i θ ( 1 ) + e X i θ ( 2 ) e X i θ ( 2 ) X i .
It is easy to verify that
P 1 ( θ ) = P 2 ( θ ) = 0 .
Thus, from now, we will suppose the true value of parameter θ is θ * .

2.2. Heterogeneous Overdispersed NBR via Double 1 Penalty

The weighted lasso estimator under our circumstance is defined as
θ ^ n = argmin θ Θ P n γ ( θ ) + λ θ ω , 1 ,
where λ > 0 is the tuning parameter and the weighted norm is defined by
λ θ ω , 1 = λ 1 θ ( 1 ) 1 + λ 2 θ ( 2 ) 1 = λ ω 1 θ ( 1 ) 1 + ω 2 θ ( 2 ) 1 ,
and ω = ( ω 1 , ω 2 ) = ( λ 1 / λ , λ 2 / λ ) [ 0 , 1 ] × [ 0 , 1 ] is the weight, · 1 means the 1 -norm. This technique is also used in [20]. Equation (2) is a weighted double 1 -penalized problem, which is a kind of convex penalty optimization, and when λ 1 = λ 2 , it becomes a single-penalized problem. In this paper, we use different λ 1 and λ 2 , as the first-order conditions for estimating the regression coefficients are entirely different from those for estimating the dispersion parameters, and take λ = λ 1 λ 2 .
Because the weighted group lasso estimator θ ^ n has no closed-form solution, we need to use iterative methods such as quasi-Newton or coordinate descent methods. We use BIC to choose the parameter λ 1 and λ 2 .
BIC ( λ 1 , λ 2 ) = 2 ( θ ^ n ) + log n n k ,
where k is the number of nonzero estimated coefficients. To illustrate the algorithm explicitly, we rewrite γ ( θ ) as γ ( θ ( 1 ) x , θ ( 2 ) x ) and define θ ( 3 ) = λ 2 / λ 1 θ ( 2 ) , θ = ( θ ( 1 ) , θ ( 3 ) ) . Converting θ ( 2 ) into θ ( 3 ) turns the double 1 -penalized problem into a single penalized one, which can be solved through some R packages, such as “lbfgs”. The algorithm is formally given in Algorithm 1.
Algorithm 1 Double 1 -Penalized Optimization
Input: the set of tuning parameters Λ = { ( λ 1 , i , λ 2 , i ) } i = 1 m
Output: the estimate θ ^ n
  for i = 1 , , m , do
    let x * = λ 1 , i λ 2 , i x ;
    solve θ ^ = ( θ ^ ( 1 ) , θ ^ ( 3 ) ) = argmin θ Θ P n γ ( θ ( 1 ) x , θ ( 3 ) x * ) ) + λ 1 , i θ 1 ;
    obtain the estimate θ ^ n , i = ( θ ^ ( 1 ) , λ 2 , i λ 1 , i θ ^ ( 3 ) ) ;
    compute BIC ( λ 1 , i , λ 2 , i ) = 2 ( θ ^ n , i ) + log n n k i ;
  end for
  find i o p t = argmin i = 1 , , m BIC ( λ 1 , i , λ 2 , i ) ;
  return θ ^ n , i o p t
The proposed algorithm can perform variable selection and dispersion estimation simultaneously. Similar studies on high-dimensional NBR models include [19], which assumed the dispersion parameter as a constant. However, their method requires an iterative algorithm to estimate the mean regression and dispersion alternatively and implement lasso in each iteration. If there are many iterations, such an algorithm is a waste of computing resources.

3. Main Results

3.1. Stochastic Lipschitz Conditions

We write the maximum of Y i from the sample of size n as M Y , n , then the sample space for { Y i } i = 1 n is Y : = { y N , y M y , n } ., i.e., M y , n = max i [ n ] Y i . Note that lim n P ( M y , n = ) = 1 ; what we need to tackle is actually an unbounded empirical process. However, for z : = x x R 2 × 2 p , we can assume the value space S for s : = z θ is bounded and satisfies
S : = s = ( s 1 , s 2 ) R 2 , < m s , n s j | s j | M s , n < , j = 1 , 2 .
As we can see, the most significant difference between this article and other conventional literature about lasso estimators is that we use s = z θ rather than θ as the explanatory variable to analyze the properties of the loss function γ . This is not a traditional way. At first glimpse, the combination may complicate the analysis in the next step because the KKT condition requires the story about θ γ , which is critical for the traditional convex penalty problem. However, this article will try a different approach, the stochastic Lipschitz conditions introduced in the event A of Proposition 1 in [14], to solve the 1 -penalization problem. Define the local stochastic Lipschitz constant by
Lip ( f ; θ * ) : = sup θ Θ / { θ * } n G n f ( θ ) f ( θ * ) θ θ * 1 .
The most apparent advantage of the stochastic Lipschitz conditions over the KKT condition is that it can easily deal with the several parameters involved in different locations of the model that need to impose the same penalty on them, which is why we do not need to derive the KKT condition in this paper.
To establish the stochastic Lipschitz conditions for this unbounded counting process, another assumption, called the strongly midpoint log-convex, for some positive γ should be satisfied, which states for the joint density from the sample Y : = ( Y 1 , , Y n ) Z n ’s negative log-density of n independent NB responses ψ ( y ) : = log p Y ( y ) satisfies
ψ ( x ) + ψ ( y ) ψ 1 2 x + 1 2 y ψ 1 2 x + 1 2 y γ 4 x y 2 2 , x , y Z n .
This assumption is a condition that ensures that the suprema of the multiplier empirical processes of n independent responses have sub-exponential concentration phenomena, which can be alternatively checked by the tail inequality for the suprema of the empirical processes corresponding to classes of unbounded functions ([21]).
Theorem 1.
Suppose max i [ n ] , 1 k p | X i k | M x < , the parameter space Θ is convex and its diameter D Θ < . If { Y i } i = 1 n and { Z i θ } i [ n ] , θ Θ are both in the value space Y and S defined as previous, then for any θ Θ ,
Lip ( γ ; θ * ) = sup θ Θ / { θ T * } n G n γ ( θ ) γ ( θ * ) θ θ * 1 n M q : = A 1 log ( 2 p / q 2 ) + A 2 log p + A 3 log ( p / q 3 ) max 1 k p i = 1 n X i k 2 + B log ( 2 p / q 1 ) max 1 k p i = 1 n X i k 4 1 / 2 C log ( 2 p / q 1 ) + D log ( p / q 3 ) ,
with probability at least 1 q 0 , where q 1 , q 2 , q 3 ( 0 , 1 ) satisfy q 1 + q 2 + q 3 = q 0 , and the constants are as follows:
A 1 = 2 F 1 , A 2 = 32 2 M x F 2 D Θ , A 3 = 2 2 ( F 1 + M y , n ) F 2 M x D Θ , B = 6 2 w ( 1 ) w ( 2 ) i = 1 n a ( μ i , k i ) 4 1 / 2 , C = 12 M x w ( 1 ) w ( 2 ) max 1 i n a ( μ i , k i ) , D = 8 2 ( F 1 + M y , n ) F 2 M x D Θ M x , w ( 1 ) = e M s , n e m s , n + e M s , n , w ( 2 ) = e + e M s , n m s , n 1 + e m s , n M s , n + 1 1 + e m s , n M s , n ,
where M y , n = max i [ n ] Y i is the suprema empirical process.
It is worthy to note that the M y , n in Theorem 1 is a random process; hence, the bound above is not deterministic. Fortunately, M y , n can use the strongly midpoint log-convex condition to be bounded, which we state in Lemma A3. Theorem 1 combined with Lemma A3 will give the following result as a step more.
Theorem 2.
Assume the conditions are the same as that in Theorem 1, then the stochastic Lipschitz constant has a nonrandom upper bound:
Lip ( γ ; θ * ) n M q : = ( A 1 log ( 2 p / q 2 ) + A 2 log p + 2 A 3 log ( 2 n / q 4 ) + log ( n p / q 3 ) ) max 1 k p i = 1 n X i k 2 + B log ( 2 p / q 1 ) max 1 k p i = 1 n X i k 4 1 / 2 C log ( 2 p / q 1 ) + D log ( p / q 3 ) ,
with probability at least 1 q 0 , where q 1 , q 2 , q 3 , q 4 ( 0 , 1 ) satisfy q 1 + q 2 + q 3 + q 4 = q 0 , and
A 3 = 2 2 F 1 + 2 γ max i [ n ] a ( μ i , k i ) μ i log 2 F 2 M x D Θ .
Theorem 1 gives us a different sight of the loss function far more than KKT conditions. However, the stochastic Lipschitz condition above does not compare the estimated and true values directly. We can resolve this issue by using an eigenvalue condition on the design matrix consisting of X i . Because the design matrix X is fixed, the eigenvalue condition in the next section is reasonable. It is worthy to note that this inequality is an oracle because it involves an unknown empirical process on the right side.

3.2. 2 -Estimation Error Oracle Inequalities RE Conditions

As we said previously, although we use stochastic Lipschitz conditions instead of KKT conditions, the restricted eigenvalue conditions (RE conditions) are still required. We denote by δ J the vector in R p with the same coordinates as v on J and zero coordinates on the complement J c of J, and spt ( v ) = { j : v j 0 } . We will assume that the minima in (2) can always be obtained in the following setting, but it may not be unique. In general, to bound θ ^ θ * , some conditions on the design matrix X R n × p are needed for obtaining abound in terms of the 2 norm of θ θ * . Here, we will utilize the restricted eigenvalue condition introduced in [22], which says that for some 1 s p and K > 0 ,
κ ( s , K ) = min X v 2 n v J 2 : 1 | J | s , v R p / { 0 } , v J c 1 K v J 1 > 0 .
It should be noted that omitting the weight ω and the sparse restricted set v J c 1 K v J 1 leads to v 1 n X X v / v v κ 2 ( s , K ) . Thus, it means that the smallest eigenvalue of the sample covariance matrix 1 n X X is positive, which is impossible when p > n because 1 n X X is not full rank. To avoid this problem, ref. [22] consider the restricted eigenvalue condition under the sparse restricted set v J c 1 K v J 1 as a considerable relation in sparse high-dimensional estimation. The restricted eigenvalue is from the restricted strong convexity, which enforces a strong convexity condition for the negative log-likelihood function of linear models under a certain sparse restrict set.
Due to the double penalty, besides the RE condition, we also require another condition similar to the RE condition, the so-called l-restricted isometry constant defined in [23], as follows
σ X , l 2 = max X v 2 2 / v 2 2 : v R p , 1 spt ( v ) l ( 0 , ) ,
which essentially requires the eigenvalue of the sample covariance matrix under every vector with cardinality less than l (l should be no more than n) approximately behaves normally like the low-dimensional case.
With the RE condition and l-restricted isometry constant, and the two theorems we established before, the lasso estimator in (2) can guarantee a good consistent property.
Lemma 1
(see Lemma 3.1 in [23]). Suppose T 0 is a set of cardinality S. For a vector h R p , we let T 1 be the S largest positions of h outside of T 0 . Put T 01 = T 0 T 1 , then
h 2 2 h T 01 2 2 + S 1 h T 0 c 1 2 .
Theorem 3.
Suppose the condition is the same as that in Theorem A1. Furthermore, assume p 1 = spt θ ( * 1 ) spt θ ( * 2 ) p / 2 , and there exists some K > 1 , κ : = κ ( 2 p 1 , K ) > 0 . Let λ = ( K + 1 ) M q n ( K 1 ) , then using this λ in (2), with probability at least 1 q ,
θ ^ θ * 2 2 8 p 1 M q 2 K 2 κ 4 n 2 C γ 2 ( K 1 ) 2 2 + K 2 + 2 ( 1 + 2 p 1 K 2 ) ( n κ 2 + 2 σ X , p 1 2 ) n κ 2 ,
where M q , C γ are defined in Theorems 1 and A1, respectively.
Remark 1.
Compared to the single lasso problem, in which we only have one unknown vectorized parameter, the oracle inequality in Theorem 3 has an extra term 2 ( 1 + 2 p 1 K 2 ) ( n κ 2 + 2 σ X , p 1 2 ) n κ 2 .
Remark 2.
From Theorem 3, we know that the 2 convergence rate is minimax optimal, as studied in [14].
Remark 3.
In this study, we use the lasso estimators of two partial regression coefficients because it is one of the most popular techniques for high-dimensional data. It is worth mentioning that the algorithms and theoretical results could be similarly generalized to other shrinkage estimators, such as the elastic net [7], the adaptive lasso [8], and so on.

4. Numerical Studies

4.1. Simulations

In this section, we evaluate the finite sample performance of the proposed method. The response is generated from the negative binomial regression model (1) with
μ ( x ) = exp { θ ( 1 ) x } , and k ( x ) = exp { θ ( 2 ) x } ,
where θ ( 1 ) and θ ( 2 ) are two p-dimensional parameters. The explanatory variables are generated from the multivariate normal distributions with mean vector 0 and C o v ( x i , x j ) = ρ | i j | , where ρ = 0 , 0.5 . The following two examples show the performance of the proposed estimator for the low-dimensional heterogeneous negative binomial regression and the variable selection in the high-dimensional case, respectively. The R package “lbfgs” is required to solve the optimization problem.
Example 1
(Low dimension). We set p = 3 and n = 100 , 200 , 400 . The true parameters are θ ( 1 ) = ( 1 , 2 , 1 ) and θ ( 2 ) = ( 1 , 0.5 , 1 ) , and their maximum likelihood estimators are denoted as θ ^ ( 1 ) and θ ^ ( 2 ) , respectively. We compare the estimator θ ^ ( 1 ) with θ ^ ( 1 ) * , which ignores the heterogeneity of the overdispersion and treats k ( x ) as a constant. Table 1 displays the average squared estimation errors θ ^ θ 2 2 based on 200 repetitions.
We can make the following observations from the table. Firstly, the performances of the three estimators become better and better as n increases. Secondly, the estimator θ ^ ( 1 ) , which estimates the parameter in the mean function μ ( x ) , performs better than θ ^ ( 2 ) , which estimates the parameter in the overdispersion function k ( x ) . Last, but the most important, θ ^ ( 1 ) * performs much worse than θ ^ ( 1 ) . For example, the average squared estimation error of θ ^ ( 1 ) * is about 5 times of θ ^ ( 1 ) ’s when n = 100 , and 10 times of θ ^ ( 1 ) ’s when n = 400 . The comparison between θ ^ ( 1 ) and θ ^ ( 1 ) * indicates the necessity of considering the heterogeneity of the overdispersion.
Example 2
(High dimension). The sample sizes are chosen to be n = 100 , 200 , 400 , with dimension p ( 25 , 50 , 150 ) , ( 50 , 100 , 250 ) and ( 100 , 200 , 500 ) , respectively. We set θ ( 1 ) = ( 1 , 2 , 1 , 0 , , 0 ) and θ ( 2 ) = ( 1 , 0.5 , 1 , 0 , , 0 ) . The unknown tuning parameters ( λ 1 , λ 2 ) for the penalty functions are chosen by BIC criterion in the simulation. Results over 200 repetitions are reported. We compared the variable selection performance of the proposed method to the previous method, which ignores the heterogeneity of the overdispersion and treats k ( x ) as a constant. For each case, Table 2 reports the number of repetitions that each important explanatory variable is selected in the final model and also the average number of unimportant explanatory variables being selected.
We see from the table that our method performs much better than the previous method that treats k ( x ) as a constant. Specifically, our method correctly selects important variables more times than the previous method, and it is less likely to select unimportant variables. Furthermore, the variable selection procedure performs better and better as the sample size n increases. When n = 400 , the important explanatory variables in μ ( x ) and k ( x ) are correctly selected in almost every repetitions. When the dimension p increases, the procedure may select more unimportant explanatory variables, but the average numbers are less than 1.3 . The important variables in k ( x ) are less likely to be selected than the important variables in μ ( x ) especially when the sample size is small, as well as the unimportant variables.

4.2. A Real Data Example

In this section, we apply the proposed method to the dataset of German health care demand. The data were employed in [24] and could be downloaded on http://qed.econ.queensu.ca/jae/2003-v18.4/riphahn-wambach-million/, accessed on 1 January 2022. The data contain 27,326 observations on 25 variables, including 2 dependent variables, Docvis (number of doctor visits in the last three months) and Hospvis (number of hospital visits in the last calendar year). For conciseness, we focus on Docvis in this study. We build the HNBR model based on the proposed variable selection procedure and make the standard NBR model a comparison. Define the fitting errors (FE) as n 1 i = 1 n ( y i y ^ i ) , where y i denotes the raw data of Hospvis, y ^ i is the predicted value, and n is the sample size. As the data are observed during 1984–1988, 1991, and 1994, we make the analysis for each observed year. Table 3 displays the variable selection results and fitting errors.
We have the following findings from the table. First, the important variables in the NBR are the same as HNBR models in each year, and the estimates are close. Second, the selected variables in μ ( x ) are almost the same every year, namely Age, Hsat (health satisfaction), Handper (degree of handicap), and Educ (years of schooling). Moreover, some of these variables still play an essential role in k ( x ) , and k ( x ) contains no variables other than these. Moreover, we can see that the fitting errors of the HNBR is less than that of the NBR. All of these illustrate the advantage of our method.

5. Conclusions and Future Study

We study the high-dimensional heterogeneous overdispersed count data via negative binomial regression models and propose a double 1 -regularized method for simultaneous variable selection and dispersion estimation. Under the restricted eigenvalue conditions, we prove the oracle inequalities with lasso estimators of two partial regression coefficients for the first time, using concentration inequalities of empirical processes. Furthermore, we derive the consistency and convergence rate for the estimators, which are the theoretical guarantees for further statistical inference. Simulation studies and a real example from the German health care demand data indicate that the proposed method works satisfactorily.
There are some limitations of this study. First, we assume that the responses are independent in this work. However, the NB responses are temporal dependent in the time-series data [25]. Thus, weak dependence conditions, including ρ -mixing, m-dependent types, could be considered in the future. Second, this study focuses little on the statistical inference, such as testing heterogeneous
H 0 : θ ( 2 ) = 0 vs . H 1 : θ ( 2 ) 0 .
The issues concerning the hypothesis testing are via the debiased lasso estimator; see [26] and references therein. This will comprise our future research work. Another possible study is the false discovery rate (FDR) control, which aims to identify some small number of statistically significantly nonzero results after obtaining the sparse penalized estimation of the HNBR; see [27,28].

Author Contributions

Conceptualization, S.L. and H.W.; methodology, H.W.; software, S.L.; validation, S.L., H.W. and X.L.; data curation, S.L.; writing—original draft preparation, S.L., H.W. and X.L.; writing—review and editing, S.L. and H.W.; supervision, S.L. and H.W.; funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 12101056.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in Section 4.2 could be downloaded on http://qed.econ.queensu.ca/jae/2003-v18.4/riphahn-wambach-million/, accessed on 1 January 2022.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proofs

The first step is giving the property of the loss function. From mathematical analysis, we prefer bounded things to unlimited things. Denote j is the first partial differentiation with respect to s j . The bounded aspect for y and s gives a nice property for the loss function γ ( s , y ) = γ ( z θ , y ) .
Lemma A1.
We have
1 γ ( s , y ) = y e s 2 e s 1 + e s 2 + e s 1 + s 2 e s 1 + e s 2 , 2 γ ( s , y ) = ν ( s , y ) + y e s 2 e s 1 + e s 2
where ν ( s , y ) = e s 2 ψ ( e s 2 ) ψ ( y + e s 2 ) + e s 2 log 1 + e s 1 s 2 e s 1 + s 2 e s 1 + e s 2 satisfying
sup s S , y Y | ν ( s , y ) | F 1 , sup s t S , y Y | ν ( s , y ) ν ( t , y ) | s t F 2 ,
with F 1 = M y , n 1 + e m s , n + e M s , n + e 2 M s , n 2 e m s , n , and
F 2 = 2 e M s , n 1 + log ( M y , n + e M s , n ) 1 2 ( M y , n + e M s , n ) 1 m s , n e m s , n + e 2 M s , n e m s , n + 2 e 2 M s , n e m s , n + e M s , n + 3 2 e M s , n .
Proof. 
We will use the properties of the psi function, the logarithmic derivative of the gamma function, to prove this lemma. Write ψ ( x ) = Γ ( x ) / Γ ( x ) . For any s S , y Y , using the Binet ’s formula (see p. 18 of [29])
ψ ( x ) = log x 0 φ ( t ) e t x d t ,
where φ ( t ) = 1 / ( 1 e t ) 1 / t is strictly increasing on ( 0 , ) , it gives
0 < ψ ( x ) = 1 x + 0 t φ ( t ) e t x d t 1 x + 0 t e t x d t = 1 x + 1 x 2 .
and y 0 , we have
| ν ( s , y ) | = e s 2 ψ ( e s 2 ) ψ ( y + e s 2 ) + e s 2 log 1 + e s 1 s 2 e s 1 + s 2 e s 1 + e s 2 e s 2 y 1 e s 2 + 1 e s 2 + e s 2 e s 1 s 2 + e s 1 + s 2 e s 1 + e s 2 M y , n 1 + e m s , n + e M s , n + e 2 M s , n 2 e m s , n .
Then, the first inequality in the lemma has been verified. On the other hand, by using the fact that (see (2.2) in [30])
1 2 x < log ( x ) ψ ( x ) < 1 x , x > 0 ,
for the function f 1 ( x ) = e x ψ ( e x ) and f 2 ( x ) = e x ψ ( y + e x ) ,
f 1 ( x ) = e x ψ ( e x ) + e 2 x ψ ( e x ) e x x 1 2 e x + e 2 x 1 e x + 1 e 2 x = ( x + 1 ) e x + 1 2 , f 1 ( x ) = e x ψ ( e x ) e 2 x ψ ( e x ) e x ψ ( e x ) x e x 1 , f 2 ( x ) = e x ψ ( y + e x ) + e 2 x ψ ( y + e x ) e x 1 + log ( y + e x ) 1 2 ( y + e x ) + 1 , f 2 ( x ) e x log ( y + e x ) 1 y + e x x e x 1 ,
and for any s t S , y Y , we conclude that
| ψ ( e s 2 ) e s 2 ψ ( e t 2 ) e t 2 | = | f 1 ( s 2 ) f 1 ( t 2 ) | ( M s , n + 1 ) e M s , n + 1 / 2 1 m s , n e m s , n s t ,
and
| ψ ( y + e s 2 ) e s 2 ψ ( y + e t 2 ) e t 2 | = | f 2 ( s 2 ) f 2 ( t 2 ) | e M s , n 1 + log ( M y , n + e M s , n ) 1 2 ( M y , n + e M s , n ) 1 m s , n e m s , n s t .
In addition, using the median value theorem again, we also have
e s 2 log ( 1 + e s 1 s 2 ) e t 2 log ( 1 + e t 1 t 2 ) log ( 1 + e s 1 s 2 ) | e s 2 e t 2 | + e t 2 log ( 1 + e s 1 s 2 ) log ( 1 + e t 1 t 2 ) e 2 M s , n m s , n | s 2 t 2 | + e M s , n 1 1 + e ( M s , n m s , n ) | ( s 1 s 2 ) ( t 1 t 2 ) | e 2 M s , n e m s , n + 2 e 2 M s , n e m s , n + e M s , n s t ,
and
e s 1 + s 2 e s 1 + e s 2 e t 1 + t 2 e t 1 + e t 2 e s 1 1 1 + e s 1 s 2 1 1 + e t 1 t 2 + 1 1 + e t 1 t 2 | e s 1 e t 1 | e M s , n 1 4 | ( s 1 s 2 ) ( t 1 t 2 ) | + 1 × e M s , n | s 1 t 1 | 3 2 e M s , n s t ,
where the fact used is that f 3 ( x ) = 1 / ( 1 + e x ) satisfies | f 3 ( x ) | = 1 / ( e x + e x + 2 ) 1 / 4 . Because
| 2 γ ( s , y ) 2 γ ( t , y ) | ψ ( e s 2 ) e s 2 ψ ( e t 2 ) e t 2 + ψ ( y + e s 2 ) e s 2 ψ ( y + e t 2 ) e t 2 + e s 2 log ( 1 + e s 1 s 2 ) e t 2 log ( 1 + e t 1 t 2 ) + e s 1 + s 2 e s 1 + e s 2 e t 1 + t 2 e t 1 + e t 2 ,
we can conclude the second inequality in the lemma. □
The Lemma separates the partial derivative of γ into two parts: the first part is the linear about the response variable y (say y e s 2 / ( e s 1 + e s 2 ) , e s 1 + s 2 / ( e s 1 + e s 2 ) , and y e s 2 / ( e s 1 + e s 2 ) ), the second part is other complicated functions (not linear function) about y. The first part is relatively easy to analyze because the following concentration inequality gives a measure of dispersion about the weighted summation of negative binomial variables. This concentration inequality is a special case for the weighted summation of a series of random variables, which can be proved by sub-exponential concentration results in Proposition 4.2 in [31].
Lemma A2.
Suppose { Y i } i = 1 n are independently distributed as NB ( μ i , k i ) . Then, for any nonrandom weights w = ( w 1 , , w n ) R n independent with { Y i } i = 1 n and t 0 ,
P | i = 1 n w i ( Y i E Y i ) | t 2 exp 1 4 t 2 2 i = 1 n w i 2 a 2 ( μ i , k i ) t max 1 i n | w i | a ( μ i , k i ) ,
where q i : = μ i k i + μ i ( 0 , 1 ) and a ( μ , k ) : = log 1 ( 1 q ) / 2 k q 1 + μ log 2 .
Proof. 
We will use the sub-exponential norm. The moment-generating function (MGF) for Y i is
E e s Y i = 1 q i 1 q i e s k i .
Then, by letting E exp ( | Y i | / t ) 2 , we have
2 E exp ( | Y i | / t ) = E exp ( Y i / t ) = 1 q i 1 q i e 1 / t k i ,
which implies the sub-exponential norm for Y i is
Y i ψ 1 = inf { t > 0 : E exp ( | Y i | / t ) 2 } = log 1 ( 1 q i ) / 2 k i q i 1 .
Using the definition of a i , from Proposition 4.2 in [31], we can immediately obtain the result in the Lemma. □
It should be noted that a i = a ( μ i , k i ) naturally has a lower and upper bound for any i [ n ] because μ i and k i are both bounded between e m s , n and e M s , n .
Note that Y i is an unbounded random variable; the next step is to find a probabilistic bound for M y , n = max i [ n ] Y i . We will cite an important lemma for this type of problem. We say a distribution P γ is strongly discrete log-concave with γ > 0 if its density is strongly midpoint log-convex with the same γ > 0 .
Lemma A3
(Concentration for strongly log-concave discrete distributions). Let P γ be any strongly log-concave discrete distribution index by γ > 0 on Z n . Then, for any function f : R n R that is L-Lipschitz with respect to Euclidean norm, we have for X P γ ,
P P γ | f ( X ) E f ( X ) | t 2 exp γ t 2 4 L 2
for any t > 0 .
Lemma A4.
The maximal of the response M y , n = max i [ n ] Y i has the concentration
P M y , n 2 max i [ n ] a ( μ i , k i ) μ i log 2 log ( 2 n ) + 2 log ( 2 n ) + max i [ n ] μ i > t e γ t 2 / 4
for any t > 0 .
Proof. 
For the upper bound of expectation, we first note that Y i E Y i subE ( 2 Y i ψ 1 ) with Y i ψ 1 has calculated in Lemma A2, then we have Y i E Y i sub Γ ( 4 Y i ψ 1 2 , 2 Y i ψ 1 ) by Example 5.3 in [31], which further gives
E M y , n E max i [ n ] ( Y i E Y i ) + max i [ n ] E Y i 2 · max i [ n ] 4 Y i ψ 1 2 · log ( 2 n ) 1 2 + max i [ n ] 2 Y i ψ 1 · log ( 2 n ) + max i [ n ] μ i = 2 max i [ n ] Y i ψ 1 log ( 2 n ) + 2 log ( 2 n ) + max i [ n ] μ i ,
where the second ≤ is by Corollary 7.3 in [31] and the bound in the lemma comes from the explicit expression in Lemma A2.
By implementing Lemma A3, it remains that we need to verify that Y : = ( Y 1 , , Y n ) Z n belongs to some strongly log-concave discrete distribution P γ with the specifying γ > 0 after we take f : ( x 1 , , x n ) max i [ n ] x i which is 1-Lipschitz. By the definition, the derivative of log-density for y : = ( y 1 , , y n ) is
ψ ( y i ) : = log p ( y ) y y i = log Γ ( k i + y i ) Γ ( 1 + y i ) y i log ( k i + μ i ) ,
then the Taylor expansion gives
ψ ( y ) = ψ 1 2 x + 1 2 y + 1 2 ψ 1 2 x + 1 2 y ( y x ) + 1 8 ( y x ) 2 ψ a 1 , ψ ( x ) = ψ 1 2 x + 1 2 y + 1 2 ψ 1 2 x + 1 2 y ( x y ) + 1 8 ( y x ) 2 ψ a 2
where a 1 = t 1 y + 1 t 1 ( x + y ) / 2 , a 2 = t 2 y + 1 t 1 ( x + y ) / 2 with t 1 , t 2 [ 0 , 1 ] . Define the difference function
Δ ( x , y ) : = x y 4 ψ 1 2 x + 1 2 y ψ 1 2 x + 1 2 y + ψ a 1 + ψ a 2 16 ( y x ) 2 ,
the Taylor expression above immediately implies
Δ ( x , y ) | x y | 2 ψ a 1 + ψ a 2 16 sup x y ; x , y Z n ψ ( ( x + y ) / 2 ) ψ ( ( x + y ) / 2 ) 4 | x y | .
Let
C ψ : = sup x y ; x , y Z n ψ ( ( x + y ) / 2 ) ψ ( ( x + y ) / 2 ) 4 | x y | = sup x y ; x , y Z n log Γ ( k i + ( x + y ) / 2 ) Γ ( ( x + y ) / 2 + 1 ) Γ ( k i + ( x + y ) / 2 ) Γ ( ( x + y ) / 2 + 1 ) ( ( x + y ) / 2 ( x + y ) / 2 ) log 1 k i + μ i / 4 | x y | ,
and it is not hard to see C ψ log k i + μ i 4 or 0. Besides,
ψ ( y ) : = 2 log p ( y ) y 2 y = y i = d d y i log Γ θ + y i Γ y i + 1 = m = 1 1 m + 1 1 m + k i + y i m = 1 1 m + 1 1 m + y i + 1 = m = 1 1 m + y i + 1 1 m + k i + y i inf y i Z m = 1 1 m + y i + 1 1 m + k i + y i = C ψ .
Now, we have obtained
Δ ( x , y ) | x y | 2 ψ a 1 + ψ a 2 16 C ψ | x y | 2 C ψ 8 C ψ
which gives γ = : C ψ 8 C ψ > 0 from the strong log-concave assumption for Y , if C ψ | log ( k i + μ i ) | 4 is small. Hence, we can conclude from Lemma A3 and the upper bound of E M y , n
P M y , n 2 max i [ n ] a ( μ i , k i ) μ i log 2 log ( 2 n ) + 2 log ( 2 n ) + max i [ n ] μ i > t P ( M y , n E M y , n > t ) e γ t 2 / 4
which is exactly the result in the lemma. □
Remark A1.
For M y , n , it is distributed as sub-Gumbel, which is rarely studied by research. Another way to deal with using the extreme value theory (EVT) technique, we note that for any t R
P ( M y , n E M y , n > t ) = 1 i = 1 n P Y i t + E M y , n = 1 i = 1 n 1 exp 1 4 ( t + E M y , n μ i ) 2 2 a i 2 t + E M y , n μ i a i .
If Y i is i.i.d., then in asymptotic sense,
P ( M y , n E M y , n > t ) = 1 1 P Y 1 > t + E M y , n n 1 exp n P Y 1 > t + E M y , n + o ( 1 ) 1 exp n exp 1 4 ( t + E M y , n μ 1 ) 2 2 a 1 2 t + E M y , n μ 1 a 1 .
Unfortunately, this technique cannot be used in the above lemma because: (i) we need non-asymptotic version inequality instead of a vague expression with n and (ii) { Y i } is not an i.i.d. series, and then EVT theory will not be easily used in this particular setting. Hence, we adopt a discrete technique which has been used in [32] and fully illustrated in [14].
The stochastic Lipschitz conditions are established by using the properties of 1 γ ( s , y ) and 2 γ ( s , y ) . As we said before, they are divided into two parts. The linear parts in them can be solved by the concentration inequality for NB variables given in Lemma A2, but the non-linear part ν ( s , y ) needs some more advanced tools regarding the empirical process. They are given as the following lemmas.
Lemma A5
(3.12) in [33]). Suppose X 1 ( ω ) , , X n ( ω ) R are zero-mean independent stochastic processes indexed by ω Ω . If there exist M 0 and S 0 satisfying | X i ( ω ) | M 0 and i = 1 n var X i ( ω ) S 0 2 for all ω Ω . Denote S n = sup ω Ω | i = 1 n X i ( ω ) | , then for any t > 0 ,
P S n 2 E S n + S 0 2 t + 4 M 0 t e t .
A map ϕ : R R is called a contraction if | ϕ ( s ) ϕ ( t ) | | s t | for all s , t R . In addition, in the following lemmas, ε 1 , , ε n are always i.i.d. Rademacher variables.
Lemma A6
(Theorem 2.2 in [34]). Let T V n be a bounded set and f 1 , , f n be functions V R such that f i is ( M i , ) -Lipschitz with f i ( 0 ) = 0 . For j = 1 , , k N , let T j = { ( t 1 j , , t n j ) : ( t 1 , , t n ) T } R n . Then,
E sup t T | i = 1 n ε i f i ( t i ) | β k j = 1 k E sup s T j | i = 1 n ε i M i s i | ,
where β k is a universal constant that can be set no greater than 3 k + 3 k 1 2 k .
Lemma A7
(Theorem 4.12 in [35]). Let F : R + R + be convex and increasing. Let further ϕ i : R R , i n be contractions such that ϕ i ( 0 ) = 0 . Then, for any bounded subset T in R n ,
E F 1 2 sup T | i = 1 n ε i ϕ i ( t i ) | E F sup T | i = 1 n ε i t i | .
Lemma A8
(Lemma 5.2 in [36]). Let A be some finite subset of R n , let R = sup a A i = 1 n a i 2 1 / 2 , then
E sup a A i = 1 n ε i a i R 2 log card ( A ) .
With the assistance of these powerful tools, we can establish the stochastic Lipschitz condition as follows, which is one of the most important points in this article for establishing the oracle inequality of the 2 distance between the estimated value θ ^ and the real value θ * .
The proof of Theorem 1.
Denote c i = Z i θ * . For θ Θ , denote t i = Z i ( θ θ * ) = Z i θ c i . We also define the map π ¯ j : ( x 1 , , x p ) ( x 1 , , x j , 0 , , 0 ) and the function
φ i j ( s ) = γ ( c i + π ¯ j s , Y i ) γ ( c i + π ¯ j 1 s , Y i ) s j j γ ( c i , Y i ) , if s j 0 ; j γ ( c i + π ¯ j 1 s , Y i ) j γ ( c i , Y i ) , if s j = 0 .
Thus, φ i j : R 2 R is a real-value function for i = 1 , , n , j = 1 , 2 . Then, it is easy to check that
γ ( Z i θ , Y i ) γ ( Z i θ * , Y i ) = j = 1 2 j γ ( c i , Y i ) + φ i j ( t i ) t i j ,
and n P n γ ( θ ) γ ( θ * ) = i = 1 n j = 1 2 j γ ( c i , Y i ) + φ i j ( t i ) X i θ ( j ) θ * ( j ) in turn. It gives
n G n γ ( θ ) γ ( θ * ) = i = 1 n j = 1 2 j γ ( c i , Y i ) E j γ ( c i , Y i ) X i θ ( j ) θ * ( j ) + i = 1 n j = 1 2 φ i j ( t i ) E φ i j ( t i ) X i θ ( j ) θ * ( j ) .
First, we would like to give the explicit formula for φ i 1 and obtain an upper bound as well as a Lipschitz parameter for φ i 2 . Denote h i ( · ) = γ ( · , Y i ) , then
φ i j ( s ) = 0 1 j h i ( c i + π ¯ j 1 s + s j u e j ) j h i ( c i ) d u ,
where e j is the j-th basis vector of R 2 . Hence, for j = 1 ,
φ i 1 ( s ) = Y i 0 1 e s 2 e s 1 + u + e s 2 e s 2 e s 1 + e s 2 d u + 0 1 e s 1 + s 2 + u e s 1 + u + e s 2 e s 1 + s 2 e s 1 + e s 2 d u = log e s 1 + 1 + e s 2 e s 1 + e s 2 e s 1 e s 1 + e s 2 Y i + C 1 ( s ) ,
in which C 1 ( s ) is a function only related to s and free of Y and the index i. Using Lemma A1, for j = 2 , write F 3 = F 1 + M y , n ,
φ i 2 ( s ) 0 1 2 h i ( c i + π ¯ 1 s + s 2 u e 2 ) 2 h i ( u ) d u 2 F 3 ,
and
| φ i 2 ( s ) φ i 2 ( t ) | 0 1 | 2 h i ( c i + π ¯ 1 s + s 2 u e 2 ) 2 h i ( c i + π ¯ 1 t + t 2 u e 2 ) | d u 0 1 F 2 π ¯ 1 ( s t ) + ( s 2 t 2 ) u e 2 d u F 2 s t .
This implies φ i 2 is ( F 2 , ) Lipschitz. In particular, letting s = Z i ( θ θ * ) and t = 0 ,
φ i 2 Z i ( θ θ * ) Z i ( θ θ * ) F 2 M x D Θ .
Hence, we obtain an upper bound for φ i 2 that
φ i 2 Z i ( θ θ * ) 2 F 3 F 2 M x D Θ : = M 1
Now, for k = 1 , , p , define
ξ i k ( θ ) : = φ i 2 ( t i ) E φ i 2 ( t i ) X i k , S k = sup θ Θ i = 1 n ξ i k ( θ ) .
Then, we can approach the final conclusion in the theorem by
sup θ Θ / { θ * } n G n γ ( θ ) γ ( θ * ) θ θ * 1 max 1 k p i = 1 n 1 γ ( c i , Y i ) E 1 γ ( c i , Y i ) X i k + sup θ Θ / { θ * } max 1 k p i = 1 n φ i 1 ( t i ) E φ i 1 ( t i ) X i k + max 1 k p i = 1 n e c i 2 e c i 1 + e c i 2 Y i E Y i X i k + max 1 k p i = 1 n ν ( c i , Y i ) E ν ( c i , Y i ) X i k + sup θ Θ / { θ * } max 1 k p S k .
We will tickle with (A2) term by term.
(i). The first three terms in (A2):
We will use concentration inequality to deal with these terms. For any 1 k p and t 0 , by Lemma A2 and Cauchy–Schwartz inequality,
P | i = 1 n 1 γ ( c i , Y i ) E 1 γ ( c i , Y i ) X i k | t = P | e c i 2 e c i 1 + e c i 2 X i k Y i E Y i | t 2 exp 1 4 t 2 2 i = 1 n ( w i ( 1 ) ) 2 X i k 2 a i 2 t max 1 i n | w i ( 1 ) X i k | a i 2 exp 1 4 t 2 2 i = 1 n ( w i ( 1 ) ) 4 a i 4 max 1 k p i = 1 n X i k 4 t M x max 1 i n | w i ( 1 ) | a i ,
where w i ( 1 ) = e c i 2 / ( e c i 1 + e c i 2 ) and a i = a ( μ i , k i ) is defined in Lemma A2; they are both determined and free of θ and the index k. Hence,
P max 1 k p | i = 1 n 1 γ ( c i , Y i ) E 1 γ ( c i , Y i ) X i k | t 2 p exp 1 4 t 2 2 i = 1 n ( w i ( 1 ) ) 4 a i 4 max 1 k p i = 1 n X i k 4 t M x max 1 i n | w i ( 1 ) | a i .
By letting the right side of the above display be q 1 ( 0 , 1 ) , we can obtain
P ( max 1 k p | i = 1 n 1 γ ( c i , Y i ) E 1 γ ( c i , Y i ) X i k | 2 2 i = 1 n ( w i ( 1 ) ) 4 a i 4 1 / 2 max 1 k p i = 1 n X i k 4 1 / 2 log ( 2 p / q 1 ) 4 M x max 1 i n | w i ( 1 ) | a i log ( 2 p / q 1 ) ) q 1 .
Exactly the same, we can obtain for any q 3 ( 0 , 1 ) , regarding to the third term,
P ( max 1 k p | i = 1 n e c i 2 e c i 1 + e c i 2 Y i E Y i X i k | 2 2 i = 1 n ( w i ( 1 ) ) 4 a i 4 1 / 2 max 1 k p i = 1 n X i k 4 1 / 2 log ( 2 p / q 3 ) 4 M x max 1 i n | w i ( 1 ) | a i log ( 2 p / q 3 ) ) q 3 .
The situation is slightly different for the second term. Indeed,
P | i = 1 n φ i 1 ( t i ) E φ i 1 ( t i ) X i k | t = P i = 1 n log e t i 1 + 1 + e t i 2 e t i 1 + e t i 2 e t i 1 e t i 1 + e t i 2 X i k ( Y i E Y i ) t : = P | i = 1 n w i ( 2 ) ( θ ) X i k Y i E Y i | t
Because t i is a function of θ , so as the weights w i ( 2 ) ( θ ) , we cannot use the exact same method as previously. However, because Θ is convex, we have { t i } i = 1 n S . Then, it only needs to note that,
| w i ( 2 ) ( θ ) | = log e t i 1 + 1 + e t i 2 e t i 1 + e t i 2 e t i 1 e t i 1 + e t i 2 log e + e M s , n m s , n 1 + e m s , n M s , n + 1 1 + e m s , n M s , n : = w ( 2 ) ,
which gives
P ( max 1 k p | i = 1 n φ i 1 ( t i ) E φ i 1 ( t i ) X i k | 2 2 n w ( 2 ) 2 i = 1 n a i 4 1 / 2 max 1 k p i = 1 n X i k 4 1 / 2 log ( 2 p / q 2 ) 4 M x w ( 2 ) max 1 i n | a i | log ( 2 p / q 2 ) ) q 2 .
for any θ Θ and q 2 ( 0 , 1 ) .
(ii). The fourth term in (A2):
From Lemma A1, we know that | ν ( c i , Y i ) | F 1 . Thus, simply by Hoeffding inequality (see Corollary 2.1 (b) in [31]), for any t 0 and 1 k p ,
P | i = 1 n ν ( c i , Y i ) E ν ( c i , Y i ) X i k | t 2 exp t 2 2 F 1 2 i = 1 n X i k 2 2 exp t 2 2 F 1 2 max 1 k p i = 1 n X i k 2 .
For arbitrary q 4 ( 0 , 1 ) , let t = F 1 2 log ( 2 p / q 4 ) max 1 k p i = 1 n X i k 2 , we obtain
P max 1 k p | i = 1 n ν ( c i , Y i ) E ν ( c i , Y i ) X i k | F 1 2 log ( 2 p / q 4 ) max 1 k p i = 1 n X i k 2 q 4 .
(iii). The last term in (A2):
For any i = 1 , , n and k = 1 , , p , by (A1), | ξ i k ( θ ) | 2 M 1 M x : = M 0 . In addition, for any θ Θ , (A1) also implies
i = 1 n var ξ i k ( θ ) = i = 1 n E φ i 2 ( t i ) X i k 2 A 1 2 i = 1 n X i k 2 A 1 2 max 1 k p i = 1 n X i k 2 : = S 0 2
Therefore, from Lemma A5, it follows that
P S k 2 E S k + S 0 2 t + 4 M 0 t e t .
Thus, the last task is giving an upper bound for E S k . Note that E ξ i k ( θ ) = 0 , by symmetrization,
E S k = E sup θ Θ | i = 1 n φ i 2 ( t i ) E φ i 2 ( t i ) X i k | 2 E sup θ Θ | i = 1 n ε i φ i 2 ( t i ) X i k | = 2 E sup t T | i = 1 n ε i φ i 2 ( t i ) X i k | ,
where T = { t i = Z i ( θ θ * ) : θ Θ , i = 1 , , n } , and ε 1 , , ε n are i.i.d. Rademacher variables independent of Y 1 , , Y n . Here, using the fact φ i 2 ( · ) X i k is ( M x F 2 , ) -Lipschitz and Lemmas A6–A8,
E sup t T | i = 1 n ε i φ i 2 ( t i ) X i k | 8 M x F 2 j = 1 2 E sup t T | ε i t i j | = 8 M x F 2 j = 1 2 E sup θ Θ | i = 1 n ε i X i θ ( j ) θ * ( j ) | 16 M x F 2 D Θ E max 1 k p | i = 1 n ε i X i k | 16 2 log p M x F 2 D Θ max 1 k p i = 1 n X i k 2 .
Then, by (A3),
P S k 32 2 log p M x F 2 D Θ max 1 k p i = 1 n X i k 2 + M 1 2 t max 1 k p i = 1 n X i k 2 + 8 M 1 M x t e t .
Note that the right side of the inequality is free of θ , let t = log ( p / q 5 ) in the above inequality, and use the same technique as previous, we obtain the uniform bound for it. The Theorem is proved by letting q 2 = q 3 = q 1 , q 4 = q 2 , q 5 = q 3 , and | w i ( 1 ) | w ( 1 ) . □
The lower bound of the likelihood-based divergence
Recall the standard steps for establishing the oracle inequality for a lasso estimator are (see [37] for example):
  • To avoid the ill behavior of Hessian, propose the restricted eigenvalue condition or other analogous conditions about the design matrix.
  • Find the tuning parameter based on the high-probability event, i.e., the KKT conditions.
  • According to some restricted eigenvalue assumptions and tuning parameter selection, derive the oracle inequalities via the definition of the lasso optimality and the minimizer under unknown expected risk function and some basic inequalities. There are three sub-steps:
    (i)
    Under the KKT conditions, show that the error vector θ ^ θ * is in some restricted set with structure sparsity, and check that θ ^ θ * is in a big compact set;
    (ii)
    Show that the likelihood-based divergence of θ ^ and θ * can be lower bounded by some quadratic distance between θ ^ and θ * ;
    (iii)
    By some elementary inequalities and (ii), show that θ ^ θ * 1 is in a smaller compact set with a radius of optimal rate (proportional to λ ).
Under our approach, the KKT condition with a high probability is replaced by the stochastic Lipschitz condition, while other steps should remain the same. For most models belonging to the canonical exponential family, the step III.(ii) is quite trivial, see Lemma 1 in [38] for example. Nonetheless, it is worthy to note that our loss function is not in the canonical exponential family, so there is no extended discussion about the lower bound of the likelihood-based divergence of θ ^ and θ * in our setting. We will use the following theorem to clarify this thing.
Theorem A1.
Suppose the condition is the same as that in Theorem 1. Denote the true parameter for Y i is μ * and k * . If { Z i θ } i = 1 , , n , θ Θ S { s R 2 : 2 s 1 + ( 1 + s 2 ( 1 k * ) k * μ * ) μ * s 1 + μ * 2 s 2 2 } and μ * 1 , then
E γ ( Z i θ , Y i ) E γ ( Z i θ * , Y i ) C γ Z i ( θ θ * ) 2 2 ,
where C γ is a positive constant and its exact definition is in the proof.
Proof. 
For simplicity, we drop the index i. By the definition and the notation in Theorem 1,
E γ ( Z θ , Y ) E γ ( Z θ * , Y ) = D KL ( s , c ) ,
where D KL is the Kullback–Leibler divergence from the Y i ’s density f ( y | Z θ ) to f ( y | Z θ * ) , i.e.,
D KL ( s , c ) : = f ( y | c ) log f ( y | c ) f ( y | s ) d y .
Due to the identification of the negative binomial distribution, we have D KL ( s , c ) 0 with equality if and only if s = c . Using the Taylor theorem,
D KL ( s , c ) = D KL ( c , c ) + s D KL ( s , c ) s = c + 1 2 ( s c ) 2 s s D KL ( s , c ) s = c + ρ ( s c ) ( s c ) = 1 2 ( s c ) 2 s s D KL ( s , c ) s = c + ρ ( s c ) ( s c ) 1 2 inf ρ [ 0 , 1 ] λ m i n 2 s s D KL ( s , c ) s = c + ρ ( s c ) s c 2 2
where ρ [ 0 , 1 ] and λ m i n ( M ) is the smallest eigenvalue of the matrix M . Thus, it is enough to show that 2 s s D KL ( s , c ) s = c + ρ ( s c ) is strictly positive define for any ρ [ 0 , 1 ] . First, calculate directly,
2 s s D KL ( s , c ) = f ( y | c ) 2 s s γ ( s , y ) d y = f ( y | c ) e s 1 + s 2 ( e s 1 + e s 2 ) 2 ( e s 2 + y ) e s 1 + s 2 ( e s 1 + e s 2 ) 2 ( e s 1 y ) e s 1 + s 2 ( e s 1 + e s 2 ) 2 ( e s 1 y ) 2 ν ( s , y ) + e s 1 + s 2 ( e s 1 + e s 2 ) 2 y d y = : a 11 + b a 12 b a 21 b a 22 + b ,
where a 11 = e s 1 + 2 s 2 ( e s 1 + e s 2 ) 2 , a 12 = e 2 s 1 + s 2 ( e s 1 + e s 2 ) 2 , b = e s 1 + 2 s 2 ( e s 1 + e s 2 ) 2 E Y , and
a 22 = E 2 v ( s , Y ) = e s 2 ψ ( e s 2 ) + e s 2 ψ ( e s 2 ) + log ( 1 + e s 1 s 2 ) e s 1 e s 1 + e s 2 e s 1 e s 1 + e s 2 2 e s 2 E ψ ( Y + e s 2 ) + e s 2 E ψ ( Y + e s 2 ) .
For a 2 × 2 matrix M , it is strictly positive define if and only if tr ( M ) > 0 and det ( M ) > 0 . Denote μ = e s 1 , k = e s 2 , and μ * = e c 1 ,   k * = e c 2 are true parameters for Y. Then,
tr 2 s s D KL ( s , c ) = μ k 2 ( μ + k ) 2 + 2 μ k 2 ( μ + k ) 2 μ * k μ μ + k + μ μ + k 2 + k log ( 1 + μ / k ) + ψ ( k ) E ψ ( Y + k ) + k ψ ( k ) E ψ ( Y + k ) = 2 ( μ * 1 ) μ k 2 ( μ + k ) 2 + k log ( 1 + μ / k ) + g 1 ( k ) + k g 2 ( k ) k log ( 1 + μ / k ) + g 1 ( k ) + k g 2 ( k ) .
Now, we are going to deal with g 1 ( k ) = ψ ( k ) E ψ ( Y + k ) and g 2 ( k ) = ψ ( k ) E ψ ( Y + k ) . For ψ ( x ) ,
0 > ψ ( x ) = 1 x 2 0 t 2 φ ( t ) e t x d t 1 x 2 2 x 3 .
Therefore, ψ ( · ) is concave. Using Jensen inequality and median value theorem
g 1 ( k ) = ψ ( k ) E ψ ( Y + k ) ψ ( k ) ψ E Y + k 1 k + 1 k 2 E Y = μ * 1 k + 1 k 2 .
Similarly, for g 2 ( k ) , by using the fact that E ( 1 / Y ) = ( 1 k * ) k * μ * μ * and the assumption,
g 2 ( k ) = E ψ ( k ) ψ ( Y + k ) E Y 1 ( ξ ( Y ) + k ) 2 + 2 ( ξ ( Y ) + k ) 3 E Y ( Y + k ) 2 + 2 E Y ( Y + k ) 3 E ( Y + k ) 2 Y 1 + 2 E ( Y + k ) 3 ) Y 1 = 1 2 k + ( 1 + k 2 ( 1 k * ) k * μ * ) μ * + 2 k * 2 ( k * + μ * ) / μ * 2 + μ * 2 + 3 k μ * + 3 k 2 + k 3 ( 1 k * ) k * μ * μ * ( μ + μ * ) 1 2 k 2 + 1 k 3 .
where ξ ( Y ) lies between 0 and Y. The lower bounds for g 1 and g 2 , together with the fact that log ( 1 + x ) x x 2 / 2 for x 0 , we conclude that tr 2 s s D KL ( s , c ) > 0 . Similarly, we can also prove det 2 s s D KL ( s , c ) > 0 , so the theorem holds. □
The proof of Theorem 3.
The proof follows the idea in [22]. First, by the definition of θ ^ ,
P γ ( θ ^ ) γ ( θ * ) P γ ( θ ^ ) γ ( θ * ) + P n γ ( θ * ) + λ θ * ω , 1 P n γ ( θ ^ ) + λ θ ^ ω , 1 1 n G n γ ( θ * ) γ ( θ ^ ) + λ θ * ω , 1 θ ^ ω , 1 .
From Theorem A1, we also have
P γ ( θ ^ ) γ ( θ * ) C γ n i = 1 n Z i ( θ ^ θ * ) 2 2 = C γ n j = 1 2 X ( θ ^ ( j ) θ * ( j ) ) 2 2 .
Then, by Theorem 1 and the definition of λ ,
C γ j = 1 2 X ( θ ^ ( j ) θ * ( j ) ) 2 2 n G n γ ( θ * ) γ ( θ ^ ) + n λ θ * ω , 1 θ ^ ω , 1 M q θ * θ ^ 1 + ( 1 + 1 / a ) M q θ * ω , 1 θ ^ ω , 1 = M q j = 1 2 θ ^ ( j ) θ * ( j ) 1 + ( 1 + 1 / a ) ω j θ * ( j ) 1 θ ^ ( j ) 1
holds with probability at least 1 q , where a = ( K 1 ) / 2 . Now, let J 1 , J 2 { 1 , , p } be any sets with J j spt θ * ( j ) . It is easy to check
θ ^ ( j ) θ * ( j ) 1 + ( 1 + 1 / a ) ω j θ * ( j ) 1 θ ^ ( j ) 1 = θ ^ J j ( j ) θ * ( j ) 1 + θ ^ J j c ( j ) 1 + ( 1 + 1 / a ) ω j θ * ( j ) 1 θ ^ J j ( j ) 1 θ ^ J j c ( j ) 1 ( K / a ) θ ^ J j ( j ) θ * ( j ) 1 ( 1 / a ) θ ^ J j c ( j ) 1 .
by the fact ω j [ 0 , 1 ] . It gives that with probability at least 1 q ,
j = 1 2 X ( θ ^ ( j ) θ * ( j ) ) 2 2 M q a C γ j = 1 2 K θ ^ J j ( j ) θ * ( j ) 1 θ ^ J j c ( j ) 1 .
Let A 1 , A 2 { 1 , , p } satisfying spt θ * ( j ) A j and card ( A j ) = p 1 , and we also let B j be the union of A j and the indices of p 1 largest θ ^ ( j ) . Then, A j and B j also guarantee (A5). In addition, from Lemma 1, they also give
θ ^ B j c ( j ) 2 2 p 1 1 θ ^ A j c ( j ) 1 2 .
In addition, from the definition of A j and B j , we know that θ ^ A j c ( j ) 1 θ ^ B j c ( j ) 1 and θ ^ A j ( j ) θ * ( j ) 1 θ ^ B j ( j ) θ * ( j ) 1 .
Unlike the single lasso question, here we need to define I : = { j = 1 , 2 : K θ ^ A j ( j ) θ * ( j ) 1 θ ^ A j c ( j ) 1 } , and consider j I and j I separately. Obviously, I , or (A5) cannot be beholden. For j I , we have
K θ ^ B j ( j ) θ * ( j ) 1 θ ^ B j c ( j ) 1 K θ ^ A j ( j ) θ * ( j ) 1 θ ^ A j c ( j ) 1 0 .
Then, by the restricted eigenvalue condition,
n κ 2 θ ^ J j ( j ) θ * ( j ) 2 2 X ( θ ^ ( j ) θ * ( j ) ) 2 2
holds for J j = A j or J j = B j . Note that from (A5),
j I X ( θ ^ ( j ) θ * ( j ) ) 2 2 M q a C γ j I θ ^ A j ( j ) θ * ( j ) 1 θ ^ A j c ( j ) 1 M q a C γ j I θ ^ B j ( j ) θ * ( j ) 1 θ ^ B j c ( j ) 1 ,
then by Cauchy–Schwartz inequality,
n κ 2 j I θ ^ A j ( j ) θ * ( j ) 2 2 X ( θ ^ A j ( j ) θ * ( j ) ) 2 2 M q K a C γ j I θ ^ A j ( j ) θ * ( j ) 1 M q K p 1 a C γ j I θ ^ A j ( j ) θ * ( j ) 2 M q K 2 p 1 a C γ j I θ ^ A j ( j ) θ * ( j ) 2 2 1 / 2 .
It gives
j I θ ^ A j ( j ) θ * ( j ) 2 2 2 p 1 M q 2 K 2 a 2 κ 4 n 2 C γ 2 , j I θ ^ B j ( j ) θ * ( j ) 2 2 4 p 1 M q 2 K 2 a 2 κ 4 n 2 C γ 2 ,
where we use that fact card ( B j ) = 2 p 1 . Furthermore, because
θ ^ B j c ( j ) 2 2 j I p 1 1 θ ^ A j c ( j ) 1 2 K 2 p 1 j I θ ^ A j ( j ) θ * ( j ) 1 2 K 2 j I θ ^ A j ( j ) θ * ( j ) 2 2 ,
we can conclude that
j I θ ^ ( j ) θ * ( j ) 2 2 = j I θ ^ B j ( j ) θ * ( j ) 2 2 + θ ^ B j c ( j ) 2 2 j I θ ^ B j ( j ) θ * ( j ) 2 2 + K 2 θ ^ A j ( j ) θ * ( j ) 2 2 = 2 p 1 M q 2 ( 2 + K 2 ) K 2 a 2 κ 4 n 2 C γ 2 .
Now, we will tickle the situation that j I . For j I , K θ ^ A j ( j ) θ * ( j ) 1 < θ ^ A j c ( j ) 1 . Again from (A5), we have
j I X ( θ ^ ( j ) θ * ( j ) ) 2 2 M q K a C γ j I θ ^ A j ( j ) θ * ( j ) 1
and
0 j I θ ^ A j c ( j ) 1 K θ ^ A j ( j ) θ * ( j ) 1 K j I θ ^ A j ( j ) θ * ( j ) 1 .
Indeed, if the two inequalities above have the opposite direction, then for the first one, one can find that
j I X ( θ ^ ( j ) θ * ( j ) ) 2 2 M q a C γ j I K θ ^ A j ( j ) θ * ( j ) 1 θ ^ A j c ( j ) 1 j I θ ^ A j c ( j ) 1 < 0 ,
and
j = 1 2 X ( θ ^ ( j ) θ * ( j ) ) 2 2 M q a C γ j I θ ^ A j c ( j ) 1 < 0 .
Once again, by Cauchy–Schwartz inequality,
j I θ ^ A j ( j ) θ * ( j ) 1 p 1 j I θ ^ A j ( j ) θ * ( j ) 2 2 p 1 j I θ ^ A j ( j ) θ * ( j ) 2 2 1 / 2 2 p 1 M q K a κ 2 n C γ .
Denote Δ j : = θ ^ A j c ( j ) 1 K θ ^ A j ( j ) θ * ( j ) 1 . Then, for j J , Δ j > 0 , and
j I Δ j K j I θ ^ A j ( j ) θ * ( j ) 1 2 p 1 M q K 2 a κ 2 n C γ .
For any j I , define
θ ˜ ( j ) = θ ^ ( j ) + Δ j p 1 K k A j sgn θ ^ k ( j ) θ k * ( j ) e k .
Then, for k A j ,
θ ˜ k ( j ) θ k * ( j ) | = | θ ^ k ( j ) θ k * ( j ) | + Δ j p 1 K ,
while for k I , θ ˜ k ( j ) = θ ^ k ( j ) . Therefore,
K θ ˜ A j ( j ) θ * ( j ) 1 = K θ ^ A j ( j ) θ * ( j ) 1 + k A j Δ j p 1 K = θ ^ A j c ( j ) 1 = θ ˜ A j c ( j ) 1 ,
and consequently θ ˜ B j c ( j ) 1 K θ ˜ B j ( j ) θ * ( j ) 1 . Once again, by the restricted eigenvalue condition,
X ( θ ˜ ( j ) θ * ( j ) ) 2 2 n κ 2 θ ˜ B j ( j ) θ * ( j ) 2 2 n κ 2 θ ˜ A j ( j ) θ * ( j ) 2 2 .
On the other hand, note that for any s , t R m inequality s + t 2 2 2 ( s 2 2 + t 2 2 ) and s 2 s 1 m s 2 hold, we conclude
j I X ( θ ˜ ( j ) θ * ( j ) ) 2 2 2 j I X ( θ ^ ( j ) θ * ( j ) ) 2 2 + X ( θ ^ ( j ) θ ˜ ( j ) ) 2 2 2 M q K a C γ j I θ ^ A j ( j ) θ * ( j ) 1 + 2 j I X ( θ ^ ( j ) θ ˜ ( j ) ) 2 2 4 p 1 M q 2 K 2 n a 2 κ 2 C γ 2 + 2 j I X ( θ ^ ( j ) θ ˜ ( j ) ) 2 2 .
Next, we will use the definition of the p 1 -restricted isometry constant σ X , l 2 . Because spt θ ˜ ( j ) θ ^ ( j ) card ( A j ) = p 1 , then
j I X ( θ ^ ( j ) θ ˜ ( j ) ) 2 2 σ X , p 1 2 j I θ ^ ( j ) θ ˜ ( j ) 2 2 = σ X , p 1 2 j I k A j Δ j p 1 K 2 = σ X , p 1 2 p 1 K 2 j I Δ j 2 σ X , p 1 2 p 1 K 2 j I Δ j 2 4 p 1 σ X , p 1 2 K 2 a 2 κ 4 n 2 C γ 2 .
The above inequality together with (A7) and (A8) gives
j I θ ˜ A j ( j ) θ * ( j ) 2 2 j I θ ˜ B j ( j ) θ * ( j ) 2 2 4 p 1 ( n κ 2 + 2 σ X , p 1 2 ) M q 2 K 2 a 2 C γ 2 n 3 κ 6 .
Finally, because
θ ˜ B j c ( j ) 2 2 θ ˜ B j c ( j ) 1 2 K 2 θ ˜ B j ( j ) θ * ( j ) 1 2 2 p 1 K 2 θ ˜ B j ( j ) θ * ( j ) 2 2 ,
we obtain that
j I θ ^ ( j ) θ * ( j ) 2 2 j I θ ˜ ( j ) θ * ( j ) 2 2 = j I θ ˜ B j ( j ) θ * ( j ) 2 2 + θ ˜ B j c ( j ) 2 2 ( 1 + 2 p 1 K ) j I θ ˜ B j ( j ) θ * ( j ) 2 2 4 p 1 ( 1 + 2 p 1 K ) ( n κ 2 + 2 σ X , p 1 2 ) M q 2 K 2 a 2 C γ 2 n 3 κ 6 .
Combining (A6) and (A9), it is easy to see what remains. □

References

  1. Dai, H.; Bao, Y.; Bao, M. Maximum likelihood estimate for the dispersion parameter of the negative binomial distribution. Stat. Probab. Lett. 2013, 83, 21–27. [Google Scholar] [CrossRef]
  2. Allison, P.D.; Waterman, R.P. Fixed–effects negative binomial regression models. Sociol. Methodol. 2002, 32, 247–265. [Google Scholar] [CrossRef] [Green Version]
  3. Hilbe, J.M. Negative Binomial Regression; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  4. Weißbach, R.; Radloff, L. Consistency for the negative binomial regression with fixed covariate. Metrika 2020, 83, 627–641. [Google Scholar] [CrossRef]
  5. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Statal Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
  6. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  7. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef] [Green Version]
  8. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef] [Green Version]
  9. Qiu, Y.; Chen, S.X.; Nettleton, D. Detecting rare and faint signals via thresholding maximum likelihood estimators. Ann. Stat. 2018, 46, 895–923. [Google Scholar] [CrossRef] [Green Version]
  10. Xie, F.; Xiao, Z. Consistency of l1 penalized negative binomial regressions. Stat. Probab. Lett. 2020, 165, 108816. [Google Scholar] [CrossRef]
  11. Li, Y.; Rahman, T.; Ma, T.; Tang, L.; Tseng, G.C. A sparse negative binomial mixture model for clustering RNA-seq count data. Biostatistics 2021, kxab025. [Google Scholar] [CrossRef]
  12. Jankowiak, M. Fast Bayesian Variable Selection in Binomial and Negative Binomial Regression. arXiv 2021, arXiv:2106.14981. [Google Scholar]
  13. Lisawadi, S.; Ahmed, S.; Reangsephet, O. Post estimation and prediction strategies in negative binomial regression model. Int. J. Model. Simul. 2021, 41, 463–477. [Google Scholar] [CrossRef]
  14. Zhang, H.; Jia, J. Elastic-net Regularized High-dimensional Negative Binomial Regression: Consistency and Weak Signals Detection. Stat. Sin. 2022, 32, 181–207. [Google Scholar] [CrossRef]
  15. Xu, D.; Zhang, Z.; Wu, L. Variable selection in high-dimensional double generalized linear models. Stat. Pap. 2014, 55, 327–347. [Google Scholar] [CrossRef]
  16. Yee, T.W. Vector Generalized Linear and Additive Models: With an Implementation in R; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
  17. Nguelifack, B.M.; Kemajou-Brown, I. Robust rank-based variable selection in double generalized linear models with diverging number of parameters under adaptive Lasso. J. Stat. Comput. Simul. 2019, 89, 2051–2072. [Google Scholar] [CrossRef]
  18. Cavalaro, L.L.; Pereira, G.H. A procedure for variable selection in double generalized linear models. J. Stat. Comput. Simul. 2022, 1–18. [Google Scholar] [CrossRef]
  19. Wang, Z.; Ma, S.; Zappitelli, M.; Parikh, C.; Wang, C.Y.; Devarajan, P. Penalized count data regression with application to hospital stay after pediatric cardiac surgery. Stat. Methods Med. Res. 2016, 25, 2685–2703. [Google Scholar] [CrossRef] [Green Version]
  20. Huang, H.; Zhang, H.; Li, B. Weighted Lasso estimates for sparse logistic regression: Non-asymptotic properties with measurement errors. Acta Math. Sci. 2021, 41, 207–230. [Google Scholar] [CrossRef]
  21. Adamczak, R. A tail inequality for suprema of unbounded empirical processes with applications to Markov chains. Electron. J. Probab. 2008, 13, 1000–1034. [Google Scholar] [CrossRef]
  22. Bickel, P.J.; Ritov, Y.; Tsybakov, A.B. Simultaneous analysis of Lasso and Dantzig selector. Ann. Stat. 2009, 37, 1705–1732. [Google Scholar] [CrossRef]
  23. Candes, E.; Tao, T. The Dantzig selector: Statistical estimation when p is much larger than n. Ann. Stat. 2007, 35, 2313–2351. [Google Scholar]
  24. Riphahn, R.T.; Wambach, A.; Million, A. Incentive effects in the demand for health care: A bivariate panel count data estimation. J. Appl. Econom. 2003, 18, 387–405. [Google Scholar] [CrossRef]
  25. Yang, X.; Song, S.; Zhang, H. Law of iterated logarithm and model selection consistency for generalized linear models with independent and dependent responses. Front. Math. China 2021, 16, 825–856. [Google Scholar] [CrossRef]
  26. Shi, C.; Song, R.; Chen, Z.; Li, R. Linear hypothesis testing for high dimensional generalized linear models. Ann. Stat. 2019, 47, 2671. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  27. Xie, F.; Lederer, J. Aggregating Knockoffs for False Discovery Rate Control with an Application to Gut Microbiome Data. Entropy 2021, 23, 230. [Google Scholar] [CrossRef]
  28. Cui, C.; Jia, J.; Xiao, Y.; Zhang, H. Directional FDR Control for Sub-Gaussian Sparse GLMs. arXiv 2021, arXiv:2105.00393. [Google Scholar]
  29. Bateman, H. Higher Transcendental Functions [Volumes i–iii]; McGraw-Hill Book Company: New York, NY, USA, 1953; Volume 1. [Google Scholar]
  30. Alzer, H. On some inequalities for the gamma and psi functions. Math. Comput. 1997, 66, 373–389. [Google Scholar] [CrossRef] [Green Version]
  31. Zhang, H.; Chen, S.X. Concentration inequalities for statistical inference. Commun. Math. Res. 2021, 37, 1–85. [Google Scholar]
  32. Moriguchi, S.; Murota, K.; Tamura, A.; Tardella, F. Discrete midpoint convexity. Math. Oper. Res. 2020, 45, 99–128. [Google Scholar] [CrossRef] [Green Version]
  33. Sen, B. A Gentle Introduction to Empirical Process Theory and Applications; Columbia University: New York, NY, USA, 2018. [Google Scholar]
  34. Chi, Z. Stochastic Lipschitz continuity for high dimensional Lasso with multiple linear covariate structures or hidden linear covariates. arXiv 2010, arXiv:1011.1384. [Google Scholar]
  35. Ledoux, M.; Talagrand, M. Probability in Banach Spaces: Isoperimetry and Processes; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
  36. Massart, P. Some applications of concentration inequalities to statistics. Ann. Fac. Sci. Toulouse Math. 2000, 9, 245–303. [Google Scholar] [CrossRef]
  37. Xiao, Y.; Yan, T.; Zhang, H.; Zhang, Y. Oracle inequalities for weighted group lasso in high-dimensional misspecified Cox models. J. Inequalities Appl. 2020, 2020, 1–33. [Google Scholar] [CrossRef]
  38. Abramovich, F.; Grinshtein, V. Model selection and minimax estimation in generalized linear models. IEEE Trans. Inf. Theory 2016, 62, 3721–3730. [Google Scholar] [CrossRef] [Green Version]
Table 1. The average squared estimation errors of the estimators.
Table 1. The average squared estimation errors of the estimators.
n ρ = 0 ρ = 0.5
θ ^ ( 1 ) * θ ^ ( 1 ) θ ^ ( 2 ) θ ^ ( 1 ) * θ ^ ( 1 ) θ ^ ( 2 )
1000.15970.03350.724140.18090.03970.68904
2000.08620.010.221490.08370.01690.33048
4000.050.00470.088470.06190.00670.15066
Table 2. The results of variable selection.
Table 2. The results of variable selection.
Previous MethodProposed Method
μ ( x ) μ ( x ) k ( x )
np θ 1 ( 1 ) θ 2 ( 1 ) θ 3 ( 1 ) Other θ ( 1 ) s θ 1 ( 1 ) θ 2 ( 1 ) θ 3 ( 1 ) Other θ ( 1 ) s θ 1 ( 2 ) θ 2 ( 2 ) θ 3 ( 2 ) Other θ ( 2 ) s
ρ = 0
100251731981712.331922001900.371801841800.32
501641971472.8851962001930.521821801880.41
1501361821112.7251941941921.021881821860.41
200501962001921.4352002002000.592001901980.53
1001932001932.052002002000.911961861960.69
2501621981551.51991991981.181981981980.69
4001002002002000.6052002002000.42001982000.55
2002002001990.882002002000.62002002000.51
5001972001981.292002002001.212002002000.61
ρ = 0.5
100251831991792.31941981940.411791841800.35
501721971502.661961961900.631781821800.42
150134191992.321941961921.011801841820.43
200501952001971.482002001980.381961831900.32
1001892001791.521992001980.531941861940.44
2501782001541.391961981961.11961961940.55
4001002002002000.4352002002000.282001991940.34
2002002001990.6752002001980.472001981960.36
5001992001941.122002001981.072001981960.56
Table 3. The variable selection results and the fitting errors (FE) of NBR and HNBR models. The variable Others = {Married, Haupts, Reals, Fachhs, Abitur, Univ, Working, Bluec, Whitec, Self, Beamt, Public, Addon}. Because these variables are not selected in any year, we put them in “Others” for brevity.
Table 3. The variable selection results and the fitting errors (FE) of NBR and HNBR models. The variable Others = {Married, Haupts, Reals, Fachhs, Abitur, Univ, Working, Bluec, Whitec, Self, Beamt, Public, Addon}. Because these variables are not selected in any year, we put them in “Others” for brevity.
Variables1984198519861987
NBRHNBRNBRHNBRNBRHNBRNBRHNBR
μ ( x ) k ( x ) μ ( x ) k ( x ) μ ( x ) k ( x ) μ ( x ) k ( x )
Female000000000000
Age−0.013−0.013−0.012−0.009−0.01−0.007−0.006−0.006−0.013−0.002−0.001−0.018
Hsat−0.205−0.2−0.025−0.244−0.2370−0.188−0.195−0.045−0.158−0.153−0.043
Handdum000000000000
Handper0.0050.0050.0040.0070.0060.0070.0070.00700.0070.0070.01
Hhninc000000000000
Hhkids000000000000
Educ00−0.02700−0.064−0.035−0.0380−0.095−0.106−0.003
Others000000000000
FE0.7980.6022.2031.8740.7350.5811.3141.027
Variables198819911994
NBRHNBRNBRHNBRNBRHNBR
μ ( x ) k ( x ) μ ( x ) k ( x ) μ ( x ) k ( x )
Female000000000
Age−0.015−0.014−0.012−0.022−0.019−0.003−0.005−0.004−0.011
Hsat−0.191−0.187−0.015−0.112−0.132−0.049−0.226−0.224−0.06
Handdum000000000
Handper0.0110.0090.0060.0140.01300.0070.0080.004
Hhninc000000000
Hhkids000000000
Educ−0.016−0.023−0.002−0.074−0.0680−0.064−0.0690
Others000000000
FE1.1440.9121.0070.7870.7130.58
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Li, S.; Wei, H.; Lei, X. Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations. Mathematics 2022, 10, 1700. https://doi.org/10.3390/math10101700

AMA Style

Li S, Wei H, Lei X. Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations. Mathematics. 2022; 10(10):1700. https://doi.org/10.3390/math10101700

Chicago/Turabian Style

Li, Shaomin, Haoyu Wei, and Xiaoyu Lei. 2022. "Heterogeneous Overdispersed Count Data Regressions via Double-Penalized Estimations" Mathematics 10, no. 10: 1700. https://doi.org/10.3390/math10101700

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop