Next Article in Journal
Model Predictive Paradigm with Low Computational Burden Based on Dandelion Optimizer for Autonomous Vehicle Considering Vision System Uncertainty
Next Article in Special Issue
Modified BIC Criterion for Model Selection in Linear Mixed Models
Previous Article in Journal
Solution Properties of a New Dynamic Model for MEMS with Parallel Plates in the Presence of Fringing Field
Previous Article in Special Issue
Tongue Segmentation and Color Classification Using Deep Convolutional Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model

1
Department of Computer Science, Mathematics, Physics and Statistics, University of British Columbia, Kelowna, BC V1V 1V7, Canada
2
Department of Mathematics, University of Louisiana at Lafayette, Lafayette, LA 70503, USA
3
Department of Mathematics and Statistics, York University, Toronto, ON M3J 1P3, Canada
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(23), 4542; https://doi.org/10.3390/math10234542
Submission received: 14 October 2022 / Revised: 18 November 2022 / Accepted: 25 November 2022 / Published: 1 December 2022

Abstract

:
In recent years, the Poisson lognormal mixed model has been frequently used in modeling count data because it can accommodate both the over-dispersion of the data and the existence of within-subject correlation. Since the likelihood function of this model is expressed in terms of an intractable integral, estimating the parameters and obtaining inference for the parameters are challenging problems. Some approximation procedures have been proposed in the literature; however, they are computationally intensive. Moreover, the existing studies of approximate parameter inference using the Gaussian variational approximation method are usually restricted to models with only one predictor. In this paper, we consider the Poisson lognormal mixed model with more than one predictor. By extending the Gaussian variational approximation method, we derive explicit forms for the estimators of the parameters and examine their properties, including the asymptotic distributions of the estimators of the parameters. Accurate inference for the parameters is also obtained. A real-life example demonstrates the applicability of the proposed method, and simulation studies illustrate the accuracy of the proposed method.

1. Introduction

The variational method of approximating intractable computations has its roots in the calculus of variations and is reviewed in Stephenson [1]. Ormerod and Wang [2] gave an overview of how the variational approximations facilitate approximate inference for the parameters in complex statistical models, and provide a fast and deterministic alternative to the Monte Carlo method. Moreover, Teschendorff et al. [3] and Flandin and Penny [4] applied the variational method to model gene-expression data and fMRI data, respectively. Hall et al. [5] and Hall et al. [6] applied the variational method to approximate the likelihood function for a Poisson linear mixed model with only one predictor and derived the maximum likelihood estimators and their asymptotic distributions.
The main idea of the variational method of approximating the density of a statistic from a complex model is to find a density from a pre-determined family of distributions such that the approximated density and the target density have the smallest Kullback–Leibler divergence. Mathematically, let q ( x ) be the target density and p ( x ) be a specific density from a pre-determined family of distributions G . Then, the variational approximation of q ( x ) is
p * ( x ) = arg min p ( x ) G X p ( x ) log p ( x ) q ( x ) d x .
G is the Gaussian family, the variational approximation is referred to as the Gaussian variational approximation (GVA).
Recent advances in GVA include links to Bayesian posterior inference in [7,8] and Bayesian Gaussian graphical model selection [9]; see also other applications to particle inference [10], GVA with a factorial covariance structure [11], and variational approximation in a Poisson-lognormal model [12].
In this paper, we consider the Poisson lognormal mixed model considered in [5] but with more than one predictor in the model. The GVA method is applied to obtain the maximum likelihood estimators of the parameters in the model. The advantages of the proposed method over the method discussed in [5] are:
  • The method discussed in [5] can only be applied to the Poisson lognormal mixed model with only one predictor, whereas the proposed method allows the model to have more than one predictor.
  • Although the explicit closed forms of the estimators are not available, the structure of the set of estimating equations allows us to easily establish the consistency property of the estimators and derive the limiting distributions of the estimators.
The rest of the paper is organized as follows. Section 2 examines a special member of the exponential family model. For this specific distribution, it is shown that the first three moments can be obtained directly from the normalizing constant. Although the explicit form of the normalizing constant is unavailable, it can be approximated by the GVA method. Section 3 reviews the Poisson lognormal mixed model and shows that the density of the model can be expressed in the form of the specific exponential model introduced in Section 2.
Thus, the likelihood function of the Poisson lognormal mixed model is explicitly available, and the maximum likelihood estimators and their limiting distributions are derived. A real-life example is presented in Section 4 to demonstrate the application of the proposed method. Moreover, results from simulation studies are also presented in Section 4 to illustrate the accuracy of the proposed method. Finally, we give our concluding remarks and discussion in Section 5.

2. A Special Member of the Exponential Family Model

Consider a special exponential family model with density
f ( x ; α , β , σ 2 ) = C 1 ( α , β , σ 2 ) exp α x β e x x 2 2 σ 2 , < x < ,
where α , β and σ are non-negative real numbers, and
C ( α , β , σ 2 ) = exp α x β e x x 2 2 σ 2 d x
is the normalizing constant. Then,
E ( X ) = 1 C ( α , β , σ 2 ) x exp α x β e x x 2 2 σ 2 d x = log C ( α , β , σ 2 ) α ,
and
E ( X 2 ) = 1 C ( α , β , σ 2 ) x 2 exp α x β e x x 2 2 σ 2 d x = log C ( α , β , σ 2 ) ( 2 σ 2 ) 1 .
The variance of X can be obtained by V a r ( X ) = E ( X 2 ) [ E ( X ) ] 2 . Following the same argument as above, it can be shown that
E ( X 3 ) = 2 log C ( α , β , σ 2 ) α ( 2 σ 2 ) 1 + E ( X ) E ( X 2 ) and E ( e X ) = log C ( α , β , σ 2 ) β .
Note that, as β 0 , the Model (1) converges to the normal model with mean α σ 2 and variance σ 2 . Furthermore, as σ 2 , the Model (1) converges to the log-gamma model with shape α and rate β .
Although Model (1) is not commonly discussed in the statistics literature, it occurs naturally in information theory. In particular, the concept of cross entropy in information theory, which is generally known as the Kullback–Leibler (KL) divergence in mathematics, is a measure of the closeness of two probability distributions P and Q. If both P and Q are continuous distributions, the cross entropy is defined as
KL ( p , q ) = p ( x ) log p ( x ) q ( x ) d x ,
where p ( x ) and q ( x ) are the density functions of P and Q, respectively. Note that, by Jensen’s inequality, KL ( p , q ) is always non-negative.
The variational method is to find g G such that it is closest to Model (1) in terms of having the smallest KL divergence. When G is restricted to the family of normal distributions with mean μ U and variance σ U 2 , we have
G = g ( x ; μ U , σ U 2 ) = 1 2 π σ 2 exp ( x μ U ) 2 2 σ U 2 , < μ U < , σ U 2 > 0 ,
where
g ( x ; μ U , σ U 2 ) d x = 1 .
Mathematically, the variational method gives
min g ( x ) G KL g , f = min ( μ U , σ U 2 ) KL g ( x ; μ U , σ U 2 ) , f ( x ; α , β , σ 2 ) = min ( μ U , σ U 2 ) g ( x ; μ U , σ U 2 ) log g ( x ; μ U , σ U 2 ) f ( x ; α , β , σ 2 ) d x = min ( μ U , σ U 2 ) log C ( α , β , σ 2 ) g ( x ; μ U , σ U 2 ) log f ( x ; α , β , σ 2 ) C ( α , β , σ 2 ) g ( x ; μ U , σ U 2 ) d x .
Since the KL divergence is always non-negative, the variational method gives
arg max ( μ U , σ U 2 ) E U log f ( U ; α , β , σ 2 ) C ( α , β , σ 2 ) g ( U ; μ U , σ U 2 ) = arg max ( μ U , σ U 2 ) α μ U β e μ U + σ U 2 / 2 σ U 2 + μ U 2 2 σ 2 + 1 2 + 1 2 log ( 2 π σ U 2 ) ,
where U N ( μ U , σ U 2 ) . Note that Equation (3) indicates that minimizing the KL divergence is the same as minimizing between log C ( α , β , σ 2 ) and all its lower bounds by Jensen’s inequality. Moreover, Equation (4) shows that (3) is equivalent to maximizing an expectation with respect to μ U and σ U 2 .
Since the main idea of GVA is to approximate the log of an intractable integral by a function that is a maximum of all possible lower bounds derived from Jensen’s inequality, and all the possible lower bounds of the log of the normalizing constant of Model (1) can be determined by Jensen’s inequality, with U N ( μ U , σ U 2 ) , we have
log C ( α , β , σ 2 ) = log exp α u β e u u 2 2 σ 2 exp { ( u μ U ) 2 / ( 2 σ U 2 ) } / 2 π λ U exp { ( u μ U ) 2 / ( 2 σ U 2 ) } / 2 π σ U 2 d u = log E U exp α U β e U U 2 2 σ 2 + ( U μ U ) 2 2 σ U 2 2 π σ U 2 E U log exp α U β e U U 2 2 σ 2 + ( U μ U ) 2 2 σ U 2 2 π σ U 2 = α μ U β e μ U + σ U 2 / 2 σ U 2 + μ U 2 2 σ 2 + 1 2 + 1 2 log ( 2 π σ U 2 ) ,
where < μ U < and 0 < σ U 2 < . The Jensen gap between log C ( α , β , σ 2 ) and its lower bounds can be further narrowed by maximizing the lower bounds for all values of μ U and σ U 2 . Finally, we have the following theorem:
Theorem 1.
The log of the normalizing constant C ( α , β , σ 2 ) approximated by GVA is
log C ( α , β , σ 2 ) ̲ = max μ U , σ U 2 α μ U β e μ U + σ U 2 / 2 σ U 2 + μ U 2 2 σ 2 + 1 2 + 1 2 log ( 2 π σ U 2 ) = α μ ˜ U β e μ ˜ U + σ ˜ U 2 / 2 σ ˜ U 2 + ( μ ˜ U ) 2 2 σ 2 + 1 2 + 1 2 log ( 2 π σ ˜ U 2 ) + O ( α 2 ) ,
where μ ˜ U and σ ˜ U 2 satisfy
α β e μ ˜ U + σ ˜ U 2 / 2 μ ˜ U σ 2 = 0 and β e μ ˜ U + σ ˜ U 2 / 2 1 σ 2 + 1 σ ˜ U 2 = 0 .
In particular, for large α and β with s = log ( α / β ) = O ( 1 ) ,
log C ( α , β , σ 2 ) ̲ = α s α 1 2 log α + 1 2 log ( 2 π ) s 2 2 σ 2 + s σ 2 σ 2 + s 2 + ( σ 4 / 6 ) 2 α σ 4 + O ( α 2 ) .
It is important to note that the drawback of Equation (5) is that numerical approximation is needed to solve the nonlinear system of the estimated mean and variance. In contrast, Equation (6) has an explicit form with second-order accuracy. A comparison of these two equations is given in Figure 1. The inference for the Poisson lognormal mixed model will be provided in Section 3.
Proof of Theorem 1.
Let
̲ ( μ U , σ U 2 ) = α μ U β e μ U + σ U 2 / 2 σ U 2 + μ U 2 2 σ 2 + 1 2 + 1 2 log ( 2 π σ U 2 ) .
If μ ˜ U and σ ˜ U 2 maximize ̲ ( μ U , σ U 2 ) , then we have
̲ ( μ U , σ U 2 ) μ U ( μ ˜ U , σ ˜ U 2 ) = 0 and ̲ ( μ U , σ U 2 ) σ U 2 ( μ ˜ U , σ ˜ U 2 ) = 0 .
Hence, μ ˜ U and σ ˜ U 2 satisfy the equations
α β e μ ˜ U + σ ˜ U 2 / 2 μ ˜ U σ 2 = 0 ,
β e μ ˜ U + σ ˜ U 2 / 2 1 σ 2 + 1 σ ˜ U 2 = 0 .
Therefore, (7) gives e μ ˜ U + σ ˜ U 2 / 2 = α β μ ˜ U β σ 2 and μ ˜ U = O ( 1 ) . Furthermore,
μ ˜ U = σ ˜ U 2 / 2 + log α β μ ˜ U β σ 2 = σ ˜ U 2 / 2 + log α β μ ˜ U α σ 2 μ ˜ U 2 2 α 2 σ 4 + O ( α 3 ) .
Substituting e μ ˜ U + σ ˜ U 2 / 2 obtained from (7) into (8) and solving it for σ ˜ U 2 , we have
1 σ ˜ U 2 = α + 1 σ 2 μ ˜ U σ 2 ,
σ ˜ U 2 = 1 α + μ ˜ U 1 α 2 σ 2 + O ( α 3 ) ,
log σ ˜ U 2 = log α + μ ˜ U 1 α σ 2 + O ( α 2 ) .
By substituting σ ˜ U 2 in (11) back into (9), we have
μ ˜ U = log α β σ 2 + 2 log ( α / β ) 2 α σ 2 + O ( α 2 ) .
Finally, replacing μ ˜ U and σ ˜ U 2 in log C ( a , b , σ 2 ) ̲ by (9) and (11), respectively, we have
log C ( α , β , σ 2 ) ̲ = α σ ˜ U 2 / 2 + α log α β μ ˜ U σ 2 μ ˜ U 2 2 α σ 4 + O ( α 2 ) = 1 2 μ ˜ U 1 2 α σ 2 + α log α β μ ˜ U σ 2 μ ˜ U 2 2 α σ 4 + 1 2 + 1 2 log ( 2 π σ ˜ U 2 ) + O ( α 2 ) .
With α and β large and s = log ( α / β ) = O ( 1 ) , by substituting μ ˜ U with (13) and σ ˜ U 2 with (12) into (14), we have
log C ( α , β , σ 2 ) ̲ = α s α 1 2 log α + 1 2 log ( 2 π ) s 2 2 σ 2 + s σ 2 σ 2 + s 2 + ( σ 4 / 6 ) 2 α σ 4 + O ( α 2 ) .
To demonstrate the difference in the numerical accuracy of the GVA, we consider the density f ( x ; α , β , σ 2 ) as stated in (1) with α = 5 , β = 5 , and σ = 0.5 , 1 , , 10 . The exact C ( α , β , σ 2 ) is obtained by numerical integration. The plot of the result is given in Figure 1.
From Figure 1, we observe that the results of GVA by (5) are accurate for small σ , while GVA by (6) gives a good approximation for large σ . Overall, Equation (6) is a better explicit approximation compared with the numerical approximation by Equation (5).

3. GVA for Poisson Lognormal Mixed Model

In this section, we apply the results from Theorem 1 to obtain the estimates of the parameters in a Poisson lognormal mixed model. The estimates are simpler to obtain than the ones in Hall et al. [5,6].
Let Y i t , which takes on non-negative integer values, be the number of occurrences of an event for subject i at time t, where i = 1 , , m and t = 1 , , n . Moreover, let X i t = ( X i t 1 , X i t 2 , , X i t p ) where X i t j is the j t h power of the covariate for subject i at time t, X i t . Then, the conditional Poisson model is
Y i t | X i t Poisson ( λ i t )
where λ i t = exp j = 0 p β j X i t j is the mean of the conditional Poisson model, X i t 0 1 , and β 0 , β 1 , , β p are the regression coefficients. An unobservable latent random variable U i for the i t h subject is introduced into Model (16) to capture the within-subject correlation. The resulting model, which is known as the Poisson generalized linear mixed model, can be written as
Y i t | ( X i t , U i ) Poisson ( λ i t )
where the mean of the model is λ i t = exp U i + j = 0 p β j X i t j , and U i and X i t are assumed to be independently distributed. Furthermore, if U 1 , , U m are assumed to be identical and independently distributed as N ( 0 , σ 2 ), the model is referred to as the Poisson lognormal mixed model, though [5] referred to it as the Poisson linear mixed model. McCulloch et al. [13] gave a detailed discussion of the connections between the Poisson lognormal mixed model, and the model is studied in longitudinal data analysis.
It is challenging to obtain the estimators of the regression coefficients, β 0 , , β p , and σ 2 , because it involves the unobservable latent variables U i . Rue et al. [14] suggested eliminating the effect of the unobservable latent variables by using the marginal log-likelihood function, which can be written as
( β , σ 2 ) = i = 1 m t = 1 n { Y i t ( β 0 + β 1 X i t + β 2 X i t 2 + + β p X i t p ) log ( Y i t ! ) } m 2 log ( 2 π σ 2 ) + i = 1 m log + exp t = 1 n Y i t u e β 0 + β 1 X i t + β 2 X i t 2 + + β p X i t p u 2 2 σ 2 d u .
The major challenge to developing likelihood-based inferences for the parameters is hindered by the last term in (18), which involves the summation of m integrals. Hall et al. [5] applied the GVA method to overcome such integration problems. They considered only the model with one covariate ( p = 1 ) and the resulting estimating equations involved 2 m nuisance parameters. Due to the complexity of their method, it is difficult, if not impossible, to extend to models with more than one covariate ( p > 1 ) .
Using the notations introduced in Section 2, the marginal log-likelihood function (18) can be rewritten as
( β , σ 2 ) = i = 1 m log C ( a i n , b i n , σ 2 ) + t = 1 n Y i t j = 0 p β j X i t j log ( σ 2 ) 2 + 0 ( Y ) ,
where
0 ( Y ) = i = 1 m t = 1 n log ( Y i t ! ) m 2 log ( 2 π ) ,
a i n = t = 1 n Y i t and b i n = b i n ( β 0 , , β p ) = t = 1 n exp j = 0 p β j X i t j .
From (6), we have
i = 1 m log C ( a i n , b i n , σ 2 ) i = 1 m log 2 ( a i n / b i n ) 2 σ 2 a i n log ( b i n ) + 1 ( Y ) ,
where 1 ( Y ) = i = 1 m { a i n log a i n a i n log ( a i n ) / 2 + log ( 2 π ) / 2 } . Therefore, the log-likelihood function (19) can be approximated by
( β , σ 2 ) i = 1 m log 2 ( a i n / b i n ) 2 σ 2 a i , n log ( b i n ) + t = 1 n Y i t j = 0 p β j X i t j log ( σ 2 ) 2 + 0 ( Y ) + 1 ( Y ) .
For j = 1 , , p , denote
b i n ( j ) = b i n ( j ) ( β 0 , , β p ) = b i n β j = t = 1 n X i t j exp k = 0 p β k X i t k ,
and b i n ( 0 ) = b i n . Hence, the partial derivatives of the log-likelihood function with respect to β 0 , , β p and σ 2 are
( β , σ 2 ) β j i = 1 m t = 1 n Y i t X i t j a i n b i n ( j ) b i n + log a i n b i n b i n ( j ) σ 2 b i n , j = 0 , , p ,
( β , σ 2 ) σ 2 i = 1 m log 2 ( a i n / b i n ) 2 σ 4 1 2 σ 2 .
Note that
b i n ( j ) b i n = 1 for j = 0 s = 1 n X i s j exp ( k = 1 p β k X i s k ) s = 1 n exp ( k = 1 p β k X i s k ) otherwise
and log ( b i n ) = β 0 + log { t = 1 n exp ( k = 1 p β k X i t k ) } . Since the maximum likelihood estimators β ^ and σ ^ 2 must satisfy
( β , σ 2 ) β j ( β ^ , σ ^ 2 ) = 0 and ( β , σ 2 ) σ 2 ( β ^ , σ ^ 2 ) = 0 ,
we have
β ^ 0 = 1 m i = 1 m log t = 1 n Y i t t = 1 n exp ( k = 1 p β ^ k X i t k ) ,
i = 1 m t = 1 n Y i t X i t j = i = 1 m t = 1 n Y i t s = 1 n X i s j exp ( k = 1 p β ^ k X i s k ) s = 1 n exp ( k = 1 p β ^ k X i s k ) ,
and
σ ^ 2 = 1 m i = 1 m log 2 t = 1 n Y i t t = 1 n exp ( k = 0 p β ^ k X i t k ) .
Although the maximum likelihood estimators of β ^ k do not have a closed form solution, Equation (23) is well-structured enough that numerical procedure, such as the Newton–Raphson method or the EM algorithm can be applied to obtain the numerical solution.
To study the asymptotic properties of the maximum likelihood estimators, we need to make the following three assumptions.
Assumption 1.
Let ξ m n = ( ξ m n 1 , , ξ m n p ) T d N ( 0 , B ) as m n , where
ξ m n j = ( m n ) 1 / 2 i = 1 m t = 1 n ( Y i t λ i t ) ( X i t j r j ) , j = 1 , , p ,
X i t ’s are independent and have the same distribution as X whose moment generating function (MGF), M X ( s ) = E { exp ( s X ) } , is well-defined on the whole real line. λ i t = E ( Y i t | X i t , U i ) , r j and B j 1 , j 2 (entry of B ) are defined as
r j = r j ( β 1 , , β p ) = E ( Y i t X i t j ) / E ( Y i t ) = ϕ j / ϕ 0 ,
where
ϕ 0 = E { exp ( k = 1 p β k X k ) } ϕ j = ϕ j ( β 1 , β p ) = E { X j exp ( k = 1 p β k X k ) } for j = 1 , , p
and
B j 1 , j 2 = cov { ( Y i t λ i t ) ( X i t j 1 r j 1 ) , ( Y i t λ i t ) ( X i t j 2 r j 2 ) } , 1 j 1 , j 2 p .
Assumption 2.
For j = 1 , , p , i = 1 m t = 1 n ( Y i t λ i t ) ( r j b i n ( j ) / b i n ) = o p ( m n ) .
Assumption 3.
For j = 1 , , p ,
b ^ i n ( j ) / b ^ i n b i n ( j ) / b i n = k = 1 p δ j k ( β ^ k β k ) + o p ( k = 1 p | β ^ k β k | )
where δ j k = r j + k r j r k = δ j k ( β 1 , , β k ) , b ^ i n ( j ) = b i n / β j = b i n ( j ) ( 0 , β ^ 1 , , β ^ p ) , b ^ i n = t = 1 n exp ( k = 1 p β k X i t k ) = b i n ( 0 , β ^ 1 , , β ^ p ) , and β ^ 1 , , β ^ p are solutions obtained from (22).
We give a few remarks on each assumption.
Remark 1.
Assumption 1 is implied by the Central Limit Theorem. Thus, we only need to calculate the moments about the expectation and covariance.
E ( Y i t λ i t ) ( X i t j r j ) = E E ( Y i t λ i t ) ( X i t j r j ) | X i t , U i = E ( X i t j r j ) E ( Y i t λ i t ) | X i t , U i = 0 .
Similarly, from the definitions of r j and ϕ j , we have
B j 1 , j 2 = B j 1 , j 2 ( β 0 , , β p , σ 2 ) = exp ( β 0 + 1 2 σ 2 ) E { ( X j 1 r j 1 ) ( X j 2 r j 2 ) exp ( k = 1 p β k X k ) } .
To be more specific, when p = 1 ,
B 1 , 1 = exp ( β 0 + 1 2 σ 2 ) ( ϕ 2 ϕ 1 2 / ϕ 0 ) .
When p = 2 ,
B 1 , 2 = exp ( β 0 + 1 2 σ 2 ) ( ϕ 3 ϕ 1 r 2 ϕ 2 r 1 + r 1 r 2 ) = exp ( β 0 + 1 2 σ 2 ) ( ϕ 3 ϕ 1 ϕ 2 / ϕ 0 ) , B 2 , 2 = exp ( β 0 + 1 2 σ 2 ) ( ϕ 4 2 ϕ 2 r 2 + ϕ 0 r 2 2 ) = exp ( β 0 + 1 2 σ 2 ) ( ϕ 4 ϕ 2 2 / ϕ 0 ) .
Remark 2.
In Assumption 2, E { ( Y i t λ i t ) ( r j b i n ( j ) / b i n ) } = 0 , and thus the variance is
i = 1 m t = 1 n ( Y i t λ i t ) ( r j b i n ( j ) / b i n ) = m × t = 1 n ( Y i t λ i t ) ( r j b i n ( j ) / b i n ) = m × E ( r j b i n ( j ) / b i n ) 2 V a r t = 1 n ( Y i t λ i t ) = m n × E λ i t ( r j b i n ( j ) / b i n ) 2 .
On account of Markov’s inequality, we need to show E { λ i t ( r j b i n ( j ) / b i n ) 2 } = o ( 1 ) . Note that r j = E ( b i n ( j ) ) / E ( b i n ) . Derive the second-order Taylor expansion of the ratio b i n ( j ) / b i n about the value ( E ( b i n ( j ) ) , E ( b i n ) ) and use it to obtain E { λ i t ( r j b i n ( j ) / b i n ) 2 } = O ( n 1 ) .
Remark 3.
In Assumption 3, we obtain the p t h -order Taylor expansion of b ^ i n ( j ) and b ^ i n about the value ( β 1 , , β p ) , which yields
b ^ i n ( j ) b ^ i n b i n ( j ) b i n = t = 1 n X i t j exp ( k = 1 p β ^ k X i t k ) t = 1 n exp ( k = 1 p β ^ k X i t k ) t = 1 n X i t j exp ( k = 1 p β k X i t k ) t = 1 n exp ( k = 1 p β k X i t k ) = t = 1 n X i t j exp ( k = 1 p β k X i t k ) + s = 1 p t = 1 n X i t j + s exp ( k = 1 p β k X i t k ) ( β ^ s β s ) t = 1 n exp ( k = 1 p β k X i t k ) + s = 1 p t = 1 n X i t s exp ( k = 1 p β k X i t k ) ( β ^ s β s ) t = 1 n X i t j exp ( k = 1 p β k X i t k ) t = 1 n exp ( k = 1 p β k X i t k ) + o p ( k = 1 p | β ^ k β k | ) = s = 1 p t = 1 n X i t j + s exp ( k = 1 p β k X i t k ) ( β ^ s β s ) t = 1 n exp ( k = 1 p β k X i t k ) t = 1 n X i t j exp ( k = 1 p β k X i t k ) s = 1 p t = 1 n X i t s exp ( k = 1 p β k X i t k ) ( β ^ s β s ) { t = 1 n exp ( k = 1 p β k X i t k ) } 2 + o p k = 1 p | β ^ k β k | = s = 1 p ( r j + s r j r s ) ( β ^ s β s ) + o p k = 1 p | β ^ k β k | .
If p = 1 , δ 11 = r 2 r 1 2 . Furthermore, we have ρ 11 = δ 11 1 / E ( Y i t ) , where
E ( Y i t ) = exp ( β 0 + 1 2 σ 2 ) E { exp ( k = 1 p β k X k ) } , and γ 1 = ρ 11 2 B 11 = exp ( β 0 1 2 σ 2 ) r 2 r 1 2 .
Note that γ 1 agrees with Equation (3.5) in Theorem 3.1 of Hall et al. [5].
If p = 2 ,
Λ = δ 11 δ 12 δ 21 δ 22 = r 2 r 1 2 r 3 r 1 r 2 r 3 r 1 r 2 r 4 r 2 2 ,
Λ 1 E ( Y i t ) = ρ 11 ρ 12 ρ 21 ρ 22 = exp ( β 0 1 2 σ 2 ) ϕ 0 1 ( r 2 r 1 2 ) ( r 4 r 2 2 ) ( r 3 r 1 r 2 ) 2 r 4 r 2 2 r 1 r 2 r 3 r 1 r 2 r 3 r 2 r 1 2 .
Furthermore, we have
γ 1 = ρ 11 2 B 11 + 2 ρ 11 ρ 12 B 12 + ρ 12 2 B 22 ,
γ 2 = ρ 21 2 B 11 + 2 ρ 21 ρ 22 B 21 + ρ 22 2 B 22 .
After tedious calculations, we obtain the simplified versions of γ 1 and γ 2 :
γ 1 = exp ( β 0 1 2 σ 2 ) ϕ 0 ( ϕ 4 ϕ 0 ϕ 2 2 ) ( ϕ 2 ϕ 0 ϕ 1 2 ) ( ϕ 4 ϕ 0 ϕ 2 2 ) ( ϕ 3 ϕ 0 ϕ 1 ϕ 2 ) 2 , γ 2 = exp ( β 0 1 2 σ 2 ) ϕ 0 ( ϕ 2 ϕ 0 ϕ 1 2 ) ( ϕ 2 ϕ 0 ϕ 1 2 ) ( ϕ 4 ϕ 0 ϕ 2 2 ) ( ϕ 3 ϕ 0 ϕ 1 ϕ 2 ) 2 .
Theorem 2.
With Assumptions 1, 2, and 3, for j = 1 , , p ,
m n ( β ^ j β j ) = k = 1 p ρ j k ξ m n k + o p ( 1 ) ,
where β ^ j ’s are the solutions obtained from (22), ρ j k = ρ j k ( β 1 , , β p ) is the jth row and kth column of p × p matrix Λ 1 / E ( Y i t ) with Λ = ( δ j k ) , and E ( Y i t ) = λ i t = exp U i + j = 0 p β j X i t j . Thus, as m , n , β ^ j is a consistent estimator with asymptotic distribution of β ^ j is
m n ( β ^ j β j ) d N ( 0 , γ j ) j = 1 , , p
where γ j = k 1 = 1 p k 2 = 1 p ρ j k 1 ρ j k 2 B k 1 , k 2 . Moreover, β ^ 0 is a consistent estimator with asymptotic distribution
m ( β ^ 0 β 0 ) d N ( 0 , σ 2 ) ,
where β ^ 0 is the solution of β 0 provided in (23). Finally, σ ^ 2 is also a consistent estimator with the asymptotic distribution is
m ( σ ^ 2 σ 2 ) d N ( 0 , 2 σ 4 ) ,
where σ ^ 2 is the solution presented in (24).
Proof of Theorem 2.
From (23), we consider the following algebraic manipulation:
0 = i = 1 m t = 1 n ( Y i t λ i t + λ i t ) X i t j r j + r j b i n ( j ) b i n + b i n ( j ) b i n b ^ i n ( j ) b ^ i n = i = 1 m t = 1 n ( Y i t λ i t ) ( X i t j r j ) + i = 1 m t = 1 n ( Y i t λ i t ) r j b i n ( j ) b i n + i = 1 m t = 1 n Y i t b i n ( j ) b i n b ^ i n ( j ) b ^ i n + i = 1 m t = 1 n λ i t X i t j b i n ( j ) b i n ,
where the first term is m n ξ m n j , the second term is o p ( m n ) , and the last term is zero.
Consequently,
i = 1 m t = 1 n Y i t b ^ i n ( j ) b ^ i n b i n ( j ) b i n = m n ξ m n j + o p ( m n ) .
As ( m n ) 1 i = 1 m t = 1 n Y i t p E ( Y i t ) and ( m n ) 1 { m n ξ m n j + o p ( m n ) } = o p ( 1 ) , b ^ i n ( j ) / b ^ i n b i n ( j ) / b i n p 0 for each j > 0 . By the continuity, β ^ j p β j for each j > 0 . This shows that the estimators are consistent. In addition,
m n k = 1 p δ j k ( β ^ k β k ) = ξ m n j E ( Y i t ) + o p ( 1 ) .
This gives the asymptotic result below by the definition of ρ j k ,
m n ( β ^ j β j ) = k = 1 p ρ j k ξ m n k + o p ( 1 ) .
Therefore, by Central Limit Theorem, we have
m n ( β ^ j β j ) d N ( 0 , γ j )
where γ j = k 1 = 1 p k 2 = 1 p ρ j k 1 ρ j k 2 B k 1 , k 2 .
Moreover, we have
1 m i = 1 n ( U i + β 0 ) = β 0 + m 1 / 2 ζ m 1 ,
where ζ m 1 d N ( 0 , σ 2 ) . From the estimator of β 0 in (22),
β ^ 0 = 1 m i = 1 m log t = 1 n Y i t t = 1 n exp ( k = 1 p β ^ k X i t k ) + 1 m i = 1 n ( U i + β 0 ) .
Since
log t = 1 n Y i t t = 1 n exp ( k = 1 p β ^ k X i t k ) ( U i + β 0 ) = log t = 1 n Y i t exp ( U i + β 0 ) t = 1 n exp ( k = 1 p β ^ k X i t k ) = o p ( 1 ) ,
we have
1 m i = 1 m log t = 1 n Y i t t = 1 n exp ( k = 1 p β ^ k X i t k ) 1 m i = 1 n ( U i + β 0 ) = o p ( 1 ) .
Therefore, β ^ 0 β 0 = o p ( 1 ) , and hence β ^ 0 is a consistent estimator. Moreover, by the Central Limit Theorem, we have
m ( β ^ 0 β 0 ) d N ( 0 , σ 2 ) .
From the estimator of σ 2 in (24), combing the consistence of β ^ 0 and (25), we have
σ ^ 2 = 1 m i = 1 m log t = 1 n Y i t t = 1 n exp ( k = 1 p β ^ k X i t k ) β ^ 0 2 = 1 m i = 1 m log t = 1 n Y i t t = 1 n exp ( k = 1 p β ^ k X i t k ) ( U i + β 0 ) + ( β 0 β ^ 0 ) + { ( U i + β 0 ) β 0 } 2 = 1 m i = 1 m ( U i + β 0 ) β 0 2 + o p ( 1 ) = 1 m i = 1 m U i 2 + o p ( 1 ) = σ 2 + m 1 / 2 ζ m 2 + o p ( 1 ) ,
where ζ m 2 d N ( 0 , 2 σ 4 ) . Hence, we have σ ^ 2 σ 2 = o p ( 1 ) , and, thus, σ ^ 2 is a consistent estimator. Again, by the Central Limit Theorem, we have
m ( σ ^ 2 σ 2 ) d N ( 0 , 2 σ 4 ) .
Remark 4.
It is natural to consider independent covariates in which the λ i t can be modeled as λ i t = exp ( U i + j = 0 p β j X i t ( j ) ) , where X i t ( 0 ) 1 and X i t ( j 1 ) and X i t ( j 2 ) are independent for 0 < j 1 j 2 p . We still have
ξ m n j = ( m n ) 1 / 2 i = 1 m t = 1 n ( Y i t λ i t ) ( X i t ( j ) r j ) ,
where
r j = E ( Y i t X i t ( j ) ) / E ( Y i t ) = E { X i t ( j ) exp ( β j X i t ( j ) ) } / E { exp ( β j X i t ( j ) ) } = ϕ j , 1 / ϕ j , 0
B j , j = exp ( β 0 + 1 2 σ 2 ) { ϕ j , 2 ϕ j , 1 2 ϕ j , 0 1 } k j ϕ k , 0 ,
where ϕ j , 2 = E { ( X i t ( j ) ) 2 exp ( β j X i t ( j ) ) } , and B j 1 , j 2 = 0 if j 1 j 2 . Furthermore, δ j j = ϕ j , 2 ϕ j , 0 1 ϕ j , 1 2 ϕ j , 0 2 and δ j k = 0 if k j . Other modifications on B j , j can be similarly done.
Remark 5.
According to Theorem 2, we can construct an approximate confidence interval (CI) for parameters based on the limiting distributions, and they take the form:
Parameter ( 1 α ) 100 % CI
β 0 β ^ 0 ± Φ ( 1 1 2 α ) σ ^ 2 m
β j ( j = 1 , , p ) β ^ j ± Φ ( 1 1 2 α ) γ ^ j m n
σ 2 σ ^ 2 ± Φ ( 1 1 2 α ) σ ^ 2 2 m
where Φ denotes the distribution function of N ( 0 , 1 ) . Note that, when p = 1 , these confidence intervals agree with those given in Hall, Pham, Wand, and Wang (2011).

4. Numerical Studies

The proposed method is applied to a study by the Philippine Statistics Authority to illustrate how the methodology works. The simulation studies are performed to demonstrate the accuracy of the proposed method.

4.1. Real Data Analysis

The Philippine Statistics Authority spearheads the nationwide Family and Expenditure Survey. One of the objectives of this survey is to identify the age of the head of the household that is likely to have the maximum number of family members in the household. This would allow the government of the Philippines to plan and set policies as well as psychological help for Filipinos to adjust to the possibility of loneliness in old age.
The results of the nationwide Family and Expenditure Survey were published in Philippine Statistics Authority (2015). For each of the 17 regions ( m = 17 ) reported in [15], a sample of 1249 households ( n = 1249 ) was randomly selected. Let Y i t be the count of the number of family members in the household in the t t h sample of the region i, X i t is the age of the head of the t t h household of region i, and U i is the random effect in the i t h region. The model considered is
Y i t | ( X i t , U i ) Poisson ( λ i t ) i = 1 , , 17 and t = 1 , , 1249
where
λ i t = exp U i + j = 0 2 β j X i t j .
Figure 2 plots the log of the averaged counts of the number of family members in the household versus the age of the head of the household. At age 80 years or below, there is a clear maximum of the log of the average counts of the number of family members in the household. After age 80, this number increased because of the grand and great grand members joining the household. In this case, the method in Hall et al. [6] is not applicable because we have more than one predictor. Figure 2 also suggested that X i t 2 should be included in Model (26). Based on the proposed method, we have the estimates and confidence intervals of the parameters in Table 1.
Moreover, the age of the head of the household that maximized the model is 48.4 years, which is indicated by the red line in Figure 2.

4.2. Simulations

Simulation studies are used to demonstrate the accuracy of the proposed method. In all studies, we set m = 10 , 20 , , 100 and n = 10 m .
For the first study, we consider Model (16) with only one predictor ( p = 1 ) and two sets of parameter values:
( β 0 , β 1 , σ 2 ) = ( 2.2 , 0.1 , 0.16 ) and ( 1.2 , 0.4 , 0.3 ) .
Let X i t be generated from N ( 0 , 1 ) , where i = 1 , , m and t = 1 , , n . For each simulated sample, we obtain the 95% confidence interval for β 0 and record if the true β 0 is within the 95% confidence interval. Figure 3 shows the coverage proportion, where the nominal value is 0.95. Moreover, we have only one predictor, and thus the method in Hall et al. [6] can be applied as well. The results are also plotted in Figure 3 for comparison. We repeated the same procedure to obtain the coverage proportions for β 1 and σ 2 , which are reported in Figure 3, respectively. As shown in Figure 3, for variance σ 2 = 0.16 , both methods are comparable for larger variance σ 2 = 0.3 . However, the proposed method is slightly better for small m.
For the second study, we consider Model (16) with only two predictors ( p = 2 ) and four sets of parameter values:
Set β 0 β 1 β 2 σ 2
12.20.1−0.10.16
21.20.4−0.20.1
32.2−0.10.10.16
41.20.40.10.1
Let X i t be generated from N ( 0 , 1 ) where i = 1 , , m and t = 1 , , n . The 95% coverage probabilities are plotted in Figure 4. In this case, the method in Hall et al. [6] is not applicable because p = 2 . As shown in Figure 4, the convergence rates for estimates of β 1 and β 2 are faster than the ones for β 0 and σ 2 . That may explain the differences in coverage percentage among them.
Our last study is similar to the previous study, except the predictors are independent. In other words, we consider the model given in Remark 4 after Theorem 2. The same parameter settings as in the second study were used, and the two predictors were generated from an independent standard normal distribution. The 95% coverage probabilities are plotted in Figure 5. In this case, again the method in Hall et al. [6] is not applicable because p = 2 . The results are similar to those obtained in our second study even for different generation processes of predictors as shown in Figure 5.
Other parameter settings were also considered, and similar results were obtained. Thus, they are not reported here. They are available from the authors upon request.

5. Conclusions and Discussion

We proposed a special exponential family model and studied the GVA of the normalized constant by connecting the KL divergence and Jensen’s inequality. A closed form of expansion of the normalized constant was given up to the second order, which simplified the calculation and theoretical justification in inference of the Poisson lognormal mixed model. Real data analysis justified the importance of extension to multiple predictors. Simulation studies showed that the coverage percentage was consistent.
The proposed method was implemented into R and is available from the authors upon request. The flowchart of the proposed algorithm by GVA is shown in Figure 6.
Note that the glmer function in R package and lme4 perform similar calculations; however, our proposed method has two advantages over glmer—namely,
  • glmer does not provide the standard deviation of the estimate of variance σ 2 , and hence asymptotic inference for σ 2 is not available.
  • glmer is slow. Figure 7 compares the required computational times for different m. The computational time for glmer increases dramatically as the sample size m or n increases. For example, for m = 10 , 000 , our proposed method was about 56-times faster than glmer.
In summary, the Gaussian variational approximation method was developed for the Poisson lognormal mixed model with one or more predictors in this paper. The explicit forms of the estimators of the parameters and the corresponding limiting distributions were derived. Hence, inference for the parameters was obtained. A real-life example demonstrated the applicability of the proposed method, and simulation studies illustrated the accuracy of the proposed method. Although the same problem has been studied in Hall et al. [5] and Hall et al. [6], their studies were restricted to one-predictor models and are difficult to generalize to multiple-predictor models. The proposed method gives results that are comparable to their results but are applicable to multiple-predictor models.

Author Contributions

X.S., X.-S.W. and A.W. designed research; X.S., X.-S.W. and A.W. performed research; X.S. and A.W. analyzed data; X.S., X.-S.W. and A.W. wrote the paper. All authors have read and agreed to the published version of the manuscript.

Funding

Shi’s work was supported by NSERC Discovery Grant RGPIN 2022-03264, the Interior Universities Research Coalition and the BC Ministry of Health, and the University of British Columbia Okanagan (UBC-O) Vice Principal Research in collaboration with UBC-O Irving K. Barber Faculty of Science, Wong’s work was supported by NSERC Discovery Grant RGPIN 2017-05179.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data can be accessed from https://www.kaggle.com/grosvenpaul/family-income-and-expenditure, accessed on 13 October 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Stephenson, G. Mathematical Methods for Science Students; Pearson: Toronto, ON, Canada, 1973. [Google Scholar]
  2. Ormerod, J.T.; Wand, M.P. Explaining variational approximation. Am. Stat. 2010, 64, 140–153. [Google Scholar] [CrossRef] [Green Version]
  3. Teschendorff, A.E.; Wang, Y.; Barbosa-Morais, N.L.; Brenton, J.D.; Cal-Das, C. A variational Bayesian mixture modelling framework for cluster analysis of gene-expression data. Bioinformatics 2005, 21, 3025–3033. [Google Scholar] [CrossRef] [PubMed]
  4. Flandin, G.; Penny, W.D. Bayesian fMRI data analysis with sparse spatial basis function priors. NeuroImage 2007, 45, S173–S186. [Google Scholar]
  5. Hall, P.; Ormerod, J.T.; Wand, M.P. Theory of Gaussian variational approximation for a Poisson linear mixed model. Stat. Sin. 2011, 21, 369–389. [Google Scholar]
  6. Hall, P.; Pham, T.; Wand, M.P.; Wang, S.S.J. Asymptotic normality and valid inference for Gaussian variational approximation. Ann. Stat. 2011, 39, 2502–2532. [Google Scholar] [CrossRef] [Green Version]
  7. Wang, Y.; Blei, D.M. Frequentist consistency of variational Bayes. J. Am. Stat. Assoc. 2019, 114, 1147–1161. [Google Scholar] [CrossRef] [Green Version]
  8. Yang, Y.; Pati, D.; Bhattacharya, A. α-variational inference with statistical guarantees. Ann. Stat. 2020, 48, 886–905. [Google Scholar] [CrossRef]
  9. Dai, W.; Hu, T.; Jin, B.; Shi, X. Incorporating Grouping Information into Bayesian Gaussian Graphical Model Selection. Commun. Stat.–Theory Methods 2022, 1–18. [Google Scholar] [CrossRef]
  10. Galy-Fajou, T.; Perrone, V.; Opper, M. Flexible and Efficient Inference with Particles for the Variational Gaussian Approximation. Entropy 2021, 23, 990. [Google Scholar] [CrossRef] [PubMed]
  11. Ong, V.M.-H.; Nott, D.J.; Smith, M.S. Gaussian variational approximation with a factor covariance structure. J. Comput. Graph. Stat. 2018, 27, 465–478. [Google Scholar] [CrossRef] [Green Version]
  12. Chiquet, J.; Mariadassou, M.; Robin, S. Variational inference for probabilistic Poisson PCA. Ann. Appl. Stat. 2018, 12, 2674–2698. [Google Scholar] [CrossRef] [Green Version]
  13. McCulloch, C.E.; Searle, S.R.; Neuhaus, J.M. Generalized, Linear, and Mixed Models, 2nd ed.; Wiley: New York, NY, USA, 2008. [Google Scholar]
  14. Rue, H.; Martino, S.; Chopin, N. Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. J. R. Stat. Soc. Ser. 2009, 71, 319–392. [Google Scholar] [CrossRef]
  15. Philippine Statistics Authority. Family Income and Expenditure Survey. 2005. Available online: https://www.kaggle.com/grosvenpaul/family-income-and-expenditure (accessed on 13 September 2022).
Figure 1. Approximation accuracy comparison. The exact C is the black solid line, the GVA as in (5) is the dashed red line, and the GVA as in (6) is a dotted purple line.
Figure 1. Approximation accuracy comparison. The exact C is the black solid line, the GVA as in (5) is the dashed red line, and the GVA as in (6) is a dotted purple line.
Mathematics 10 04542 g001
Figure 2. Plot of the log of the averaged counts for each age in years (dots) and fitted solid curve of the log of the true rate.
Figure 2. Plot of the log of the averaged counts for each age in years (dots) and fitted solid curve of the log of the true rate.
Mathematics 10 04542 g002
Figure 3. Comparison of the method in Hall et al. [6] (purple) and our method (red) for the 95% coverage percentage for β 0 , β 1 , and σ 2 .
Figure 3. Comparison of the method in Hall et al. [6] (purple) and our method (red) for the 95% coverage percentage for β 0 , β 1 , and σ 2 .
Mathematics 10 04542 g003
Figure 4. Our 95% coverage percentage for β 0 , β 1 , β 2 , and σ 2 .
Figure 4. Our 95% coverage percentage for β 0 , β 1 , β 2 , and σ 2 .
Mathematics 10 04542 g004
Figure 5. Our 95% coverage percentage for β 0 , β 1 , β 2 , and σ 2 .
Figure 5. Our 95% coverage percentage for β 0 , β 1 , β 2 , and σ 2 .
Mathematics 10 04542 g005
Figure 6. Flowchart of the proposed algorithm using GVA.
Figure 6. Flowchart of the proposed algorithm using GVA.
Mathematics 10 04542 g006
Figure 7. Comparison of the computation time for lme4 and our method for different m, n = m / 10 , and p = 2 .
Figure 7. Comparison of the computation time for lme4 and our method for different m, n = m / 10 , and p = 2 .
Mathematics 10 04542 g007
Table 1. Estimates and confidence intervals of the parameters.
Table 1. Estimates and confidence intervals of the parameters.
ParameterEstimates95% Confidence Interval
β 0 0.55262(0.52585, 0.57939)
β 1 0.04403(0.04363, 0.04443)
β 2 0.00045 ( 0.00046 , 0.00045 )
σ 2 0.00317(0.00104, 0.0053)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Shi, X.; Wang, X.-S.; Wong, A. Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model. Mathematics 2022, 10, 4542. https://doi.org/10.3390/math10234542

AMA Style

Shi X, Wang X-S, Wong A. Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model. Mathematics. 2022; 10(23):4542. https://doi.org/10.3390/math10234542

Chicago/Turabian Style

Shi, Xiaoping, Xiang-Sheng Wang, and Augustine Wong. 2022. "Explicit Gaussian Variational Approximation for the Poisson Lognormal Mixed Model" Mathematics 10, no. 23: 4542. https://doi.org/10.3390/math10234542

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop