Next Article in Journal
Fact, Fiction, and Fitness
Next Article in Special Issue
A New Multi-Attribute Emergency Decision-Making Algorithm Based on Intuitionistic Fuzzy Cross-Entropy and Comprehensive Grey Correlation Analysis
Previous Article in Journal
Cooperative Detection of Multiple Targets by the Group of Mobile Agents
Previous Article in Special Issue
On a Class of Tensor Markov Fields
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations

Río Piedras Campus, University of Puerto Rico, 00925 San Juan, Puerto Rico
*
Author to whom correspondence should be addressed.
14 Ave. Universidad Ste. 1401, San Juan, PR 00925, USA.
These authors contributed equally to this work.
Entropy 2020, 22(5), 513; https://doi.org/10.3390/e22050513
Submission received: 3 March 2020 / Revised: 14 April 2020 / Accepted: 26 April 2020 / Published: 30 April 2020
(This article belongs to the Special Issue Data Science: Measuring Uncertainties)

Abstract

:
There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. On the other hand, variational inference methods have been employed to solve classification problems using probit regression and logistic regression with normal priors. In this article, we propose to apply the variational approximation on probit regression models with intrinsic prior. We review the mean-field variational method and the procedure of developing intrinsic prior for the probit regression model. We then present our work on implementing the variational Bayesian probit regression model using intrinsic prior. Publicly available data from the world’s largest peer-to-peer lending platform, LendingClub, will be used to illustrate how model output uncertainties are addressed through the framework we proposed. With LendingClub data, the target variable is the final status of a loan, either charged-off or fully paid. Investors may very well be interested in how predictive features like FICO, amount financed, income, etc. may affect the final loan status.

1. Introduction

There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. By far, only two articles have explored intrinsic prior related methods on classification problems. Reference [1] implements integral priors into the generalized linear models with various link functions. In addition, reference [2] considers intrinsic priors for probit models. On the other hand, variational inference methods have been employed to solve classification problem with logistic regression ([3]) and probit regression ([4,5]) with normal priors. Variational approximation methods have been reviewed in [6,7], and more recently [8].
In this article, we propose to apply variational approximations on probit regression models with intrinsic priors. In Section 4, we review the mean-field variational method that will be used in this article. In Section 3, procedures for developing intrinsic priors for probit models will be introduced following [2]. Our work is presented in Section 5. Our motivations for combining intrinsic prior methodology and variational inference is as following
  • Avoiding manually set ad hoc plugin priors by automatically generating a family of non-informative priors that are less sensible.
  • Reference [1,2] do not consider inference of posterior distributions of parameters. Their focus is on model comparison. Although the development of intrinsic priors itself comes from a model selection background, we thought it would be interesting to apply intrinsic priors on inference problems. In fact, some recently developed priors that proposed to solve inference or estimation problems turned out to be also intrinsic priors. For example, the Scaled Beta2 prior [9] and the Matrix-F prior [10].
  • Intrinsic priors concentrate probability near the null hypothesis, a condition that is widely accepted and should be required of a prior for testing a hypothesis.
  • Also, intrinsic priors have flat tails that prevents finite sample inconsistency [11].
  • For inference problems with large data set, variational approximation methods are much faster than MCMC-based methods.
As for model comparison, due to the fact that the output of variational inference methods cannot be employed directly to compare models, we propose in Section 5.3 to simply make use of the variational approximation of the posterior distribution as an importance function and get the Monte Carlo estimated marginal likelihood by importance sampling for model comparison.

2. Background and Development of Intrinsic Prior Methodology

2.1. Bayes Factor

The Bayesian framework of model selection coherently involves the use of probability to express all uncertainty in the choice of model, including uncertainty about the unknown parameters of a model. Suppose that models M 1 , M 2 , . . . , M q are under consideration. We shall assume that the observed data x = ( x 1 , x 2 , . . . , x n ) is generated from one of these models but we do not know which one it is. We express our uncertainty through prior probability P ( M j ) , j = 1 , 2 , . . . , q . Under model M i , x has density f i ( x | θ i , M i ) , where θ i are unknown model parameters, and the prior distribution for θ i is π i ( θ i | M i ) . Given observed data and prior probabilities, we can then evaluate the posterior probability of M i using Bayes’ rule
P ( M i | x ) = p i ( x | M i ) P ( M i ) j = 1 q p j ( x | M j ) P ( M j ) ,
where
p i ( x | M i ) = f i ( x | θ i , M i ) π i ( θ i | M i ) d θ i
is the marginal likelihood of x under M i , also called the evidence for M i [12]. A common choice of prior model probabilities is P ( M j ) = 1 q , so that each model has the same initial probability. However, there are other alternatives of assigning probabilities to correct for multiple comparison (See [13]). From (1), the posterior odds are therefore the prior odds multiplied by the Bayes factor
P ( M j | x ) P ( M i | x ) = P ( M j ) p j ( x ) P ( M i ) p i ( x ) = P ( M j ) P ( M i ) × B j i .
where the Bayes factor of M j to M i is defined by
B j i = p j ( x ) p i ( x ) = f j ( x | θ j ) π j ( θ j ) d θ j f i ( x | θ i ) π i ( θ i ) d θ i .
Here we omit the dependence on models M j , M i to keep the notation simple. The marginal likelihood, p i ( x ) expresses the preference shown by the observed data for different models. When B j i > 1 , the data favor M j over M i , and when B j i < 1 the data favor M i over M j . A scale for interpretation of B j i is given by [14].

2.2. Motivation and Development of Intrinsic Prior

Computing B j i requires specification of π i ( θ i ) and π j ( θ j ) . Often in Bayesian analysis, when prior information is weak, one can use non-informative (or default) priors π i N ( θ i ) . Common choices for non-informative priors are the uniform prior, π i U ( θ i ) 1 ; the Jeffreys prior, π i J ( θ i ) det ( I i ( θ i ) ) 1 / 2 where I i ( θ i ) is the expected Fisher information matrix corresponding to M i .
Using any of the π i N in (4) would yield
B j i N = p j N ( x ) p i N ( x ) = f j ( x | θ j ) π j N ( θ j ) d θ j f i ( x | θ i ) π i N ( θ i ) d θ i .
The difficulty with (5) is that π i N are typically improper and hence are defined only up to an unspecified constant c i . So B j i N is defined only up to the ratio c j / c i of two unspecified constants.
An attempt to circumvent the ill definition of the Bayes factors for improper non-informative priors is the intrinsic Bayes factor introduced by [15], which is a modification of a partial Bayes factor [16]. To define the intrinsic Bayes factor we consider the set of subsamples x ( l ) of the data x of minimal size l such that 0 < p i N ( x ( l ) ) < . These subsamples are called training samples (not to be confused with training sample in machine learning). In addition, there is a total number of L such subsamples.
The main idea here is that training sample x ( l ) will be used to convert the improper π i N ( θ i ) to proper posterior
π i N ( θ i | x ( l ) ) = f i ( x ( l ) | θ i ) π i N ( θ i ) p i N ( x ( l ) )
where p i N ( x ( l ) ) = f i ( x ( l ) | θ i ) π i N ( θ i ) d θ i . Then, the Bayes factor for the remaining of the data x ( n l ) , where x ( l ) x ( n l ) = x , using π i N ( θ i | x ( l ) ) as prior is called a “partial” Bayes factor,
B j i N ( x ( n l ) | x ( l ) ) = f j ( x ( n l ) | θ j ) π j N ( θ j | x ( l ) ) d θ j f i ( x ( n l ) | θ i ) π i N ( θ i | x ( l ) ) d θ i
This partial Bayes factor is a well-defined Bayes factor, and can be written as B j i N ( x ( n l ) | x ( l ) ) = B j i N ( x ) B i j ( x ( l ) ) , where B j i N ( x ) = p j N ( x ) p i N ( x ) and B i j ( x ( l ) ) = p i N ( x ( l ) ) p j N ( x ( l ) ) . Clearly, B j i N ( x ( n l ) | x ( l ) ) will depend on the choice of the training samples x ( l ) . To eliminate this arbitrariness and increase stability, reference [15] suggests averaging over all training samples and obtained the arithmetic intrinsic Bayes factor (AIBF)
B j i A I B F ( x ) = B j i N ( x ) 1 L l = 1 L B i j N ( x ( l ) ) .
The strongest justification of the arithmetic IBF is its asymptotic equivalence with a proper Bayes factor arising from Intrinsic priors. These intrinsic priors were identified through an asymptotic analysis (see [15]). For the case where M i is nested in M j , it can be shown that the intrinsic priors are given by
π i I ( θ i ) = π i N ( θ i ) and π j I ( θ j ) = π j N ( θ j ) E M j m i N ( x ( l ) ) m j N ( x ( l ) ) | θ j .

3. Objective Bayesian Probit Regression Models

3.1. Bayesian Probit Model and the Use of Auxiliary Variables

Consider a sample y = ( y 1 , . . . , y n ) , where Y i , i = 1 , . . . , n , is a 0 1 random variable such that under model M j , it follows a probit regression model with a j + 1 -dimensional vector of covariates x i , where j p . Here, p is the total number of covariate variables under our consideration. In addition, this probit model M j has the form
Y i | β 0 , . . . , β j , M j Bernoulli ( Φ ( β 0 x 0 i + β 1 x 1 i + . . . + β j x j i ) ) , 1 i n ,
where Φ denotes the standard normal cumulative distribution function and β j = ( β 0 , . . . , β j ) is a vector of dimension j + 1 . The first component of the vector x i is set equal to 1 so that when considering models of the form (10), the intercept is in any submodel. The maximum length of the vector of covariates is p + 1 . Let π ( β ) , proper or improper, summarize our prior information about β . Then the posterior density of β is given by
π ( β | y ) = π ( β ) i = 1 n Φ ( x i β ) y i ( 1 Φ ( x i β ) 1 y i ) π ( β ) i = 1 n Φ ( x i β ) y i ( 1 Φ ( x i β ) 1 y i ) d β ,
which is largely intractable.
As shown by [17], the Bayesian probit regression model becomes tractable when a particular set of auxiliary variables is introduced. Based on the data augmentation approach [18], introducing n latent variables Z 1 , . . . , Z n , where
Z i | β N ( x i β , 1 ) .
The probit model (10) can be thought of as a regression model with incomplete sampling information by considering that only the sign of z i is observed. More specifically, define Y i = 1 if Z i > 0 and Y i = 0 otherwise. This allows us to write the probability density of y i given z i
p ( y i | z i ) = I ( z i > 0 ) I ( y i = 1 ) + I ( z i 0 ) I ( y i = 0 ) .
Expansion of the parameter set from { β } to { β , Z } is the key to achieving a tractable solution for variational approximation.

3.2. Development of Intrinsic Prior for Probit Models

For the sample z = ( z 1 , . . . , z n ) , the null normal model is
M 1 : { N n ( z | α 1 n , I n ) , π ( α ) } .
For a generic model M j with j + 1 regressors, the alternative model is
M j : { N n ( z | X j β j , I n ) , π ( β j ) } ,
where the design matrix X j has dimensions n × ( j + 1 ) . Intrinsic prior methodology for the linear model was first developed by [19], and was further developed in [20] by using the methods of [21]. This intrinsic methodology gives us an automatic specification of the priors π ( α ) and π ( β ) , starting with the non-informative priors π N ( α ) and π N ( β ) for α and β , which are both improper and proportional to 1.
The marginal distributions for the sample z under the null model, and under the alternative model with intrinsic prior, are formally written as
p 1 ( z ) = N n ( z | α 1 n , I n ) π N ( α ) d α , p j ( z ) = N n ( z | X j β j , I n ) π I ( β | α ) π N ( α ) d α d β .
However, these are marginals of the sample z , but our selection procedure requires us to compute the Bayes factor of model M j versus the reference model M 1 for the sample y = ( y 1 , . . . , y n ) . To solve this problem, reference [2] proposed to transform the marginal p j ( z ) into the marginal p j ( y ) by using the probit transformations y i = 1 ( z i > 0 ) , i = 1 , . . . , n . These latter marginals are given by
p j ( y ) = A 1 × . . . × A n p j ( z ) d z
where
A i = ( 0 , ) if y i = 1 , ( , 0 ) if y i = 0 .

4. Variational Inference

4.1. Overview of Variational Methods

Variational methods have their origins in the 18th century with the work of Euler, Lagrange, and others on the calculus of variations (The derivation in this section is standard in the literature on variational approximation and will at times follow the arguments in [22,23]). Variational inference is a body of deterministic techniques for making approximate inference for parameters in complex statistical models. Variational approximations are a much faster alternative to Markov Chain Monte Carlo (MCMC), especially for large models, and are a richer class of methods than the Laplace approximation [6].
Suppose we have a Bayesian model and a prior distribution for the parameters. The model may also have latent variables, here we shall denote the set of all latent variables and parameters by θ . In addition, we denote the set of all observed variables by X . Given a set of n independent, identically distributed data, for which X = { x 1 , . . . , x n } and θ = { θ 1 , . . . , θ n } , our probabilistic model (e.g., probit regression model) specifies the joint distribution p ( X , θ ) , and our goal is to find an approximation for the posterior distribution p ( θ | X ) as well as for the marginal likelihood p ( X ) . For any probability distribution q ( θ ) , we have the following decomposition of the log marginal likelihood
ln p ( X ) = L ( q ) + KL ( q | | p )
where we have defined
L ( q ) = q ( θ ) ln p ( X , θ ) q ( θ ) d θ
KL ( q | | p ) = q ( θ ) ln p ( θ | X ) q ( θ ) d θ
We refer to (14) as the lower bound of the log marginal likelihood with respect to the density q, and (15) is by definition the Kullback–Leibler divergence of the posterior q ( θ | X ) from the density q. Based on this decomposition, we can maximize the lower bound L ( q ) by optimization with respect to the distribution q ( θ ) , which is equivalent to minimizing the KL divergence. In addition, the lower bound is attained when the KL divergence is zero, which happens when q ( θ ) equals the posterior distribution p ( θ | X ) . It would be hard to find such a density since the true posterior distribution is intractable.

4.2. Factorized Distributions

The essence of the variational inference approach is approximation to the posterior distribution p ( θ | X ) by q ( θ ) for which the q dependent lower bound L ( q ) is more tractable than the original model evidence. In addition, tractability is achieved by restricting q to a more manageable class of distributions, and then maximizing L ( q ) over that class.
Suppose we partition elements of θ into disjoint groups { θ i } where i = 1 , . . . , M . We then assume that the q density factorizes with respect to this partition, i.e.,
q ( θ ) = i = 1 M q i ( θ i ) .
The product form is the only assumption we made about the distribution. Restriction (16) is also known as mean-field approximation and has its root in Physics [24].
For all distributions q ( θ ) with the form (16), we need to find the distribution for which the lower bound L ( q ) is largest. Restriction of q to a subclass of product densities like (16) gives rise to explicit solutions for each product component in terms of the others. This fact, in turn, leads to an iterative scheme for obtaining the solutions. To achieve this, we first substitute (16) into (14) and then separate out the dependence on one of the factors q j ( θ j ) . Denoting q j ( θ j ) by q j to keep the notation clear, we obtain
L ( q ) = i = 1 M q i ln p ( X , θ ) i = 1 M ln q i d θ = q j ln p ( X , θ ) i j q i d θ i d θ j q j ln q j d θ j + constant = q j ln p ˜ ( X , θ j ) d θ j q j ln q j d θ j + constant
where p ˜ ( X , θ j ) is given by
ln p ˜ ( X , θ j ) = E i j [ ln p ( X , θ ) ] + constant .
The notation E i j [ · ] denotes an expectation with respect to the q distributions over all variables z i for i j , so that
E i j [ ln p ( X , θ ) ] = ln p ( X , θ ) i j q i d θ i .
Now suppose we keep the { q i j } fixed and maximize L ( q ) in (17) with respect to all possible forms for the density q j ( θ j ) . By recognizing that (17) is the negative KL divergence between p ˜ ( X , θ j ) and q j ( θ j ) , we notice that maximizing (17) is equivalent to minimize the KL divergence, and the minimum occurs when q j ( θ j ) = p ˜ ( X , θ j ) . The optimal q j * ( θ j ) is then
ln q j * ( θ j ) = E i j [ ln p ( X , θ ) ] + constant .
The above solution says that the log of the optimal q j is obtained simply by considering the log of the joint distribution of all parameter, latent and observable variables and then taking the expectation with respect to all the other factors q i for i j . Normalizing the exponential of (19), we have    
q j * ( θ j ) = exp ( E i j [ ln p ( X , θ ) ] ) exp ( E i j [ ln p ( X , θ ) ] ) d θ j .
The set of equations in (19) for j = 1 , . . . , M are not an explicit solution because the expression on the right hand side of (19) for the optimal q j * depends on expectations taken with respect to the other factors q i for i j . We will need to first initialize all of the factors q i ( θ i ) and then cycle through the factors one by one and replace each in turn with an updated estimate given by the right hand side of (19) evaluated using the current estimates for all of the other factors. Convexity properties can be used to show that convergence to at least local optima is guaranteed [25]. The iterative procedure is described in Algorithm 1.
Algorithm 1 Iterative procedure for obtaining the optimal densities under factorized density restriction (16). The updates are based on the solutions given by (19).
1:
Initialize q 2 ( θ 2 ) , , q M ( θ M ) .
2:
Cycle through
q 1 ( θ 1 ) exp ( E i 1 [ ln p ( X , θ ) ] ) exp ( E i 1 [ ln p ( X , θ ) ] ) d θ 1 q M ( θ M ) exp ( E i M [ ln p ( X , θ ) ] ) exp ( E i M [ ln p ( X , θ ) ] ) d θ M
until the increase in L ( q ) is negligible.

5. Incorporate Intrinsic Prior with Variational Approximation to Bayesian Probit Models

5.1. Derivation of Intrinsic Prior to Be Used in Variational Inference

Let X l be the design matrix of a minimal training sample (mTS) of a normal regression model M j for the variable Z N ( X j β j , I j + 1 ) . We have, for the j + 1 -dimensional parameter β j ,
N j + 1 ( z l | X l β j , I j + 1 ) d β j = | X l X l | 1 / 2 if rank of X l j + 1 otherwise .
Therefore, it follows that the mTS size is j + 1 [2]. Given that priors for α and β are proportional to 1, the intrinsic prior for β conditional on α could be derived. Let β 0 denote the vector with the first component equal to α and the others equal to zero. Based on Formula (9), we have
π I ( β | α ) = π j N ( β ) E z l | β M j p 1 ( z l | α ) p j ( z l | β ) π j N ( β ) d β = E z l | β M j exp { 1 2 ( z l X l β 0 ) ( z l X l β 0 ) } exp { 1 2 ( z l X l β ) ( z l X l β ) } d β = ( 2 π ) ( j + 1 ) 2 | ( X l X l ) 1 | 1 2 × E z l | β M j exp { 1 2 ( z l X l β 0 ) ( z l X l β 0 ) } = ( 2 π ) ( j + 1 ) 2 | 2 ( X l X l ) 1 | 1 2 exp { 1 2 [ ( β β 0 ) X l X l 2 ( β β 0 ) ] } .
Therefore,
π I ( β | α ) = N j + 1 ( β | β 0 , 2 ( X l X l ) 1 ) , where β 0 = α 0 0 ( j + 1 ) × 1 .
Notice that X l X l is unknown because it is a theoretical design matrix corresponding to the training sample z l . It can be estimated by averaging over all submatrices containing j + 1 rows of the n × ( j + 1 ) design matrix X j . This average is j + 1 n X j X j (See [26] and Appendix A in [2]), and therefore
π I ( β | α ) = N j + 1 ( β | β 0 , 2 n j + 1 ( X j X j ) 1 ) .
Next, based on π I ( β | α ) , the intrinsic prior for β can be obtained by
π I ( β ) = π I ( β | α ) π I ( α ) d α .
Since we assume that π I ( α ) = π N ( α ) is proportional to one, set π N ( α ) = c where c is an arbitrary positive constant. Denote 2 n j + 1 ( X j X j ) 1 by Σ β | α , we obtain
π I ( β ) = π I ( β | α ) π I ( α ) d α = c · ( 2 π ) j + 1 2 | Σ β | α | 1 2 exp { 1 2 ( β β 0 ) Σ β | α 1 ( β β 0 ) } d α exp { 1 2 β Σ β | α 1 β } × exp { 1 2 [ β 0 Σ β | α 1 β 0 2 β Σ β | α 1 β 0 ] } d α exp { 1 2 β Σ β | α 1 β } × exp { 1 2 ( Σ β | α ( 1 , 1 ) 1 α 2 2 β Σ β | α ( · 1 ) 1 α ) } d α
where Σ β | α ( 1 , 1 ) 1 is component of Σ β | α 1 at position row 1 column 1 and Σ β | α ( · 1 ) 1 is the first column of Σ β | α 1 . Denote Σ β | α ( 1 , 1 ) 1 by σ 11 and Σ β | α ( · 1 ) 1 by γ 1 , we then obtain
π I ( β ) exp { 1 2 β Σ β | α 1 β } × exp { 1 2 σ 11 ( α β γ 1 σ 11 ) 2 + 1 2 ( β γ 1 ) 2 σ 11 } d α exp { 1 2 ( β Σ β | α 1 β β γ 1 γ 1 σ 11 β ) } × 2 π σ 11 1 / 2 exp { 1 2 β ( Σ β | α 1 γ 1 γ 1 σ 11 ) β } .
Therefore, we have derived that
π I ( β ) N j + 1 ( 0 , ( Σ β | α 1 γ 1 γ 1 σ 11 ) 1 ) .
For model comparison, the specific form of the intrinsic prior may be needed, including the constant factor. Therefore, by following (21) and (22) we have
π I ( β ) = c · ( 2 π ) j + 1 2 | Σ β | α | 1 2 ( 2 π ) j + 1 2 | ( Σ β | α 1 γ 1 γ 1 σ 11 ) 1 | 1 2 2 π σ 11 1 / 2 × N j + 1 ( 0 , ( Σ β | α 1 γ 1 γ 1 σ 11 ) 1 ) = c · | Σ β | α ( Σ β | α 1 γ 1 γ 1 σ 11 ) | 1 2 2 π σ 11 1 / 2 × N j + 1 ( 0 , ( Σ β | α 1 γ 1 γ 1 σ 11 ) 1 ) = c · 2 π σ 11 1 / 2 | ( I γ 1 γ 1 σ 11 Σ β | α ) | 1 2 × N j + 1 ( 0 , ( Σ β | α 1 γ 1 γ 1 σ 11 ) 1 ) .

5.2. Variational Inference for Probit Model with Intrinsic Prior

5.2.1. Iterative Updates for Factorized Distributions

We have that
Z i | β N ( x i β , 1 ) and p ( y i | z i ) = I ( z i > 0 ) I ( y i = 1 ) + I ( z i 0 ) I ( y i = 0 )
in Section 3.1. We have shown in Section 5.1 that
π I ( β ) N j + 1 ( μ β , Σ β ) ,
where μ β = 0 and Σ β = ( Σ β | α 1 γ 1 γ 1 σ 11 ) 1 . Since y is independent of β given z , we have
p ( y , z , β ) = p ( y | z , β ) p ( z | β ) p ( β ) = p ( y | z ) p ( z | β ) p ( β ) .
To apply the variational approximation to probit regression model, unobservable variables are considered in two separate groups, coefficient parameter β and auxiliary variable Z . To approximate the posterior distribution of β , consider the product form
q ( Z , β ) = q Z ( Z ) q β ( β ) .
We proceed by first describing the distribution for each factor of the approximation, q Z ( Z ) and q β ( β ) . Then variational approximation is accomplished by iteratively updating the parameters of each factor distribution.
Start with q Z ( Z ) , when y i = 1 , we have
log p ( y , z , β ) = log i 1 2 π exp { ( z i x i β ) 2 2 } × π I ( β ) where z i > 0 .
Now, according to (19) and Algorithm 1, the optimal q Z is proportional to
Entropy 22 00513 i001
So, we have the optimal q Z ,
q Z * ( Z ) exp { 1 2 z z + E β [ β ] X z + constant } exp { 1 2 ( z X E β [ β ] ) ( z X E β [ β ] ) } .
Similar procedure could be used to develop cases when y i = 0 . Therefore, we have that the optimal approximation for q Z is a truncated normal distribution, where
q Z * ( Z ) = N [ 0 , + ) ( X E β [ β ] i , 1 ) if y i = 1 , N ( , 0 ] ( X E β [ β ] i , 1 ) if y i = 0 .
Denote X E β [ β ] by μ z , the location of distribution q Z * ( Z ) . The expectation E β is taken with respect to the density form of q ( β ) for which we shall derive now.
For q β ( β ) , given the joint form in (25), we have
log p ( y , z , β ) = 1 2 exp { ( z X β ) ( z X β ) } 1 2 exp { ( β μ β ) Σ β 1 ( β μ β ) } + constant .
Taking expectation with respect to q Z ( z ) , we have
Entropy 22 00513 i002
Again, based on (19) and Algorithm 1, the optimal q β ( β ) is proportional to E Z [ log p ( y , z , β ) ] ,
q β * ( β ) 1 2 β ( X X + Σ β 1 ) β + ( E Z [ Z ] X + μ β Σ β 1 ) β .
First notice that any constant terms, including constant factor in the intrinsic prior, were canceled out due to the ratio form of (19). Then by noticing the quadratic form in the above formula we have
q β * ( β ) = N ( μ q β , Σ q β ) ,
where
Σ q β = ( X X + Σ β 1 ) 1 , μ q β = ( X X + Σ β 1 ) 1 ( E Z [ Z ] X + μ β Σ β 1 ) .
Notice that μ q β , i.e., E β [ β ] , depends on E Z [ Z ] . In addition, from our previous derivation, we found that the update for E Z [ Z ] depends on E β [ β ] . Given that the density form of q Z is truncated normal, we have
E Z [ Z i ] = X E β [ β ] i + ϕ ( X E β [ β ] i ) 1 Φ ( X E β [ β ] ) i if y i = 1 , X E β [ β ] i ϕ ( X E β [ β ] i ) Φ ( X E β [ β ] ) i if y i = 0 ,
where ϕ is the standard normal density and Φ is the standard normal cumulative density. Denote E Z [ Z ] by μ q Z . See properties of truncated normal distribution in Appendix A. Updating procedures for parameters μ q β and μ q Z of each factor distribution are summarized in Algorithm 2.
Algorithm 2 Iterative procedure for updating parameters to reach optimal factor densities q β and q Z in Bayesian probit regression model. The updates are based on the solutions given by (26) and (27).
1:
Initialize μ q Z .
2:
Cycle through
μ q β ( X X + Σ β 1 ) 1 ( μ q z X + μ β Σ β 1 ) , μ q Z X μ q β + ϕ ( X μ q β ) Φ ( X μ q β ) y [ Φ ( X μ q β ) 1 ] 1 y ,
until the increase in L ( q ) is negligible.

5.2.2. Evaluation of the Lower Bound L ( q )

During the process of optimization of variational approximation densities, the lower bound for the log marginal likelihood need to be evaluated and monitored to determine when the iterative updating process converges. Based on derivations from previous section, we now have the exact form for the variational inference density,
q ( β , Z ) = q β ( β ) q Z ( Z ) .
According to (14), we can write down the lower bound L ( q ) with respect to q ( β , Z ) .
L ( q ) = q ( β , Z ) log p ( Y , β , Z ) q ( β , Z ) d β d Z = q β ( β ) q Z ( Z ) log p ( Y , β , Z ) q β ( β ) q Z ( Z ) d β d Z = q β ( β ) q Z ( Z ) log { p ( Y , β , Z ) } d β d Z q β ( β ) q Z ( Z ) log { q β ( β ) q Z ( Z ) } d β d Z = E β , Z [ log { p ( Y , Z | β ) } ] + E β , Z [ π I ( β ) ] E β , Z [ log { q β ( β ) } ] E β , Z [ log { q Z ( Z ) } ] .
As we can see in (28), L ( q ) has been divided into four different parts with expectation taken over the variational approximation density q ( β , Z ) = q β ( β ) q Z ( Z ) . We now find the expression of these expectations one by one.

Part 1: E β , Z [ log { p ( Y , Z | β ) } ]

= log ( 2 π ) n 2 + q β ( β ) q Z ( Z ) { 1 2 ( z X β ) ( z X β ) } d β d z = log ( 2 π ) n 2 + q Z ( Z ) q β ( β ) { 1 2 ( β X X β 2 z X β + z z ) } d β d z
Deal with the inner integral first, we have
q β ( β ) { 1 2 ( β X X β 2 z X β + z z ) } d β = 1 2 q β ( β ) [ β X X β ] d β + z X E β [ β ] 1 2 z z = 1 2 q β ( β ) [ β X X β ] d β + z X μ q β 1 2 z z
where
1 2 q β ( β ) [ β X X β ] d β = 1 2 q β ( β ) [ ( β μ q β + μ q β ) X X ( β μ q β + μ q β ) ] d β = 1 2 trace ( X X E β [ ( β μ q β ) ( β μ q β ) ] ) 1 2 μ q β X X μ q β = 1 2 trace ( X X [ μ q β μ q β + Σ q β ] ) .
Substitute (31) into (30), we got
q β ( β ) { 1 2 ( β X X β 2 z X β + z z ) } d β = 1 2 trace ( X X [ μ q β μ q β + Σ q β ] ) + z X μ q β 1 2 z z .
Substituting (32) back into (29) gives
E β , Z [ log { p ( Y , Z | β ) } ] = log ( 2 π ) n 2 + q Z ( z ) { 1 2 trace ( X X [ μ q β μ q β + Σ q β ] ) + z X μ q β 1 2 z z } d z = log ( 2 π ) n 2 1 2 trace ( X X [ μ q β μ q β + Σ q β ] ) 1 2 E Z [ z z ] + μ q z μ z = log ( 2 π ) n 2 1 2 trace ( X X [ μ q β μ q β + Σ q β ] ) + μ q z μ z 1 2 i = 1 n [ 1 + μ z i 2 μ z i ϕ ( μ z i ) Φ ( μ z i ) ] I ( y i = 0 ) [ 1 + μ z i 2 + μ z i ϕ ( μ z i ) 1 Φ ( μ z i ) ] I ( y i = 1 ) = log ( 2 π ) n 2 1 2 trace ( X X [ μ q β μ q β + Σ q β ] ) + μ q z μ z 1 2 i = 1 n [ 1 + μ q z i μ z i ] I ( y i = 0 ) [ 1 + μ q z i μ z i ] I ( y i = 1 ) = log ( 2 π ) n 2 1 2 trace ( X X [ μ q β μ q β + Σ q β ] ) + 1 2 μ q z μ z n 2 .
We applied properties of truncated normal distribution in Appendix B to find the expression of the second moment E Z [ z z ] .

Part 2: E β , Z [ log q Z ( z ) ]

= q β ( β ) q Z ( z ) log q Z ( z ) d β d Z = q Z ( z ) log q Z ( z ) d Z = n 2 ( log ( 2 π ) + 1 ) + i = 1 n { [ log ( Φ ( μ z i ) ) + μ z i ϕ ( μ z i ) 2 Φ ( μ z i ) ] I ( y i = 0 ) [ log ( 1 Φ ( μ z i ) ) μ z i ϕ ( μ z i ) 2 ( 1 Φ ( μ z i ) ) ] I ( y i = 1 ) } = n 2 ( log ( 2 π ) + 1 ) 1 2 μ z μ z + 1 2 μ q z μ z + i = 1 n { [ log ( Φ ( μ z i ) ) ] I ( y i = 0 ) [ log ( 1 Φ ( μ z i ) ) ] I ( y i = 1 ) }
Again, see Appendix B for well-known properties of truncated normal distribution. Now subtracting (34) from (33) we got
E β , Z [ log { p ( Y , Z | β ) } ] E β , Z [ log q Z ( z ) ] = 1 2 trace ( X X [ μ q β μ q β + Σ q β ] ) + 1 2 μ z μ z + i = 1 n { [ log ( Φ ( μ z i ) ) ] I ( y i = 0 ) [ log ( 1 Φ ( μ z i ) ) ] I ( y i = 1 ) } .
Based on the exact expression of the intrinsic prior π I ( β ) , denoting all constant terms by C, we have

Part 3: E β , Z [ log p β ( β ) ]

= q Z ( z ) q β ( β ) log π I ( β ) d β d z = log C ( j + 1 ) 2 log ( 2 π ) 1 2 log | Σ β | 1 2 q β ( β ) [ β Σ β 1 β ] d β
To find the expression for the integral, we have
q β ( β ) [ β Σ β 1 β ] d β = q β ( β ) ( β μ q β + μ q β ) Σ β 1 ( β μ q β + μ q β ) d β = E [ trace ( Σ β 1 ( β μ q β ) ( β μ q β ) ) ] + μ q β Σ β 1 μ q β = trace ( Σ β 1 Σ q β ) + μ q β Σ β 1 μ q β
Substituting (37) back into (36), we obtained
E β , Z [ log p β ( β ) ] = log C ( j + 1 ) 2 log ( 2 π ) 1 2 log | Σ β | 1 2 [ trace ( Σ β 1 Σ q β ) + μ q β Σ β 1 μ q β ] .

Part 4: E β , Z [ log q β ( β ) ]

= q Z ( z ) q β ( β ) log q β ( β ) d β = j + 1 2 log ( 2 π ) 1 2 log | Σ q β | 1 2 q β ( β ) ( β μ q β ) Σ q β 1 ( β μ q β ) d β = j + 1 2 log ( 2 π ) 1 2 log | Σ q β | 1 2 trace ( Σ β 1 Σ β ) = j + 1 2 ( log ( 2 π ) + 1 ) 1 2 log | Σ q β |
Combining all four parts together, we get
L ( q ) = E β , Z [ log { p ( Y , Z | β ) } ] + E β , Z [ π I ( β ) ] E β , Z [ log { q β ( β ) } ] E β , Z [ log { q Z ( Z ) } ] = 1 2 trace ( X X [ μ q β μ q β + Σ q β ] ) + 1 2 μ z μ z + i = 1 n { [ log ( Φ ( μ z i ) ) ] I ( y i = 0 ) [ log ( 1 Φ ( μ z i ) ) ] I ( y i = 1 ) } E β , Z [ log { p ( Y , Z | β ) } ] E β , Z [ log { q Z ( Z ) } ] + log C 1 2 log | Σ β | 1 2 [ trace ( Σ β 1 Σ q β ) + μ q β Σ β 1 μ q β ] + j + 1 2 + 1 2 log | Σ q β | E β , Z [ log p β ( β ) ] E β , Z [ log q β ( β ) ] .

5.3. Model Comparison Based on Variational Approximation

Suppose we want to compare two models, M 1 and M 0 , where M 0 is the simpler model. An intuitive thought on comparing two models by variational approximation methods is just to compare the lower bounds L ( q 1 ) and L ( q 0 ) . However, we should note that by comparing the lower bounds, we are assuming that the KL divergences in the two approximations are the same, so that we can use just these lower bounds as guide. Unfortunately, it is not easy to measure how tight in theory any particular bound can be, if this can be accomplished we could then more accurately estimate the log marginal likelihood from the beginning. As clarified in [27], when comparing two exact log marginal likelihood, we have
log p 1 ( X ) log p 0 ( X ) = [ L ( q 1 ) + K L ( q 1 p 1 ) ] [ L ( q 0 ) K L ( q 0 p 0 ) ]
= L ( q 1 ) L ( q 0 ) + [ K L ( q 1 p 1 ) K L ( q 0 p 0 ) ]
L ( q 1 ) L ( q 0 ) .
The difference in log marginal likelihood, log p 1 ( X ) log p 0 ( X ) , is the quantity we wish to estimate. However, if we base this on the lower bounds difference, we are basing our model comparison on () rather than (41). Therefore, there exists a systematic bias towards simpler model when comparing models if K L ( q 1 p 1 ) K L ( q 0 p 0 ) is not zero.
Realizing that we have a variational approximation for the posterior distribution of β , we propose the following method to estimate p ( X ) based on our variational approximation q β ( β ) (27). First, writing the marginal likelihood as
p ( x ) = p ( x | β ) π I ( β ) q β ( β ) q β ( β ) d β ,
we can interpret it as the conditional expectation
p ( x ) = E p ( x | β ) π I ( β ) q β ( β )
with respect to q β ( β ) . Next, draw samples β ( 1 ) , . . . , β ( n ) from q β ( β ) and obtain the estimated marginal likelihood
p X ( x ) ^ = 1 n i = 1 n p ( x | β ( i ) ) π I ( β ( i ) ) q β ( β ( i ) ) .
Please note that this method proposed is equivalent to importance sampling with importance function being q β ( β ) , for which we know the exact form and the generation of the random β ( i ) is easy and inexpensive.

6. Modeling Probability of Default Using Lending Club Data

6.1. Introduction

LendingClub (https://www.lendingclub.com/) is the world’s largest peer-to-peer lending platform. LendingClub enables borrowers to create unsecured personal loans between $1000 and $40,000. The standard loan period is three or five years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee. To attract lenders, LendingClub publishes most of the information available in borrowers’ credit reports as well as information reported by borrowers for almost every loan issued through its website.

6.2. Modeling Probability of Default—Target Variable and Predictive Features

Publicly available LendingClub data, from 2007 June to 2018 Q4, has a total of 2,260,668 issued loans. Each loan has a status, either Paid-off, Charged-off, or Ongoing. We only adopted loans with an end status, i.e., either paid-off or charged-off. In addition, that loan status is the target variable. We then selected following loan features as our predictive covariates.
  • Loan term in months (either 36 or 60)
  • FICO
  • Issued loan amount
  • DTI (Debt to income ratio, i.e., customer’s total debt divided by income)
  • Number of credit lines opened in past 24 months
  • Employment length in years
  • Annual income
  • Home ownership type (own, mortgage, of rent)
We took a sample from the original data set that has customer yearly income between $15,000 and $60,000 and end up with a data set of 520,947 rows.

6.3. Addressing Uncertainty of Estimated Probit Model Using Variational Inference with Intrinsic Prior

Using the process developed in Section 5, we can update the intrinsic prior for parameters (see Figure 1) of the probit model using variational inference, and get the posterior distribution for the estimated parameters. Based on the derived parameter distributions, questions of interest may be explored with model uncertainty being considered.
Investors will be interested in understanding how each loan feature affect the probability of default, given a certain loan term, either 36 or 60. To answer this question, we samples 6000 cases from the original data set and draw from derived posterior distribution 100 times. We end up with 6000 × 100 calculated probability of default, where each one of the 6000 samples yield 100 different probit estimates based on 100 different posterior draws. We summarize some of our findings in Figure 2, where color red representing 36 months loans and green representing 60 months loans.
  • In general, 60 months loans have higher risk of default.
  • Given loan term months, there is a clear trend showing that high FICO means lower risk.
  • Given loan term months, there is a trend showing that high DTI indicating higher risk.
  • Given loan term months, there is a trend showing that more credit lines opened in past 24 months indicating higher risk.
  • There is no clear pattern regarding income. This is probably because we only included customers with income between $15,000 and $60,000 in our training data, which may not representing the true income level of the whole population.
Model uncertainty could also be measured through credible intervals. Again, with the derived posterior distribution, the credible interval is just the range containing a particular percentage of estimated effect/parameter values. For instance, the 95 % credible interval of the estimated parameter value of FICO is simply the central portion of the posterior distribution that contains 95 % of the estimated values. Contrary to the frequentist confidence intervals, Bayesian credible interval is much more straightforward to interpret. Using the Bayesian framework created in this article, from Figure 3, we can simply state that given the observed data, the estimated effect of DTI on default has 89 % probability of falling within [ 8 . 300 , 8 . 875 ] . Instead of the conventional 95 % , we used 89 % following suggestions in [28,29], which is just as arbitrary as any of the conventions.
One of the main advantages of using variational inference over MCMC is that variational inference is much faster. Comparisons were made between the two approximation frameworks on a 64-bit Windows 10 laptop, with 32.0 GB RAM. Using the data set introduced in Section 6.2, we have that
  • with a conjugate prior and following the Gibbs sampling scheme proposed by [17], it took 89.86 s to finish 100 simulations for the Gibbs sampler;
  • following our method proposed in Section 5.2, it took 58.38 s to get the approximated posterior distribution and sampling 10,000 times from that posterior.

6.4. Model Comparison

Following the procedure proposed in Section 5.3, we compare the following series of nested models. From the data set introduced in Section 6.2, 2000 records were sampled to estimate the likelihood p ( x | β ( i ) ) . Where β ( i ) is one of the 2500 draws sampled directly from the approximated posterior distribution q β ( β ) , which serves as the importance function used to estimate the marginal likelihood p ( x ) .
  • M 2 : FICO + Term 36 Indicator
  • M 3 : FICO + Term 36 Indicator + Loan Amount
  • M 4 : FICO + Term 36 Indicator + Loan Amount + Annual Income
  • M 5 : FICO + Term 36 Indicator + Loan Amount + Annual Income + Mortgage Indicator
Estimated log marginal likelihood for each model is plotted in Figure 4. We can see that the model evidence has increased by adding predictive features Loan Amount and Annual Income sequentially. However, if we further adding home ownership information, i.e., Mortgage Indicator as a predictive feature, the model evidence decreased. We have the Bayes factor
B F 45 = p ( x | M 4 ) p ( x | M 5 ) = e 1014 . 78 ( 1016 . 42 ) = 5 . 16 ,
which suggests a substantial evidence for model M 4 , indicating home ownership information may be irrelevant in predicting probability of default given that all the other predictive features are relevant.

7. Further Work

The authors thank the reviewers for pointing out that mean-field variational Bayes underestimates the posterior variance. This could be an interesting topic for our future research. We plan to study the linearresponsevariationalBayes (LRVB) method proposed in [30] to see if it can be applied on the framework we proposed in this article. To see if we can get the approximated posterior variance close enough to the true variance using our proposed method, comparisons should be made between normal conjugate prior with the MCMC procedure, normal conjugate prior with LRVB, and intrinsic prior with LRVB.

Author Contributions

Methodology, A.L., L.P. and K.W.; software, A.L.; writing–original draft preparation, A.L., L.P. and K.W.; writing–review and editing, A.L. and L.P.; visualization, A.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work of L.R.Pericchi was partially funded by NIH grants U54CA096300, P20GM103475 and R25MD010399.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Density Function

Suppose X N ( μ , σ 2 ) has a normal distribution and lies within the interval X ( a , b ) , a < b . Then X conditional on a < X < b has a truncated normal distribution. Its probability density function, f, for a X < b , is given by
f ( x | μ , σ , a , b ) = 1 σ ϕ ( x μ σ ) Φ ( b μ σ ) Φ ( a μ σ )
and by f = 0 otherwise. Here
ϕ ( ξ ) = 1 2 π exp ( 1 2 ξ 2 )
is the probability density function of the standard normal distribution and Φ ( · ) is its cumulative distribution function. If b = , then Φ ( b μ σ ) = 1 , and similarly, if a = , then Φ ( a μ σ ) = 0 . And the cumulative density for the truncated normal distribution is
F ( x | μ , σ , a , b ) = Φ ( ξ ) Φ ( α ) Z ,
where ξ = x μ σ and Z = Φ ( β ) Φ ( α ) .

Appendix B. Moments and Entropy

Let α = a μ σ and β = b μ σ . For two-sided truncation:
E ( X | a < X < b ) = μ + σ ϕ ( α ) ϕ ( β ) Φ ( β ) Φ ( α ) , V a r ( X | a < X < b ) = σ 2 1 + α ϕ ( α ) β ϕ ( β ) Φ ( β ) Φ ( α ) ϕ ( α ) ϕ ( β ) Φ ( β ) Φ ( α ) 2 .
For one sided truncation (upper tail):
E ( X | X > a ) = μ + σ λ ( α ) V a r ( X | X > a ) = σ 2 [ 1 δ ( α ) ] ,
where α = a μ σ , λ ( α ) = ϕ ( α ) 1 Φ ( α ) and δ ( α ) = λ ( α ) [ λ ( α ) α ] .
For one sided truncation (lower tail):
E ( X | X < b ) = μ σ ϕ ( β ) Φ ( β ) V a r ( X | X < b ) = σ 2 1 β ϕ ( β ) Φ ( β ) ϕ ( β ) Φ ( β ) 2 .
More generally, the moment generating function for truncated normal distribution is
e μ t + σ 2 t 2 / 2 · Φ ( β σ t ) Φ ( α σ t ) Φ ( β ) Φ ( α ) .
For a density f ( x ) defined over a continuous variable, the entropy is given by
H [ x ] = f ( x ) log f ( x ) d x .
And the entropy for a truncated normal density is
log ( 2 π e σ Z ) + α ϕ ( α ) β ϕ ( β ) 2 Z .

References

  1. Salmeron, D.; Cano, J.A.; Robert, C.P. Objective Bayesian hypothesis testing in binomial regression models with integral prior distributions. Stat. Sin. 2015, 25, 1009–1023. [Google Scholar] [CrossRef]
  2. Leon-Novelo, L.; Moreno, E.; Casella, G. Objective Bayes model selection in probit models. Stat. Med. 2012, 31, 353–365. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Jaakkola, T.S.; Jordan, M.I. Bayesian parameter estimation via variational methods. Stat. Comput. 2000, 10, 25–37. [Google Scholar] [CrossRef]
  4. Girolami, M.; Rogers, S. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Comput. 2006, 18, 1790–1817. [Google Scholar] [CrossRef]
  5. Consonni, G.; Marin, J.M. Mean-field variational approximate Bayesian inference for latent variable models. Comput. Stat. Data Anal. 2007, 52, 790–798. [Google Scholar] [CrossRef] [Green Version]
  6. Ormerod, J.T.; Wand, M.P. Explaining variational approximations. Am. Stat. 2010, 64, 140–153. [Google Scholar] [CrossRef] [Green Version]
  7. Grimmer, J. An introduction to Bayesian inference via variational approximations. Political Anal. 2010, 19, 32–47. [Google Scholar] [CrossRef]
  8. Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef] [Green Version]
  9. Pérez, M.E.; Pericchi, L.R.; Ramírez, I.C. The Scaled Beta2 distribution as a robust prior for scales. Bayesian Anal. 2017, 12, 615–637. [Google Scholar] [CrossRef]
  10. Mulder, J.; Pericchi, L.R. The matrix-F prior for estimating and testing covariance matrices. Bayesian Anal. 2018, 13, 1193–1214. [Google Scholar] [CrossRef]
  11. Berger, J.O.; Pericchi, L.R. Objective Bayesian Methods for Model Selection: Introduction and Comparison. In Model Selection; Institute of Mathematical Statistics: Beachwood, OH, USA, 2001; pp. 135–207. [Google Scholar]
  12. Pericchi, L.R. Model selection and hypothesis testing based on objective probabilities and Bayes factors. Handb. Stat. 2005, 25, 115–149. [Google Scholar]
  13. Scott, J.G.; Berger, J.O. Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Ann. Stat. 2010, 38, 2587–2619. [Google Scholar] [CrossRef] [Green Version]
  14. Jeffreys, H. The Theory of Probability; OUP: Oxford, UK, 1961. [Google Scholar]
  15. Berger, J.O.; Pericchi, L.R. The intrinsic Bayes factor for model selection and prediction. J. Am. Stat. Assoc. 1996, 91, 109–122. [Google Scholar] [CrossRef]
  16. Leamer, E.E. Specification Searches: Ad Hoc Inference with Nonexperimental Data; Wiley: New York, NY, USA, 1978; Volume 53. [Google Scholar]
  17. Albert, J.H.; Chib, S. Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 1993, 88, 669–679. [Google Scholar] [CrossRef]
  18. Tanner, M.A.; Wong, W.H. The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 1987, 82, 528–540. [Google Scholar] [CrossRef]
  19. Berger, J.O.; Pericchi, L.R. The intrinsic Bayes factor for linear models. Bayesian Stat. 1996, 5, 25–44. [Google Scholar]
  20. Casella, G.; Moreno, E. Objective Bayesian variable selection. J. Am. Stat. Assoc. 2006, 101, 157–167. [Google Scholar] [CrossRef]
  21. Moreno, E.; Bertolino, F.; Racugno, W. An intrinsic limiting procedure for model selection and hypotheses testing. J. Am. Stat. Assoc. 1998, 93, 1451–1460. [Google Scholar] [CrossRef]
  22. Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  23. Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn. 1999, 37, 183–233. [Google Scholar] [CrossRef]
  24. Parisi, G.; Shankar, R. Statistical field theory. Phys. Today 1988, 41, 110. [Google Scholar] [CrossRef]
  25. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  26. Berger, J.; Pericchi, L. Training samples in objective Bayesian model selection. Ann. Stat. 2004, 32, 841–869. [Google Scholar] [CrossRef] [Green Version]
  27. Beal, M.J. Variational Algorithms for Approximate Bayesian Inference; University College London: London, UK, 2003. [Google Scholar]
  28. Kruschke, J. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]
  29. McElreath, R. Statistical Rethinking: A Bayesian Course with Examples in R and Stan; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
  30. Giordano, R.J.; Broderick, T.; Jordan, M.I. Linear response methods for accurate covariance estimates from mean field variational Bayes. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 7–12 December 2015; pp. 1441–1449. [Google Scholar]
Figure 1. Intrinsic Prior.
Figure 1. Intrinsic Prior.
Entropy 22 00513 g001
Figure 2. Effect of term months and other covariates on probability of default
Figure 2. Effect of term months and other covariates on probability of default
Entropy 22 00513 g002
Figure 3. Credible intervals for estimated coefficients
Figure 3. Credible intervals for estimated coefficients
Entropy 22 00513 g003
Figure 4. Log marginal likelihood comparison
Figure 4. Log marginal likelihood comparison
Entropy 22 00513 g004

Share and Cite

MDPI and ACS Style

Li, A.; Pericchi, L.; Wang, K. Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations. Entropy 2020, 22, 513. https://doi.org/10.3390/e22050513

AMA Style

Li A, Pericchi L, Wang K. Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations. Entropy. 2020; 22(5):513. https://doi.org/10.3390/e22050513

Chicago/Turabian Style

Li, Ang, Luis Pericchi, and Kun Wang. 2020. "Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations" Entropy 22, no. 5: 513. https://doi.org/10.3390/e22050513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop