Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations

Li, Ang; Pericchi, Luis; Wang, Kun

doi:10.3390/e22050513

Open AccessArticle

Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations

by

Ang Li

^†,‡

,

Luis Pericchi

^*,†,‡ and

Kun Wang

^†,‡

Río Piedras Campus, University of Puerto Rico, 00925 San Juan, Puerto Rico

^*

Author to whom correspondence should be addressed.

^†

14 Ave. Universidad Ste. 1401, San Juan, PR 00925, USA.

^‡

These authors contributed equally to this work.

Entropy 2020, 22(5), 513; https://doi.org/10.3390/e22050513

Submission received: 3 March 2020 / Revised: 14 April 2020 / Accepted: 26 April 2020 / Published: 30 April 2020

(This article belongs to the Special Issue Data Science: Measuring Uncertainties)

Download

Browse Figures

Versions Notes

Abstract

:

There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. On the other hand, variational inference methods have been employed to solve classification problems using probit regression and logistic regression with normal priors. In this article, we propose to apply the variational approximation on probit regression models with intrinsic prior. We review the mean-field variational method and the procedure of developing intrinsic prior for the probit regression model. We then present our work on implementing the variational Bayesian probit regression model using intrinsic prior. Publicly available data from the world’s largest peer-to-peer lending platform, LendingClub, will be used to illustrate how model output uncertainties are addressed through the framework we proposed. With LendingClub data, the target variable is the final status of a loan, either charged-off or fully paid. Investors may very well be interested in how predictive features like FICO, amount financed, income, etc. may affect the final loan status.

Keywords:

objective Bayesian inference; intrinsic prior; variational inference; binary probit regression; mean-field approximation

1. Introduction

There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. By far, only two articles have explored intrinsic prior related methods on classification problems. Reference [1] implements integral priors into the generalized linear models with various link functions. In addition, reference [2] considers intrinsic priors for probit models. On the other hand, variational inference methods have been employed to solve classification problem with logistic regression ([3]) and probit regression ([4,5]) with normal priors. Variational approximation methods have been reviewed in [6,7], and more recently [8].

In this article, we propose to apply variational approximations on probit regression models with intrinsic priors. In Section 4, we review the mean-field variational method that will be used in this article. In Section 3, procedures for developing intrinsic priors for probit models will be introduced following [2]. Our work is presented in Section 5. Our motivations for combining intrinsic prior methodology and variational inference is as following

Avoiding manually set ad hoc plugin priors by automatically generating a family of non-informative priors that are less sensible.
Reference [1,2] do not consider inference of posterior distributions of parameters. Their focus is on model comparison. Although the development of intrinsic priors itself comes from a model selection background, we thought it would be interesting to apply intrinsic priors on inference problems. In fact, some recently developed priors that proposed to solve inference or estimation problems turned out to be also intrinsic priors. For example, the Scaled Beta2 prior [9] and the Matrix-F prior [10].
Intrinsic priors concentrate probability near the null hypothesis, a condition that is widely accepted and should be required of a prior for testing a hypothesis.
Also, intrinsic priors have flat tails that prevents finite sample inconsistency [11].
For inference problems with large data set, variational approximation methods are much faster than MCMC-based methods.

As for model comparison, due to the fact that the output of variational inference methods cannot be employed directly to compare models, we propose in Section 5.3 to simply make use of the variational approximation of the posterior distribution as an importance function and get the Monte Carlo estimated marginal likelihood by importance sampling for model comparison.

2. Background and Development of Intrinsic Prior Methodology

2.1. Bayes Factor

The Bayesian framework of model selection coherently involves the use of probability to express all uncertainty in the choice of model, including uncertainty about the unknown parameters of a model. Suppose that models

M_{1}, M_{2}, . . ., M_{q}

are under consideration. We shall assume that the observed data

x = (x_{1}, x_{2}, . . ., x_{n})

is generated from one of these models but we do not know which one it is. We express our uncertainty through prior probability

P (M_{j}), j = 1, 2, . . ., q

. Under model

M_{i}

,

x

has density

f_{i} (x | θ_{i}, M_{i})

, where

θ_{i}

are unknown model parameters, and the prior distribution for

θ_{i}

is

π_{i} (θ_{i} | M_{i})

. Given observed data and prior probabilities, we can then evaluate the posterior probability of

M_{i}

using Bayes’ rule

\begin{matrix} P (M_{i} | x) = \frac{p_{i} (x | M_{i}) P (M_{i})}{\sum_{j = 1}^{q} p_{j} (x | M_{j}) P (M_{j})}, \end{matrix}

(1)

where

\begin{matrix} p_{i} (x | M_{i}) = \int f_{i} (x | θ_{i}, M_{i}) π_{i} (θ_{i} | M_{i}) d θ_{i} \end{matrix}

(2)

is the marginal likelihood of

x

under

M_{i}

, also called the evidence for

M_{i}

[12]. A common choice of prior model probabilities is

P (M_{j}) = \frac{1}{q}

, so that each model has the same initial probability. However, there are other alternatives of assigning probabilities to correct for multiple comparison (See [13]). From (1), the posterior odds are therefore the prior odds multiplied by the Bayes factor

\begin{matrix} \frac{P (M_{j} | x)}{P (M_{i} | x)} = \frac{P (M_{j}) p_{j} (x)}{P (M_{i}) p_{i} (x)} = \frac{P (M_{j})}{P (M_{i})} \times B_{j i} . \end{matrix}

(3)

where the Bayes factor of

M_{j}

to

M_{i}

is defined by

\begin{matrix} B_{j i} = \frac{p_{j} (x)}{p_{i} (x)} = \frac{\int f_{j} (x | θ_{j}) π_{j} (θ_{j}) d θ_{j}}{\int f_{i} (x | θ_{i}) π_{i} (θ_{i}) d θ_{i}} . \end{matrix}

(4)

Here we omit the dependence on models

M_{j}, M_{i}

to keep the notation simple. The marginal likelihood,

p_{i} (x)

expresses the preference shown by the observed data for different models. When

B_{j i} > 1

, the data favor

M_{j}

over

M_{i}

, and when

B_{j i} < 1

the data favor

M_{i}

over

M_{j}

. A scale for interpretation of

B_{j i}

is given by [14].

2.2. Motivation and Development of Intrinsic Prior

Computing

B_{j i}

requires specification of

π_{i} (θ_{i})

and

π_{j} (θ_{j})

. Often in Bayesian analysis, when prior information is weak, one can use non-informative (or default) priors

π_{i}^{N} (θ_{i})

. Common choices for non-informative priors are the uniform prior,

π_{i}^{U} (θ_{i}) \propto 1

; the Jeffreys prior,

π_{i}^{J} (θ_{i}) \propto {[det (I_{i} (θ_{i}))]}^{1 / 2}

where

I_{i} (θ_{i})

is the expected Fisher information matrix corresponding to

M_{i}

.

Using any of the

π_{i}^{N}

in (4) would yield

\begin{matrix} B_{j i}^{N} = \frac{p_{j}^{N} (x)}{p_{i}^{N} (x)} = \frac{\int f_{j} (x | θ_{j}) π_{j}^{N} (θ_{j}) d θ_{j}}{\int f_{i} (x | θ_{i}) π_{i}^{N} (θ_{i}) d θ_{i}} . \end{matrix}

(5)

The difficulty with (5) is that

π_{i}^{N}

are typically improper and hence are defined only up to an unspecified constant

c_{i}

. So

B_{j i}^{N}

is defined only up to the ratio

c_{j} / c_{i}

of two unspecified constants.

An attempt to circumvent the ill definition of the Bayes factors for improper non-informative priors is the intrinsic Bayes factor introduced by [15], which is a modification of a partial Bayes factor [16]. To define the intrinsic Bayes factor we consider the set of subsamples

x (l)

of the data

x

of minimal size l such that

0 < p_{i}^{N} (x (l)) < \infty

. These subsamples are called training samples (not to be confused with training sample in machine learning). In addition, there is a total number of L such subsamples.

The main idea here is that training sample

x (l)

will be used to convert the improper

π_{i}^{N} (θ_{i})

to proper posterior

\begin{matrix} π_{i}^{N} (θ_{i} | x (l)) = \frac{f_{i} (x (l) | θ_{i}) π_{i}^{N} (θ_{i})}{p_{i}^{N} (x (l))} \end{matrix}

(6)

where

p_{i}^{N} (x (l)) = \int f_{i} (x (l) | θ_{i}) π_{i}^{N} (θ_{i}) d θ_{i}

. Then, the Bayes factor for the remaining of the data

x (n - l)

, where

x (l) \cup x (n - l) = x

, using

π_{i}^{N} (θ_{i} | x (l))

as prior is called a “partial” Bayes factor,

\begin{matrix} B_{j i}^{N} (x (n - l) | x (l)) = \frac{\int f_{j} (x (n - l) | θ_{j}) π_{j}^{N} (θ_{j} | x (l)) d θ_{j}}{\int f_{i} (x (n - l) | θ_{i}) π_{i}^{N} (θ_{i} | x (l)) d θ_{i}} \end{matrix}

(7)

This partial Bayes factor is a well-defined Bayes factor, and can be written as

B_{j i}^{N} (x (n - l) | x (l)) = B_{j i}^{N} (x) B_{i j} (x (l))

, where

B_{j i}^{N} (x) = \frac{p_{j}^{N} (x)}{p_{i}^{N} (x)}

and

B_{i j} (x (l)) = \frac{p_{i}^{N} (x (l))}{p_{j}^{N} (x (l))}

. Clearly,

B_{j i}^{N} (x (n - l) | x (l))

will depend on the choice of the training samples

x (l)

. To eliminate this arbitrariness and increase stability, reference [15] suggests averaging over all training samples and obtained the arithmetic intrinsic Bayes factor (AIBF)

\begin{matrix} B_{j i}^{A I B F} (x) = B_{j i}^{N} (x) \frac{1}{L} \sum_{l = 1}^{L} B_{i j}^{N} (x (l)) . \end{matrix}

(8)

The strongest justification of the arithmetic IBF is its asymptotic equivalence with a proper Bayes factor arising from Intrinsic priors. These intrinsic priors were identified through an asymptotic analysis (see [15]). For the case where

M_{i}

is nested in

M_{j}

, it can be shown that the intrinsic priors are given by

\begin{matrix} π_{i}^{I} (θ_{i}) = π_{i}^{N} (θ_{i}) and π_{j}^{I} (θ_{j}) = π_{j}^{N} (θ_{j}) E_{M_{j}} [\frac{m_{i}^{N} (x (l))}{m_{j}^{N} (x (l))} | θ_{j}] . \end{matrix}

(9)

3. Objective Bayesian Probit Regression Models

3.1. Bayesian Probit Model and the Use of Auxiliary Variables

Consider a sample

y = (y_{1}, . . ., y_{n})

, where

Y_{i}, i = 1, . . ., n

, is a

0 - 1

random variable such that under model

M_{j}

, it follows a probit regression model with a

j + 1

-dimensional vector of covariates

x_{i}

, where

j \leq p

. Here, p is the total number of covariate variables under our consideration. In addition, this probit model

M_{j}

has the form

Y_{i} | β_{0}, . . ., β_{j}, M_{j} \sim Bernoulli (Φ (β_{0} x_{0 i} + β_{1} x_{1 i} + . . . + β_{j} x_{j i})), 1 \leq i \leq n,

(10)

where

Φ

denotes the standard normal cumulative distribution function and

β_{j} = (β_{0}, . . ., β_{j})

is a vector of dimension

j + 1

. The first component of the vector

x_{i}

is set equal to 1 so that when considering models of the form (10), the intercept is in any submodel. The maximum length of the vector of covariates is

p + 1

. Let

π (β)

, proper or improper, summarize our prior information about

β

. Then the posterior density of

β

is given by

\begin{matrix} π (β | y) = \frac{π (β) \prod_{i = 1}^{n} Φ {(x_{i}^{'} β)}^{y_{i}} (1 - Φ {(x_{i}^{'} β)}^{1 - y_{i}})}{\int π (β) \prod_{i = 1}^{n} Φ {(x_{i}^{'} β)}^{y_{i}} (1 - Φ {(x_{i}^{'} β)}^{1 - y_{i}}) d β}, \end{matrix}

which is largely intractable.

As shown by [17], the Bayesian probit regression model becomes tractable when a particular set of auxiliary variables is introduced. Based on the data augmentation approach [18], introducing n latent variables

Z_{1}, . . ., Z_{n}

, where

\begin{matrix} Z_{i} | β \sim N (x_{i}^{'} β, 1) . \end{matrix}

The probit model (10) can be thought of as a regression model with incomplete sampling information by considering that only the sign of

z_{i}

is observed. More specifically, define

Y_{i} = 1

if

Z_{i} > 0

and

Y_{i} = 0

otherwise. This allows us to write the probability density of

y_{i}

given

z_{i}

\begin{matrix} p (y_{i} | z_{i}) = I (z_{i} > 0) I (y_{i} = 1) + I (z_{i} \leq 0) I (y_{i} = 0) . \end{matrix}

Expansion of the parameter set from

{β}

to

{β, Z}

is the key to achieving a tractable solution for variational approximation.

3.2. Development of Intrinsic Prior for Probit Models

For the sample

z = {(z_{1}, . . ., z_{n})}^{'}

, the null normal model is

M_{1} : {N_{n} (z | α 1_{n}, I_{n}), π (α)} .

For a generic model

M_{j}

with

j + 1

regressors, the alternative model is

M_{j} : {N_{n} (z | X_{j} β_{j}, I_{n}), π (β_{j})},

where the design matrix

X_{j}

has dimensions

n \times (j + 1)

. Intrinsic prior methodology for the linear model was first developed by [19], and was further developed in [20] by using the methods of [21]. This intrinsic methodology gives us an automatic specification of the priors

π (α)

and

π (β)

, starting with the non-informative priors

π^{N} (α)

and

π^{N} (β)

for

α

and

β

, which are both improper and proportional to 1.

The marginal distributions for the sample

z

under the null model, and under the alternative model with intrinsic prior, are formally written as

\begin{matrix} p_{1} (z) = \int N_{n} (z | α 1_{n}, I_{n}) π^{N} (α) d α, \\ p_{j} (z) = \int \int N_{n} (z | X_{j} β_{j}, I_{n}) π^{I} (β | α) π^{N} (α) d α d β . \end{matrix}

(11)

However, these are marginals of the sample

z

, but our selection procedure requires us to compute the Bayes factor of model

M_{j}

versus the reference model

M_{1}

for the sample

y = (y_{1}, . . ., y_{n})

. To solve this problem, reference [2] proposed to transform the marginal

p_{j} (z)

into the marginal

p_{j} (y)

by using the probit transformations

y_{i} = 1 (z_{i} > 0), i = 1, . . ., n

. These latter marginals are given by

\begin{matrix} p_{j} (y) = \int_{A_{1} \times . . . \times A_{n}} p_{j} (z) d z \end{matrix}

(12)

where

\begin{matrix} A_{i} = \{\begin{matrix} (0, \infty) if y_{i} = 1, \\ (- \infty, 0) if y_{i} = 0 . \end{matrix} \end{matrix}

(13)

4. Variational Inference

4.1. Overview of Variational Methods

Variational methods have their origins in the 18th century with the work of Euler, Lagrange, and others on the calculus of variations (The derivation in this section is standard in the literature on variational approximation and will at times follow the arguments in [22,23]). Variational inference is a body of deterministic techniques for making approximate inference for parameters in complex statistical models. Variational approximations are a much faster alternative to Markov Chain Monte Carlo (MCMC), especially for large models, and are a richer class of methods than the Laplace approximation [6].

Suppose we have a Bayesian model and a prior distribution for the parameters. The model may also have latent variables, here we shall denote the set of all latent variables and parameters by

θ

. In addition, we denote the set of all observed variables by

X

. Given a set of n independent, identically distributed data, for which

X = {x_{1}, . . ., x_{n}}

and

θ = {θ_{1}, . . ., θ_{n}}

, our probabilistic model (e.g., probit regression model) specifies the joint distribution

p (X, θ)

, and our goal is to find an approximation for the posterior distribution

p (θ | X)

as well as for the marginal likelihood

p (X)

. For any probability distribution

q (θ)

, we have the following decomposition of the log marginal likelihood

\begin{matrix} ln p (X) = L (q) + KL (q | | p) \end{matrix}

where we have defined

\begin{matrix} L (q) = \int q (θ) ln \{\frac{p (X, θ)}{q (θ)}\} d θ \end{matrix}

(14)

\begin{matrix} KL (q | | p) = - \int q (θ) ln \{\frac{p (θ | X)}{q (θ)}\} d θ \end{matrix}

(15)

We refer to (14) as the lower bound of the log marginal likelihood with respect to the density q, and (15) is by definition the Kullback–Leibler divergence of the posterior

q (θ | X)

from the density q. Based on this decomposition, we can maximize the lower bound

L (q)

by optimization with respect to the distribution

q (θ)

, which is equivalent to minimizing the KL divergence. In addition, the lower bound is attained when the KL divergence is zero, which happens when

q (θ)

equals the posterior distribution

p (θ | X)

. It would be hard to find such a density since the true posterior distribution is intractable.

4.2. Factorized Distributions

The essence of the variational inference approach is approximation to the posterior distribution

p (θ | X)

by

q (θ)

for which the q dependent lower bound

L (q)

is more tractable than the original model evidence. In addition, tractability is achieved by restricting q to a more manageable class of distributions, and then maximizing

L (q)

over that class.

Suppose we partition elements of

θ

into disjoint groups

{θ_{i}}

where

i = 1, . . ., M

. We then assume that the q density factorizes with respect to this partition, i.e.,

\begin{matrix} q (θ) = \prod_{i = 1}^{M} q_{i} (θ_{i}) . \end{matrix}

(16)

The product form is the only assumption we made about the distribution. Restriction (16) is also known as mean-field approximation and has its root in Physics [24].

For all distributions

q (θ)

with the form (16), we need to find the distribution for which the lower bound

L (q)

is largest. Restriction of q to a subclass of product densities like (16) gives rise to explicit solutions for each product component in terms of the others. This fact, in turn, leads to an iterative scheme for obtaining the solutions. To achieve this, we first substitute (16) into (14) and then separate out the dependence on one of the factors

q_{j} (θ_{j})

. Denoting

q_{j} (θ_{j})

by

q_{j}

to keep the notation clear, we obtain

\begin{matrix} \begin{matrix} L (q) & = \int \prod_{i = 1}^{M} q_{i} \{ln p (X, θ) - \sum_{i = 1}^{M} ln q_{i}\} d θ \\ = \int q_{j} \{\int ln p (X, θ) \prod_{i \neq j} q_{i} d θ_{i}\} d θ_{j} - \int q_{j} ln q_{j} d θ_{j} + constant \\ = \int q_{j} ln \tilde{p} (X, θ_{j}) d θ_{j} - \int q_{j} ln q_{j} d θ_{j} + constant \end{matrix} \end{matrix}

(17)

where

\tilde{p} (X, θ_{j})

is given by

\begin{matrix} ln \tilde{p} (X, θ_{j}) = E_{i \neq j} [ln p (X, θ)] + constant . \end{matrix}

(18)

The notation

E_{i \neq j} [\cdot]

denotes an expectation with respect to the q distributions over all variables

z_{i}

for

i \neq j

, so that

\begin{matrix} E_{i \neq j} [ln p (X, θ)] = \int ln p (X, θ) \prod_{i \neq j} q_{i} d θ_{i} . \end{matrix}

Now suppose we keep the

{q_{i \neq j}}

fixed and maximize

L (q)

in (17) with respect to all possible forms for the density

q_{j} (θ_{j})

. By recognizing that (17) is the negative KL divergence between

\tilde{p} (X, θ_{j})

and

q_{j} (θ_{j})

, we notice that maximizing (17) is equivalent to minimize the KL divergence, and the minimum occurs when

q_{j} (θ_{j}) = \tilde{p} (X, θ_{j})

. The optimal

q_{j}^{*} (θ_{j})

is then

\begin{matrix} ln q_{j}^{*} (θ_{j}) = E_{i \neq j} [ln p (X, θ)] + constant . \end{matrix}

(19)

The above solution says that the log of the optimal

q_{j}

is obtained simply by considering the log of the joint distribution of all parameter, latent and observable variables and then taking the expectation with respect to all the other factors

q_{i}

for

i \neq j

. Normalizing the exponential of (19), we have

\begin{matrix} q_{j}^{*} (θ_{j}) = \frac{exp (E_{i \neq j} [ln p (X, θ)])}{\int exp (E_{i \neq j} [ln p (X, θ)]) d θ_{j}} . \end{matrix}

The set of equations in (19) for

j = 1, . . ., M

are not an explicit solution because the expression on the right hand side of (19) for the optimal

q_{j}^{*}

depends on expectations taken with respect to the other factors

q_{i}

for

i \neq j

. We will need to first initialize all of the factors

q_{i} (θ_{i})

and then cycle through the factors one by one and replace each in turn with an updated estimate given by the right hand side of (19) evaluated using the current estimates for all of the other factors. Convexity properties can be used to show that convergence to at least local optima is guaranteed [25]. The iterative procedure is described in Algorithm 1.

Algorithm 1 Iterative procedure for obtaining the optimal densities under factorized density restriction (16). The updates are based on the solutions given by (19).

1:: Initialize $q_{2}^{*} (θ_{2}), \dots, q_{M}^{*} (θ_{M}) .$
2:: Cycle through

$\begin{matrix} q_{1}^{*} (θ_{1}) \leftarrow \frac{exp (E_{i \neq 1} [ln p (X, θ)])}{\int exp (E_{i \neq 1} [ln p (X, θ)]) d θ_{1}} \\ ⋮ \\ q_{M}^{*} (θ_{M}) \leftarrow \frac{exp (E_{i \neq M} [ln p (X, θ)])}{\int exp (E_{i \neq M} [ln p (X, θ)]) d θ_{M}} \end{matrix}$

until the increase in $L (q)$ is negligible.

5. Incorporate Intrinsic Prior with Variational Approximation to Bayesian Probit Models

5.1. Derivation of Intrinsic Prior to Be Used in Variational Inference

Let

X_{l}

be the design matrix of a minimal training sample (mTS) of a normal regression model

M_{j}

for the variable

Z \sim N (X_{j} β_{j}, I_{j + 1})

. We have, for the

j + 1

-dimensional parameter

β_{j}

,

\begin{matrix} \int N_{j + 1} (z_{l} | X_{l} β_{j}, I_{j + 1}) d β_{j} = \{\begin{matrix} | X_{l}^{'} X_{l} |^{- 1 / 2} & if rank of X_{l} \geq j + 1 \\ \infty & otherwise \end{matrix} . \end{matrix}

Therefore, it follows that the mTS size is

j + 1

[2]. Given that priors for

α

and

β

are proportional to 1, the intrinsic prior for

β

conditional on

α

could be derived. Let

β_{0}

denote the vector with the first component equal to

α

and the others equal to zero. Based on Formula (9), we have

\begin{matrix} π^{I} (β | α) & = π_{j}^{N} (β) E_{z_{l} | β}^{M_{j}} [\frac{p_{1} (z_{l} | α)}{\int p_{j} (z_{l} | β) π_{j}^{N} (β) d β}] \\ = E_{z_{l} | β}^{M_{j}} [\frac{exp {- \frac{1}{2} {(z_{l} - X_{l} β_{0})}^{'} (z_{l} - X_{l} β_{0})}}{\int exp {- \frac{1}{2} {(z_{l} - X_{l} β)}^{'} (z_{l} - X_{l} β)} d β}] \\ = {(2 π)}^{- \frac{(j + 1)}{2}} {| {(X_{l}^{'} X_{l})}^{- 1} |}^{- \frac{1}{2}} \times E_{z_{l} | β}^{M_{j}} [exp {- \frac{1}{2} {(z_{l} - X_{l} β_{0})}^{'} (z_{l} - X_{l} β_{0})}] \\ = {(2 π)}^{- \frac{(j + 1)}{2}} {| 2 {(X_{l}^{'} X_{l})}^{- 1} |}^{- \frac{1}{2}} exp {- \frac{1}{2} [{(β - β_{0})}^{'} \frac{X_{l}^{'} X_{l}}{2} (β - β_{0})]} . \end{matrix}

Therefore,

\begin{matrix} π^{I} (β | α) = N_{j + 1} (β | β_{0}, 2 {(X_{l}^{'} X_{l})}^{- 1}), where β_{0} = {(\begin{matrix} α \\ 0 \\ ⋮ \\ 0 \end{matrix})}_{(j + 1) \times 1} . \end{matrix}

Notice that

X_{l}^{'} X_{l}

is unknown because it is a theoretical design matrix corresponding to the training sample

z_{l}

. It can be estimated by averaging over all submatrices containing

j + 1

rows of the

n \times (j + 1)

design matrix

X_{j}

. This average is

\frac{j + 1}{n} X_{j}^{'} X_{j}

(See [26] and Appendix A in [2]), and therefore

\begin{matrix} π^{I} (β | α) = N_{j + 1} (β | β_{0}, \frac{2 n}{j + 1} {(X_{j}^{'} X_{j})}^{- 1}) . \end{matrix}

Next, based on

π^{I} (β | α)

, the intrinsic prior for

β

can be obtained by

\begin{matrix} π^{I} (β) = \int π^{I} (β | α) π^{I} (α) d α . \end{matrix}

(20)

Since we assume that

π^{I} (α) = π^{N} (α)

is proportional to one, set

π^{N} (α) = c

where c is an arbitrary positive constant. Denote

\frac{2 n}{j + 1} {(X_{j}^{'} X_{j})}^{- 1}

by

Σ_{β | α}

, we obtain

\begin{matrix} \begin{matrix} π^{I} (β) & = \int π^{I} (β | α) π^{I} (α) d α \\ = c \cdot {(2 π)}^{- \frac{j + 1}{2}} {| Σ_{β | α} |}^{- \frac{1}{2}} \int exp {- \frac{1}{2} {(β - β_{0})}^{'} Σ_{β | α}^{- 1} (β - β_{0})} d α \\ \propto exp {- \frac{1}{2} β^{'} Σ_{β | α}^{- 1} β} \times \int exp {- \frac{1}{2} [β_{0}^{'} Σ_{β | α}^{- 1} β_{0} - 2 β^{'} Σ_{β | α}^{- 1} β_{0}]} d α \\ \propto exp {- \frac{1}{2} β^{'} Σ_{β | α}^{- 1} β} \times \int exp {- \frac{1}{2} (Σ_{β | α_{(1, 1)}}^{- 1} α^{2} - 2 β^{'} Σ_{β | α_{(\cdot 1)}}^{- 1} α)} d α \end{matrix} \end{matrix}

(21)

where

Σ_{β | α_{(1, 1)}}^{- 1}

is component of

Σ_{β | α}^{- 1}

at position row 1 column 1 and

Σ_{β | α_{(\cdot 1)}}^{- 1}

is the first column of

Σ_{β | α}^{- 1}

. Denote

Σ_{β | α_{(1, 1)}}^{- 1}

by

σ_{11}

and

Σ_{β | α_{(\cdot 1)}}^{- 1}

by

γ_{1}

, we then obtain

\begin{matrix} \begin{matrix} π^{I} (β) & \propto exp {- \frac{1}{2} β^{'} Σ_{β | α}^{- 1} β} \times \int exp {- \frac{1}{2} σ_{11} {(α - \frac{β^{'} γ_{1}}{σ_{11}})}^{2} + \frac{1}{2} \frac{{(β}^{'} γ_{1})^{2}}{σ_{11}}} d α \\ \propto exp {- \frac{1}{2} (β^{'} Σ_{β | α}^{- 1} β - β^{'} \frac{γ_{1} γ_{1}^{'}}{σ_{11}} β)} \times \sqrt{2 π} σ_{11}^{- 1 / 2} \\ \propto exp {- \frac{1}{2} β^{'} (Σ_{β | α}^{- 1} - \frac{γ_{1} γ_{1}^{'}}{σ_{11}}) β} . \end{matrix} \end{matrix}

(22)

Therefore, we have derived that

\begin{matrix} π^{I} (β) \propto N_{j + 1} (0, {(Σ_{β | α}^{- 1} - \frac{γ_{1} γ_{1}^{'}}{σ_{11}})}^{- 1}) . \end{matrix}

(23)

For model comparison, the specific form of the intrinsic prior may be needed, including the constant factor. Therefore, by following (21) and (22) we have

\begin{matrix} \begin{matrix} π^{I} (β) & = c \cdot {(2 π)}^{- \frac{j + 1}{2}} | Σ_{β | α} |^{- \frac{1}{2}} {(2 π)}^{\frac{j + 1}{2}} {| {(Σ_{β | α}^{- 1} - \frac{γ_{1} γ_{1}^{'}}{σ_{11}})}^{- 1} |}^{\frac{1}{2}} \sqrt{2 π} σ_{11}^{- 1 / 2} \times N_{j + 1} (0, {(Σ_{β | α}^{- 1} - \frac{γ_{1} γ_{1}^{'}}{σ_{11}})}^{- 1}) \\ = c \cdot | Σ_{β | α} (Σ_{β | α}^{- 1} - \frac{γ_{1} γ_{1}^{'}}{σ_{11}}) |^{- \frac{1}{2}} \sqrt{2 π} σ_{11}^{- 1 / 2} \times N_{j + 1} (0, {(Σ_{β | α}^{- 1} - \frac{γ_{1} γ_{1}^{'}}{σ_{11}})}^{- 1}) \\ = c \cdot \sqrt{2 π} σ_{11}^{- 1 / 2} {| (I - \frac{γ_{1} γ_{1}^{'}}{σ_{11}} Σ_{β | α}) |}^{- \frac{1}{2}} \times N_{j + 1} (0, {(Σ_{β | α}^{- 1} - \frac{γ_{1} γ_{1}^{'}}{σ_{11}})}^{- 1}) . \end{matrix} \end{matrix}

(24)

5.2. Variational Inference for Probit Model with Intrinsic Prior

5.2.1. Iterative Updates for Factorized Distributions

We have that

\begin{matrix} Z_{i} | β \sim N (x_{i}^{'} β, 1) and \\ p (y_{i} | z_{i}) = I (z_{i} > 0) I (y_{i} = 1) + I (z_{i} \leq 0) I (y_{i} = 0) \end{matrix}

in Section 3.1. We have shown in Section 5.1 that

\begin{matrix} π^{I} (β) \propto N_{j + 1} (μ_{β}, Σ_{β}), \end{matrix}

where

μ_{β} = 0

and

Σ_{β} = {(Σ_{β | α}^{- 1} - \frac{γ_{1} γ_{1}^{'}}{σ_{11}})}^{- 1}

. Since

y

is independent of

β

given

z

, we have

\begin{matrix} \begin{matrix} p (y, z, β) & = p (y | z, β) p (z | β) p (β) \\ = p (y | z) p (z | β) p (β) . \end{matrix} \end{matrix}

(25)

To apply the variational approximation to probit regression model, unobservable variables are considered in two separate groups, coefficient parameter

β

and auxiliary variable

Z

. To approximate the posterior distribution of

β

, consider the product form

\begin{matrix} q (Z, β) = q_{Z} (Z) q_{β} (β) . \end{matrix}

We proceed by first describing the distribution for each factor of the approximation,

q_{Z} (Z)

and

q_{β} (β)

. Then variational approximation is accomplished by iteratively updating the parameters of each factor distribution.

Start with

q_{Z} (Z)

, when

y_{i} = 1

, we have

\begin{matrix} log p (y, z, β) & = log (\prod_{i} \frac{1}{\sqrt{2 π}} exp {- \frac{{(z_{i} - x_{i}^{'} β)}^{2}}{2}} \times π^{I} (β)) where z_{i} > 0 . \end{matrix}

Now, according to (19) and Algorithm 1, the optimal

q_{Z}

is proportional to

So, we have the optimal

q_{Z}

,

\begin{matrix} q_{Z}^{*} (Z) & \propto exp {- \frac{1}{2} z^{'} z + E_{β} {[β]}^{'} X^{'} z + constant} \\ \propto exp {- \frac{1}{2} {(z - X E_{β} [β])}^{'} (z - X E_{β} [β])} . \end{matrix}

Similar procedure could be used to develop cases when

y_{i} = 0

. Therefore, we have that the optimal approximation for

q_{Z}

is a truncated normal distribution, where

\begin{matrix} q_{Z}^{*} (Z) = \{\begin{matrix} N_{[0, + \infty)} (X E_{β} {[β]}_{i}, 1) if y_{i} = 1, \\ N_{(- \infty, 0]} (X E_{β} {[β]}_{i}, 1) if y_{i} = 0 . \end{matrix} \end{matrix}

(26)

Denote

X E_{β} [β]

by

μ_{z}

, the location of distribution

q_{Z}^{*} (Z)

. The expectation

E_{β}

is taken with respect to the density form of

q (β)

for which we shall derive now.

For

q_{β} (β)

, given the joint form in (25), we have

\begin{matrix} log p (y, z, β) = - \frac{1}{2} exp {{(z - X β)}^{'} (z - X β)} - \frac{1}{2} exp {{(β - μ_{β})}^{'} Σ_{β}^{- 1} (β - μ_{β})} + constant . \end{matrix}

Taking expectation with respect to

q_{Z} (z)

, we have

Again, based on (19) and Algorithm 1, the optimal

q_{β} (β)

is proportional to

E_{Z} [log p (y, z, β)]

,

\begin{matrix} q_{β}^{*} (β) \propto - \frac{1}{2} β^{'} (X^{'} X + Σ_{β}^{- 1}) β + (E_{Z} {[Z]}^{'} X + μ_{β}^{'} Σ_{β}^{- 1}) β . \end{matrix}

First notice that any constant terms, including constant factor in the intrinsic prior, were canceled out due to the ratio form of (19). Then by noticing the quadratic form in the above formula we have

\begin{matrix} q_{β}^{*} (β) = N (μ_{q_{β}}, Σ_{q_{β}}), \end{matrix}

(27)

where

\begin{matrix} Σ_{q_{β}} & = {(X^{'} X + Σ_{β}^{- 1})}^{- 1}, \\ μ_{q_{β}} & = {(X^{'} X + Σ_{β}^{- 1})}^{- 1} (E_{Z} {[Z]}^{'} X + μ_{β}^{'} Σ_{β}^{- 1}) . \end{matrix}

Notice that

μ_{q_{β}}

, i.e.,

E_{β} [β]

, depends on

E_{Z} [Z]

. In addition, from our previous derivation, we found that the update for

E_{Z} [Z]

depends on

E_{β} [β]

. Given that the density form of

q_{Z}

is truncated normal, we have

\begin{matrix} E_{Z} [Z_{i}] = \{\begin{matrix} X E_{β} {[β]}_{i} + \frac{ϕ (- X E_{β} {[β]}_{i})}{1 - Φ {(- X E_{β} [β])}_{i}} & if y_{i} = 1, \\ X E_{β} {[β]}_{i} - \frac{ϕ (- X E_{β} {[β]}_{i})}{Φ {(- X E_{β} [β])}_{i}} & if y_{i} = 0, \end{matrix} \end{matrix}

where

ϕ

is the standard normal density and

Φ

is the standard normal cumulative density. Denote

E_{Z} [Z]

by

μ_{q_{Z}}

. See properties of truncated normal distribution in Appendix A. Updating procedures for parameters

μ_{q_{β}}

and

μ_{q_{Z}}

of each factor distribution are summarized in Algorithm 2.

Algorithm 2 Iterative procedure for updating parameters to reach optimal factor densities

q_{β}^{*}

and

q_{Z}^{*}

in Bayesian probit regression model. The updates are based on the solutions given by (26) and (27).

1:: Initialize $μ_{q_{Z}} .$
2:: Cycle through

$\begin{matrix} μ_{q_{β}} \leftarrow {(X^{'} X + Σ_{β}^{- 1})}^{- 1} (μ_{q_{z}}^{'} X + μ_{β}^{'} Σ_{β}^{- 1}), & μ_{q_{Z}} \leftarrow X μ_{q_{β}} + \frac{ϕ (X μ_{q_{β}})}{Φ {(X μ_{q_{β}})}^{y} {[Φ (X μ_{q_{β}}) - 1]}^{1 - y}}, \end{matrix}$

until the increase in $L (q)$ is negligible.

5.2.2. Evaluation of the Lower Bound $L (q)$

During the process of optimization of variational approximation densities, the lower bound for the log marginal likelihood need to be evaluated and monitored to determine when the iterative updating process converges. Based on derivations from previous section, we now have the exact form for the variational inference density,

\begin{matrix} q (β, Z) = q_{β} (β) q_{Z} (Z) . \end{matrix}

According to (14), we can write down the lower bound

L (q)

with respect to

q (β, Z)

.

\begin{matrix} \begin{matrix} L (q) & = \int q (β, Z) log \{\frac{p (Y, β, Z)}{q (β, Z)}\} d β d Z \\ = \int q_{β} (β) q_{Z} (Z) log \{\frac{p (Y, β, Z)}{q_{β} (β) q_{Z} (Z)}\} d β d Z \\ = \int q_{β} (β) q_{Z} (Z) log {p (Y, β, Z)} d β d Z - \int q_{β} (β) q_{Z} (Z) log {q_{β} (β) q_{Z} (Z)} d β d Z \\ = E_{β, Z} [log {p (Y, Z | β)}] + E_{β, Z} [π^{I} (β)] - E_{β, Z} [log {q_{β} (β)}] - E_{β, Z} [log {q_{Z} (Z)}] . \end{matrix} \end{matrix}

(28)

As we can see in (28),

L (q)

has been divided into four different parts with expectation taken over the variational approximation density

q (β, Z) = q_{β} (β) q_{Z} (Z)

. We now find the expression of these expectations one by one.

Part 1: $E_{β, Z} [log {p (Y, Z | β)}]$

\begin{matrix} \begin{matrix} = log {(2 π)}^{- \frac{n}{2}} + \int \int q_{β} (β) q_{Z} (Z) {- \frac{1}{2} {(z - X β)}^{'} (z - X β)} d β d z \\ = log {(2 π)}^{- \frac{n}{2}} + \int q_{Z} (Z) \int q_{β} (β) {- \frac{1}{2} (β^{'} X^{'} X β - 2 z^{'} X β + z^{'} z)} d β d z \end{matrix} \end{matrix}

(29)

Deal with the inner integral first, we have

\begin{matrix} \begin{matrix} \int q_{β} (β) {- \frac{1}{2} (β^{'} X^{'} X β - 2 z^{'} X β + z^{'} z)} d β & = - \frac{1}{2} \int q_{β} (β) [β^{'} X^{'} X β] d β + z^{'} X E_{β} [β] - \frac{1}{2} z^{'} z \\ = - \frac{1}{2} \int q_{β} (β) [β^{'} X^{'} X β] d β + z^{'} X μ_{q_{β}} - \frac{1}{2} z^{'} z \end{matrix} \end{matrix}

(30)

where

\begin{matrix} \begin{matrix} - \frac{1}{2} \int q_{β} (β) [β^{'} X^{'} X β] d β & = - \frac{1}{2} \int q_{β} (β) [{(β - μ_{q_{β}} + μ_{q_{β}})}^{'} X^{'} X (β - μ_{q_{β}} + μ_{q_{β}})] d β \\ = - \frac{1}{2} trace (X^{'} X E_{β} [(β - μ_{q_{β}}) {(β - μ_{q_{β}})}^{'}]) - \frac{1}{2} μ_{q_{β}}^{'} X^{'} X μ_{q_{β}} \\ = - \frac{1}{2} trace (X^{'} X [μ_{q_{β}} μ_{q_{β}}^{'} + Σ_{q_{β}}]) . \end{matrix} \end{matrix}

(31)

Substitute (31) into (30), we got

\begin{matrix} \int q_{β} (β) {- \frac{1}{2} (β^{'} X^{'} X β - 2 z^{'} X β + z^{'} z)} d β = - \frac{1}{2} trace (X^{'} X [μ_{q_{β}} μ_{q_{β}}^{'} + Σ_{q_{β}}]) + z^{'} X μ_{q_{β}} - \frac{1}{2} z^{'} z . \end{matrix}

(32)

Substituting (32) back into (29) gives

\begin{matrix} \begin{matrix} E_{β, Z} [log {p (Y, Z | β)}] & = log {(2 π)}^{- \frac{n}{2}} + \int q_{Z} (z) {- \frac{1}{2} trace (X^{'} X [μ_{q_{β}} μ_{q_{β}}^{'} + Σ_{q_{β}}]) + z^{'} X μ_{q_{β}} - \frac{1}{2} z^{'} z} d z \\ = log {(2 π)}^{- \frac{n}{2}} - \frac{1}{2} trace (X^{'} X [μ_{q_{β}} μ_{q_{β}}^{'} + Σ_{q_{β}}]) - \frac{1}{2} E_{Z} [z^{'} z] + μ_{q_{z}}^{'} μ_{z} \\ = log {(2 π)}^{- \frac{n}{2}} - \frac{1}{2} trace (X^{'} X [μ_{q_{β}} μ_{q_{β}}^{'} + Σ_{q_{β}}]) + μ_{q_{z}}^{'} μ_{z} \\ - \frac{1}{2} \sum_{i = 1}^{n} {[1 + μ_{z_{i}}^{2} - μ_{z_{i}} \frac{ϕ (- μ_{z_{i}})}{Φ (- μ_{z_{i}})}]}^{I (y_{i} = 0)} {[1 + μ_{z_{i}}^{2} + μ_{z_{i}} \frac{ϕ (- μ_{z_{i}})}{1 - Φ (- μ_{z_{i}})}]}^{I (y_{i} = 1)} \\ = log {(2 π)}^{- \frac{n}{2}} - \frac{1}{2} trace (X^{'} X [μ_{q_{β}} μ_{q_{β}}^{'} + Σ_{q_{β}}]) + μ_{q_{z}}^{'} μ_{z} \\ - \frac{1}{2} \sum_{i = 1}^{n} {[1 + μ_{q_{z_{i}}} μ_{z_{i}}]}^{I (y_{i} = 0)} {[1 + μ_{q_{z_{i}}} μ_{z_{i}}]}^{I (y_{i} = 1)} \\ = log {(2 π)}^{- \frac{n}{2}} - \frac{1}{2} trace (X^{'} X [μ_{q_{β}} μ_{q_{β}}^{'} + Σ_{q_{β}}]) + \frac{1}{2} μ_{q_{z}}^{'} μ_{z} - \frac{n}{2} . \end{matrix} \end{matrix}

(33)

We applied properties of truncated normal distribution in Appendix B to find the expression of the second moment

E_{Z} [z^{'} z]

.

Part 2: $E_{β, Z} [log q_{Z} (z)]$

\begin{matrix} \begin{matrix} = \int \int q_{β} (β) q_{Z} (z) log q_{Z} (z) d β d Z \\ = \int q_{Z} (z) log q_{Z} (z) d Z \\ = - \frac{n}{2} (log (2 π) + 1) \\ + \sum_{i = 1}^{n} {{[log (Φ (- μ_{z_{i}})) + μ_{z_{i}} \frac{ϕ (- μ_{z_{i}})}{2 Φ (- μ_{z_{i}})}]}^{I (y_{i} = 0)} {[log (1 - Φ (- μ_{z_{i}})) - μ_{z_{i}} \frac{ϕ (- μ_{z_{i}})}{2 (1 - Φ (- μ_{z_{i}}))}]}^{I (y_{i} = 1)}} \\ = - \frac{n}{2} (log (2 π) + 1) - \frac{1}{2} μ_{z}^{'} μ_{z} + \frac{1}{2} μ_{q_{z}}^{'} μ_{z} + \sum_{i = 1}^{n} {{[log (Φ (- μ_{z_{i}}))]}^{I (y_{i} = 0)} {[log (1 - Φ (- μ_{z_{i}}))]}^{I (y_{i} = 1)}} \end{matrix} \end{matrix}

(34)

Again, see Appendix B for well-known properties of truncated normal distribution. Now subtracting (34) from (33) we got

\begin{matrix} \begin{matrix} E_{β, Z} [log {p (Y, Z | β)}] - E_{β, Z} [log q_{Z} (z)] & = - \frac{1}{2} trace (X^{'} X [μ_{q_{β}} μ_{q_{β}}^{'} + Σ_{q_{β}}]) + \frac{1}{2} μ_{z}^{'} μ_{z} + \\ \sum_{i = 1}^{n} {{[log (Φ (- μ_{z_{i}}))]}^{I (y_{i} = 0)} {[log (1 - Φ (- μ_{z_{i}}))]}^{I (y_{i} = 1)}} . \end{matrix} \end{matrix}

(35)

Based on the exact expression of the intrinsic prior

π^{I} (β)

, denoting all constant terms by C, we have

Part 3: $E_{β, Z} [log p_{β} (β)]$

\begin{matrix} \begin{matrix} = \int \int q_{Z} (z) q_{β} (β) log π^{I} (β) d β d z \\ = log C - \frac{(j + 1)}{2} log (2 π) - \frac{1}{2} log | Σ_{β} | - \frac{1}{2} \int q_{β} (β) [β^{'} Σ_{β}^{- 1} β] d β \end{matrix} \end{matrix}

(36)

To find the expression for the integral, we have

\begin{matrix} \begin{matrix} \int q_{β} (β) [β^{'} Σ_{β}^{- 1} β] d β & = \int q_{β} (β) {(β - μ_{q_{β}} + μ_{q_{β}})}^{'} Σ_{β}^{- 1} (β - μ_{q_{β}} + μ_{q_{β}}) d β \\ = E [trace (Σ_{β}^{- 1} (β - μ_{q_{β}}) {(β - μ_{q_{β}})}^{'})] + μ_{q_{β}}^{'} Σ_{β}^{- 1} μ_{q_{β}} \\ = trace (Σ_{β}^{- 1} Σ_{q_{β}}) + μ_{q_{β}}^{'} Σ_{β}^{- 1} μ_{q_{β}} \end{matrix} \end{matrix}

(37)

Substituting (37) back into (36), we obtained

\begin{matrix} E_{β, Z} [log p_{β} (β)] = log C - \frac{(j + 1)}{2} log (2 π) - \frac{1}{2} log | Σ_{β} | - \frac{1}{2} [trace (Σ_{β}^{- 1} Σ_{q_{β}}) + μ_{q_{β}}^{'} Σ_{β}^{- 1} μ_{q_{β}}] . \end{matrix}

(38)

Part 4: $E_{β, Z} [log q_{β} (β)]$

\begin{matrix} \begin{matrix} = \int \int q_{Z} (z) q_{β} (β) log q_{β} (β) d β \\ = - \frac{j + 1}{2} log (2 π) - \frac{1}{2} log | Σ_{q_{β}} | - \frac{1}{2} \int q_{β} (β) {(β - μ_{q_{β}})}^{'} Σ_{q_{β}}^{- 1} (β - μ_{q_{β}}) d β \\ = - \frac{j + 1}{2} log (2 π) - \frac{1}{2} log | Σ_{q_{β}} | - \frac{1}{2} trace (Σ_{β}^{- 1} Σ_{β}) \\ = - \frac{j + 1}{2} (log (2 π) + 1) - \frac{1}{2} log | Σ_{q_{β}} | \end{matrix} \end{matrix}

(39)

Combining all four parts together, we get

\begin{matrix} \begin{matrix} L (q) & = E_{β, Z} [log {p (Y, Z | β)}] + E_{β, Z} [π^{I} (β)] - E_{β, Z} [log {q_{β} (β)}] - E_{β, Z} [log {q_{Z} (Z)}] \\ = \underset{E_{β, Z} [log {p (Y, Z | β)}] - E_{β, Z} [log {q_{Z} (Z)}]}{\underset{︸}{- \frac{1}{2} trace (X^{'} X [μ_{q_{β}} μ_{q_{β}}^{'} + Σ_{q_{β}}]) + \frac{1}{2} μ_{z}^{'} μ_{z} + \sum_{i = 1}^{n} {{[log (Φ (- μ_{z_{i}}))]}^{I (y_{i} = 0)} {[log (1 - Φ (- μ_{z_{i}}))]}^{I (y_{i} = 1)}}}} \\ \underset{E_{β, Z} [log p_{β} (β)] - E_{β, Z} [log q_{β} (β)]}{\underset{︸}{+ log C - \frac{1}{2} log | Σ_{β} | - \frac{1}{2} [trace (Σ_{β}^{- 1} Σ_{q_{β}}) + μ_{q_{β}}^{'} Σ_{β}^{- 1} μ_{q_{β}}] + \frac{j + 1}{2} + \frac{1}{2} log | Σ_{q_{β}} |}} . \end{matrix} \end{matrix}

(40)

5.3. Model Comparison Based on Variational Approximation

Suppose we want to compare two models,

M_{1}

and

M_{0}

, where

M_{0}

is the simpler model. An intuitive thought on comparing two models by variational approximation methods is just to compare the lower bounds

L (q_{1})

and

L (q_{0})

. However, we should note that by comparing the lower bounds, we are assuming that the KL divergences in the two approximations are the same, so that we can use just these lower bounds as guide. Unfortunately, it is not easy to measure how tight in theory any particular bound can be, if this can be accomplished we could then more accurately estimate the log marginal likelihood from the beginning. As clarified in [27], when comparing two exact log marginal likelihood, we have

\begin{matrix} log p_{1} (X) - log p_{0} (X) & = [L (q_{1}) + K L (q_{1} ‖ p_{1})] - [L (q_{0}) - K L (q_{0} ‖ p_{0})] \end{matrix}

(41)

\begin{matrix} = L (q_{1}) - L (q_{0}) + [K L (q_{1} ‖ p_{1}) - K L (q_{0} ‖ p_{0})] \end{matrix}

(42)

\begin{matrix} \neq L (q_{1}) - L (q_{0}) . \end{matrix}

(43)

The difference in log marginal likelihood,

log p_{1} (X) - log p_{0} (X)

, is the quantity we wish to estimate. However, if we base this on the lower bounds difference, we are basing our model comparison on () rather than (41). Therefore, there exists a systematic bias towards simpler model when comparing models if

K L (q_{1} ‖ p_{1}) - K L (q_{0} ‖ p_{0})

is not zero.

Realizing that we have a variational approximation for the posterior distribution of

β

, we propose the following method to estimate

p (X)

based on our variational approximation

q_{β} (β)

(27). First, writing the marginal likelihood as

\begin{matrix} p (x) = \int [\frac{p (x | β) π^{I} (β)}{q_{β} (β)}] q_{β} (β) d β, \end{matrix}

we can interpret it as the conditional expectation

\begin{matrix} p (x) = E [\frac{p (x | β) π^{I} (β)}{q_{β} (β)}] \end{matrix}

with respect to

q_{β} (β)

. Next, draw samples

β^{(1)}, . . ., β^{(n)}

from

q_{β} (β)

and obtain the estimated marginal likelihood

\begin{matrix} \hat{p_{X} (x)} = \frac{1}{n} \sum_{i = 1}^{n} \frac{p (x | β^{(i)}) π^{I} (β^{(i)})}{q_{β} (β^{(i)})} . \end{matrix}

Please note that this method proposed is equivalent to importance sampling with importance function being

q_{β} (β)

, for which we know the exact form and the generation of the random

β^{(i)}

is easy and inexpensive.

6. Modeling Probability of Default Using Lending Club Data

6.1. Introduction

LendingClub (https://www.lendingclub.com/) is the world’s largest peer-to-peer lending platform. LendingClub enables borrowers to create unsecured personal loans between $1000 and $40,000. The standard loan period is three or five years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee. To attract lenders, LendingClub publishes most of the information available in borrowers’ credit reports as well as information reported by borrowers for almost every loan issued through its website.

6.2. Modeling Probability of Default—Target Variable and Predictive Features

Publicly available LendingClub data, from 2007 June to 2018 Q4, has a total of 2,260,668 issued loans. Each loan has a status, either Paid-off, Charged-off, or Ongoing. We only adopted loans with an end status, i.e., either paid-off or charged-off. In addition, that loan status is the target variable. We then selected following loan features as our predictive covariates.

Loan term in months (either 36 or 60)
FICO
Issued loan amount
DTI (Debt to income ratio, i.e., customer’s total debt divided by income)
Number of credit lines opened in past 24 months
Employment length in years
Annual income
Home ownership type (own, mortgage, of rent)

We took a sample from the original data set that has customer yearly income between $15,000 and $60,000 and end up with a data set of 520,947 rows.

6.3. Addressing Uncertainty of Estimated Probit Model Using Variational Inference with Intrinsic Prior

Using the process developed in Section 5, we can update the intrinsic prior for parameters (see Figure 1) of the probit model using variational inference, and get the posterior distribution for the estimated parameters. Based on the derived parameter distributions, questions of interest may be explored with model uncertainty being considered.

Investors will be interested in understanding how each loan feature affect the probability of default, given a certain loan term, either 36 or 60. To answer this question, we samples 6000 cases from the original data set and draw from derived posterior distribution 100 times. We end up with

6000 \times 100

calculated probability of default, where each one of the 6000 samples yield 100 different probit estimates based on 100 different posterior draws. We summarize some of our findings in Figure 2, where color red representing 36 months loans and green representing 60 months loans.

In general, 60 months loans have higher risk of default.
Given loan term months, there is a clear trend showing that high FICO means lower risk.
Given loan term months, there is a trend showing that high DTI indicating higher risk.
Given loan term months, there is a trend showing that more credit lines opened in past 24 months indicating higher risk.
There is no clear pattern regarding income. This is probably because we only included customers with income between $15,000 and $60,000 in our training data, which may not representing the true income level of the whole population.

Model uncertainty could also be measured through credible intervals. Again, with the derived posterior distribution, the credible interval is just the range containing a particular percentage of estimated effect/parameter values. For instance, the

95 %

credible interval of the estimated parameter value of FICO is simply the central portion of the posterior distribution that contains

95 %

of the estimated values. Contrary to the frequentist confidence intervals, Bayesian credible interval is much more straightforward to interpret. Using the Bayesian framework created in this article, from Figure 3, we can simply state that given the observed data, the estimated effect of DTI on default has

89 %

probability of falling within

[8.300, 8.875]

. Instead of the conventional

95 %

, we used

89 %

following suggestions in [28,29], which is just as arbitrary as any of the conventions.

One of the main advantages of using variational inference over MCMC is that variational inference is much faster. Comparisons were made between the two approximation frameworks on a 64-bit Windows 10 laptop, with 32.0 GB RAM. Using the data set introduced in Section 6.2, we have that

with a conjugate prior and following the Gibbs sampling scheme proposed by [17], it took 89.86 s to finish 100 simulations for the Gibbs sampler;
following our method proposed in Section 5.2, it took 58.38 s to get the approximated posterior distribution and sampling 10,000 times from that posterior.

6.4. Model Comparison

Following the procedure proposed in Section 5.3, we compare the following series of nested models. From the data set introduced in Section 6.2, 2000 records were sampled to estimate the likelihood

p (x | β^{(i)})

. Where

β^{(i)}

is one of the 2500 draws sampled directly from the approximated posterior distribution

q_{β} (β)

, which serves as the importance function used to estimate the marginal likelihood

p (x)

.

$M_{2}$ : FICO + Term 36 Indicator
$M_{3}$ : FICO + Term 36 Indicator + Loan Amount
$M_{4}$ : FICO + Term 36 Indicator + Loan Amount + Annual Income
$M_{5}$ : FICO + Term 36 Indicator + Loan Amount + Annual Income + Mortgage Indicator

Estimated log marginal likelihood for each model is plotted in Figure 4. We can see that the model evidence has increased by adding predictive features Loan Amount and Annual Income sequentially. However, if we further adding home ownership information, i.e., Mortgage Indicator as a predictive feature, the model evidence decreased. We have the Bayes factor

B F_{45} = \frac{p (x | M_{4})}{p (x | M_{5})} = e^{- 1014.78 - (- 1016.42)} = 5.16,

which suggests a substantial evidence for model

M_{4}

, indicating home ownership information may be irrelevant in predicting probability of default given that all the other predictive features are relevant.

7. Further Work

The authors thank the reviewers for pointing out that mean-field variational Bayes underestimates the posterior variance. This could be an interesting topic for our future research. We plan to study the

linearresponsevariationalBayes

(LRVB) method proposed in [30] to see if it can be applied on the framework we proposed in this article. To see if we can get the approximated posterior variance close enough to the true variance using our proposed method, comparisons should be made between normal conjugate prior with the MCMC procedure, normal conjugate prior with LRVB, and intrinsic prior with LRVB.

Author Contributions

Methodology, A.L., L.P. and K.W.; software, A.L.; writing–original draft preparation, A.L., L.P. and K.W.; writing–review and editing, A.L. and L.P.; visualization, A.L. All authors have read and agreed to the published version of the manuscript.

Funding

The work of L.R.Pericchi was partially funded by NIH grants U54CA096300, P20GM103475 and R25MD010399.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Density Function

Suppose

X \sim N (μ, σ^{2})

has a normal distribution and lies within the interval

X \in (a, b), - \infty \leq a < b \leq \infty

. Then X conditional on

a < X < b

has a truncated normal distribution. Its probability density function, f, for

a \leq X < b

, is given by

\begin{matrix} f (x | μ, σ, a, b) = \frac{\frac{1}{σ} ϕ (\frac{x - μ}{σ})}{Φ (\frac{b - μ}{σ}) - Φ (\frac{a - μ}{σ})} \end{matrix}

and by

f = 0

otherwise. Here

\begin{matrix} ϕ (ξ) = \frac{1}{\sqrt{2 π}} exp (- \frac{1}{2} ξ^{2}) \end{matrix}

is the probability density function of the standard normal distribution and

Φ (\cdot)

is its cumulative distribution function. If

b = \infty

, then

Φ (\frac{b - μ}{σ}) = 1,

and similarly, if

a = - \infty

, then

Φ (\frac{a - μ}{σ}) = 0 .

And the cumulative density for the truncated normal distribution is

\begin{matrix} F (x | μ, σ, a, b) = \frac{Φ (ξ) - Φ (α)}{Z}, \end{matrix}

where

ξ = \frac{x - μ}{σ}

and

Z = Φ (β) - Φ (α)

.

Appendix B. Moments and Entropy

Let

α = \frac{a - μ}{σ}

and

β = \frac{b - μ}{σ}

. For two-sided truncation:

\begin{matrix} E (X | a < X < b) = μ + σ \frac{ϕ (α) - ϕ (β)}{Φ (β) - Φ (α)}, \\ V a r (X | a < X < b) = σ^{2} [1 + \frac{α ϕ (α) - β ϕ (β)}{Φ (β) - Φ (α)} - {(\frac{ϕ (α) - ϕ (β)}{Φ (β) - Φ (α)})}^{2}] . \end{matrix}

For one sided truncation (upper tail):

\begin{matrix} E (X | X > a) = μ + σ λ (α) \\ V a r (X | X > a) = σ^{2} [1 - δ (α)], \end{matrix}

where

α = \frac{a - μ}{σ}, λ (α) = \frac{ϕ (α)}{1 - Φ (α)}

and

δ (α) = λ (α) [λ (α) - α]

.

For one sided truncation (lower tail):

\begin{matrix} E (X | X < b) = μ - σ \frac{ϕ (β)}{Φ (β)} \\ V a r (X | X < b) = σ^{2} [1 - β \frac{ϕ (β)}{Φ (β)} - {(\frac{ϕ (β)}{Φ (β)})}^{2}] . \end{matrix}

More generally, the moment generating function for truncated normal distribution is

\begin{matrix} e^{μ t + σ^{2} t^{2} / 2} \cdot [\frac{Φ (β - σ t) - Φ (α - σ t)}{Φ (β) - Φ (α)}] . \end{matrix}

For a density

f (x)

defined over a continuous variable, the entropy is given by

\begin{matrix} H [x] = - \int f (x) log f (x) d x . \end{matrix}

And the entropy for a truncated normal density is

\begin{matrix} log (\sqrt{2 π e} σ Z) + \frac{α ϕ (α) - β ϕ (β)}{2 Z} . \end{matrix}

References

Salmeron, D.; Cano, J.A.; Robert, C.P. Objective Bayesian hypothesis testing in binomial regression models with integral prior distributions. Stat. Sin. 2015, 25, 1009–1023. [Google Scholar] [CrossRef]
Leon-Novelo, L.; Moreno, E.; Casella, G. Objective Bayes model selection in probit models. Stat. Med. 2012, 31, 353–365. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Jaakkola, T.S.; Jordan, M.I. Bayesian parameter estimation via variational methods. Stat. Comput. 2000, 10, 25–37. [Google Scholar] [CrossRef]
Girolami, M.; Rogers, S. Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Comput. 2006, 18, 1790–1817. [Google Scholar] [CrossRef]
Consonni, G.; Marin, J.M. Mean-field variational approximate Bayesian inference for latent variable models. Comput. Stat. Data Anal. 2007, 52, 790–798. [Google Scholar] [CrossRef] [Green Version]
Ormerod, J.T.; Wand, M.P. Explaining variational approximations. Am. Stat. 2010, 64, 140–153. [Google Scholar] [CrossRef] [Green Version]
Grimmer, J. An introduction to Bayesian inference via variational approximations. Political Anal. 2010, 19, 32–47. [Google Scholar] [CrossRef]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef] [Green Version]
Pérez, M.E.; Pericchi, L.R.; Ramírez, I.C. The Scaled Beta2 distribution as a robust prior for scales. Bayesian Anal. 2017, 12, 615–637. [Google Scholar] [CrossRef]
Mulder, J.; Pericchi, L.R. The matrix-F prior for estimating and testing covariance matrices. Bayesian Anal. 2018, 13, 1193–1214. [Google Scholar] [CrossRef]
Berger, J.O.; Pericchi, L.R. Objective Bayesian Methods for Model Selection: Introduction and Comparison. In Model Selection; Institute of Mathematical Statistics: Beachwood, OH, USA, 2001; pp. 135–207. [Google Scholar]
Pericchi, L.R. Model selection and hypothesis testing based on objective probabilities and Bayes factors. Handb. Stat. 2005, 25, 115–149. [Google Scholar]
Scott, J.G.; Berger, J.O. Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Ann. Stat. 2010, 38, 2587–2619. [Google Scholar] [CrossRef] [Green Version]
Jeffreys, H. The Theory of Probability; OUP: Oxford, UK, 1961. [Google Scholar]
Berger, J.O.; Pericchi, L.R. The intrinsic Bayes factor for model selection and prediction. J. Am. Stat. Assoc. 1996, 91, 109–122. [Google Scholar] [CrossRef]
Leamer, E.E. Specification Searches: Ad Hoc Inference with Nonexperimental Data; Wiley: New York, NY, USA, 1978; Volume 53. [Google Scholar]
Albert, J.H.; Chib, S. Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 1993, 88, 669–679. [Google Scholar] [CrossRef]
Tanner, M.A.; Wong, W.H. The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 1987, 82, 528–540. [Google Scholar] [CrossRef]
Berger, J.O.; Pericchi, L.R. The intrinsic Bayes factor for linear models. Bayesian Stat. 1996, 5, 25–44. [Google Scholar]
Casella, G.; Moreno, E. Objective Bayesian variable selection. J. Am. Stat. Assoc. 2006, 101, 157–167. [Google Scholar] [CrossRef]
Moreno, E.; Bertolino, F.; Racugno, W. An intrinsic limiting procedure for model selection and hypotheses testing. J. Am. Stat. Assoc. 1998, 93, 1451–1460. [Google Scholar] [CrossRef]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Jordan, M.I.; Ghahramani, Z.; Jaakkola, T.S.; Saul, L.K. An introduction to variational methods for graphical models. Mach. Learn. 1999, 37, 183–233. [Google Scholar] [CrossRef]
Parisi, G.; Shankar, R. Statistical field theory. Phys. Today 1988, 41, 110. [Google Scholar] [CrossRef]
Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Berger, J.; Pericchi, L. Training samples in objective Bayesian model selection. Ann. Stat. 2004, 32, 841–869. [Google Scholar] [CrossRef] [Green Version]
Beal, M.J. Variational Algorithms for Approximate Bayesian Inference; University College London: London, UK, 2003. [Google Scholar]
Kruschke, J. Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]
McElreath, R. Statistical Rethinking: A Bayesian Course with Examples in R and Stan; Chapman and Hall/CRC: Boca Raton, FL, USA, 2018. [Google Scholar]
Giordano, R.J.; Broderick, T.; Jordan, M.I. Linear response methods for accurate covariance estimates from mean field variational Bayes. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 7–12 December 2015; pp. 1441–1449. [Google Scholar]

Figure 1. Intrinsic Prior.

Figure 2. Effect of term months and other covariates on probability of default

Figure 3. Credible intervals for estimated coefficients

Figure 4. Log marginal likelihood comparison

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, A.; Pericchi, L.; Wang, K. Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations. Entropy 2020, 22, 513. https://doi.org/10.3390/e22050513

AMA Style

Li A, Pericchi L, Wang K. Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations. Entropy. 2020; 22(5):513. https://doi.org/10.3390/e22050513

Chicago/Turabian Style

Li, Ang, Luis Pericchi, and Kun Wang. 2020. "Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations" Entropy 22, no. 5: 513. https://doi.org/10.3390/e22050513

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations

Abstract

1. Introduction

2. Background and Development of Intrinsic Prior Methodology

2.1. Bayes Factor

2.2. Motivation and Development of Intrinsic Prior

3. Objective Bayesian Probit Regression Models

3.1. Bayesian Probit Model and the Use of Auxiliary Variables

3.2. Development of Intrinsic Prior for Probit Models

4. Variational Inference

4.1. Overview of Variational Methods

4.2. Factorized Distributions

5. Incorporate Intrinsic Prior with Variational Approximation to Bayesian Probit Models

5.1. Derivation of Intrinsic Prior to Be Used in Variational Inference

5.2. Variational Inference for Probit Model with Intrinsic Prior

5.2.1. Iterative Updates for Factorized Distributions

5.2.2. Evaluation of the Lower Bound L ( q )

Part 1: E β , Z [ log { p ( Y , Z | β ) } ]

Part 2: E β , Z [ log q Z ( z ) ]

Part 3: E β , Z [ log p β ( β ) ]

Part 4: E β , Z [ log q β ( β ) ]

5.3. Model Comparison Based on Variational Approximation

6. Modeling Probability of Default Using Lending Club Data

6.1. Introduction

6.2. Modeling Probability of Default—Target Variable and Predictive Features

6.3. Addressing Uncertainty of Estimated Probit Model Using Variational Inference with Intrinsic Prior

6.4. Model Comparison

7. Further Work

Author Contributions

Funding

Conflicts of Interest

Appendix A. Density Function

Appendix B. Moments and Entropy

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2.2. Evaluation of the Lower Bound $L (q)$

Part 1: $E_{β, Z} [log {p (Y, Z | β)}]$

Part 2: $E_{β, Z} [log q_{Z} (z)]$

Part 3: $E_{β, Z} [log p_{β} (β)]$

Part 4: $E_{β, Z} [log q_{β} (β)]$