Next Article in Journal
Assessing the Impact of School Rules and Regulations on Students’ Perception Toward Promoting Good Behavior: Sabian Secondary School, Dire Dawa, Ethiopia
Previous Article in Journal
Improved Small Sample Inference Methods for a Mixed-Effects Model for Repeated Measures Approach in Incomplete Longitudinal Data Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Parametric Bayesian Approach in Density Ratio Estimation

by
Abdolnasser Sadeghkhani
1,*,
Yingwei Peng
2 and
Chunfang Devon Lin
3
1
Department of Mathematics & Statistics, Brock University, St. Catharines, ON L2S 3A1, Canada
2
Departments of Public Health Sciences, Queen’s University, Kingston, ON K7L 3N6, Canada
3
Department of Mathematics & Statistics, Queen’s University, Kingston, ON K7L 3N6, Canada
*
Author to whom correspondence should be addressed.
Stats 2019, 2(2), 189-201; https://doi.org/10.3390/stats2020014
Submission received: 7 March 2019 / Revised: 25 March 2019 / Accepted: 26 March 2019 / Published: 30 March 2019

Abstract

:
This paper is concerned with estimating the ratio of two distributions with different parameters and common supports. We consider a Bayesian approach based on the log–Huber loss function, which is resistant to outliers and useful for finding robust M-estimators. We propose two different types of Bayesian density ratio estimators and compare their performance in terms of frequentist risk function. Some applications, such as classification and divergence function estimation, are addressed.

1. Introduction

The problem of estimating the ratio of two densities appears in many areas of statistical and computer science. Density ratio estimation (DRE) is widely considered to be the most important factor in machine learning and information theory (Sugiyama et al. [1], and Kanamori et al. [2]). In a series of papers, Sugiyama et al. [3,4] developed DRE in different statistical data analysis problems. Some useful applications of DRE are as follows: non-stationary adaptation (Sugiyama and Müller [5] and Quiñonero-Candela et al. [6]), variable selection (Suzuki et al. [7]), dimension reduction (Suzuki and Sugiyama, [8]), conditional density estimation (Sugiyama et al. [9]), outlier detection (Hido et al. [10]), and posterior density estimation (Thomas et al. [11]), among others. This paper addresses cases where the density belongs to the parametric distributions (e.g., exponential family of distributions). Parametric methods are usually favorable thanks to the existence of closed forms (mostly simple and explicit formulas) and hence help enhance computational efficiency.
Several estimators of the ratio of two densities have recently been proposed. One of the simplest approaches to estimate the density ratio p / q , where p and q are two probability density (or mass) functions (PDFs or PMFs), is called “plug-in”, and the ratio of the estimated densities is computed. Alternatively, one can estimate the ratio of two densities directly. Several approaches have recently been explored to estimate the ratio, including the moment matching approach (Gretton et al. [12]) and the density matching approach (Sugiyama et al. [13]). There are other works, such as Nguyen [14] and Deledalle [15], which studied the application of DRE in estimating Kullback–Leibler (KL) divergence (or more generally the α -divergence function, also known as f-divergence) and vice versa. The main objective of this paper is to address Bayesian parametric estimation with some commonly used loss functions for the ratio of p / q and to compare the proposed estimators with other estimators in the literature.
The remainder of the paper is organized as follows. In Section 2, we discuss the DRE methodology and introduce some useful definitions and related examples. Section 3 discusses how to find a Bayesian DRE under several loss functions for any arbitrary prior density on the parameter, and provides some interesting examples from the exponential families. In Section 4 we study some of the DRE applications. Some numerical illustrations for considering the efficiency of the proposed DREs are given in Section 5. Finally, we make some concluding remarks in Section 6.

2. Density Ratio Estimation for the Exponential Distribution Family

Let X | η and Y | γ be conditionally independent multivariate random variables
p ( x η ) = h ( x ) exp η s ( x ) κ ( η ) ,
q ( y γ ) = h ( y ) exp γ s ( y ) κ ( γ ) ,
where η , γ R d are natural parameters, s ( · ) is a sufficient statistic, and κ ( · ) = log c ( · ) is the log–normalizer, which ensures that the distribution integrates to one, that is, c ( a ) = h ( x ) exp { a s ( x ) } d x , for a = η and γ .
Consider the problem of estimating the density ratio
r ( t ; η , γ ) = p ( t η ) q ( t γ ) ,
which is obviously proportional to exp ( η γ ) s ( t ) . So, one can merge two natural parameters η and γ into one single parameter β = η γ (it can be a vector), and write the density ratio in (3) as follows
r ( t ; θ ) = exp α + β s ( t ) ,
where α = κ ( γ ) κ ( η ) , and θ = ( α , β ) are parameters of interest. Note that since r ( t ; θ ) itself belongs to the exponential family, the normalization term α can be considered as log N ( β ) , where N ( β ) = q ( t ) exp β s ( t ) d t , which guarantees q ( t γ ) r ( t ; θ ) d t = 1 and hence q ( t ) r ( t ; θ ) becomes a valid PDF (PMF).
For instance, suppose two normal densities p ( t μ 1 , σ 1 2 ) = N ( t μ 1 , σ 1 2 ) and q ( t μ 2 , σ 2 2 ) = N ( t μ 2 , σ 2 2 ) . They correspond to (1) and (2) respectively. One can easily verify that η = ( μ 1 σ 1 2 , 1 2 σ 1 2 ) , γ = ( μ 2 σ 2 2 , 1 2 σ 2 2 ) , s ( t ) = ( t , t 2 ) , and κ ( η ) = μ 1 2 2 σ 1 2 + log σ 1 2 , γ ( η ) = μ 2 2 2 σ 2 2 + log σ 2 2 , and so according to (4), we have α = log σ 1 σ 2 + μ 1 2 2 σ 1 2 μ 2 2 2 σ 2 2 and β = μ 2 σ 2 2 μ 1 σ 1 2 , 1 2 σ 1 2 1 2 σ 2 2 .
Another interesting point is the relationship between DRE and the probabilistic classification problem (Sugiyama et al. [16]). Suppose Z = 0 , 1 is a binary response variable which shows whether an observation t is drawn from either p ( t ) (say Z = 0 ) or q ( t ) (say Z = 1 ). In other words p ( t ) = P ( t Z = 0 ) and q ( t ) = P ( t Z = 1 ) . The standard logistic regression model is:
P ( Z = 1 t ) = exp ( β 0 + β t ) 1 + exp ( β 0 + β t ) ,
and consequently, P ( Z = 0 t ) = 1 1 + exp ( β 0 + β t ) and
P ( Z = 1 t ) P ( Z = 0 t ) = exp ( β 0 + β t ) .
Letting π = P ( Z = 1 ) = P ( Z = 1 t ) p ( t ) d t , and applying Bayes’ rule, we have
p ( t ) = exp ( β 0 + β t ) ( 1 + exp ( β 0 + β t ) ) π , q ( t ) = p ( t ) ( 1 + exp ( β 0 + β t ) ) ( 1 π ) ,
and hence
p ( t ) q ( t ) = 1 π π exp ( β 0 + β ) = exp α + log 1 π π + β t ,
which indeed has the form of exp ( α + β t ) with α = β 0 + log 1 π π and hence has a density ratio model in (4).
However if our response Z = 0 , 1 , , m 1 , we can extend Equation (5) to the multinomial logit model, and we have
P ( Z = k t ) P ( Z = 0 t ) = exp ( β 0 k + β k t ) , k = 1 , , m 1 .
Analogously, letting P k ( t ) = P ( t Z = k ) and π k = P ( Z = k ) for k = 1 , , m 1 , we have P k ( t ) P 0 ( t ) = exp ( α k + β k ) , with α k = β 0 k + log 1 k = 1 k 1 π k π k .
In fact, a conceptually simple solution to the DRE is to separately estimate both densities and calculate the ratio. This can be done by replacing the estimators of the parameters into each density. This approach is known as a plug-in density ratio estimation, defined below:
r ^ p l u g ( t ) = p ( t η ^ ( t ) ) q ( t γ ^ ( t ) ) ,
where η ^ ( · ) and γ ^ ( · ) are estimates of parameters of p η ( · ) and q γ ( · ) based on a sample of size n p and n q , respectively.

3. Bayesian DRE

Consider the loss function ( θ , δ ( t ) ) and its average under long-term (repeated) use of δ ( t ) . In estimating θ it is called frequentist risk and given by
R ( θ , δ ) = E T | θ [ ( θ , δ ) ]
= t ( θ , δ ( t ) ) P θ ( d t ) ,
where E T θ ( · ) is the expectation with respect to the arbitrary measurable cumulative density function (CDF) T P ( t θ ) . Given any prior distribution π ( · ) , with the CDF Π ( · ) , one can define the integrated risk (Bayes risk), which is the frequentist risk averaged over the values of θ according to prior distribution π ( θ ) and posterior distribution π ( θ | t ) .
E θ t R ( δ ( t ) , θ ) = Θ T ( θ , δ ( t ) ) P θ ( d t ) Π ( d θ ) d t .
Finally, the Bayes estimator is the minimizer of the Bayes risk (10). It can be shown that the minimizer of the above expression also minimizes the posterior risk function E θ t ( θ , δ ( t ) ) , and hence is the Bayes estimator (see Lehman and Casella [17]).
Next, we will address the log–Huber loss and reason beyond choosing such a loss function, in order to study the efficiency of the DRE.

3.1. log–Huber’s Robust Loss

One of the weakness of a squared error loss function z 2 , with z = δ θ , is that it can over-emphasize outliers. Other loss functions can be used to avoid this issue. For instance, one can use the absolute loss function | z | , but since it is not a differentiable function at the origin, Huber [18] proposed the following function instead:
h ( z ) = z 2 | z | 1 , 2 | z | 1 | z | 1 ,
which is less sensitive to outliers and more robust and equivalent to the squared error loss ( L 2 loss) for errors that are smaller than 1, and absolute error loss ( L 1 loss) for larger errors (see Figure 1). This loss borrows the advantages of L 1 and L 2 losses, and does not have their disadvantage. That is, it is not sensitive to outliers (as opposed to L 2 ) and it is differentiable everywhere (as opposed to L 1 ). In practice, optimizing the Huber loss (and consequently, log–Huber) is much faster because it enables us to use standard smooth optimization methods (e.g., quasi-Newton) instead of linear programming. Thanks to its robustness, the Huber loss is widely used in robust regression problems (e.g., Huber and Ronchetti [19], Chapter 7).
In the context of DRE, we propose taking z = log r ^ ( t ) r ( t , θ ) in (11), which yields the log–Huber loss function ( r ^ , r ) = ( log r ^ r ) 2 , for e 1 r ^ r e (and ( r ^ , r ) = | log r ^ r | , otherwise).
The corresponding frequentist risk function to the log–Huber loss function is given by
R ( r ^ , θ ) = ( log r ^ r ) 2 d P ( x ) d P ( y ) e 1 r ^ r e , 2 | log r ^ r | 1 d P ( x ) d P ( y ) 0 < r ^ r e 1 , r ^ r e ,
where θ = ( η , γ ) .

3.1.1. Log– L 2 Loss Function

The log– L 2 loss ( log z ) 2 , with z = r ^ r , puts a small weight whenever the density ratio estimator r ^ and the true density ratio r are close, and puts proportionally more weight when they are significantly different. Lemma 2 provides the Bayesian DRE of the density ratio r in (3). An interesting representation of Bayesian DRE is the Bayesian DRE that can be expressed in terms of the plug-in DRE in (7).
Theorem 1.
For any PDF p and q belonging to (1) and (2), respectively, the Bayesian DRE of under log– L 2 loss L ( r ^ , r ) = ( log r ^ r ) 2 , and prior distribution π ( θ ) , for θ = ( η , γ ) , is given by
r ^ π ( t ) = r ^ p l u g ( t ) H ( t ) ,
where r ^ p l u g is obtained by replacing the posterior expectations of unknown parameters for given t:
r ^ p l u g ( t ) = exp { s ( t ) , E η X = t , Y = t η log E η X = t , Y = t log c ( η ) } exp { s ( t ) , E γ X = t , Y = t γ log E γ X = t , Y = t log c ( γ ) } } ,
the correction factor is given by
H ( t ) = exp { log E η X = t , Y = t c ( η ) E η X = t , Y = t log c ( η ) } exp { log E γ X = t , Y = t c ( γ ) E γ X = t , Y = t log c ( γ ) } .
Proof. 
The Bayesian DRE r ^ π ( t ) is the minimizer of the posterior risk and
log r ^ π E η X = t , Y = t log r ^ π log r 2 = 0
implies log r ^ π ( t ) = E η X = t , Y = t log r π , or equivalently,
r ^ π ( t ) = exp E η X = t , Y = t log r ( t ) ,
with 2 log r ^ π 2 E η X = t , Y = t log r ^ log r 2 0 . Therefore
r ^ π ( t ) = exp E η | X = t log p η ( t ) E γ | Y = t log q γ ( t ) = exp E η | X = t log p η ( t ) exp E γ | Y = t log q γ ( t ) = h 1 ( t ) exp η ^ ( x ) s 1 ( t ) log c 1 ( η ^ ( t ) ) h 2 ( t ) exp γ ^ ( t ) s 2 ( t ) log c 2 ( γ ^ ( t ) ) × exp c 1 ( η ^ 1 ( t ) ) log c 1 ( η 1 ) ^ ( t ) exp c 2 ( η ^ 2 ( t ) ) log c 2 ( η 2 ) ^ ( t ) = r ^ p l u g ( t ) H ( t ) .
This completes the proof. ☐
In Table 1 we explore the correction factor H ( · ) in (15) for certain densities which belong to the exponential family. Let the likelihood functions below, with sub indices i = 1 , 2 representing the underlying distributions, be drawn from p and q according to (1) and (2), respectively. Note that posing constraints on the hyper-parameters leads to having H ( t ) as a constant function in t. In fact, H ( t ) = 1 induces the Bayesian DRE r ^ π to coincide with the plug-in DRE r ^ p l u g .
Note that the notation ψ ( α ) below is for a “digamma” function, and is given by Γ ( δ ) / δ Γ ( δ ) . Also, in Table 1 and Table 2, G a m , P, P a , W, B i n , B e t , χ 2 , R a y , U, and G e stand for gamma, Poisson, Pareto, Weibull, binomial, beta, chi–squared, Rayleigh, uniform, and geometric distributions, respectively.
Definition 1.
In the exponential distribution family in (1), Bregman divergence associated with a real-valued strictly convex and differentiable function c ( · ) is defined by
B ( η , γ ) = κ ( η ) κ ( γ ) η γ , c ( γ ) ,
where · , · is the inner product and it can be shown that for all regular exponential families (see Brown [20]), κ ( · ) is strictly convex, and furthermore B ( η , γ ) = log c ( η ) c ( γ ) η γ , E q ( s ( T ) ) .
Lemma 1.
The Kullback–Leibler (KL) divergence between p η and q γ in model (1) is equal to the Bregman divergence between natural parameters. That is:
K L ( p ( · | η ) , q ( · | γ ) ) = B ( η , γ ) ,
where the KL divergence (Kullback–Leibler, [21]), between densities f 1 and f 2 , K L ( f 1 , f 2 ) = E f 1 log f 1 / f 2 .
Proof. 
See Nielsen and Nock [22]. ☐
Lemma 2.
Consider E p log r ^ ( t ) log r ( t ) 2 as a loss function where the random variable T follows from p as in (1) and the prior function π ( θ ) on θ = ( η , γ ) , then the Bayesian DRE of r ( t ; θ ) is given by
r ^ π ( t ) = exp E K L ( p , q ) X = t , Y = t ,
and in addition for the natural exponential family model (1) it can also be expressed as
r ^ π ( t ) = exp E B ( η , γ ) | X = t , Y = t .
Proof. 
For simplicity, assume notations r ^ π ( t ) and r ( t ; η , γ ) as r ^ and r respectively.
The Bayesian estimator r ^ in estimating r is the minimizer of E ( ( r ^ , r ) X = t , Y = t ) , with ( r ^ , r ) = E p ( log r ^ r ) 2 , and therefore log r ^ E ( E p log r ^ log r 2 X = t , Y = t ) = 0 implies log r ^ π ( t ) = E ( E p log r X = t , Y = t ) = E ( K L ( p , q ) X = t , Y = t ) . Applying Lemma 1 and the fact that 2 log r ^ 2 E ( E p log r ^ log r 2 X = t , Y = t ) 0 , completes the proof. ☐

3.1.2. Log- L 1 Loss Function

For some larger errors in loss function (12), one needs to consider the log– L 1 loss function ( r ˜ , r ) = 2 | log r r ˜ | 2 . Let r ˜ be the corresponding Bayesian DRE. As in Theorem 1 we can also express r ˜ π ( t ) in terms of the product of the correction factor and the plug-in DRE. That is, under log– L 1 loss L ( r ˜ , r ) = | log r ^ r | , and prior distribution π ( θ ) , for θ = ( η , γ ) , we have
r ˜ π ( t ) = r ˜ p l u g ( t ) H ( t ) ,
where r ˜ p l u g ( t ) and H ( t ) are obtained in the same fashion in (14) and (15) except applying median M ( · X = t , Y = t ) instead of E ( · X = t , Y = t ) . Notice that for instance the results in Table 1 hold wherever the posterior densities turn out to be symmetric about their means (or medians).
Similar calculation to Lemma 2, with loss function E p | log r ^ ( t ) log r ( t ) | , suggests
r ˜ π ( t ) = exp M ( B ( η , γ ) X = t , Y = t ) .
Similar to (17) and (18), we can write r ˜ π ( t ) = exp M ( K L ( p , q ) X = t , Y = t ) , whose expectations are replaced by medians.

3.2. Examples of the Bayesian DRE and Some Applications

For instance, we consider the problem of finding Bayesian DRE under log– L 1 and log– L 2 losses for two different normal densities.
Example 1.
Let X | θ 1 N ( θ 1 , σ 1 2 ) be independent of Y | θ 2 N ( θ 2 , σ 2 2 ) , with the known variances and independent prior distributions θ i N ( ξ i , τ i 2 ) for i = 1 , 2 . Hence, the posterior densities are given by
θ 1 | X = t N ξ 1 τ 1 2 + τ 1 2 t τ 1 2 + σ 1 2 , τ 1 2 σ 1 2 τ 1 2 + σ 1 2 , θ 2 | Y = t N ξ 2 τ 2 2 + τ 2 2 t τ 2 2 + σ 2 2 , τ 2 2 σ 2 2 τ 2 2 + σ 2 2 ,
yielding the Bayesian DRE under either loss functions L ( r ^ , r ) = ( log r ^ r ) 2 or L ( r ^ , r ) = | log r ^ r | 2 , as below:
r ˜ π ( t ) = r ^ π ( t ) = σ 2 σ 1 exp σ 1 2 σ 1 2 + ξ 1 σ 1 2 + t τ 1 2 σ 1 2 + τ 1 2 ξ 2 σ 2 2 + t τ 2 2 σ 2 2 + τ 2 2 2 2 σ 2 2 .
If we assume the multivariate case, that is, X | θ 1 N p ( θ 1 , Σ 1 ) and Y | θ 2 N p ( θ 2 , Σ 2 ) , where both covariance matrices are known, since the conjugate prior for the mean is p-variate normal, θ i N p ( ξ i , V i ) for i = 1 , 2 , hence the posterior is p-variate normal. The results are analogous to the univariate case and θ 1 | t N p ( W 1 ( Σ 1 1 t + V 1 1 ξ 1 ) , W 1 ) and θ 2 | t N p ( W 2 ( Σ 2 1 t + V 2 1 ξ 2 ) , W 2 ) with W i = ( Σ i 1 + V i 1 ) i . Therefore the Bayesian DRE for the ratio of two multivariate normal densities equals to
1 2 ( W 1 ( Σ 1 1 t + V 1 1 ξ 1 ) W 2 ( Σ 2 1 t + V 2 1 ξ 2 ) ) Σ 2 1 ( W 1 ( Σ 1 1 t + V 1 1 ξ 1 ) W 2 ( Σ 2 1 t + V 2 1 ξ 2 ) + t r ( Σ 2 1 Σ 1 ) 1 2 log d e t ( Σ 1 ) d e t ( Σ 2 ) .
As we saw in the previous section, the key point is to find the KL divergence between two densities p and q (or equivalently, Bregman divergence in the cases of distributions belonging to exponential family), and we have a closed form for the divergence. Table 2 represents the KL divergence between p and q.
Example 2.
Let X | σ 1 2 R a y ( σ 1 2 ) and Y | σ 2 2 R a y ( σ 2 2 ) be two independent random variables from a Rayleigh distribution with the PDF p ( x | σ 2 ) = x σ 2 exp ( x 2 2 σ 2 ) for x > 0 and σ 2 > 0 . The KL divergence between p ( x | σ 1 2 ) and p ( y | σ 2 2 ) is given in Table 2. Assuming λ i = 2 σ i 2 , and choosing the inverse gamma conjugate prior distribution (IG) with parameters α and β, with the PDF π ( λ | α , β ) = β α γ ( α ) λ ( α + 1 ) exp ( β λ ) for λ > 0 , yields the posterior distribution as follows:
λ i t I G ( 1 + α i , β i + t 2 ) .
Therefore, the Bayesian DRE for the ratio of two Rayleigh densities is the exponential function of the conditional expectation of their KL loss divergence function. That is,
exp E λ 2 | t λ 2 E λ 1 | t log λ 1 + E λ 1 | t λ 1 E λ 2 | t λ 2 1 1 .
By making use of properties of λ I G ( α , β ) , such as E ( log λ ) = log β ψ ( α ) , where ψ ( α ) is a “digamma” function, the Bayesian DRE under loss function E p log r ^ ( t ) log r ( t ) 2 is given by
r ^ π ( t ) = β 2 + t 2 β 1 + t 2 exp α 2 α 1 α 1 α 2 + α 1 1 + α 2 β 1 + t 2 β 2 + t 2 ψ ( α 2 ) + ψ ( α 1 ) 1 .
Since the median of an inverse gamma distribution does not have a closed form, we cannot express an explicit formula for the Bayesian DRE under loss function E p | log r ^ ( t ) log r ( t ) | , and consequently for the log–Huber loss function, and they must be calculated iteratively.
The following remarks are some clarifications related to Table 2 and the obtained Bayes estimators in the previous examples in Section 3.2.
Remark 1.
The Bayes DRE r ^ π ( t ) is connected to samples X and Y via the posterior density η X = t and γ Y = t .
Remark 2.
From Table 2 it can be seen that the KL divergence between two log–normal PDFs is the same as in normal distributions, since it is known that KL is invariant under parameter transformations.
Example 3.
Suppose X U n i f o r m ( 0 , θ 1 ) is independent of Y U n i f o r m ( 0 , θ 2 ) , with θ 1 < θ 2 . It is easy to check that the Pareto distribution is conjugate to a uniform. Hence, assume that π ( θ i ) = α i β i α i θ i ( α 1 + 1 ) for θ β i and i = 1 , 2 . Therefore,
θ i t P a r e t o ( 1 + α i , max ( β i , t ) ) .
Moreover, assuming p and q are the PDFs of X and Y, respectively, K L ( p , q ) = log θ 2 log θ 1 (see Table 2). Employing the fact that the transformation log ( z / b ) has an exponential distribution with mean of a, given Z P a r e t o ( a , b ) , and hence E Z log Z = a + log b , after some calculation we have the Bayes DRE associated with loss function E p log r ^ ( t ) log r ( t ) 2 :
r ^ π ( t ) = max ( β 2 , t ) max ( β 1 , t ) + α 2 α 1 .
Similar to Example 2, the Bayes DRE under E p | log r ^ ( t ) log r ( t ) | must be calculated numerically.

4. Other Applications

Here, we discuss some of other applications of the DRE method.
1
Estimating the α -divergence function between two probability densities:
A discrepancy measure between densities p η and q γ applicable to the class of Ali-Silvey [23] distance, also known as α -divergence (Csiszàr, [24]), is given by
α ( p ( · η ) , q ( · | γ ) ) = R d h α p ( t | η ) q ( t | γ ) d P ( t | γ ) ,
where
h α ( z ) = 4 1 α 2 ( 1 z ( 1 + α ) / 2 ) for | α | 1 , log ( z ) / z for α = 1 , log ( z ) for α = 1 .
Note that some of the notable divergence functions, such as Kullback–Leibler, reverse Kullback–Leibler (RKL), and Hellinger divergence functions correspond to α = 1 , 1 , and 0, respectively, and belong to this class. So, if r ( t ) is estimated by r ^ π ( t ) under log– L 2 (or r ˜ π (t) under log– L 1 losses), then applying the Monte Carlo approximations method, the α -divergence is also estimated by
d α ( t ) = 1 / n i = 1 n h α ( r ^ π ( t i ) ) ,
where t i are drawn from p ( · η ) . Note that there are other works in order to estimate the α -divergence loss function (e.g., Póczos and Schneider [25]), but our estimator in (23) is based on the Bayesian parametric method based on the DRE. In the next section we show the performance of the proposal estimator d α .
2
Plug-in-type estimation of the density ratio under KL loss function:
Consider the plug-in density estimator, say p η ^ ( t ) , for estimating p η in the exponential family (1), based on the KL loss. We have
K L ( p ( t | η ) , p ( t | η ^ ) ) = log p ( η | t ) p ( t | η ^ ) P ( t | η ) d t = p ( t | η ^ ) log exp η , s ( t ) c ( η ) exp η ^ ( t ) , s ( t ) c ( η ^ ( t ) ) d t = η η ^ ( t ) , E p s ( T ) c ( η ) c ( η ^ ( x ) = η η ^ ( t ) , c ( η ) η c ( η ) c ( η ^ ( t ) .
Next, finding the plug-in density estimator p ( t , η ^ ( t ) ) that minimizes KL loss is equivalent to finding the point estimator η ^ ( t ) which minimizes the posterior expectation associated with the loss function in (24). Therefore, η ^ E η | t K L ( p ( t | η ) , p ( t | η ^ ) ) = 0 implies c 1 ( η ^ ) = E η | t c ( η ) η , and hence the Bayes estimator of η is given by
η ^ 1 ( t ) = c 1 1 E η | t c 1 ( η ) η .
Similar arguments can be applied to q γ ^ ( t ) for estimating q γ in the exponential family (2), and we have:
γ ^ 2 ( y ) = c 2 1 E γ | t c 2 ( γ ) γ .
By setting the case when both densities follow the identical distribution from (1) (e.g., the ratio of two normal or two Poisson, etc.), substituting the Bayes estimators obtained in (25) and (26) into the plug-in estimator of r ( t ) gives
r ^ ( t ) = exp { ( η ^ 1 ( t ) η ^ 2 ( t ) ) s ( t ) c 1 ( η ^ 1 ( t ) c 2 ( η ^ 2 ( t ) } .
Note that r ˜ ( t ) can be obtained similarly by replacing posterior medians η ˜ i instead of posterior expectations η ^ i for i = 1 , 2 above.

5. Numerical Illustrations

We conclude this section with some numerical illustrations of log–Huber risk performance of the Bayesian and plug-in when both p and q are two normal models (belong to model (1) and (2)) with a common location parameter θ . We show that the performance of plug-in and Bayes are quite similar and by selecting the hyper–parameters these two density ratio estimators coincide and hence have the same frequentist risk. We start with comparing risk performance under the log– L 2 first and then extend them to log–Huber loss.
Figure 2 exhibits the frequentist risk performance of the plug-in DRE r ^ p l u g and the Bayes DRE r ^ π under log– L 2 for all possible values of θ using Theorem 1.
Figure 3 shows the changing frequentist risk performance of the plug-in DRE r ^ p l u g and the Bayes DRE r ^ π under log– L 2 when the ratio of variances σ 1 2 / σ 2 2 varies (means are known) using Theorem 1. It can be seen the performance of plug-in estimation is better for larger ratios of variances.
Finally, Figure 4 and Figure 5 depict the risk functions of the Bayes and the plug-in DRE under the log–Huber loss over possible ranges of the mean and variance, respectively, in Example 1. It is clear from Figure 5, in contrast to Figure 3 (risk functions of Bayes DRE under log– L 2 and plug-in DRE estimator), that the Bayes estimator DRE has a lower risk with respect to the plug-in DRE, and hence is more robust under the variation of variances.

6. Concluding Remarks

Estimation pf the ratio of two or more densities has received widespread attention in recent years. Until now, most of the methods have been concentrated on solving this problem via nonparametric approaches. In this paper, we focused on a parametric Bayesian approach, when distributions come from the canonical form of the exponential family. We applied the log–Huber loss function to investigate the utility of the Bayesian and plug-in DREs. Our results confirm that Bayesian DRE along with the plug-in DRE (based on posterior expectations) perform similarly under log–Huber loss functions with the possibility of being exactly equal when the correction factor H = 1 . This is a somewhat different result from a non-parametric point of view, which often happens when the plug-in estimators perform poorly as opposed to empirical non-parametric Bayesian methods which typically include stochastic processes such as the Gaussian process and the Dirichlet process. There are instances (see Krnjajic et al. [26] and the references therein) where for certain types of count data, the nonparametric Bayesian methodology provides enhanced flexibility to fit the data, provides rich posterior inferences, and provides finer predictive inference under a set of carefully selected criteria. However, there is an apparent major drawback. These processes have an infinite number of dimensions, and thus naive algorithmic approaches to compute posteriors are generally infeasible. Finally, the application to estimate the α -divergence between two PDFs was discussed.

Author Contributions

Methodology, calculations and writing the draft, A.S.; adding critical and constructive comments as well as editorial advice, Y.P. and C.D.L.

Funding

This research received no external funding.

Acknowledgments

The authors thank Eric Marchand (Université de Sherbrooke) for his useful suggestions. We also want to thank the referees for their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Sugiyama, M.; Suzuki, T.; Kanamori, T. Density Ratio Estimation in Machine Learning; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
  2. Kanamori, T.; Hido, S.; Sugiyama, M. A Least-squares Approach to Direct Importance Estimation. J. Mach. Learn. Res. 2009, 10, 1391–1445. [Google Scholar]
  3. Sugiyama, M.; Kanamori, T.; Suzuki, T.; Hido, S.; Sese, J.; Takeuchi, I.; Wang, L. A density-ratio framework for statistical data processing. IPSJ Trans. Comput. Vis. Appl. 2009, 1, 183–208. [Google Scholar] [CrossRef]
  4. Sugiyama, M.; Yamada, M.; Bunau, P.V.; Suzuki, T.; Kanamori, T.; Kawanabe, M. Direct density-ratio estimation with dimensionality reduction via least-squares hetero-distributional subspace search. Neural Netw. 2011, 24, 183–198. [Google Scholar] [CrossRef] [PubMed]
  5. Sugiyama, M.; Müller, K.R. Input-dependent estimation of generalization error under covariate shift. Stat. Decis. 2005, 23, 249–279. [Google Scholar] [CrossRef]
  6. Quiñonero-Candela, J.; Sugiyama, M.; Schwaighofer, A.; Lawrence, N. Dataset Shift in Machine Learning; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
  7. Suzuki, T.; Sugiyama, M.; Kanamori, T.; Sese, J. Mutual information estimation reveals global associations between stimuli and biological processes. BMC Bioinform. 2009, 10, S52. [Google Scholar] [CrossRef] [PubMed]
  8. Suzuki, T.; Sugiyama, M. Sufficient dimension reduction via squared-loss mutual information estimation. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Chia Laguna, Italy, 13–15 May 2010; pp. 804–811. [Google Scholar]
  9. Sugiyama, M.; Hara, S.; von Bünau, P.; Suzuki, T.; Kanamori, T.; Kawanabe, M. Direct density ratio estimation with dimensionality reduction. In Proceedings of the SIAM International Conference on Data Mining, Columbus, OH, USA, 29 April–1 May 2010. [Google Scholar]
  10. Hido, S.; Tsuboi, Y.; Kashima, H.; Sugiyama, M.; Kanamori, T. Inlier–Based Outlier Detection via Direct Density Ratio Estimation. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 223–232. [Google Scholar]
  11. Thomas, O.; Dutta, R.; Corander, J.; Kaski, S.; Gutmann, M.U. Likelihood-free inference by ratio estimation. arXiv 2016, arXiv:1611.10242. [Google Scholar]
  12. Gretton, A.; Smola, A.; Huang, J.; Schmittfull, M.; Borgwardt, K.; Schölkopf, B. Covariate shift by kernel mean matching. Dataset Shift Mach. Learn. 2009, 3, 5. [Google Scholar]
  13. Sugiyama, M.; Suzuki, T.; Nakajima, S.; Kashima, H.; von Bunau, P.; Kawanabe, M. Direct importance estimation for covariate shift adaptation. Ann. Inst. Stat. Math. 2008, 60, 699–746. [Google Scholar] [CrossRef]
  14. Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functional and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef]
  15. Deledalle, C.A. Estimation of Kullback-Leibler losses for noisy recovery problems within the exponential family. Electron. J. Stat. 2017, 11, 3141–3164. [Google Scholar] [CrossRef]
  16. Sugiyama, M.; Suzuki, T.; Kanamori, T. Density-ratio matching under the Bregman divergence: A unified framework of density-ratio estimation. Ann. Inst. Stat. Math. 2012, 64, 1009–1044. [Google Scholar] [CrossRef]
  17. Lehman, E.L.; Casella, G. Theory of Point Estimation, 2nd ed.; Springer Texts in Statistics; Springer: New York, NY, USA, 1998. [Google Scholar]
  18. Huber, P. Robust estimation of a location parameter. Ann. Math. Stat. 1964, 53, 73–101. [Google Scholar] [CrossRef]
  19. Huber, P.J.; Ronchetti, E.M. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 2. [Google Scholar]
  20. Brown, L.D. Fundamentals of Exponential Families; IMS: Hayward, CA, USA, 1986. [Google Scholar]
  21. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  22. Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; pp. 3621–3624. [Google Scholar]
  23. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 1966, 28, 131–142. [Google Scholar] [CrossRef]
  24. Csiszàr, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 299–318. [Google Scholar]
  25. Póczos, B.; Schneider, J. On the estimation of alpha-divergences. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 11–13 April 2011; pp. 609–617. [Google Scholar]
  26. Krnjajic, M.; Kottas, A.; Draper, D. Parametric and nonparametric Bayesian model specification: A case study involving models for count data. Comput. Stat. Data Anal. 2008, 52, 2110–2128. [Google Scholar] [CrossRef]
Figure 1. Comparison of Huber, L 2 (least square), and L 1 (absolute error) loss functions in (11).
Figure 1. Comparison of Huber, L 2 (least square), and L 1 (absolute error) loss functions in (11).
Stats 02 00014 g001
Figure 2. Risk function of the Bayes and plug-in density ratio estimation (DRE) under L 2 loss in normal models p = N ( θ , 1 ) , q = N ( θ , 2 ) and corresponding hyper-parameters ξ = 1 , ξ = 2 = 0 , τ 1 = τ 2 = 1 corresponding to Example 1.
Figure 2. Risk function of the Bayes and plug-in density ratio estimation (DRE) under L 2 loss in normal models p = N ( θ , 1 ) , q = N ( θ , 2 ) and corresponding hyper-parameters ξ = 1 , ξ = 2 = 0 , τ 1 = τ 2 = 1 corresponding to Example 1.
Stats 02 00014 g002
Figure 3. Risk function of the Bayes and plug-in DRE under log– L 2 loss in normal models p = N ( 1 , σ 1 2 ) , q = N ( 1 , σ 2 2 ) and hyper-parameters ξ = 1 , ξ = 2 = 0 , τ 1 = τ 2 = 1 in Example 1.
Figure 3. Risk function of the Bayes and plug-in DRE under log– L 2 loss in normal models p = N ( 1 , σ 1 2 ) , q = N ( 1 , σ 2 2 ) and hyper-parameters ξ = 1 , ξ = 2 = 0 , τ 1 = τ 2 = 1 in Example 1.
Stats 02 00014 g003
Figure 4. Risk function of the Bayes and plug-in DRE under log–Huber loss in normal models p = N ( θ , 1 ) , q = N ( θ , 2 ) and corresponding hyper-parameters ξ = 1 , ξ = 2 = 0 , τ 1 = τ 2 = 1 for the whole possible range of θ in Example 1.
Figure 4. Risk function of the Bayes and plug-in DRE under log–Huber loss in normal models p = N ( θ , 1 ) , q = N ( θ , 2 ) and corresponding hyper-parameters ξ = 1 , ξ = 2 = 0 , τ 1 = τ 2 = 1 for the whole possible range of θ in Example 1.
Stats 02 00014 g004
Figure 5. Risk function of the Bayes and plug-in DRE under log–Huber loss in normal models p = N ( 1 , σ 1 2 ) , q = N ( 1 , σ 2 2 ) and corresponding hyper-parameters ξ = 1 , ξ = 2 = 0 , τ 1 = τ 2 = 1 , for the whole possible range of σ 2 in Example 1.
Figure 5. Risk function of the Bayes and plug-in DRE under log–Huber loss in normal models p = N ( 1 , σ 1 2 ) , q = N ( 1 , σ 2 2 ) and corresponding hyper-parameters ξ = 1 , ξ = 2 = 0 , τ 1 = τ 2 = 1 , for the whole possible range of σ 2 in Example 1.
Stats 02 00014 g005
Table 1. Correction factor H ( t ) and conditions when H ( t ) = 1 associated with log– L 2 loss. B e t beta distribution; B i n : binary distribution; G a m : gamma distribution; G e : geometric distribution; I G : inverse gamma conjugate prior distribution; P: Poisson distribution; P a : Pareto distribution; W: Weibull distribution.
Table 1. Correction factor H ( t ) and conditions when H ( t ) = 1 associated with log– L 2 loss. B e t beta distribution; B i n : binary distribution; G a m : gamma distribution; G e : geometric distribution; I G : inverse gamma conjugate prior distribution; P: Poisson distribution; P a : Pareto distribution; W: Weibull distribution.
DensityConjugate PriorCorrection Factor H ( t ) Condition for H = 1
G a m ( α i , β i ) β i G a m ( a i , b i ) ( α 1 + a 1 ) α 1 ( α 2 + a 2 ) α 2 exp { α 1 ψ ( α 1 + a 1 ) α 2 ψ ( α 2 + a 2 ) } α 1 = α 2 , a 1 = a 2
N ( θ i , σ i 2 ) θ i N ( μ i , τ i 2 ) exp { 1 / 2 ( 1 b 1 ( 1 + b 1 ) 1 b 2 ( 1 + b 2 ) ) } b 1 b 2 = 1 + b 1 1 + b 2 , where b i = σ i 2 / τ i 2
N ( θ i , σ i 2 ) σ i 2 I G ( α i , β i ) α 1 1 / 2 α 2 1 / 2 exp { 1 / 2 ( ψ ( α 2 + 1 / 2 ) ψ ( α 1 + 1 / 2 ) ) } α 1 = α 2
P a ( a i , b i ) b i G a m ( α i , β i ) α 1 + 1 α 2 1 exp { α 1 α 2 α 1 α 2 + ψ ( α 2 ) ψ ( α 1 ) } α 1 = α 2
W ( λ i , k i ) λ i I G ( α i , β i ) α 1 k 1 / α 2 k 2 exp { k 2 ψ ( α 2 + 1 ) k 1 ψ ( α 1 + 1 ) } k 1 = k 2 , α 1 = α 2
B i n ( n i , p i ) p i B e t ( α i , β i ) α 2 + β 2 + n 2 α 1 + β 1 + n 1 β 1 + n 1 t α 2 + n 2 t × exp { ψ ( β 2 + n t ) ψ ( β 1 + n 1 t ) + ψ ( α 1 + β 1 + n 1 ) ψ ( α 2 + β 2 + n 2 ) } α 1 = α 2 , β 1 = β 2 , n 1 = n 2
P ( λ i ) λ i G a m ( α i , β i ) 1 α i , β i , and λ i
G e ( p i ) p i B e t ( α i , β i ) exp { ψ ( β 1 + t ) ψ ( α 1 + 1 ) + ψ ( β 2 + t ) ψ ( α 2 + 1 ) } β 2 + t β 1 + t α 1 + 1 α 2 + 1 α 1 = α 2 , β 1 = β 2
Table 2. KL divergence between p and q. χ 2 : chi-squared distribution; U: uniform distribution; R a y : Rayleigh distribution.
Table 2. KL divergence between p and q. χ 2 : chi-squared distribution; U: uniform distribution; R a y : Rayleigh distribution.
Density                 KL
N p ( θ i , Σ i ) 1 2 ( θ 1 θ 2 ) Σ 2 1 ( θ 1 θ 2 ) + t r ( Σ 2 1 Σ 1 ) p log d e t ( Σ 1 ) d e t ( Σ 2 )
N ( θ i , σ i ) 1 2 σ 2 2 ( θ 1 θ 2 ) 2 + σ 1 2 σ 2 2 1 2 log σ 1 2 σ 2 2
Bet ( a i , b i ) log B e t ( a 2 , b 2 ) B e t ( a 1 , b 1 ) + ψ ( a 1 ) ( a 1 a 2 ) + ψ ( b 1 ) ( b 1 b 2 ) + ψ ( a 1 + b 1 ) ( a 2 a 1 + b 2 b 1 )
Gam ( k i , λ i ) ( λ 1 λ 2 1 ) k 1 + ( k 1 k 2 ) ( log λ 1 + ψ ( k 1 ) ) log Γ ( k 1 ) λ 1 k 1 Γ ( k 2 ) λ 2 k 2
χ 2 ( k i ) log Γ ( k 2 2 ) Γ ( k 1 2 ) + 1 2 ψ ( k 2 2 ) ( k 1 k 2 )
log– N ( θ i , σ i 2 ) 1 2 σ 2 2 ( θ 1 θ 2 ) 2 + σ 1 2 σ 2 2 log σ 1 2 σ 2 2
Ray ( σ i ) 2 log σ 2 σ 1 + σ 1 2 σ 2 2 σ 2 2
Pa ( m i , α i ) log ( m 1 m 2 ) 2 log a 2 a 1 + a 2 a 1 1
U ( 0 , θ i ) log ( θ 2 θ 1 ) Provided θ 2 > θ 1

Share and Cite

MDPI and ACS Style

Sadeghkhani, A.; Peng, Y.; Lin, C.D. A Parametric Bayesian Approach in Density Ratio Estimation. Stats 2019, 2, 189-201. https://doi.org/10.3390/stats2020014

AMA Style

Sadeghkhani A, Peng Y, Lin CD. A Parametric Bayesian Approach in Density Ratio Estimation. Stats. 2019; 2(2):189-201. https://doi.org/10.3390/stats2020014

Chicago/Turabian Style

Sadeghkhani, Abdolnasser, Yingwei Peng, and Chunfang Devon Lin. 2019. "A Parametric Bayesian Approach in Density Ratio Estimation" Stats 2, no. 2: 189-201. https://doi.org/10.3390/stats2020014

Article Metrics

Back to TopTop