Next Article in Journal
Similarity Changes Analysis for Heart Rate Fluctuation Regularity as a New Screening Method for Congestive Heart Failure
Next Article in Special Issue
Regularization, Bayesian Inference, and Machine Learning Methods for Inverse Problems
Previous Article in Journal
Quantifying the Endogeneity in Online Donations
Previous Article in Special Issue
Entropy-Based Temporal Downscaling of Precipitation as Tool for Sediment Delivery Ratio Assessment
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Update of Prior Probabilities by Minimal Divergence

Departement Fysica, Universiteit Antwerpen, 2610 Antwerpen, Belgium
Entropy 2021, 23(12), 1668; https://doi.org/10.3390/e23121668
Submission received: 18 November 2021 / Revised: 7 December 2021 / Accepted: 9 December 2021 / Published: 11 December 2021

Abstract

:
The present paper investigates the update of an empirical probability distribution with the results of a new set of observations. The update reproduces the new observations and interpolates using prior information. The optimal update is obtained by minimizing either the Hellinger distance or the quadratic Bregman divergence. The results obtained by the two methods differ. Updates with information about conditional probabilities are considered as well.

1. Introduction

The present work is inspired by the current practices in Information Geometry [1,2,3] where minimization of divergences is an important tool. In Statistical Physics a divergence is called a relative entropy. Its importance was noted rather late in the twentieth century, after the work of Jaynes on the maximal entropy principle [4]. Estimation in the presence of hidden variables by minimizing a divergence function is briefly discussed in Chapter 8 of [2].
Assume now that some observation or experiment yields new statistical data. The approach is then to look for a probability distribution that reproduces the newly observed probabilities and that interpolates the data with missing information coming from a prior.
No further model assumptions are imposed. Hence, the statistical model under consideration consists of all probability distributions that are consistent with the newly obtained empirical data. Internal consistency of the empirical data ensures that the model is not empty. The update is the model point that minimizes the chosen divergence function from the prior to the manifold of the model.
In the context of Maximum Likelihood Estimation (MLE) one usually adopts a parameterized model. The dimension of the model can be kept low and properties of the model can be used to ease the calculations. One assumes that the new data can lead to a more accurate estimation of the limited number of model parameters. It can then happen that the model is misspecified [5] and that the update is only a good approximation of the empirical data.
Here, the model is dictated by the newly acquired empirical data and the update is forced to reproduce the measured data. Finding the probability distribution is then an underdetermined problem. Minimization of the divergence from the prior probability distribution solves the underdetermination.
In Bayesian statistics, the update q ( B ) of the probability p ( B ) of an event B equals
q ( B ) = p emp ( A ) p ( B | A ) + p emp ( A c ) p ( B | A c ) .
The quantities p emp ( A ) and p emp ( A c ) are the empirical probabilities obtained after repeated measurement of event A and its complement A c . Expression (1) has been called Jeffrey conditioning [6]. It implies the sufficiency conditions q ( B | A ) = p ( B | A ) and q ( B | A c ) = p ( B | A c ) . It is an updating rule used in Radical Probabilism [7]. This expression is also obtained when minimizing the Hellinger distance between the prior and the model manifold. A proof of the latter follows later on in Section 4.
The present approach is a special case of minimizing a divergence function in the presence of linear constraints. See the introduction of [8] for an overview of early applications of this technique. Two classes of generalized distance functions satisfy a natural set of axioms: the f-divergences of Csiszár and the generalized Bregman divergences. The squared Hellinger distance belongs to the former class. The other divergence function considered here is the square Bregman divergence. Both Hellinger and square Bregman have special properties that make it easy to work with them.
A broad class of generalized Bregman divergences satisfies the Pythagorean equality [8,9]. Pythagorean inequalities hold for an even larger class [10]. The Pythagorean relations derived in the present work make use of the specific properties of the Hellinger distance and of the quadratic Bregman divergence. It is unclear how to prove them for more general divergences.
One incentive for starting the present work is a paper of Banerjee, Guo, and Wang [11,12]. They consider the problem of predicting a random variable Z 1 given observations of a random variable Z 2 . It is well-known that the conditional expectation, as defined by Kolmogorov, is the optimal predictor. They show that this statement remains true when the metric distance is replaced by a Bregman divergence. It is shown in Theorem 2 below that a proof in a more general context yields a deviating result.
The next Section fixes notations. Section 3 collects some results about the squared Hellinger distance and the quadratic Bregman divergence. Section 4 discusses the optimal choice and contains the Theorems 1 and 2. The proof of the theorems can be adapted to cover the situation that a subsequent measurement also yields information on conditional probabilities. This is shown in Section 4.3. Section 5 treats a simple example. A final section summarizes the results of the paper.

2. Empirical Data

Consider a probability space Ω , μ . A measurable subset A of Ω is called an event. Its probability is denoted p ( A ) and is given by
p ( A ) = Ω I A ( x ) d μ ( x ) ,
where I A ( x ) equals 1 when x A and 0 otherwise. The conditional expectation of a random variable f given an event A with non-vanishing probability p ( A ) is given by
E μ f | A = 1 p ( A ) E μ f I A .
The probability space Ω , μ reflects the prior knowledge of the system at hand. When new data become available an update procedure is used to select the posterior probability space. The latter is denoted Ω , ν in what follows. The corresponding probability of an event A is denoted q ( A ) .
The outcome of repeated experiments is the empirical probability distribution of the events, denoted p emp ( A ) . The question at hand is then to establish a criterion for finding the update ν of the probability distribution μ that is as close as possible to μ while reproducing the empirical results.
The event A defines a partition A , A c of the probability space Ω , μ . As before A c denotes the complement of A in Ω . In what follows a slightly more general situation is considered in which the event A is replaced by a partition ( O i ) i = 1 n of the measure space Ω , μ into subsets with non-vanishing probability. The notations p i and μ i are used, with
p i = p ( O i ) and d μ i ( x ) = 1 p i I O i ( x ) d μ ( x ) .
Introduce the random variable g defined by g ( x ) = i when x O i . Repeated measurement of the random variable g yields the empirical probabilities
p i emp = Emp   Prob { x : g ( x ) = i } .
They may deviate from the prior probabilities p i . In some cases one also measures the conditional probabilities
p emp ( B | O i ) = Emp   Prob   of B given   that g ( x ) = i
of some other event B.

3. A Geometric Approach

In this section two divergences are reviewed, the squared Hellinger distance and the quadratic Bregman divergence.

3.1. Squared Hellinger Distance

For simplicity the present section is restricted to the case that the sample space Ω is the real line.
Given two probability measures μ and σ , both absolutely continuous w.r.t. the Lebesgue measure, the squared Hellinger distance is the divergence D 2 ( σ | | μ ) defined by
D 2 ( σ | | μ ) = 1 2 R d σ d x d μ d x 2 d x .
It satisfies
D 2 ( σ | | μ ) = 1 R d σ d x d μ d x d x .
Let ( O i ) i be a partition of Ω , μ and let g ( x ) = i when x belongs to O i , as before. Let p i and μ i be defined by (2). Consider the following functions of i, with i in { 1 , , n } ,
τ ( 1 ) ( i ) = μ , independent   of i , τ ( 2 ) ( i ) = μ i , τ ( 3 ) ( i ) = σ i ,
where each of the σ i is a probability distribution with support in O i . The empirical expectation of a function f ( i ) is given by E emp f = i p i emp f ( i ) .
Proposition 1.
If p i emp > 0 for all i and i p i emp = 1 then one has
E emp D 2 ( τ ( 1 ) | | τ ( 3 ) ) E emp D 2 ( τ ( 1 ) | | τ ( 2 ) )
with equality if and only if σ i = μ i for all i.
First prove the following two lemmas.
Lemma 1.
Assume that the probability measure ν i is absolutely continuous w.r.t. the measure μ i , with Radon-Nikodym derivative given by d ν i ( x ) = f i ( x ) d μ i . Then one has
D 2 ( μ | | σ i ) D 2 ( μ | | ν i ) = p i D 2 ( μ i | | σ i ) D 2 ( μ i | | ν i )
and
D 2 ( μ i | | ν i ) = 1 O i f i ( x ) d μ i ( x ) .
Proof. 
One calculates
D 2 ( μ | | σ i ) D 2 ( μ | | ν i ) = R d μ d x d ν i d x d σ i d x d x = p i O i d μ i d x d ν i d x d σ i d x d x = p i O i f i ( x ) d μ i ( x ) O i d μ i d x d σ i d x 1 / 2 d x = p i O i f i ( x ) d μ i ( x ) 1 + D 2 ( μ i | | σ i ) .
Now take σ i = ν i to obtain the desired results. □
Lemma 2.
(Pythagorean relation) For any i is
D 2 ( μ | | σ i ) = D 2 ( μ | | μ i ) + p i D 2 ( μ i | | σ i ) .
Proof. 
The proof follows by taking ν i = μ i in the previous lemma. □
Proof. 
(Proposition 1)
From the previous lemma it follows that D 2 ( τ ( 1 ) | | τ ( 3 ) ) D 2 ( τ ( 1 ) | | τ ( 2 ) ) . Note that σ i = μ i implies that τ ( 3 ) = τ ( 2 ) and hence D 2 ( τ ( 1 ) | | τ ( 3 ) ) = D 2 ( τ ( 1 ) | | τ ( 2 ) ) . Conversely, if
E emp D 2 ( τ ( 1 ) | | τ ( 3 ) ) = E emp D 2 ( τ ( 1 ) | | τ ( 2 ) )
then it follows from the previous lemma that E emp D 2 ( τ ( 2 ) | | τ ( 3 ) ) = 0 . If in addition p i emp > 0 for all i then it follows that for all i
0 = D 2 τ ( 2 ) ( i ) | | τ ( 3 ) ( i ) .
Because the squared Hellinger distance is a divergence, this implies that τ ( 2 ) ( i ) = τ ( 3 ) ( i ) , which is equivalent with μ i = σ i . □

3.2. Bregman Divergence

In the present section the squared Hellinger distance, which is an f-divergence, is replaced by a divergence of the Bregman type. In addition let Ω be a finite set equipped with the counting measure ρ . It assigns to each subset A of Ω the number of elements in A. This number is denoted | A | . The expectation value E μ f of a random variable f w.r.t. the probability measure μ is given by
E μ f = k Ω μ ( k ) f ( k ) .
Given a partition of Ω into sets O i one can define conditional probability measures with probability mass function ρ i given by
ρ i ( k ) = 1 | O i | if   k O i , = 0 otherwise .
Similarly, conditional probability measures with probability mass function μ i are given by
μ i ( k ) = μ ( k ) μ ( O i ) if k O i , = 0 otherwise .
Fix a strictly convex function ϕ : R R . The Bregman divergence of the probability measures σ and μ is defined by
D ϕ ( σ | | μ ) = F ( σ ) F ( μ ) F , σ μ
with
F ( σ ) = k ϕ σ ( k ) and k F ( σ ) = ϕ σ ( k ) .
In the case that ϕ ( x ) = x 2 / 2 , which is used below, it becomes
D ϕ ( σ | | μ ) = 1 2 k σ ( k ) μ ( k ) 2 .
For convenience, this case is referred to as the quadratic Bregman divergence.
The following result, obtained with the quadratic Bregman divergence, is more elegant than the result of Lemma 2.
Proposition 2.
Consider the quadratic Bregman divergence D ϕ as given by (5). Let ν i = p i μ i + ( 1 p i ) ρ i . Let σ i be any probability measure with support in O i . Then the following Pythagorean relation holds.
D ϕ ( μ | | σ i ) = D ϕ ( μ | | ν i ) + D ϕ ( ν i | | σ i ) .
Proof. 
One calculates
D ϕ ( μ | | σ i ) D ϕ ( μ | | ν i ) = D ϕ ( ν i | | σ i ) + x μ ( x ) ν i ( x ) ϕ ν i ( x ) ϕ σ i ( x ) = D ϕ ( ν i | | σ i ) + x O i p i μ i ( x ) ν i ( x ) ϕ ν i ( x ) ϕ σ i ( x ) = D ϕ ( ν i | | σ i ) ( 1 p i ) 1 | O i | x O i ϕ ν i ( x ) ϕ σ i ( x ) .
Use now that ϕ ( u ) = u and the normalization of the probability measures ν i and σ i to find the desired result. □

4. The Optimal Choice

4.1. Updated Probabilities

The following result proves that the standard Kolmogorovian definition of the conditional probability minimizes the Hellinger distance between the prior probability measure μ and the updated probability measure ν . The optimal choice of the updated probability measure ν is given by corresponding probabilities q ( B ) . They satisfy
q ( B ) = i = 1 n p i emp p ( B | O i ) for   any   event B .
Theorem 1.
Let be given a partition ( O i ) i = 1 n of the probability space Ω , μ with Ω = R . Let μ i be given by (2). Let p i = p ( O i ) > 0 denote the probability of the event O i and let be given strictly positive empirical probabilities p i emp , i = 1 , , n . The squared Hellinger distance D 2 ( σ | | μ ) as a function of σ is minimal if and only if σ i = μ i for all i. Here, σ is any probability measure on Ω satisfying
σ = i = 1 n p i emp σ i ,
and each of the σ i is a probability measure with support in O i and absolutely continuous w.r.t. μ i .
Note that the probability measure ν given by
ν ( x ) = i = 1 n p i emp μ i ( x )
uses the Kolmogorovian conditional probability as the predictor because the probabilities determined by the μ i are obtained from the prior probability distribution μ by p i ( x ) = p ( x | O i ) . By the above theorem this predictor is the optimal one w.r.t. the squared Hellinger distance.
Proof. 
With the notations of the previous section is
D 2 ( σ | | μ ) = E emp D 2 ( τ ( 1 ) | | τ ( 3 ) ) .
Proposition 1 shows that it is minimal if and only if σ i = μ i for all i. □
Next, consider the use of the quadratic Bregman divergence in the context of a finite probability space.
Theorem 2.
Let be given a partition ( O i ) i = 1 n of the finite probability space Ω , μ . Let ρ i be the counting measure on O i defined by (3). Let μ i be given by (2). Let p i = p ( O i ) > 0 denote the probability of the event O i and let be given strictly positive empirical probabilities p i emp , i = 1 , , n summing up to 1. Assume that
p i emp p i 1 | O i | μ i ( x ) f o r   a l l   x O i   a n d   f o r   i = 1 , , n .
Then the following hold.
(a)
A probability distribution ν is defined by ν = i p i emp ν i with
ν i = 1 p i p i emp ρ i + p i p i emp μ i .
(b)
Let σ be any probability measure on Ω satisfying σ = i = 1 n p i emp σ i , where each of the σ i is a probability distribution with support in O i . Then the quadratic Bregman divergence satisfies the Pythagorean relation
D ϕ ( σ | | μ ) = D ϕ ( ν | | μ ) + i = 1 n ( p i emp ) 2 D ϕ ( σ i | | ν i ) .
(c)
The quadratic Bregman divergence D ϕ ( σ | | μ ) is minimal if and only if σ = ν .
Proof. 
  • (a)
The assumption (6) guarantees that the ν i ( x ) are probabilities.
  • (b)
One calculates
D ϕ ( σ | | μ ) D ϕ ( ν | | μ ) = 1 2 x σ ( x ) ν ( x ) σ ( x ) + ν ( x ) 2 μ ( x ) = i = 1 n p i emp 1 2 x O i σ i ( x ) ν i ( x ) × p i emp σ i ( x ) + p i emp ν i ( x ) 2 p i μ i ( x ) = i = 1 n ( p i emp ) 2 1 2 x O i σ i ( x ) ν i ( x ) 2 + i = 1 n p i emp x O i σ i ( x ) ν i ( x ) ( p i emp p i ) ρ i ( x ) = i = 1 n ( p i emp ) 2 D ϕ ( σ i | | ν i ) .
In the above calculation the third line is obtained by eliminating p i μ i using the definition of ν i . This gives
p i emp σ i ( x ) + p i emp ν i ( x ) 2 p i μ i ( x ) = p i emp σ i ( x ) + p i emp ν i ( x ) 2 p i emp ν i ( x ) 1 p i p i emp ρ i ( x ) = p i emp σ i ( x ) ν i ( x ) + 2 ( p i emp p i ) ρ i ( x ) .
The term
i = 1 n p i emp x O i σ i ( x ) ν i ( x ) ( p i emp p i ) ρ i ( x )
vanishes because ρ i ( x ) is constant on the set O i and the probability measures ν i and σ i have support in O i .
(c)
From (b) it follows that D ϕ ( σ | | μ ) D ϕ ( ν | | μ ) , with equality when σ = ν .
Conversely, when D ϕ ( σ | | μ ) = D ϕ ( ν | | μ ) then (8) implies that
i = 1 n ( p i emp ) 2 D ϕ ( σ i | | ν i ) = 0 .
The empirical probabilities are strictly positive by assumption. Hence, it follows that D ϕ ( μ | | σ i ) = D ϕ ( μ | | ν i ) for all i and hence, that σ i = ν i for all i. The latter implies σ = ν . □
The optimal update ν can be written as
ν = i ( p i emp p i ) ρ i + p i μ i = μ + i ( p i emp p i ) ρ i .
This result is in general quite different from the update proposed by Theorem 1, which is
ν = i p i emp μ i .
The updates proposed by the two theorems coincide only in the special cases that either p i emp = p i for all i or that μ i = ρ i for all i. In the latter case the prior distribution μ = i p i ρ i is replaced by the update ν = i p i emp ρ i .
The entropy of the update when event O i is observed, according to Theorem 1, equals S ( ν i ) = S ( μ i ) . According to Theorem 2 it equals
S ( ν i ) = S 1 p i p i emp ρ i + p i p i emp μ i .
If p i p i emp then it follows that
S ( ν i ) 1 p i p i emp S ( ρ i ) + p i p i emp S ( μ i ) S ( μ i ) .
The former inequality follows because the entropy is a concave function. The latter follows because entropy is maximal for the uniform distribution ρ i . On the other hand, if p i > p i emp then one has
S ( μ i ) = S 1 p i emp p i ρ i + p i emp p i ν i 1 p i emp p i S ( ρ i ) + p i emp p i S ( ν i ) S ( ν i ) .
In the latter case the decrease of the entropy is stronger than in the case of the update based on the squared Hellinger distance. In conclusion, the update relying on the quadratic Bregman divergence looses details of the prior distribution by making a convex combination with a uniform distribution weighed with the probabilities of the observation. It does this moreso for the events with observed probability larger than predicted; this is when p i emp > p i .
Note that Theorem 2 cannot always be applied because it contains restrictions on the empirical probabilities. In particular, if the prior probability μ ( x ) of some point x in Ω vanishes then the condition (6) requires that the empirical probability p i emp of the partition O i to which the point x belongs is larger than or equal to the prior probability p i .

4.2. Update of Conditional Probabilities

The two previous theorems assume that no empirical information is available about conditional probabilities. If such information is present then an optimal choice should make use of it. In one case the solution of the problem is straightforward. If the probabilities p i emp are available together with all conditional probabilities p emp ( B | O i ) and there exists an update ν which reproduces these results then it is unique. Two cases remain: (1) The information about the conditional probabilities is incomplete; (2) the information is internally inconsistent – no update exists which reproduces the data.
Let us tackle the problem by considering the case that the only information that is available besides the probabilities p i emp is the vector of conditional probabilities p emp ( B | O i ) of a fixed event B, given the outcome of the measurement of the random variable g as introduced in Section 2.
The following result is independent of the choice of divergence function.
Proposition 3.
Fix an event B in Ω. Assume that the conditional probabilities p ( B | O i ) , i = 1 , , n , are strictly positive and strictly less than 1. Assume in addition that p i emp p emp ( B | O i ) 1 for all i. Then there exists an update ν with corresponding probabilities q ( · ) such that q ( O i ) = p i emp and q ( B | O i ) = p emp ( B | O i ) , i = 1 , , n .
Proof. 
An obvious choice is to take ν of the form ν = i p i emp ν i with ν i of the form
d ν i ( x ) = a i I B O i ( x ) + b i I B c O i ( x ) d μ ( x ) ,
with a i 0 and b i 0 . Normalization of the ν i gives the conditions
1 = a i p ( B O i ) + b i p ( B c O i ) .
Reproduction of the conditional probabilities gives the conditions
p emp ( B | O i ) = q ( B O i ) q ( O i ) = a i p ( B O i ) p i emp .
The latter gives
a i = p i emp p i p emp ( B | O i ) p ( B | O i ) .
The normalization condition (9) becomes
1 = p i emp p emp ( B | O i ) + b i p ( B c O i ) .
It has a positive solution for b i because p i emp p emp ( B | O i ) 1 and p ( B c O i ) > 0 . □

4.3. The Hellinger Case

The optimal updates can be derived easily from Theorem 1. Double the partition by introduction of the following sets
O i + = B O i a n d O i = B c O i .
They have prior probabilities p i ± = p ( O i ± ) . Corresponding prior measures μ i ± are defined by
d μ i ± ( x ) = 1 p i ± I O i ± ( x ) d μ ( x ) .
The empirical probability of the set O i + is taken equal to p i emp p emp ( B | O i ) , that of O i equals p i emp [ 1 p emp ( B | O i ) ] . The optimal update ν follows from Theorem 1 and is given by
d ν ( x ) = i p i emp p emp ( B | O i ) d μ i + ( x ) + i p i emp [ 1 p emp ( B | O i ) ] d μ i ( x ) .
By construction it is
q ( O i + ) = p i emp p emp ( B | O i ) a n d q ( O i ) = p i emp [ 1 p emp ( B | O i ) ] .
One now verifies that q ( O i ) = p i emp and q ( B | O i ) = p emp ( B | O i ) , which is the intended result.

4.4. The Bregman Case

Next consider the optimization with the quadratic Bregman divergence. Probability distributions ρ i ± are defined by
ρ i ± ( x ) = 1 | O i ± | I O i ± ( x ) .
Introduce the notations
r i + = p i + p i emp p emp ( B | O i ) , r i = p i p i emp [ 1 p emp ( B | O i ) ] , ν i ± ( x ) = ( 1 r i ± ) ρ i ± + r i ± μ i ± ( x ) .
Then the condition for Theorem 2 to hold is that ν i ± ( x ) 0 for all x , i . The optimal probability distribution ν is given by
ν ( x ) = i p i emp p emp ( B | O i ) ν i + ( x ) + i p i emp [ 1 p emp ( B | O i ) ] ν i ( x ) = i p i emp p emp ( B | O i ) p i + ρ i + + i p i + μ i + + i p i emp [ 1 p emp ( B | O i ) ] p i ρ i + i p i μ i = i p i emp p emp ( B | O i ) ρ i + ρ i i p i + ρ i + + i [ p i emp p i ] ρ i + μ .

5. Example

Assume that the prior probability distribution is binomial with parameters n , λ , where n is known with certainty. The probability mass function is given by
μ ( k ) = Prob ( X = k ) = n k λ k ( 1 λ ) n k k = 0 , 1 , 2 , , n .
The probability distribution and the value of the parameter λ are for instance the result of theoretical modeling of the experiment. Or they are obtained from a different kind of experiment.
The experiment under consideration yields accurate values for the probability p emp of the two events X = 1 and X = 2 . The problem at hand is to predict by extrapolation the probability of the event X = k for other values of k. A fit of the data with a binomial distribution is likely to fail because two accurate data points are given to determine a single parameter λ . The binomial model can be misspecified.
The geometric approach followed in the present paper yields an update from the binomial distribution to another distribution, one which is reproducing the data. The update is conducted in an unbiased manner. Quite often one is tempted to replace the model, in the case of the binomial model, by a model with one extra free parameter.
Let us see what are the results of minimizing divergence functions. The probability space Ω is the set of integers 0 , 1 , 2 , , n equipped with the uniform measure. Choose events
O 1 = { 1 } , O 2 = { 2 } , O 3 = Ω ( O 1 O 2 ) .
This gives for p i : = Prob ( X O i )
p 1 = μ ( 1 ) = n λ ( 1 λ ) n 1 , p 2 = μ ( 2 ) = 1 2 n ( n 1 ) λ 2 ( 1 λ ) n 2 , p 3 = 1 p 2 p 3 .
The optimal update according to Theorem 1, minimizing the Hellinger distance, is given by the probabilities
ν ( B ) = i p i emp μ ( B | O i ) .
In particular, the probability mass function ν ( k ) : = ν ( { k } ) becomes
ν ( 1 ) = p 1 emp , ν ( 2 ) = p 2 emp , ν ( k ) = p 3 emp p 3 μ ( k ) o t h e r w i s e .
The optimal update according to Theorem 2, minimizing the quadratic Bregman divergence, is given by (7). The auxiliary measures μ i , ρ i , and ν i have probability mass functions given by
μ i ( k ) = ρ i ( k ) = ν i = δ k , i for i = 1 , 2 ,
and
μ 3 ( k ) = ( 1 δ k , 1 ) ( 1 δ k , 2 ) μ ( k ) p 3 , ρ 3 ( k ) = ( 1 δ k , 1 ) ( 1 δ k , 2 ) 1 n 2 ν 3 ( k ) = ( 1 δ k , 1 ) ( 1 δ k , 2 ) ( 1 p 3 p 3 emp ) 1 n 2 + μ ( k ) p 3 emp .
The probability mass function ν ( k ) : = ν ( { k } ) becomes
ν ( k ) = p 1 emp ν 1 ( k ) + p 2 emp ν 2 ( k ) + p 3 emp ν 3 ( k ) = p 1 emp if k = 1 , = p 2 emp if k = 2 , = p 3 emp p 3 n 2 + μ ( k ) otherwise .
The condition (6) is the requirement that all ν ( k ) are non-negative. Because the probabilities μ ( k ) can become very small this essentially means that p 3 emp should be larger than p 3 . The amount of probability missing in the empirical probabilities p 1 emp and p 2 emp is equally distributed over the remaining n 1 points of Ω . On the other hand, when minimizing the Hellinger distance the excess or shortage of probability is compensated by multiplying all remaining probabilities by a constant factor.
A numerical comparison with n = 20 and λ = 1 / 8 is found in Figure 1. The empirical values are p 1 emp = 0.15 and p 2 emp = 0.25 . The difference with the prior values p 1 0.19774 and p 2 0.26836 is made large enough to amplify the effects of the update.

6. Summary

It is well known that the use of unmodified prior conditional probabilities is the optimal way for updating a probability distribution after new data become available. The update procedure minimizes the Hellinger distance between prior and posterior probability distributions. For the sake of completeness a proof is given in Theorem 1.
Alternatively, one can minimize the quadratic Bregman divergence instead of the Hellinger distance. The result is given in Theorem 2. The conservation of probability is handled in a different way in the two cases, either by multiplying prior probabilities with a suitable factor or by adding an appropriate term.
The example of Section 5 shows that the two update procedures have different effects and that neither of them may be satisfactory. This raises the question whether the present approach should be improved by choosing divergences other than Hellinger or Bregman.
In the present research, the work of Banerjee, Guo, and Wang [11] was considered as well. They prove that minimization of the Hellinger distance can be replaced by minimization of a Bregman divergence, without modifying the outcome. It is shown in Theorem 2 that, in a different context, the use of the Bregman divergence yields results quite distinct from those obtained by minimizing the Hellinger distance.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

  1. Amari, S.; Nagaoka, H. Methods of Information Geometry; Originally published in Japanese by Iwanami Shoten, Tokyo, Japan, 1993; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
  2. Amari, S. Information Geometry and Its Applications; Springer Nature: Tokyo, Japan, 2016. [Google Scholar]
  3. Ay, N.; Jost, J.; Lê, H.V.; Schwachhöfer, L. Information Geometry; Springer Nature: Basel, Switzerland, 2017. [Google Scholar]
  4. Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
  5. White, H. Maximum Likelihood Estimation of Misspecified Models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
  6. Jeffrey, R. Alias Smith and Jones: The Testimony of the Senses. Erkenntnis 1987, 26, 391–399. [Google Scholar] [CrossRef]
  7. Skyrms, B. The structure of Radical Probabilism. Erkenntnis 1997, 45, 285–297. [Google Scholar]
  8. Csiszár, I. Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems. Ann. Stat. 1991, 19, 2032–2066. [Google Scholar] [CrossRef]
  9. Csiszár, I. I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
  10. Grünwald, P.D.; Dawid, A.P. Game Theory, Maximum Entropy, Minimum Discrepancy and robust Bayesian Decision Theory. Ann. Stat. 2004, 32, 1367–1433. [Google Scholar] [CrossRef] [Green Version]
  11. Banerjee, A.; Guo, X.; Wang, H. On the Optimality of Conditional Expectation as a Bregman Predictor. IEEE Trans. Inf. Theory 2005, 51, 2664–2669. [Google Scholar] [CrossRef] [Green Version]
  12. Frigyik, B.A.; Srivastava, S.; Gupta, M.R. Functional Bregman Divergences and Bayesian Estimation of Distributions. IEEE Trans. Inf. Theory 2008, 54, 5130–5139. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Probability as a function of the integer k running from 0 to 20, showing different updates of the binomial distribution with parameters n = 20 and λ = 1 / 8 . The squares represent the binomial, the diamonds the update with the Hellinger distance, and the triangles the update with the square Bregman divergence. The empirical values are p 1 emp = 0.15 and p 2 emp = 0.25 .
Figure 1. Probability as a function of the integer k running from 0 to 20, showing different updates of the binomial distribution with parameters n = 20 and λ = 1 / 8 . The squares represent the binomial, the diamonds the update with the Hellinger distance, and the triangles the update with the square Bregman divergence. The empirical values are p 1 emp = 0.15 and p 2 emp = 0.25 .
Entropy 23 01668 g001
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Naudts, J. Update of Prior Probabilities by Minimal Divergence. Entropy 2021, 23, 1668. https://doi.org/10.3390/e23121668

AMA Style

Naudts J. Update of Prior Probabilities by Minimal Divergence. Entropy. 2021; 23(12):1668. https://doi.org/10.3390/e23121668

Chicago/Turabian Style

Naudts, Jan. 2021. "Update of Prior Probabilities by Minimal Divergence" Entropy 23, no. 12: 1668. https://doi.org/10.3390/e23121668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop