Update of Prior Probabilities by Minimal Divergence

Naudts, Jan

doi:10.3390/e23121668

Open AccessArticle

Update of Prior Probabilities by Minimal Divergence

by

Jan Naudts

Departement Fysica, Universiteit Antwerpen, 2610 Antwerpen, Belgium

Entropy 2021, 23(12), 1668; https://doi.org/10.3390/e23121668

Submission received: 18 November 2021 / Revised: 7 December 2021 / Accepted: 9 December 2021 / Published: 11 December 2021

(This article belongs to the Special Issue MaxEnt 2020/2021—The 40th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering)

Download

Browse Figure

Versions Notes

Abstract

:

The present paper investigates the update of an empirical probability distribution with the results of a new set of observations. The update reproduces the new observations and interpolates using prior information. The optimal update is obtained by minimizing either the Hellinger distance or the quadratic Bregman divergence. The results obtained by the two methods differ. Updates with information about conditional probabilities are considered as well.

Keywords:

statistical update procedure; minimal divergence; Hellinger distance; Bregman divergence; Jeffrey conditioning

1. Introduction

The present work is inspired by the current practices in Information Geometry [1,2,3] where minimization of divergences is an important tool. In Statistical Physics a divergence is called a relative entropy. Its importance was noted rather late in the twentieth century, after the work of Jaynes on the maximal entropy principle [4]. Estimation in the presence of hidden variables by minimizing a divergence function is briefly discussed in Chapter 8 of [2].

Assume now that some observation or experiment yields new statistical data. The approach is then to look for a probability distribution that reproduces the newly observed probabilities and that interpolates the data with missing information coming from a prior.

No further model assumptions are imposed. Hence, the statistical model under consideration consists of all probability distributions that are consistent with the newly obtained empirical data. Internal consistency of the empirical data ensures that the model is not empty. The update is the model point that minimizes the chosen divergence function from the prior to the manifold of the model.

In the context of Maximum Likelihood Estimation (MLE) one usually adopts a parameterized model. The dimension of the model can be kept low and properties of the model can be used to ease the calculations. One assumes that the new data can lead to a more accurate estimation of the limited number of model parameters. It can then happen that the model is misspecified [5] and that the update is only a good approximation of the empirical data.

Here, the model is dictated by the newly acquired empirical data and the update is forced to reproduce the measured data. Finding the probability distribution is then an underdetermined problem. Minimization of the divergence from the prior probability distribution solves the underdetermination.

In Bayesian statistics, the update

q (B)

of the probability

p (B)

of an event B equals

\begin{matrix} q (B) & = & p^{emp} (A) p (B | A) + p^{emp} (A^{c}) p (B | A^{c}) . \end{matrix}

(1)

The quantities

p^{emp} (A)

and

p^{emp} (A^{c})

are the empirical probabilities obtained after repeated measurement of event A and its complement

A^{c}

. Expression (1) has been called Jeffrey conditioning [6]. It implies the sufficiency conditions

q (B | A) = p (B | A)

and

q (B | A^{c}) = p (B | A^{c})

. It is an updating rule used in Radical Probabilism [7]. This expression is also obtained when minimizing the Hellinger distance between the prior and the model manifold. A proof of the latter follows later on in Section 4.

The present approach is a special case of minimizing a divergence function in the presence of linear constraints. See the introduction of [8] for an overview of early applications of this technique. Two classes of generalized distance functions satisfy a natural set of axioms: the f-divergences of Csiszár and the generalized Bregman divergences. The squared Hellinger distance belongs to the former class. The other divergence function considered here is the square Bregman divergence. Both Hellinger and square Bregman have special properties that make it easy to work with them.

A broad class of generalized Bregman divergences satisfies the Pythagorean equality [8,9]. Pythagorean inequalities hold for an even larger class [10]. The Pythagorean relations derived in the present work make use of the specific properties of the Hellinger distance and of the quadratic Bregman divergence. It is unclear how to prove them for more general divergences.

One incentive for starting the present work is a paper of Banerjee, Guo, and Wang [11,12]. They consider the problem of predicting a random variable

Z_{1}

given observations of a random variable

Z_{2}

. It is well-known that the conditional expectation, as defined by Kolmogorov, is the optimal predictor. They show that this statement remains true when the metric distance is replaced by a Bregman divergence. It is shown in Theorem 2 below that a proof in a more general context yields a deviating result.

The next Section fixes notations. Section 3 collects some results about the squared Hellinger distance and the quadratic Bregman divergence. Section 4 discusses the optimal choice and contains the Theorems 1 and 2. The proof of the theorems can be adapted to cover the situation that a subsequent measurement also yields information on conditional probabilities. This is shown in Section 4.3. Section 5 treats a simple example. A final section summarizes the results of the paper.

2. Empirical Data

Consider a probability space

Ω, μ

. A measurable subset A of

Ω

is called an event. Its probability is denoted

p (A)

and is given by

\begin{matrix} p (A) & = & \int_{Ω} I_{A} (x) d μ (x), \end{matrix}

where

I_{A} (x)

equals 1 when

x \in A

and 0 otherwise. The conditional expectation of a random variable f given an event A with non-vanishing probability

p (A)

is given by

\begin{matrix} E_{μ} f | A & = & \frac{1}{p (A)} E_{μ} f I_{A} . \end{matrix}

The probability space

Ω, μ

reflects the prior knowledge of the system at hand. When new data become available an update procedure is used to select the posterior probability space. The latter is denoted

Ω, ν

in what follows. The corresponding probability of an event A is denoted

q (A)

.

The outcome of repeated experiments is the empirical probability distribution of the events, denoted

p^{emp} (A)

. The question at hand is then to establish a criterion for finding the update

ν

of the probability distribution

μ

that is as close as possible to

μ

while reproducing the empirical results.

The event A defines a partition

A, A^{c}

of the probability space

Ω, μ

. As before

A^{c}

denotes the complement of A in

Ω

. In what follows a slightly more general situation is considered in which the event A is replaced by a partition

{(O_{i})}_{i = 1}^{n}

of the measure space

Ω, μ

into subsets with non-vanishing probability. The notations

p_{i}

and

μ_{i}

are used, with

\begin{matrix} p_{i} = p (O_{i}) and d μ_{i} (x) = \frac{1}{p_{i}} I_{O_{i}} (x) d μ (x) . \end{matrix}

(2)

Introduce the random variable g defined by

g (x) = i

when

x \in O_{i}

. Repeated measurement of the random variable g yields the empirical probabilities

\begin{matrix} p_{i}^{emp} & = & Emp Prob {x : g (x) = i} . \end{matrix}

They may deviate from the prior probabilities

p_{i}

. In some cases one also measures the conditional probabilities

\begin{matrix} p^{emp} (B | O_{i}) & = & Emp Prob of B given that g (x) = i \end{matrix}

of some other event B.

3. A Geometric Approach

In this section two divergences are reviewed, the squared Hellinger distance and the quadratic Bregman divergence.

3.1. Squared Hellinger Distance

For simplicity the present section is restricted to the case that the sample space

Ω

is the real line.

Given two probability measures

μ

and

σ

, both absolutely continuous w.r.t. the Lebesgue measure, the squared Hellinger distance is the divergence

D_{2} (σ | | μ)

defined by

\begin{matrix} D_{2} (σ | | μ) & = & \frac{1}{2} \int_{R} {(\sqrt{\frac{d σ}{d x}} - \sqrt{\frac{d μ}{d x}})}^{2} d x . \end{matrix}

It satisfies

\begin{matrix} D_{2} (σ | | μ) & = & 1 - \int_{R} \sqrt{\frac{d σ}{d x} \frac{d μ}{d x}} d x . \end{matrix}

Let

{(O_{i})}_{i}

be a partition of

Ω, μ

and let

g (x) = i

when x belongs to

O_{i}

, as before. Let

p_{i}

and

μ_{i}

be defined by (2). Consider the following functions of i, with i in

{1, \dots, n}

,

\begin{matrix} τ^{(1)} (i) & = & μ, independent of i, \\ τ^{(2)} (i) & = & μ_{i}, \\ τ^{(3)} (i) & = & σ_{i}, \end{matrix}

where each of the

σ_{i}

is a probability distribution with support in

O_{i}

. The empirical expectation of a function

f (i)

is given by

E^{emp} f = \sum_{i} p_{i}^{emp} f (i)

.

Proposition 1.

If

p_{i}^{emp} > 0

for all i and

\sum_{i} p_{i}^{emp} = 1

then one has

\begin{matrix} E^{emp} D_{2} (τ^{(1)} | | τ^{(3)}) \geq E^{emp} D_{2} (τ^{(1)} | | τ^{(2)}) \end{matrix}

with equality if and only if

σ_{i} = μ_{i}

for all i.

First prove the following two lemmas.

Lemma 1.

Assume that the probability measure

ν_{i}

is absolutely continuous w.r.t. the measure

μ_{i}

, with Radon-Nikodym derivative given by

d ν_{i} (x) = f_{i} (x) d μ_{i}

. Then one has

\begin{matrix} D_{2} (μ | | σ_{i}) - D_{2} (μ | | ν_{i}) & = & \sqrt{p_{i}} [D_{2} (μ_{i} | | σ_{i}) - D_{2} (μ_{i} | | ν_{i})] \end{matrix}

and

\begin{matrix} D_{2} (μ_{i} | | ν_{i}) & = & 1 - \int_{O_{i}} \sqrt{f_{i} (x)} d μ_{i} (x) . \end{matrix}

Proof.

One calculates

\begin{matrix} D_{2} (μ | | σ_{i}) - D_{2} (μ | | ν_{i}) & = & \int_{R} \sqrt{\frac{d μ}{d x}} [\sqrt{\frac{d ν_{i}}{d x}} - \sqrt{\frac{d σ_{i}}{d x}}] d x \\ = & \sqrt{p_{i}} \int_{O_{i}} \sqrt{\frac{d μ_{i}}{d x}} [\sqrt{\frac{d ν_{i}}{d x}} - \sqrt{\frac{d σ_{i}}{d x}}] d x \\ = & \sqrt{p_{i}} [\int_{O_{i}} \sqrt{f_{i} (x)} d μ_{i} (x) - \int_{O_{i}} {[\frac{d μ_{i}}{d x} \frac{d σ_{i}}{d x}]}^{1 / 2} d x] \\ = & \sqrt{p_{i}} [\int_{O_{i}} \sqrt{f_{i} (x)} d μ_{i} (x) - 1 + D_{2} (μ_{i} | | σ_{i})] . \end{matrix}

Now take

σ_{i} = ν_{i}

to obtain the desired results. □

Lemma 2.

(Pythagorean relation) For any i is

\begin{matrix} D_{2} (μ | | σ_{i}) & = & D_{2} (μ | | μ_{i}) + \sqrt{p_{i}} D_{2} (μ_{i} | | σ_{i}) . \end{matrix}

Proof.

The proof follows by taking

ν_{i} = μ_{i}

in the previous lemma. □

Proof.

(Proposition 1)

From the previous lemma it follows that

D_{2} (τ^{(1)} | | τ^{(3)}) \geq D_{2} (τ^{(1)} | | τ^{(2)})

. Note that

σ_{i} = μ_{i}

implies that

τ^{(3)} = τ^{(2)}

and hence

D_{2} (τ^{(1)} | | τ^{(3)}) = D_{2} (τ^{(1)} | | τ^{(2)})

. Conversely, if

\begin{matrix} E^{emp} D_{2} (τ^{(1)} | | τ^{(3)}) & = & E^{emp} D_{2} (τ^{(1)} | | τ^{(2)}) \end{matrix}

then it follows from the previous lemma that

E^{emp} D_{2} (τ^{(2)} | | τ^{(3)}) = 0

. If in addition

p_{i}^{emp} > 0

for all i then it follows that for all i

\begin{matrix} 0 & = & D_{2} (τ^{(2)} (i) | | τ^{(3)} (i)) . \end{matrix}

Because the squared Hellinger distance is a divergence, this implies that

τ^{(2)} (i) = τ^{(3)} (i)

, which is equivalent with

μ_{i} = σ_{i}

. □

3.2. Bregman Divergence

In the present section the squared Hellinger distance, which is an f-divergence, is replaced by a divergence of the Bregman type. In addition let

Ω

be a finite set equipped with the counting measure

ρ

. It assigns to each subset A of

Ω

the number of elements in A. This number is denoted

| A |

. The expectation value

E_{μ} f

of a random variable f w.r.t. the probability measure

μ

is given by

\begin{matrix} E_{μ} f & = & \sum_{k \in Ω} μ (k) f (k) . \end{matrix}

Given a partition of

Ω

into sets

O_{i}

one can define conditional probability measures with probability mass function

ρ_{i}

given by

\begin{matrix} ρ_{i} (k) & = & \frac{1}{| O_{i} |} if k \in O_{i}, \\ = & 0 otherwise . \end{matrix}

(3)

Similarly, conditional probability measures with probability mass function

μ_{i}

are given by

\begin{matrix} μ_{i} (k) & = & \frac{μ (k)}{μ (O_{i})} if k \in O_{i}, \\ = & 0 otherwise . \end{matrix}

(4)

Fix a strictly convex function

ϕ : R \mapsto R

. The Bregman divergence of the probability measures

σ

and

μ

is defined by

\begin{matrix} D_{ϕ} (σ | | μ) & = & F (σ) - F (μ) - 〈 \nabla F, σ - μ 〉 \end{matrix}

with

\begin{matrix} F (σ) = \sum_{k} ϕ (σ (k)) and \nabla_{k} F (σ) = ϕ^{'} (σ (k)) . \end{matrix}

In the case that

ϕ (x) = x^{2} / 2

, which is used below, it becomes

\begin{matrix} D_{ϕ} (σ | | μ) & = & \frac{1}{2} \sum_{k} {[σ (k) - μ (k)]}^{2} . \end{matrix}

(5)

For convenience, this case is referred to as the quadratic Bregman divergence.

The following result, obtained with the quadratic Bregman divergence, is more elegant than the result of Lemma 2.

Proposition 2.

Consider the quadratic Bregman divergence

D_{ϕ}

as given by (5). Let

ν_{i} = p_{i} μ_{i} + (1 - p_{i}) ρ_{i}

. Let

σ_{i}

be any probability measure with support in

O_{i}

. Then the following Pythagorean relation holds.

\begin{matrix} D_{ϕ} (μ | | σ_{i}) & = & D_{ϕ} (μ | | ν_{i}) + D_{ϕ} (ν_{i} | | σ_{i}) . \end{matrix}

Proof.

One calculates

\begin{matrix} D_{ϕ} (μ | | σ_{i}) - D_{ϕ} (μ | | ν_{i}) & = & D_{ϕ} (ν_{i} | | σ_{i}) + \sum_{x} [μ (x) - ν_{i} (x)] [ϕ^{'} (ν_{i} (x)) - ϕ^{'} (σ_{i} (x))] \\ = & D_{ϕ} (ν_{i} | | σ_{i}) + \sum_{x \in O_{i}} [p_{i} μ_{i} (x) - ν_{i} (x)] [ϕ^{'} (ν_{i} (x)) - ϕ^{'} (σ_{i} (x))] \\ = & D_{ϕ} (ν_{i} | | σ_{i}) - (1 - p_{i}) \frac{1}{| O_{i} |} \sum_{x \in O_{i}} [ϕ^{'} (ν_{i} (x)) - ϕ^{'} (σ_{i} (x))] . \end{matrix}

Use now that

ϕ^{'} (u) = u

and the normalization of the probability measures

ν_{i}

and

σ_{i}

to find the desired result. □

4. The Optimal Choice

4.1. Updated Probabilities

The following result proves that the standard Kolmogorovian definition of the conditional probability minimizes the Hellinger distance between the prior probability measure

μ

and the updated probability measure

ν

. The optimal choice of the updated probability measure

ν

is given by corresponding probabilities

q (B)

. They satisfy

\begin{matrix} q (B) & = & \sum_{i = 1}^{n} p_{i}^{emp} p (B | O_{i}) for any event B . \end{matrix}

Theorem 1.

Let be given a partition

{(O_{i})}_{i = 1}^{n}

of the probability space

Ω, μ

with

Ω = R

. Let

μ_{i}

be given by (2). Let

p_{i} = p (O_{i}) > 0

denote the probability of the event

O_{i}

and let be given strictly positive empirical probabilities

p_{i}^{emp}

,

i = 1, \dots, n

. The squared Hellinger distance

D_{2} (σ | | μ)

as a function of σ is minimal if and only if

σ_{i} = μ_{i}

for all i. Here, σ is any probability measure on Ω satisfying

\begin{matrix} σ = \sum_{i = 1}^{n} p_{i}^{emp} σ_{i}, \end{matrix}

and each of the

σ_{i}

is a probability measure with support in

O_{i}

and absolutely continuous w.r.t.

μ_{i}

.

Note that the probability measure

ν

given by

\begin{matrix} ν (x) & = & \sum_{i = 1}^{n} p_{i}^{emp} μ_{i} (x) \end{matrix}

uses the Kolmogorovian conditional probability as the predictor because the probabilities determined by the

μ_{i}

are obtained from the prior probability distribution

μ

by

p_{i} (x) = p (x | O_{i})

. By the above theorem this predictor is the optimal one w.r.t. the squared Hellinger distance.

Proof.

With the notations of the previous section is

\begin{matrix} D_{2} (σ | | μ) & = & E^{emp} D_{2} (τ^{(1)} | | τ^{(3)}) . \end{matrix}

Proposition 1 shows that it is minimal if and only if

σ_{i} = μ_{i}

for all i. □

Next, consider the use of the quadratic Bregman divergence in the context of a finite probability space.

Theorem 2.

Let be given a partition

{(O_{i})}_{i = 1}^{n}

of the finite probability space

Ω, μ

. Let

ρ_{i}

be the counting measure on

O_{i}

defined by (3). Let

μ_{i}

be given by (2). Let

p_{i} = p (O_{i}) > 0

denote the probability of the event

O_{i}

and let be given strictly positive empirical probabilities

p_{i}^{emp}

,

i = 1, \dots, n

summing up to 1. Assume that

\begin{matrix} p_{i}^{emp} \geq p_{i} [1 - | O_{i} | μ_{i} (x)] f o r a l l x \in O_{i} a n d f o r i = 1, \dots, n . \end{matrix}

(6)

Then the following hold.

(a): A probability distribution ν is defined by $ν = \sum_{i} p_{i}^{emp} ν_{i}$ with

$\begin{matrix} ν_{i} & = & (1 - \frac{p_{i}}{p_{i}^{emp}}) ρ_{i} + \frac{p_{i}}{p_{i}^{emp}} μ_{i} . \end{matrix}$

(7)
(b): Let σ be any probability measure on Ω satisfying $σ = \sum_{i = 1}^{n} p_{i}^{emp} σ_{i}$ , where each of the $σ_{i}$ is a probability distribution with support in $O_{i}$ . Then the quadratic Bregman divergence satisfies the Pythagorean relation

$\begin{matrix} D_{ϕ} (σ | | μ) & = & D_{ϕ} (ν | | μ) + \sum_{i = 1}^{n} {(p_{i}^{emp})}^{2} D_{ϕ} (σ_{i} | | ν_{i}) . \end{matrix}$

(8)
(c): The quadratic Bregman divergence $D_{ϕ} (σ | | μ)$ is minimal if and only if $σ = ν$ .

Proof.

(a)

The assumption (6) guarantees that the

ν_{i} (x)

are probabilities.

(b)

One calculates

\begin{matrix} D_{ϕ} (σ | | μ) - D_{ϕ} (ν | | μ) & = & \frac{1}{2} \sum_{x} [σ (x) - ν (x)] [σ (x) + ν (x) - 2 μ (x)] \\ = & \sum_{i = 1}^{n} p_{i}^{emp} \frac{1}{2} \sum_{x \in O_{i}} [σ_{i} (x) - ν_{i} (x)] \\ \times [p_{i}^{emp} σ_{i} (x) + p_{i}^{emp} ν_{i} (x) - 2 p_{i} μ_{i} (x)] \\ = & \sum_{i = 1}^{n} {(p_{i}^{emp})}^{2} \frac{1}{2} \sum_{x \in O_{i}} {[σ_{i} (x) - ν_{i} (x)]}^{2} \\ + \sum_{i = 1}^{n} p_{i}^{emp} \sum_{x \in O_{i}} [σ_{i} (x) - ν_{i} (x)] (p_{i}^{emp} - p_{i}) ρ_{i} (x) \\ = & \sum_{i = 1}^{n} {(p_{i}^{emp})}^{2} D_{ϕ} (σ_{i} | | ν_{i}) . \end{matrix}

In the above calculation the third line is obtained by eliminating

p_{i} μ_{i}

using the definition of

ν_{i}

. This gives

\begin{matrix} p_{i}^{emp} σ_{i} (x) + p_{i}^{emp} ν_{i} (x) - 2 p_{i} μ_{i} (x) \\ = & p_{i}^{emp} σ_{i} (x) + p_{i}^{emp} ν_{i} (x) - 2 p_{i}^{emp} [ν_{i} (x) - (1 - \frac{p_{i}}{p_{i}^{emp}}) ρ_{i} (x)] \\ = & p_{i}^{emp} [σ_{i} (x) - ν_{i} (x)] + 2 (p_{i}^{emp} - p_{i}) ρ_{i} (x) . \end{matrix}

The term

\begin{matrix} \sum_{i = 1}^{n} p_{i}^{emp} \sum_{x \in O_{i}} [σ_{i} (x) - ν_{i} (x)] (p_{i}^{emp} - p_{i}) ρ_{i} (x) \end{matrix}

vanishes because

ρ_{i} (x)

is constant on the set

O_{i}

and the probability measures

ν_{i}

and

σ_{i}

have support in

O_{i}

.

(c)

From (b) it follows that

D_{ϕ} (σ | | μ) \geq D_{ϕ} (ν | | μ)

, with equality when

σ = ν

.

Conversely, when

D_{ϕ} (σ | | μ) = D_{ϕ} (ν | | μ)

then (8) implies that

\begin{matrix} \sum_{i = 1}^{n} {(p_{i}^{emp})}^{2} D_{ϕ} (σ_{i} | | ν_{i}) & = & 0 . \end{matrix}

The empirical probabilities are strictly positive by assumption. Hence, it follows that

D_{ϕ} (μ | | σ_{i}) = D_{ϕ} (μ | | ν_{i})

for all i and hence, that

σ_{i} = ν_{i}

for all i. The latter implies

σ = ν

. □

The optimal update

ν

can be written as

\begin{matrix} ν = \sum_{i} [(p_{i}^{emp} - p_{i}) ρ_{i} + p_{i} μ_{i}] = μ + \sum_{i} (p_{i}^{emp} - p_{i}) ρ_{i} . \end{matrix}

This result is in general quite different from the update proposed by Theorem 1, which is

\begin{matrix} ν & = & \sum_{i} p_{i}^{emp} μ_{i} . \end{matrix}

The updates proposed by the two theorems coincide only in the special cases that either

p_{i}^{emp} = p_{i}

for all i or that

μ_{i} = ρ_{i}

for all i. In the latter case the prior distribution

μ = \sum_{i} p_{i} ρ_{i}

is replaced by the update

ν = \sum_{i} p_{i}^{emp} ρ_{i}

.

The entropy of the update when event

O_{i}

is observed, according to Theorem 1, equals

S (ν_{i}) = S (μ_{i})

. According to Theorem 2 it equals

\begin{matrix} S (ν_{i}) & = & S ([1 - \frac{p_{i}}{p_{i}^{emp}}] ρ_{i} + \frac{p_{i}}{p_{i}^{emp}} μ_{i}) . \end{matrix}

If

p_{i} \leq p_{i}^{emp}

then it follows that

\begin{matrix} S (ν_{i}) & \geq & [1 - \frac{p_{i}}{p_{i}^{emp}}] S (ρ_{i}) + \frac{p_{i}}{p_{i}^{emp}} S (μ_{i}) \\ \geq & S (μ_{i}) . \end{matrix}

The former inequality follows because the entropy is a concave function. The latter follows because entropy is maximal for the uniform distribution

ρ_{i}

. On the other hand, if

p_{i} > p_{i}^{emp}

then one has

\begin{matrix} S (μ_{i}) & = & S ([1 - \frac{p_{i}^{emp}}{p_{i}}] ρ_{i} + \frac{p_{i}^{emp}}{p_{i}} ν_{i}) \\ \geq & [1 - \frac{p_{i}^{emp}}{p_{i}}] S (ρ_{i}) + \frac{p_{i}^{emp}}{p_{i}} S (ν_{i}) \\ \geq & S (ν_{i}) . \end{matrix}

In the latter case the decrease of the entropy is stronger than in the case of the update based on the squared Hellinger distance. In conclusion, the update relying on the quadratic Bregman divergence looses details of the prior distribution by making a convex combination with a uniform distribution weighed with the probabilities of the observation. It does this moreso for the events with observed probability larger than predicted; this is when

p_{i}^{emp} > p_{i}

.

Note that Theorem 2 cannot always be applied because it contains restrictions on the empirical probabilities. In particular, if the prior probability

μ (x)

of some point x in

Ω

vanishes then the condition (6) requires that the empirical probability

p_{i}^{emp}

of the partition

O_{i}

to which the point x belongs is larger than or equal to the prior probability

p_{i}

.

4.2. Update of Conditional Probabilities

The two previous theorems assume that no empirical information is available about conditional probabilities. If such information is present then an optimal choice should make use of it. In one case the solution of the problem is straightforward. If the probabilities

p_{i}^{emp}

are available together with all conditional probabilities

p^{emp} (B | O_{i})

and there exists an update

ν

which reproduces these results then it is unique. Two cases remain: (1) The information about the conditional probabilities is incomplete; (2) the information is internally inconsistent – no update exists which reproduces the data.

Let us tackle the problem by considering the case that the only information that is available besides the probabilities

p_{i}^{emp}

is the vector of conditional probabilities

p^{emp} (B | O_{i})

of a fixed event B, given the outcome of the measurement of the random variable g as introduced in Section 2.

The following result is independent of the choice of divergence function.

Proposition 3.

Fix an event B in Ω. Assume that the conditional probabilities

p (B | O_{i})

,

i = 1, \dots, n

, are strictly positive and strictly less than 1. Assume in addition that

p_{i}^{emp} p^{emp} (B | O_{i}) \leq 1

for all i. Then there exists an update ν with corresponding probabilities

q (\cdot)

such that

q (O_{i}) = p_{i}^{emp}

and

q (B | O_{i}) = p^{emp} (B | O_{i})

,

i = 1, \dots, n

.

Proof.

An obvious choice is to take

ν

of the form

ν = \sum_{i} p_{i}^{emp} ν_{i}

with

ν_{i}

of the form

\begin{matrix} d ν_{i} (x) = [a_{i} I_{B \cap O_{i}} (x) + b_{i} I_{B^{c} \cap O_{i}} (x)] d μ (x), \end{matrix}

with

a_{i} \geq 0

and

b_{i} \geq 0

. Normalization of the

ν_{i}

gives the conditions

\begin{matrix} 1 & = & a_{i} p (B \cap O_{i}) + b_{i} p (B^{c} \cap O_{i}) . \end{matrix}

(9)

Reproduction of the conditional probabilities gives the conditions

\begin{matrix} p^{emp} (B | O_{i}) = \frac{q (B \cap O_{i})}{q (O_{i})} = a_{i} \frac{p (B \cap O_{i})}{p_{i}^{emp}} . \end{matrix}

The latter gives

\begin{matrix} a_{i} = \frac{p_{i}^{emp}}{p_{i}} \frac{p^{emp} (B | O_{i})}{p (B | O_{i})} . \end{matrix}

The normalization condition (9) becomes

\begin{matrix} 1 = p_{i}^{emp} p^{emp} (B | O_{i}) + b_{i} p (B^{c} \cap O_{i}) . \end{matrix}

It has a positive solution for

b_{i}

because

p_{i}^{emp} p^{emp} (B | O_{i}) \leq 1

and

p (B^{c} \cap O_{i}) > 0

. □

4.3. The Hellinger Case

The optimal updates can be derived easily from Theorem 1. Double the partition by introduction of the following sets

\begin{matrix} O_{i}^{+} = B \cap O_{i} a n d O_{i}^{-} = B^{c} \cap O_{i} . \end{matrix}

They have prior probabilities

p_{i}^{\pm} = p (O_{i}^{\pm})

. Corresponding prior measures

μ_{i}^{\pm}

are defined by

\begin{matrix} d μ_{i}^{\pm} (x) = \frac{1}{p_{i}^{\pm}} I_{O_{i}^{\pm}} (x) d μ (x) . \end{matrix}

The empirical probability of the set

O_{i}^{+}

is taken equal to

p_{i}^{emp} p^{emp} (B | O_{i})

, that of

O_{i}^{-}

equals

p_{i}^{emp} [1 - p^{emp} (B | O_{i})]

. The optimal update

ν

follows from Theorem 1 and is given by

\begin{matrix} d ν (x) & = & \sum_{i} p_{i}^{emp} p^{emp} (B | O_{i}) d μ_{i}^{+} (x) + \sum_{i} p_{i}^{emp} [1 - p^{emp} (B | O_{i})] d μ_{i}^{-} (x) . \end{matrix}

By construction it is

\begin{matrix} q (O_{i}^{+}) = p_{i}^{emp} p^{emp} (B | O_{i}) a n d q (O_{i}^{-}) = p_{i}^{emp} [1 - p^{emp} (B | O_{i})] . \end{matrix}

One now verifies that

q (O_{i}) = p_{i}^{emp}

and

q (B | O_{i}) = p^{emp} (B | O_{i})

, which is the intended result.

4.4. The Bregman Case

Next consider the optimization with the quadratic Bregman divergence. Probability distributions

ρ_{i}^{\pm}

are defined by

\begin{matrix} ρ_{i}^{\pm} (x) & = & \frac{1}{| O_{i}^{\pm} |} I_{O_{i}^{\pm}} (x) . \end{matrix}

Introduce the notations

\begin{matrix} r_{i}^{+} & = & \frac{p_{i}^{+}}{p_{i}^{emp} p^{emp} (B | O_{i})}, \\ r_{i}^{-} & = & \frac{p_{i}^{-}}{p_{i}^{emp} [1 - p^{emp} (B | O_{i})]}, \\ ν_{i}^{\pm} (x) & = & (1 - r_{i}^{\pm}) ρ_{i}^{\pm} + r_{i}^{\pm} μ_{i}^{\pm} (x) . \end{matrix}

Then the condition for Theorem 2 to hold is that

ν_{i}^{\pm} (x) \geq 0

for all

x, i

. The optimal probability distribution

ν

is given by

\begin{matrix} ν (x) & = & \sum_{i} p_{i}^{emp} p^{emp} (B | O_{i}) ν_{i}^{+} (x) + \sum_{i} p_{i}^{emp} [1 - p^{emp} (B | O_{i})] ν_{i}^{-} (x) \\ = & \sum_{i} [p_{i}^{emp} p^{emp} (B | O_{i}) - p_{i}^{+}] ρ_{i}^{+} + \sum_{i} p_{i}^{+} μ_{i}^{+} \\ + \sum_{i} [p_{i}^{emp} [1 - p^{emp} (B | O_{i})] - p_{i}^{-}] ρ_{i}^{-} + \sum_{i} p_{i}^{-} μ_{i}^{-} \\ = & \sum_{i} p_{i}^{emp} p^{emp} (B | O_{i}) [ρ_{i}^{+} - ρ_{i}^{-}] \\ - \sum_{i} p_{i}^{+} ρ_{i}^{+} + \sum_{i} [p_{i}^{emp} - p_{i}^{-}] ρ_{i}^{-} \\ + μ . \end{matrix}

5. Example

Assume that the prior probability distribution is binomial with parameters

n, λ

, where n is known with certainty. The probability mass function is given by

\begin{matrix} μ (k) = Prob (X = k) & = & (\binom{n}{k}) λ^{k} {(1 - λ)}^{n - k} k = 0, 1, 2, \dots, n . \end{matrix}

The probability distribution and the value of the parameter

λ

are for instance the result of theoretical modeling of the experiment. Or they are obtained from a different kind of experiment.

The experiment under consideration yields accurate values for the probability

p^{emp}

of the two events

X = 1

and

X = 2

. The problem at hand is to predict by extrapolation the probability of the event

X = k

for other values of k. A fit of the data with a binomial distribution is likely to fail because two accurate data points are given to determine a single parameter

λ

. The binomial model can be misspecified.

The geometric approach followed in the present paper yields an update from the binomial distribution to another distribution, one which is reproducing the data. The update is conducted in an unbiased manner. Quite often one is tempted to replace the model, in the case of the binomial model, by a model with one extra free parameter.

Let us see what are the results of minimizing divergence functions. The probability space

Ω

is the set of integers

0, 1, 2, \dots, n

equipped with the uniform measure. Choose events

\begin{matrix} O_{1} = {1}, O_{2} = {2}, O_{3} = Ω ∖ (O_{1} \cup O_{2}) . \end{matrix}

This gives for

p_{i} : = Prob (X \in O_{i})

\begin{matrix} p_{1} = μ (1) = n λ {(1 - λ)}^{n - 1}, p_{2} = μ (2) = \frac{1}{2} n (n - 1) λ^{2} {(1 - λ)}^{n - 2}, p_{3} = 1 - p_{2} - p_{3} . \end{matrix}

The optimal update according to Theorem 1, minimizing the Hellinger distance, is given by the probabilities

\begin{matrix} ν (B) & = & \sum_{i} p_{i}^{emp} μ (B | O_{i}) . \end{matrix}

In particular, the probability mass function

ν (k) : = ν ({k})

becomes

\begin{matrix} ν (1) & = & p_{1}^{emp}, \\ ν (2) & = & p_{2}^{emp}, \\ ν (k) & = & \frac{p_{3}^{emp}}{p_{3}} μ (k) o t h e r w i s e . \end{matrix}

The optimal update according to Theorem 2, minimizing the quadratic Bregman divergence, is given by (7). The auxiliary measures

μ_{i}

,

ρ_{i}

, and

ν_{i}

have probability mass functions given by

\begin{matrix} μ_{i} (k) = ρ_{i} (k) = ν_{i} = δ_{k, i} for i = 1, 2, \end{matrix}

and

\begin{matrix} μ_{3} (k) & = & (1 - δ_{k, 1}) (1 - δ_{k, 2}) \frac{μ (k)}{p_{3}}, \\ ρ_{3} (k) & = & (1 - δ_{k, 1}) (1 - δ_{k, 2}) \frac{1}{n - 2} \\ ν_{3} (k) & = & (1 - δ_{k, 1}) (1 - δ_{k, 2}) [(1 - \frac{p_{3}}{p_{3}^{emp}}) \frac{1}{n - 2} + \frac{μ (k)}{p_{3}^{emp}}] . \end{matrix}

The probability mass function

ν (k) : = ν ({k})

becomes

\begin{matrix} ν (k) & = & p_{1}^{emp} ν_{1} (k) + p_{2}^{emp} ν_{2} (k) + p_{3}^{emp} ν_{3} (k) \\ = & p_{1}^{emp} if k = 1, \\ = & p_{2}^{emp} if k = 2, \\ = & \frac{p_{3}^{emp} - p_{3}}{n - 2} + μ (k) otherwise . \end{matrix}

The condition (6) is the requirement that all

ν (k)

are non-negative. Because the probabilities

μ (k)

can become very small this essentially means that

p_{3}^{emp}

should be larger than

p_{3}

. The amount of probability missing in the empirical probabilities

p_{1}^{emp}

and

p_{2}^{emp}

is equally distributed over the remaining

n - 1

points of

Ω

. On the other hand, when minimizing the Hellinger distance the excess or shortage of probability is compensated by multiplying all remaining probabilities by a constant factor.

A numerical comparison with

n = 20

and

λ = 1 / 8

is found in Figure 1. The empirical values are

p_{1}^{emp} = 0.15

and

p_{2}^{emp} = 0.25

. The difference with the prior values

p_{1} ≃ 0.19774

and

p_{2} ≃ 0.26836

is made large enough to amplify the effects of the update.

6. Summary

It is well known that the use of unmodified prior conditional probabilities is the optimal way for updating a probability distribution after new data become available. The update procedure minimizes the Hellinger distance between prior and posterior probability distributions. For the sake of completeness a proof is given in Theorem 1.

Alternatively, one can minimize the quadratic Bregman divergence instead of the Hellinger distance. The result is given in Theorem 2. The conservation of probability is handled in a different way in the two cases, either by multiplying prior probabilities with a suitable factor or by adding an appropriate term.

The example of Section 5 shows that the two update procedures have different effects and that neither of them may be satisfactory. This raises the question whether the present approach should be improved by choosing divergences other than Hellinger or Bregman.

In the present research, the work of Banerjee, Guo, and Wang [11] was considered as well. They prove that minimization of the Hellinger distance can be replaced by minimization of a Bregman divergence, without modifying the outcome. It is shown in Theorem 2 that, in a different context, the use of the Bregman divergence yields results quite distinct from those obtained by minimizing the Hellinger distance.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

References

Amari, S.; Nagaoka, H. Methods of Information Geometry; Originally published in Japanese by Iwanami Shoten, Tokyo, Japan, 1993; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
Amari, S. Information Geometry and Its Applications; Springer Nature: Tokyo, Japan, 2016. [Google Scholar]
Ay, N.; Jost, J.; Lê, H.V.; Schwachhöfer, L. Information Geometry; Springer Nature: Basel, Switzerland, 2017. [Google Scholar]
Jaynes, E. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
White, H. Maximum Likelihood Estimation of Misspecified Models. Econometrica 1982, 50, 1–25. [Google Scholar] [CrossRef]
Jeffrey, R. Alias Smith and Jones: The Testimony of the Senses. Erkenntnis 1987, 26, 391–399. [Google Scholar] [CrossRef]
Skyrms, B. The structure of Radical Probabilism. Erkenntnis 1997, 45, 285–297. [Google Scholar]
Csiszár, I. Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems. Ann. Stat. 1991, 19, 2032–2066. [Google Scholar] [CrossRef]
Csiszár, I. I-divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975, 3, 146–158. [Google Scholar] [CrossRef]
Grünwald, P.D.; Dawid, A.P. Game Theory, Maximum Entropy, Minimum Discrepancy and robust Bayesian Decision Theory. Ann. Stat. 2004, 32, 1367–1433. [Google Scholar] [CrossRef] [Green Version]
Banerjee, A.; Guo, X.; Wang, H. On the Optimality of Conditional Expectation as a Bregman Predictor. IEEE Trans. Inf. Theory 2005, 51, 2664–2669. [Google Scholar] [CrossRef] [Green Version]
Frigyik, B.A.; Srivastava, S.; Gupta, M.R. Functional Bregman Divergences and Bayesian Estimation of Distributions. IEEE Trans. Inf. Theory 2008, 54, 5130–5139. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Probability as a function of the integer k running from 0 to 20, showing different updates of the binomial distribution with parameters

n = 20

and

λ = 1 / 8

. The squares represent the binomial, the diamonds the update with the Hellinger distance, and the triangles the update with the square Bregman divergence. The empirical values are

p_{1}^{emp} = 0.15

and

p_{2}^{emp} = 0.25

.

Figure 1. Probability as a function of the integer k running from 0 to 20, showing different updates of the binomial distribution with parameters

n = 20

and

λ = 1 / 8

. The squares represent the binomial, the diamonds the update with the Hellinger distance, and the triangles the update with the square Bregman divergence. The empirical values are

p_{1}^{emp} = 0.15

and

p_{2}^{emp} = 0.25

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Naudts, J. Update of Prior Probabilities by Minimal Divergence. Entropy 2021, 23, 1668. https://doi.org/10.3390/e23121668

AMA Style

Naudts J. Update of Prior Probabilities by Minimal Divergence. Entropy. 2021; 23(12):1668. https://doi.org/10.3390/e23121668

Chicago/Turabian Style

Naudts, Jan. 2021. "Update of Prior Probabilities by Minimal Divergence" Entropy 23, no. 12: 1668. https://doi.org/10.3390/e23121668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Update of Prior Probabilities by Minimal Divergence

Abstract

1. Introduction

2. Empirical Data

3. A Geometric Approach

3.1. Squared Hellinger Distance

3.2. Bregman Divergence

4. The Optimal Choice

4.1. Updated Probabilities

4.2. Update of Conditional Probabilities

4.3. The Hellinger Case

4.4. The Bregman Case

5. Example

6. Summary

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI