Variational Message Passing and Local Constraint Manipulation in Factor Graphs

Şenöz, İsmail; van de Laar, Thijs; Bagaev, Dmitry; de Vries, Bert

doi:10.3390/e23070807

Open AccessArticle

Variational Message Passing and Local Constraint Manipulation in Factor Graphs

¹

Department of Electrical Engineering, Eindhoven University of Technology, 5600 MB Eindhoven, The Netherlands

²

GN Hearing, JF Kennedylaan 2, 5612 AB Eindhoven, The Netherlands

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(7), 807; https://doi.org/10.3390/e23070807

Submission received: 19 May 2021 / Revised: 18 June 2021 / Accepted: 22 June 2021 / Published: 24 June 2021

(This article belongs to the Special Issue Approximate Bayesian Inference)

Download

Browse Figures

Versions Notes

Abstract

:

Accurate evaluation of Bayesian model evidence for a given data set is a fundamental problem in model development. Since evidence evaluations are usually intractable, in practice variational free energy (VFE) minimization provides an attractive alternative, as the VFE is an upper bound on negative model log-evidence (NLE). In order to improve tractability of the VFE, it is common to manipulate the constraints in the search space for the posterior distribution of the latent variables. Unfortunately, constraint manipulation may also lead to a less accurate estimate of the NLE. Thus, constraint manipulation implies an engineering trade-off between tractability and accuracy of model evidence estimation. In this paper, we develop a unifying account of constraint manipulation for variational inference in models that can be represented by a (Forney-style) factor graph, for which we identify the Bethe Free Energy as an approximation to the VFE. We derive well-known message passing algorithms from first principles, as the result of minimizing the constrained Bethe Free Energy (BFE). The proposed method supports evaluation of the BFE in factor graphs for model scoring and development of new message passing-based inference algorithms that potentially improve evidence estimation accuracy.

Keywords:

Bayesian inference; Bethe free energy; factor graphs; message passing; variational free energy; variational inference; variational message passing

1. Introduction

Building models from data is at the core of both science and engineering applications. The search for good models requires a performance measure that scores how well a particular model m captures the hidden patterns in a data set D. In a Bayesian framework, that measure is the Bayesian evidence

p (D | m)

, i.e., the probability that model m would generate D if we were to draw data from m. The art of modeling is then the iterative process of proposing new model specifications, evaluating the evidence for each model and retaining the model with the most evidence [1].

Unfortunately, Bayesian evidence is intractable for most interesting models. A popular solution to evidence evaluation is provided by variational inference, which describes the process of Bayesian evidence evaluation as a (free energy) minimization process, since the variational free energy (VFE) is a tractable upper bound on Bayesian (negative log-)evidence [2]. In practice, the model development process then consists of proposing various candidate models, minimizing VFE for each model and selecting the model with the lowest minimized VFE.

The difference between VFE and negative log-evidence (NLE) is equal to the Kullback–Leibler divergence (KLD) [3] from the (perfect) Bayesian posterior distribution to the variational distribution for the latent variables in the model. The KLD can be interpreted as the cost of conducting variational rather than Bayesian inference. Perfect (Bayesian) inference would lead to zero inference costs (KLD

= 0

), and the KLD increases as the variational posterior diverges further from the Bayesian posterior. As a result, model development in a variational inference context is a balancing act, where we search for models that have both large amounts of evidence for the data and small inference costs (small KLD). In other words, in a variational inference context, the researcher has two knobs to tune models. The first knob alters the model specification, which affects model evidence. The second knob relates to constraining the search space for the variational posterior, which may affect the inference costs.

In this paper, we are concerned with developing algorithms for tuning the second knob. How do we constrain the range of variational posteriors so as to make variational inferences both tractable and accurate (resulting in low KLD)? We present our framework in the context of a (Forney-style) factor graph representation of the model [4,5]. In that context, variational inference can be understood as an automatable and efficient message passing-based inference procedure [6,7,8].

Traditional constraints include mean-field [6] and Bethe approximations [9,10]. However, more recently it has become clear how alternative local constraints, such as posterior factorization [11], expectation and chance constraints [12,13], and local Laplace approximation [14], may impact both tractability and inference accuracy, and thereby potentially lead to lower VFE. The main contribution of the current work lies in unifying the various ideas on local posterior constraints into a principled method for deriving variational message passing-based inference algorithms. The proposed method derives existing message passing algorithms, but also supports the development of new message passing variants.

Section 2 reviews Forney-style Factor Graphs (FFGs) and variational inference by minimizing the Bethe Free Energy (BFE). This review is continued in Section 3, where we discuss BFE optimization from a Lagrangian optimization viewpoint. In Appendix A, we include an example to illustrate that the Bayes rule can be derived from Lagrangian optimization with data constraints. Our main contribution lies in Section 4, which provides a rigorous treatment of the effects of imposing local constraints on the BFE and the resulting message update rules. We build upon several previous works that describe how manipulation of (local) constraints and variational objectives can be employed to improve variational approximations in the context of message passing. For example, ref. [12] shows how inference algorithms can be unified in terms of hybrid message passing by Lagrangian constraint manipulation. We extend this view by bringing form (Section 4.2) and factorization constraints (Section 4.1) into a constrained optimization framework. In [15], a high-level recipe for generating message passing algorithms from divergence measures is described. We apply their general recipe in the current work, where we adhere to the view on local stationary points for region-based approximations on general graphs [16]. In Appendix B, we also show that locally stationary solutions are also the global stationary solutions. In Section 5, we develop an algorithm for VFE evaluation in an FFG. In previous work, ref. [17] describes a factor softening approach to evaluate the VFE for models with deterministic factors. We extend this work in Section 5, and show how to avoid factor softening for both free energy evaluation and inference of posteriors. We show an example of how to compute VFE for a deterministic node in Appendix C. A more detailed comparison to related work is given in Section 7.

In the literature, proofs and descriptions of message passing-based inference algorithms are scattered across multiple papers and varying graphical representations, including Bayesian networks [6,18], Markov random fields [16], bi-partite (Tanner) factor graphs [12,17,19] and Forney-style factor graphs (FFGs) [5,11]. In Appendix D, we provide first-principle proofs for a large collection of familiar message passing algorithms in the context of Forney-style factor graphs, which is the preferred framework in the information and communication theory communities [4,20].

2. Factor Graphs and the Bethe Free Energy

2.1. Terminated Forney-Style Factor Graphs

A Forney-style factor graph (FFG) is an undirected graph

G = (V, E)

with nodes

V

and edges

E \subseteq V \times V

. We denote the neighboring edges of a node

a \in V

by

E (a)

. Vice versa, for an edge

i \in E

, the notation

V (i)

collects all neighboring nodes. As a notational convention, we index nodes by

a, b, c

and edges by

i, j, k

, unless stated otherwise. We will mainly use a and i as summation indices and use the other indices to refer to a node or edge of interest.

In this paper, we will frequently refer to the notion of a subgraph. We define an edge-induced subgraph by

G (i) = (V (i), i)

, and a node-induced subgraph by

G (a) = (a, E (a))

. Furthermore, we denote a local subgraph by

G (a, i) = (V (i), E (a))

, which collects all local nodes and edges around i and a, respectively.

An FFG can be used to represent a factorized function,

\begin{matrix} f (s) = \prod_{a \in V} f_{a} (s_{a}), \end{matrix}

(1)

where

s_{a}

collects the argument variables of factor

f_{a}

. We assumed that all the factors are positive. In an FFG, a node

a \in V

corresponds to a factor

f_{a}

, and the neighboring edges

E (a)

correspond to the variables

s_{a}

that are the arguments of

f_{a}

.

As an example model, the following factorization (2), the corresponding FFG of which is shown in Figure 1.

\begin{matrix} f (s_{1}, \dots, s_{5}) = f_{a} (s_{1}) f_{b} (s_{1}, s_{2}, s_{3}) f_{c} (s_{2}) f_{d} (s_{3}, s_{4}, s_{5}) f_{e} (s_{5}) . \end{matrix}

(2)

The FFG of Figure 1 consists of five nodes

V = {a, \dots, e}

, as annotated by their corresponding factor functions, and five edges

E = {(a, b), \dots, (d, e)}

as annotated by their corresponding variables. An edge that connects to only one node (e.g., the edge for

s_{4}

) is called a half-edge. In this example, the neighborhood

E (b) = {(a, b), (b, c), (b, d)}

and

V ((b, c)) = {b, c}

.

In the FFG representation, a node can be connected to an arbitrary number of edges, while an edge can only be connected to at most two nodes. Therefore, FFGs often contain “equality nodes” that constrain connected edges to carry identical beliefs, with the implication that these beliefs can be made available to more than two factors. An equality node has the factor function

\begin{matrix} f_{a} (s_{i}, s_{j}, s_{k}) & = δ (s_{j} - s_{i}) δ (s_{j} - s_{k}), \end{matrix}

(3)

for which the node-induced subgraph

G (a)

is drawn in Figure 2.

If every edge in the FFG has exactly two connected nodes (including equality nodes), then we designate the graph as a terminated FFG (TFFG). Since multiplication of a function

f (s)

by 1 does not alter the function, any FFG can be terminated by connecting any half-edge i to a node a that represents the unity factor

f_{a} (s_{i}) = 1

.

In Section 4.2 we discuss form constraints on posterior distributions. If such a constraint takes on a Dirac-delta functional form, then we visualize the constraint on the FFG by a small circle in the middle of the edge. For example, the small shaded circle in Figure 11 indicates that the variable has been observed. In Section 4.3.2 we consider form constraints in the context of optimization, in which case the circle annotation will be left open (see, e.g., Figure 14).

2.2. Variational Free Energy

Given a model

f (s)

and a (normalized) probability distribution

q (s)

, we can define a Variational Free Energy (VFE) functional as

F [q, f] ≜ \int q (s) log \frac{q (s)}{f (s)} d s .

(4)

Variational inference is concerned with finding solutions to the minimization problem

\begin{matrix} q^{*} (s) = arg min_{q \in Q} F [q, f], \end{matrix}

(5)

where

Q

imposes some constraints on q.

If q is unconstrained, then the optimal solution is obtained for

q^{*} (s) = p (s)

, with

p (s) = \frac{1}{Z} f (s)

being the exact posterior, and

Z = \int f (s) d s

a normalizing constant that is commonly referred to as the evidence. The minimum value of the free energy then follows as the negative log-evidence (NLE),

\begin{matrix} F [q^{*}, f] = - log Z, \end{matrix}

which is also known as the surprisal. The NLE can be interpreted as a measure of model performance, where low NLE is preferred.

As an unconstrained search space for q grows exponentially with the number of variables, the optimization of (5) quickly becomes intractable beyond the most basic models. Therefore, constraints and approximations to the variational free energy (4) are often utilized. As a result, the constrained variational free energy with

q^{*} \in Q

bounds the NLE by

\begin{matrix} F [q^{*}, f] = - log Z + \int q^{*} (s) log \frac{q^{*} (s)}{p (s)} d s, \end{matrix}

(6)

where the latter term expresses the divergence from the (intractable) exact solution to the optimal variational belief.

In practice, the functional form of

q (s) = q (s; θ)

is often parameterized, such that gradients of F can be derived w.r.t. the parameters

θ

. This effectively converts the variational optimization of

F [q, f]

to a parametric optimization of

F (θ)

as a function of

θ

. This problem can then be solved by a (stochastic) gradient descent procedure [21,22].

In the context of variational calculus, while form constraints may lead to interesting properties (see Section 4.2), they are generally not required. Interestingly, in a variational optimization context, the functional form of q is often not an assumption, but rather a result of optimization (see Section 4.3.1). An example of variational inference is provided in Appendix A.

2.3. Bethe Free Energy

The Bethe approximation enjoys a unique place in the landscape of

Q

, because the Bethe free energy (BFE) defines the fundamental objective of the celebrated belief propagation (BP) algorithm [17,23]. The origin of the Bethe approximation is rooted in tree-like approximations to subgraphs (possibly containing cycles) by enforcing local consistency conditions on the beliefs associated with edges and nodes [24].

Given a TFFG

G = (V, E)

for a factorized function

f (s) = \prod_{a \in V} f_{a} (s_{a})

(1), the Bethe free energy (BFE) is defined as [25]:

\begin{matrix} F [q, f] & ≜ \sum_{a \in V} \underset{F [q_{a}, f_{a}]}{\underset{︸}{\int q_{a} (s_{a}) log \frac{q_{a} (s_{a})}{f_{a} (s_{a})} d s_{a}}} + \sum_{i \in E} \underset{H [q_{i}]}{\underset{︸}{\int q_{i} (s_{i}) log \frac{1}{q_{i} (s_{i})} d s_{i}}} \end{matrix}

(7)

such that the factorized beliefs

\begin{matrix} q (s) & = \prod_{a \in V} q_{a} (s_{a}) \prod_{i \in E} q_{i} {(s_{i})}^{- 1} \end{matrix}

(8)

satisfy the following constraints:

\begin{matrix} \int q_{a} (s_{a}) d s_{a} & = 1, for all a \in V \end{matrix}

(9a)

\begin{matrix} \int q_{a} (s_{a}) d s_{a ∖ i} & = q_{i} (s_{i}), for all a \in V and all i \in E (a) . \end{matrix}

(9b)

Together, the normalization constraint (9a) and marginalization constraint (9b) imply that the edge marginals are also normalized:

\begin{matrix} \int q_{i} (s_{i}) d s_{i} & = 1, for all i \in E . \end{matrix}

(10)

The Bethe free energy (7) includes a local free energy term

F [q_{a}, f_{a}]

for each node

a \in V

, and an entropy term

H [q_{i}]

for each edge

i \in E

. Note that the local free energy also depends on the node function

f_{a}

, as specified in the factorization of f (1), whereas the entropy only depends on the local belief

q_{i}

.

The Bethe factorization (8) and constraints are summarized by the local polytope [26]

\begin{matrix} L (G) = \{q_{a} for all a \in V s . t . (9 a), and q_{i} for all i \in E (a) s . t . (9 b)\}, \end{matrix}

(11)

which defines the constrained search space for the factorized variational distribution (8).

2.4. Problem Statement

In this paper, the problem is to find the beliefs in the local polytope that minimize the Bethe free energy

\begin{matrix} q^{*} (s) = arg min_{q \in L (G)} F [q, f], \end{matrix}

(12)

where q is defined by (8), and where

q \in L (G)

offers a shorthand notation for optimizing over the individual beliefs in the local polytope. In the following sections, we will follow the Lagrangian optimization approach to derive various message passing-based inference algorithms.

2.5. Sketch of Solution Approach

The problem statement of Section 2.4 defines a global minimization of the beliefs in the Bethe factorization. Instead of solving the global optimization problem directly, we employ the factorization of the variational posterior and local polytope to subdivide the global problem statement in multiple interdependent local objectives.

From the BFE objective (12) and local polytope of (11), we can construct the Lagrangian

\begin{matrix} L [q, f] = & \sum_{a \in V} F [q_{a}, f_{a}] + \sum_{a \in V} ψ_{a} [\int q_{a} (s_{a}) d s_{a} - 1] + \sum_{a \in V} \sum_{i \in E (a)} \int λ_{i a} (s_{i}) [q_{i} (s_{i}) - \int q_{a} (s_{a}) d s_{a ∖ i}] d s_{i} \\ + \sum_{i \in E} H [q_{i}] + \sum_{i \in E} ψ_{i} [\int q_{i} (s_{i}) d s_{i} - 1], \end{matrix}

(13)

where the Lagrange multipliers

ψ_{a}

,

ψ_{i}

and

λ_{i a}

enforce the normalization and marginalization constraints of (9). It can be seen that this Lagrangian contains local beliefs

q_{a}

and

q_{i}

, which are coupled through the

λ_{i a}

Lagrange multipliers. The Lagrange multipliers

λ_{i a}

are doubly indexed, because there is a multiplier associated with each marginalization constraint. The Lagrangian method then converts a constrained optimization problem of

F [q, f]

to an unconstrained optimization problem of

L [q, f]

. The total variation of the Lagrangian (13) can then be approached from the perspective of variations of the individual (coupled) local beliefs.

More specifically, given a locally connected pair

b \in V, j \in E (b)

, we can rewrite the optimization of (12) in terms of the local beliefs

q_{b}, q_{j}

, and the constraints in the local polytope

\begin{matrix} L (G (b, j)) = \{q_{b} s . t . (9 a), and q_{j} s . t . (9 b)\}, \end{matrix}

(14)

that pertains to these beliefs. The problem then becomes finding local stationary solutions

\begin{matrix} {q_{b}^{*}, q_{j}^{*}} & = arg min_{L (G (b, j))} F [q, f] . \end{matrix}

(15)

Using (13), the optimization of (15) can then be written in the Lagrangian form

\begin{matrix} q_{b}^{*} & = arg min_{q_{b}} L_{b} [q_{b}, f_{b}], \end{matrix}

(16a)

\begin{matrix} q_{j}^{*} & = arg min_{q_{j}} L_{j} [q_{j}], \end{matrix}

(16b)

where the Lagrangians

L_{b}

and

L_{j}

include the local polytope of (14) to rewrite (13) as an explicit functional of beliefs

q_{b}

and

q_{j}

(see, e.g., Lemmas 1 and 2). The combined stationary solutions to the local objectives then also comprise a stationary solution to the global objective (Appendix B).

The current paper shows how to identify stationary solutions to local objectives of the form (15), with the use of variational calculus, under varying constraints as imposed by the local polytope (14). Interestingly, the resulting fixed-point equations can be interpreted as message passing updates on the underlying TFFG representation of the model. In the following Section 3 and Section 4, we derive the local stationary solutions under a selection of constraints and show how these relate to known message passing update rules (Table 1). It then becomes possible to derive novel message updates and algorithms by simply altering the local polytope.

3. Bethe Lagrangian Optimization by Message Passing

3.1. Stationary Points of the Bethe Lagrangian

We wish to minimize the Bethe free energy under variations of the variational density. As the Bethe free energy factorizes over factors and variables (7), we first consider variations on separate node- and edge-induced subgraphs.

Lemma 1.

Given a TFFG

G = (V, E)

, consider the node-induced subgraph

G (b)

(Figure 3). The stationary points of the Lagrangian (16a) as a functional of

q_{b}

,

\begin{matrix} L_{b} [q_{b}, f_{b}] & = F [q_{b}, f_{b}] + ψ_{b} [\int q_{b} (s_{b}) d s_{b} - 1] + \sum_{i \in E (b)} \int λ_{i b} (s_{i}) [q_{i} (s_{i}) - \int q_{b} (s_{b}) d s_{b ∖ i}] d s_{i} + C_{b}, \end{matrix}

(17)

where

C_{b}

collects all terms that are independent of

q_{b}

, which are of the form

q_{b} (s_{b}) = \frac{f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i})}{\int f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i}) d s_{b}} .

(18)

Proof.

See Appendix D.1. □

The

μ_{i b} (s_{i})

are any set of positive functions that makes (18) satisfy (9b), and will be identified in Theorem 1.

Lemma 2.

Given a TFFG

G = (V, E)

, consider an edge-induced subgraph

G (j)

(Figure 4). The stationary points of the Lagrangian (16b) as a functional of

q_{j}

,

\begin{matrix} L_{j} [q_{j}] & = H [q_{j}] + ψ_{j} [\int q_{j} (s_{j}) d s_{j} - 1] + \sum_{a \in V (j)} \int λ_{j a} (s_{j}) [q_{j} (s_{j}) - \int q_{a} (s_{a}) d s_{a ∖ j}] d s_{j} + C_{j}, \end{matrix}

(19)

where

C_{j}

collects all terms that are independent of

q_{j}

, are of the form

\begin{matrix} q_{j} (s_{j}) & = \frac{μ_{j b} (s_{j}) μ_{j c} (s_{j})}{\int μ_{j b} (s_{j}) μ_{j c} (s_{j}) d s_{j}} . \end{matrix}

(20)

Proof.

See Appendix D.2. □

3.2. Minimizing the Bethe Free Energy by Belief Propagation

We now combine Lemmas 1 and 2 to derive the sum-product message update.

Theorem 1

(Sum-Product Message Update). Given a TFFG

G = (V, E)

, consider the induced subgraph

G (b, j)

(Figure 5). Given the local polytope

L (G (b, j))

of (14), then the local stationary solutions to (15) are given by

\begin{matrix} q_{b}^{*} (s_{b}) & = \frac{f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b}^{*} (s_{i})}{\int f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b}^{*} (s_{i}) d s_{b}} \end{matrix}

(21a)

\begin{matrix} q_{j}^{*} (s_{j}) & = \frac{μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j})}{\int μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j}) d s_{j}}, \end{matrix}

(21b)

with messages

μ_{j c}^{*} (s_{j})

corresponding to the fixed points of

\begin{matrix} μ_{j c}^{(k + 1)} (s_{j}) & = \int f_{b} (s_{b}) \prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} μ_{i b}^{(k)} (s_{i}) d s_{b ∖ j}, \end{matrix}

(22)

with k representing an iteration index.

Proof.

See Appendix D.3. □

The sum-product algorithm has proven to be useful in many engineering applications and disciplines. For example, it is widely used for decoding in communication systems [4,20,27]. Furthermore, for a linear Gaussian state space model, Kalman filtering and smoothing can be expressed in terms of sum-product message passing for state inference on a factor graph [28,29]. This equivalence has inspired applications ranging from localization [30] to estimation [31].

The sum-product algorithm with updates (22) obtains the exact Bayesian posterior when the underlying graph is a tree [24,25,32]. Application of the sum-product algorithm to cyclic graphs is not guaranteed to converge and might lead to oscillations in the BFE over iterations. Theorems 3.1 and 3.2 in [33] show that the BFE of a graph with a single cycle is convex, which implies that the sum-product algorithm will converge in this case. Moreover, ref. [19] shows that it is possible to obtain a double-loop message passing algorithm if the graph has a cycle such that the stable fixed points will correspond to local minima of the BFE.

Example 1.

A Linear Dynamical System Considering a Linear Gaussian state space model specified by the following factors:

\begin{matrix} g_{0} (x_{0}) & = N (x_{0} | m_{x_{0}}, V_{x_{0}}) \end{matrix}

(23a)

\begin{matrix} g_{t} (x_{t - 1}, z_{t}, A_{t}) & = δ (z_{t} - A_{t} x_{t - 1}) \end{matrix}

(23b)

\begin{matrix} h_{t} (x_{t}^{'}, z_{t}, Q_{t}) & = N (x_{t}^{'} | z_{t}, Q_{t}^{- 1}) \end{matrix}

(23c)

\begin{matrix} n_{t} (x_{t}, x_{t}^{'}, x_{t}^{''}) & = δ (x_{t} - x_{t}^{'}) δ (x_{t} - x_{t}^{''}) \end{matrix}

(23d)

\begin{matrix} m_{t} (o_{t}, x_{t}^{″}, B_{t}) & = δ (o_{t} - B_{t} x_{t}^{″}) \end{matrix}

(23e)

\begin{matrix} r_{t} (y_{t}, o_{t}, R_{t}) & = N (y_{t} | o_{t}, R_{t}^{- 1}) . \end{matrix}

(23f)

The FFG corresponding to the one time segment of the state space model is given in Figure 6. We assumed that we know the following matrices that are used to generate the data:

\begin{matrix} {\hat{A}}_{t} & = [\begin{matrix} cos (θ) & - sin (θ) \\ sin (θ) & cos (θ) \end{matrix}], {\hat{Q}}_{t}^{- 1} = [\begin{matrix} 3 & 0.1 \\ 0.1 & 2 \end{matrix}], {\hat{B}}_{t} = [\begin{matrix} 1 & 0 \\ 0 & 1 \end{matrix}], {\hat{R}}_{t}^{- 1} = [\begin{matrix} 10 & 2 \\ 2 & 20 \end{matrix}] \end{matrix}

(24)

with

θ = π / 8

. Given a collection of observations

\hat{y} = {{\hat{y}}_{1}, \dots, {\hat{y}}_{T}}

, we constrain the latent states

x = {x_{0}, \dots, x_{T}}

by local marginalization and normalization constraints (for brevity we omit writing the normalization constraints explicitly) in accordance with Theorem 1, i.e.,

\begin{matrix} \int q (x_{t - 1}, z_{t}, A_{t}) d x_{t - 1} d z_{t} & = q (A_{t}), \int q (x_{t - 1}, z_{t}, A_{t}) d A_{t} = q (z_{t} | x_{t - 1}) q (x_{t - 1}) \end{matrix}

(25a)

\begin{matrix} \int q (x_{t}^{'}, z_{t}, Q_{t}) d x_{t}^{'} d z_{t} & = q (Q_{t}), \int q (x_{t}^{'}, z_{t}, Q_{t}) d z_{t} d Q_{t} = q (x_{t}^{'}), \int q (x_{t}^{'}, z_{t}, Q_{t}) d x_{t}^{'} d Q_{t} = q (z_{t}) \end{matrix}

(25b)

\begin{matrix} q (x_{t}, x_{t}^{'}, x_{t}^{″}) & = q (x_{t}) δ (x_{t} - x_{t}^{'}) δ (x_{t} - x_{t}^{″}) \end{matrix}

(25c)

\begin{matrix} \int q (o_{t}, x_{t}^{″}, B_{t}) d o_{t}, d x_{t}^{″} & = q (B_{t}), \int q (o_{t}, x_{t}^{″}, B_{t}) d B_{t} = q (o_{t} | x_{t}^{″}) q (x_{t}^{″}) \end{matrix}

(25d)

\begin{matrix} \int q (o_{t}, y_{t}, R_{t}) d o_{t} d y_{t} & = q (R_{t}), \int q (o_{t}, y_{t}, R_{t}) d R_{t} d o_{t} = q (y_{t}), \int q (o_{t}, y_{t}, R_{t}) d R_{t} d y_{t} = q (o_{t}) \end{matrix}

(25e)

Moreover, we use data constraints in accordance with Theorem 3 (explained in Section 4.2.1) for the observations, state transition matrices and precision matrices, i.e.,

\begin{matrix} q (y_{t}) = δ (y_{t} - {\hat{y}}_{t}), q (A_{t}) = δ (A_{t} - {\hat{A}}_{t}), q (B_{t}) = δ (B_{t} - {\hat{B}}_{t}), q (Q_{t}) = δ (Q_{t} - {\hat{Q}}_{t}), q (R_{t}) = δ (R_{t} - {\hat{R}}_{t}) . \end{matrix}

Computation of sum-product messages by (22) is analytically tractable and detailed algebraic manipulation can be found in [31]. If the backwards messages are not passed, then the resulting sum-product message passing algorithm is equivalent to Kalman filtering and if both forward and backward messages are propagated, then the Rauch–Tung–Striebel smoother is obtained [34] (Ch. 8).

We generated

T = 100

observations

\hat{y}

using the matrices specified in (24) and the initial condition

{\hat{x}}_{0} = {[5, - 5]}^{⊤}

. Due to (23a), we have

μ_{x_{0} g_{1}} = N (m_{x_{0}}, V_{x_{0}})

. We chose

V_{x_{0}} = 100 \cdot I

and

m_{x_{0}} = {\hat{x}}_{0}

. Under these constraints, the results of sum-product message passing and Bethe free energy evaluation is given in Figure 6. As the underlying graph is a tree, sum-product message passing results are exact and the evaluated BFE corresponds to negative log-evidence. In the follow-up Example 2, we will modify the constraints and give a comparative free energy plot for the examples in Figures 10 and 16.

4. Message Passing Variations through Constraint Manipulation

For generic node functions with arbitrary connectivity, there is no guarantee that the sum-product updates can be solved analytically. When analytic solutions are not possible, there are two ways to proceed. One way is to try to solve the sum-product update equations numerically, e.g., by Monte Carlo methods. Alternatively, we can add additional constraints to the BFE that leads to simpler update equations at the cost of inference accuracy. In the remainder of the paper, we explore a variety of constraints that have proven to yield useful inference solutions.

4.1. Factorization Constraints

Additional factorizations of the variational density

q_{a} (s_{a})

are often assumed to ease computation. In particular, we assumed a structured mean-field factorization such that

q_{b} (s_{b}) ≜ \prod_{n \in l (b)} q_{b}^{n} (s_{b}^{n}),

(26)

where n indicates a local cluster as a set of edges. To define a local cluster rigorously, let us first denote by

P (a)

the power set of an edge set

E (a)

, where the power set is the set of all subsets of

E (a)

. Then, a mean-field factorization

l (a) \subseteq P (a)

can be chosen such that all elements in

E (a)

are included in

l (a)

exactly once. Therefore,

l (a)

is defined as a set of one or multiple sets of edges. For example, if

E (a) = {i, j, k}

, then

l (a) = {{i}, {j, k}}

is allowed, as is

l (a) = {{i, j, k}}

itself, but

l (a) = {{i, j}, {j, k}}

is not allowed, since the element j occurs twice. More formally, in (26), the intersection of the super- and subscript collects the required variables, see Figure 7 for an example. The special case of a fully factorized

l (b)

for all edges

i \in E (b)

is known as the naive mean-field factorization [11,24].

We will analyze the effect of a structured mean-field factorization (26) on the Bethe free energy (7) for a specific factor node

b \in V

. Substituting (26) in the local free energy for factor b yields

F [q_{b}, f_{b}] = F [{q_{b}^{n}}, f_{b}] = \sum_{n \in l (b)} \int q_{b}^{n} (s_{b}^{n}) log q_{b}^{n} (s_{b}^{n}) d s_{b}^{n} - \int \{\prod_{n \in l (b)} q_{b}^{n} (s_{b}^{n})\} log f_{b} (s_{b}) d s_{b} .

(27)

We are then interested in

\begin{matrix} q_{b}^{m, *} & = arg min_{q_{b}^{m}} L_{b}^{m} [q_{b}^{m}, f_{b}], \end{matrix}

(28)

where the Lagrangian

L_{b}^{m}

(Lemma 3) enforces the normalization and marginalization constraints

\begin{matrix} \int q_{b}^{m} (s_{b}^{m}) d s_{b}^{m} = 1, \end{matrix}

(29a)

\begin{matrix} \int q_{b}^{m} (s_{b}^{m}) d s_{b ∖ i}^{m} = q_{i} (s_{i}), for all i \in m, m \in l (b) . \end{matrix}

(29b)

Lemma 3.

Given a terminated FFG

G = (V, E)

, consider a node-induced subgraph

G (b)

with a structured mean-field factorization

l (b)

(e.g., Figure 7). Then, local stationary solutions to the Lagrangian

\begin{matrix} L_{b}^{m} [q_{b}^{m}] & = \int q_{b}^{m} (s_{b}^{m}) log q_{b}^{m} (s_{b}^{m}) d s_{b}^{m} - \int \{\prod_{n \in l (b)} q_{b}^{n} (s_{b}^{n})\} log f_{b} (s_{b}) d s_{b} + \\ ψ_{b}^{m} [\int q_{b}^{m} (s_{b}^{m}) d s_{b}^{m} - 1] + \sum_{i \in m} \int λ_{i b} (s_{i}) [q_{i} (s_{i}) - \int q_{b}^{m} (s_{b}^{m}) d s_{m ∖ i}] d s_{i} + C_{b}^{m}, \end{matrix}

(30)

where

C_{b}^{m}

collects all terms independent of

q_{b}^{m}

, which are of the form

\begin{matrix} q_{b}^{m} (s_{b}^{m}) & = \frac{{\tilde{f}}_{b}^{m} (s_{b}^{m}) \prod_{i \in m} μ_{i b} (s_{i})}{\int {\tilde{f}}_{b}^{m} (s_{b}^{m}) \prod_{i \in m} μ_{i b} (s_{i}) d s_{b}^{m}}, \end{matrix}

(31)

where

\begin{matrix} {\tilde{f}}_{b}^{m} (s_{b}^{m}) & = exp (\int \{\prod_{\begin{matrix} n \in l (b) \\ n \neq m \end{matrix}} q_{b}^{n} (s_{b}^{n})\} log f_{b} (s_{b}) d s_{b}^{∖ m}) . \end{matrix}

(32)

Proof.

See Appendix D.4. □

4.1.1. Structured Variational Message Passing

We now combine Lemmas 2 and 3 to derive the structured variational message passing algorithm.

Theorem 2.

Structured variational message passing: Given a TFFG

G = (V, E)

, consider the induced subgraph

G (b, j)

with a structured mean-field factorization

l (b) \subseteq P (b)

, with local clusters

n \in l (b)

. Let

m \in l (b)

be the cluster where

j \in m

(see, e.g., Figure 8). Given the local polytope

\begin{matrix} L (G (b, j)) = \{q_{b}^{n} f o r a l l n \in l (b) s . t . (29 a), a n d q_{j} s . t . (29 b)\}, \end{matrix}

(33)

then local stationary solutions to

\begin{matrix} {q_{b}^{m, *}, q_{j}^{*}} & = arg min_{L (G (b, j))} F [q, f], \end{matrix}

(34)

are given by

\begin{matrix} q_{b}^{m, *} (s_{b}^{m}) & = \frac{{\tilde{f}}_{b}^{m, *} (s_{b}^{m}) \prod_{i \in m} μ_{i b}^{*} (s_{i})}{\int {\tilde{f}}_{b}^{m, *} (s_{b}^{m}) \prod_{i \in m} μ_{i b}^{*} (s_{i}) d s_{b}^{m}} \end{matrix}

(35a)

\begin{matrix} q_{j}^{*} (s_{j}) & = \frac{μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j})}{\int μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j}) d s_{j}}, \end{matrix}

(35b)

with messages

μ_{j c}^{*} (s_{j})

corresponding to the fixed points of

\begin{matrix} μ_{j c}^{(k + 1)} (s_{j}) & = \int {\tilde{f}}_{b}^{m, (k)} (s_{b}^{m}) \prod_{\begin{matrix} i \in m \\ i \neq j \end{matrix}} μ_{i b}^{(k)} (s_{i}) d s_{b ∖ j}^{m}, \end{matrix}

(36)

with iteration index k, and where

{\tilde{f}}_{b}^{m, (k)} = exp (\int \{\prod_{\begin{matrix} n \in l (b) \\ n \neq m \end{matrix}} q_{b}^{n, (k)} (s_{b}^{n})\} log f_{b} (s_{b}) d s_{b}^{∖ m}) .

(37)

Proof.

See Appendix D.5. □

The structured mean-field factorization applies the marginalization constraint only to the local cluster beliefs, as opposed to the joint node belief. As a result, computation for the local cluster beliefs might become tractable [24] (Ch.5). The practical appeal of Variational Message Passing (VMP) based inference becomes evident when the underlying model is composed of conjugate factor pairs from the exponential family. When the underlying factors are conjugate exponential family distributions, the message passing updates (36) amounts to adding natural parameters [35] of the underlying exponential family distributions. Structured variational message passing is popular in acoustic signal modelling, e.g., [36], as it allows one to be able to keep track of correlations over time. In [37], a stochastic variant of structured variational inference is utilized for Latent Dirichlet Allocation. Structured approximations are also used to improve inference in auto-encoders. In [38], inference involving non-parametric Beta-Bernoulli process priors is improved by developing a structured approximation to variational auto-encoders. When the data being modelled are time series, structured approximations reflect the transition structure over time. In [39], an efficient structured black-box variational inference algorithm for fitting Gaussian variational models to latent time series is proposed.

Example 2.

Consider the linear Gaussian state space model of Example 1. Let us assume that the precision matrix for latent-state transitions

Q_{t}

is not known and can not be constrained by data. Then, we can augment state space model by including a prior for

Q_{t}

and try to infer a posterior over

Q_{t}

from the observations. Since

Q_{t}

is the precision of a normal factor, we chose a conjugate Wishart prior and assumed that

Q_{t}

is time-invariant by adding the following factors

\begin{matrix} w_{0} (Q_{0}, V, ν) & = W (Q_{0} | V, ν) \end{matrix}

(38a)

\begin{matrix} w_{t} (Q_{t - 1}, Q_{t}, Q_{t + 1}) & = δ (Q_{t - 1} - Q_{t}) δ (Q_{t} - Q_{t + 1}), f o r e v e r y t = 1, \dots, T . \end{matrix}

(38b)

It is certainly possible to assume a time-varying structure for

Q_{t}

; however, our purpose is to illustrate a change in constraints rather than analyzing time-varying properties. This is why we assume time-invariance.

In this setting, the sum-product equations around the factor

h_{t}

are not analytically tractable. Therefore, we changed the constraints associated with

h_{t}

(25b) to those given in Theorem 2 as follows

\begin{matrix} \int q (x_{t}^{'}, z_{t}, Q_{t}) d x_{t}^{'} d z_{t} & = q (Q_{t}), \int q (x_{t}^{'}, z_{t}, Q_{t}) d Q_{t} = q (x_{t}^{'}, z_{t}) \end{matrix}

(39a)

\begin{matrix} \int q (Q_{t}) d Q_{t} & = 1, \int q (x_{t}^{'}, z_{t}) d x_{t}^{'} d z_{t} = 1 . \end{matrix}

(39b)

We removed the data constraint on

q (Q_{t})

and instead included data constraints on the hyper-parameters

\begin{matrix} q (V) = δ (V - \hat{V}), q (ν) = δ (ν - \hat{ν}) . \end{matrix}

(40)

With the new set of constraints ((39a) and (39b)), we obtained a hybrid of the sum-product and structured VMP algorithm, where structured messages around the factor

h_{t}

are computed by (36) and the rest of the messages are computed by the sum-product (22). One time segment of the modified FFG along with the messages is given Figure 9. We used the same observations

\hat{y}

that were generated in Example 1 and the same initialization for the hidden states. For the hyper-parameters of the Wishart prior, we chose

\hat{V} = 0.1 \cdot I

and

\hat{ν} = 2

. Under these constraints, the result of structured variational message passing results along with the Bethe free energy evaluation is given in Figure 9.

4.1.2. Naive Variational Message Passing

As a corollary of Theorem 2, we can consider the special case of a naive mean-field factorization, which is defined for node b as

\begin{matrix} q_{b} (s_{b}) = \prod_{i \in E (b)} q_{i} (s_{i}) . \end{matrix}

(41)

The naive mean-field constraint (41) transforms the local free energy into

\begin{matrix} F [q_{b}, f_{b}] & = F [{q_{i}}, f_{b}] \\ = \sum_{i \in E (b)} \int q_{i} (s_{i}) log q_{i} (s_{i}) d s_{i} - \int \{\prod_{i \in E (b)} q_{i} (s_{i})\} log f_{b} (s_{b}) d s_{b} . \end{matrix}

(42)

Corollary 1.

Naive Variational Message Passing: Given a TFFG

G = (V, E)

, consider the induced subgraph

G (b, j)

with a naive mean-field factorization

l (b) = {i such that for all i \in E (b)}

. Let

m \in l (b)

be the cluster where

j = m

. Given the local polytope of (33), the local stationary solutions to (34) are given by

\begin{matrix} q_{b}^{m, *} (s_{b}^{m}) & = q_{j}^{*} (s_{j}) = \frac{μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j})}{\int μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j}) d s_{j}}, \end{matrix}

where the messages

μ_{j c}^{*} (s_{j})

are the fixed points of the following iterations

\begin{matrix} μ_{j c}^{(k + 1)} (s_{j}) & = exp (\int \{\prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} q_{i}^{(k)} (s_{i})\} log f_{b} (s_{b}) d s_{b ∖ j}), \end{matrix}

(43)

where k is an iteration index.

Proof.

See Appendix D.6. □

The naive mean-field factorization limits the search space of beliefs by imposing strict constraints on the variational posterior. As a result, the variational posterior also loses flexibility. To improve inference performance for sparse Bayesian learning, the authors of [40] proposes a hybrid mechanism by augmenting naive mean-field VMP with sum-product updates. This hybrid scheme reduces the complexity of the sum-product algorithm, while improving the accuracy of the naive VMP approach. In [41], naive VMP is applied to semi-parametric regression and allows for scaling of regression models to large data sets.

Example 3.

As a follow up on Example 2, we relaxed the constraints in ((39a) and (39b)) to the following constraints presented in Corollary 1 as

\begin{matrix} \int q (x_{t}^{'}, z_{t}, Q_{t}) d x_{t}^{'} d z_{t} & = q (Q_{t}), \int q (x_{t}^{'}, z_{t}, Q_{t}) d Q_{t} = q (x_{t}^{'}, z_{t}) = q (x_{t}^{'}) q (z_{t}) \end{matrix}

(44a)

\begin{matrix} \int q (Q_{t}) d Q_{t} & = 1, \int q (x_{t}^{'}) d x_{t}^{'} = 1, \int q (z_{t}) d z_{t} = 1 . \end{matrix}

(44b)

The FFG remains the same and we use identical data constraints as in Example 2. Together with constraint (44), we obtained a hybrid of naive variational message passing and sum-product message passing algorithm where the messages around the factor

h_{t}

are computed by (43) and the rest of the messages by sum-product (22). Using the same data as in Example 1, the results for naive VMP are given in Figure 10 along with the evaluated Bethe free energy.

4.2. Form Constraints

Form constraints limit the functional form of the variational factors

q_{a} (s_{a})

and

q_{i} (s_{i})

. One of the most widely used form constraints, the data constraint, is also illustrated in Appendix A.

4.2.1. Data Constraints

A data constraint can be viewed as a special case of (9b), where the belief

q_{j}

is constrained to be a Dirac-delta function [42], such that

\begin{matrix} \int q_{a} (s_{a}) d s_{a ∖ j} = q_{j} (s_{j}) = δ (s_{j} - {\hat{s}}_{j}), \end{matrix}

(45)

where

{\hat{s}}_{j}

is a known value, e.g., an observation.

Lemma 4.

Given a TFFG

G = (V, E)

, consider the node-induced subgraph

G (b)

(Figure 3). Then local stationary solutions to the Lagrangian

\begin{matrix} L_{b} [q_{b}, f_{b}] & = F [q_{b}, f_{b}] + ψ_{b} [\int q_{b} (s_{b}) d s_{b} - 1] + \sum_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} \int λ_{i b} (s_{i}) [q_{i} (s_{i}) - \int q_{b} (s_{b}) d s_{b ∖ i}] d s_{i} + \\ \int λ_{j b} (s_{j}) [δ (s_{j} - {\hat{s}}_{j}) - \int q_{b} (s_{b}) d s_{b ∖ j}] d s_{j} + C_{b} . \end{matrix}

(46)

where

C_{b}

collects all terms that are independent of

q_{b}

, are of the form

\begin{matrix} q_{b} (s_{b}) & = \frac{f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i})}{\int f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i}) d s_{b}} . \end{matrix}

(47)

Proof.

See Appendix D.7. □

Theorem 3.

Data-Constrained Sum-Product: Given a TFFG

G = (V, E)

, consider the induced subgraph

G (b, j)

(Figure 11). Given the local polytope

\begin{matrix} L (G (b, j)) = {q_{b} s . t . (45)}, \end{matrix}

(48)

the local stationary solutions to

\begin{matrix} q_{b}^{*} = arg min_{L (G (b, j))} F [q, f], \end{matrix}

are of the form

\begin{matrix} q_{b}^{*} (s_{b}) & = \frac{f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b}^{*} (s_{i})}{\int f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b}^{*} (s_{i}) d s_{b}}, \end{matrix}

(49)

with message

\begin{matrix} μ_{j b}^{*} (s_{j}) & = δ (s_{j} - {\hat{s}}_{j}) . \end{matrix}

(50)

Proof.

See Appendix D.8. □

Note that the resulting message

μ_{j b}^{*} (s_{j})

to node b does not depend on messages from node c, as would be the case for a sum-product update. By the symmetry of Theorem 3 for the subgraph

L {G (c, j)}

, (A32) identifies

\begin{matrix} μ_{c j} (s_{j}) = \int f_{c} (s_{c}) \prod_{\begin{matrix} i \in E (c) \\ i \neq j \end{matrix}} μ_{i c} (s_{i}) d s_{c ∖ j} \neq δ (s_{j} - {\hat{s}}_{j}) . \end{matrix}

This implies that messages incoming to a data constraint (such as

μ_{c j}

) are not further propagated through the data constraint. The data constraint thus effectively introduces a conditional independence between the variables of neighboring factors (conditioned on the shared constrained variable). Interestingly, this is similar to the notion of an intervention [43], where a decision variable is externally forced to a realization.

Data constraints allow information from data sets to be absorbed into the model. Essentially, (variational) Bayesian machine learning is an application of inference in a graph with data constraints. In our framework, data are a constraint, and machine learning via Bayes rule follows naturally from the minimization of the Bethe free energy (see also Appendix A).

4.2.2. Laplace Propagation

A second type of form constraint we consider is the Laplace constraint, see also [14]. Consider a second-order Taylor approximation on the local log-node function

\begin{matrix} L_{a} (s_{a}) = log f_{a} (s_{a}), \end{matrix}

(51)

around an approximation point

{\hat{s}}_{a}

, as

\begin{matrix} {\tilde{L}}_{a} (s_{a}; {\hat{s}}_{a}) = L_{a} ({\hat{s}}_{a}) + \nabla^{⊤} L_{a} ({\hat{s}}_{a}) (s_{a} - {\hat{s}}_{a}) + \frac{1}{2} {(s_{a} - {\hat{s}}_{a})}^{⊤} \nabla^{2} L_{a} ({\hat{s}}_{a}) (s_{a} - {\hat{s}}_{a}) . \end{matrix}

(52)

From this approximation, we define the Laplace-approximated node function as

\begin{matrix} {\tilde{f}}_{a} (s_{a}; {\hat{s}}_{a}) & ≜ exp ({\tilde{L}}_{a} (s_{a}; {\hat{s}}_{a})), \end{matrix}

(53)

which is substituted in the local free energy to obtain the Laplace-encoded local free energy as

\begin{matrix} F [q_{a}, {\tilde{f}}_{a}; {\hat{s}}_{a}] & = \int q_{a} (s_{a}) log \frac{q_{a} (s_{a})}{{\tilde{f}}_{a} (s_{a}; {\hat{s}}_{a})} d s_{a} . \end{matrix}

(54)

It follows that the Laplace-encoded optimization of the local free energy becomes

\begin{matrix} q_{a}^{*} & = arg min_{q_{a}} L_{a} [q_{a}, {\tilde{f}}_{a}; {\hat{s}}_{a}], \end{matrix}

(55)

where the Lagrangian

L_{a}

imposes the marginalization and normalization constraints of (9) on (54).

Lemma 5.

Given a TFFG

G = (V, E)

, consider the node-induced subgraph

G (b)

(Figure 12). The stationary points of the Laplace-approximated Lagrangian (55) as a functional of

q_{b}

,

\begin{matrix} L_{b} [q_{b}, {\tilde{f}}_{b}; {\hat{s}}_{b}] & = F [q_{b}, {\tilde{f}}_{b}; {\hat{s}}_{b}] + ψ_{b} [\int q_{b} (s_{b}) d s_{b} - 1] + \\ \sum_{i \in E (b)} \int λ_{i b} (s_{i}) [q_{i} (s_{i}) - \int q_{b} (s_{b}) d s_{b ∖ i}] d s_{i} + C_{b}, \end{matrix}

(56)

where

C_{b}

collects all terms that are independent of

q_{b}

, which are of the form

q_{b} (s_{b}) = \frac{{\tilde{f}}_{b} (s_{b}; {\hat{s}}_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i})}{\int {\tilde{f}}_{b} (s_{b}; {\hat{s}}_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i}) d s_{b}} .

(57)

Proof.

See Appendix D.9. □

We can now formulate Laplace propagation as an iterative procedure, where the approximation point

{\hat{s}}_{b}

is chosen as the mode of the belief

q_{b} (s_{b})

.

Theorem 4.

Laplace Propagation: Given a TFFG

G = (V, E)

, consider the induced subgraph

G (b, j)

(Figure 13) with the Laplace-encoded factor

{\tilde{f}}_{b}

as per (53). We write the model (1) with the Laplace-encoded factor

{\tilde{f}}_{b}

substituted for

f_{b}

, as

\tilde{f}

. Given the local polytope

L (G (b, j))

of (14), the local stationary solutions to

\begin{matrix} {q_{b}^{*}, q_{j}^{*}} & = arg min_{L (G (b, j))} F [q, \tilde{f}; {\hat{s}}_{b}], \end{matrix}

(58)

are given by

\begin{matrix} q_{b}^{*} (s_{b}) & = \frac{{\tilde{f}}_{b} (s_{b}; {\hat{s}}_{b}^{*}) \prod_{i \in E (b)} μ_{i b}^{*} (s_{i})}{\int {\tilde{f}}_{b} (s_{b}; {\hat{s}}_{b}^{*}) \prod_{i \in E (b)} μ_{i b}^{*} (s_{i}) d s_{b}} \\ q_{j}^{*} (s_{j}) & = \frac{μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j})}{\int μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j}) d s_{j}}, \end{matrix}

with

{\hat{s}}_{b}^{*}

and the messages

μ_{j c}^{*} (s_{j})

the fixed points of

\begin{matrix} {\hat{s}}_{b}^{(k)} & = arg max_{s_{b}} log q_{b}^{(k)} (s_{b}) \\ q_{b}^{(k + 1)} (s_{b}) & = \frac{{\tilde{f}}_{b} (s_{b}; {\hat{s}}_{b}^{(k)}) \prod_{i \in E (b)} μ_{i b}^{(k)} (s_{i})}{\int {\tilde{f}}_{b} (s_{b}; {\hat{s}}_{b}^{(k)}) \prod_{i \in E (b)} μ_{i b}^{(k)} (s_{i}) d s_{b}} \\ μ_{j c}^{(k + 1)} (s_{j}) & = \int {\tilde{f}}_{b} (s_{b}; {\hat{s}}_{b}^{(k)}) \prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} μ_{i b}^{(k)} (s_{i}) d s_{b ∖ j} . \end{matrix}

Proof.

See Appendix D.10. □

A Laplace propagation is introduced in [14] as an algorithm that propagates mean and variance information when exact updates are expensive to compute. Laplace propagation has found applications in the context of Gaussian processes and support vector machines [14]. In the jointly normal case, Laplace propagation coincides with sum-product and expectation propagation [14,18].

4.2.3. Expectation Propagation

Expectation propagation can be derived in terms of constraint manipulation by relaxing the marginalization constraints to expectation constraints. Expectation constraints are of the form

\begin{matrix} \int q_{a} (s_{a}) T_{i} (s_{i}) d s_{a} = \int q_{i} (s_{i}) T_{i} (s_{i}) d s_{i}, \end{matrix}

(59)

for a given function (statistic)

T_{i} (s_{i})

. Technically, the statistic

T_{i} (s_{i})

can be chosen arbitrarily. Nevertheless, they are often chosen as sufficient statistics of an exponential family distribution. An exponential family distribution is defined by

q_{i} (s_{i}) = h (s_{i}) exp (η_{i}^{⊤} T_{i} (s_{i}) - log Z (η_{i})),

(60)

where

η_{i}

is the natural parameter,

Z (η_{i})

is the partition function,

T_{i} (s_{i})

is the sufficient statistics and

h (s_{i})

is a base measure [24]. The reason

T_{i} (s_{i})

is a sufficient statistic is because if there are observed values of the random variable

s_{i}

, then the parameter

η_{i}

can be estimated by using only the statistics

T_{i} (s_{i})

. This means that the estimator of

η_{i}

will depend only on the statistics.

The idea behind expectation propagation [18] is to relax the marginalization constraints with moment-matching constraints by choosing sufficient statistics from exponential family distributions [12]. Relaxation allows approximating the marginals of the sum-product algorithm with exponential family distributions. By keeping the marginals within the exponential family, the complexity of the resulting computations is reduced.

Lemma 6.

Given a TFFG

G = (V, E)

, consider the node-induced subgraph

G (b)

(Figure 3). The stationary points of the Lagrangian

\begin{matrix} L_{b} [q_{b}, f_{b}] & = F [q_{b}, f_{b}] + ψ_{b} [\int q_{b} (s_{b}) d s_{b} - 1] + \sum_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} \int λ_{i b} (s_{i}) [q_{i} (s_{i}) - \int q_{b} (s_{b}) d s_{b ∖ i}] d s_{i} + \end{matrix}

\begin{matrix} η_{j b}^{⊤} [\int q_{j} (s_{j}) T_{j} (s_{j}) d s_{j} - \int q_{b} (s_{b}) T_{j} (s_{j}) d s_{b}] + C_{b}, \end{matrix}

(61)

with sufficient statistics

T_{j}

, and where

C_{b}

collects all terms that are independent of

q_{b}

, are of the form

\begin{matrix} q_{b} (s_{b}) & = \frac{f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i})}{\int f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i}) d s_{b}}, \end{matrix}

(62)

with incoming exponential family message

\begin{matrix} μ_{j b} (s_{j}) & = exp (η_{j b}^{⊤} T_{j} (s_{j})) . \end{matrix}

(63)

Proof.

See Appendix D.11. □

Lemma 7.

Given a TFFG

G = (V, E)

, consider an edge-induced subgraph

G (j)

(Figure 4). The stationary solutions of the Lagrangian

\begin{matrix} L_{j} [q_{j}] & = H [q_{j}] + ψ_{j} [\int q_{j} (s_{j}) d s_{j} - 1] + \sum_{a \in V (j)} η_{j a}^{⊤} [\int q_{j} (s_{j}) T_{j} (s_{j}) d s_{j} - \int q_{a} (s_{a}) T_{j} (s_{j}) d s_{a}] + C_{j}, \end{matrix}

with sufficient statistics

T_{j} (s_{j})

, and where

C_{j}

collects all terms that are independent of

q_{j}

, are of the form

\begin{matrix} q_{j} (s_{j}) & = \frac{exp ({[η_{j b} + η_{j c}]}^{⊤} T_{j} (s_{j}))}{\int exp ({[η_{j b} + η_{j c}]}^{⊤} T_{j} (s_{j})) d s_{j}} . \end{matrix}

(64)

Proof.

See Appendix D.12. □

Theorem 5.

Expectation Propagation: Given a TFFG

G = (V, E)

, consider the induced subgraph

G (b, j)

(Figure 5). Given the local polytope

\begin{matrix} L (G (b, j)) = \{q_{b} s . t . (9 a), a n d q_{j} s . t . (59) a n d (10)\}, \end{matrix}

(65)

and

μ_{j b} (s_{j}) = exp (η_{j b}^{⊤} T_{j} (s_{j}))

an exponential family message (from Lemma 6). Then, the local stationary solutions to (15) are given by

\begin{matrix} q_{b}^{*} (s_{b}) & = \frac{f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b}^{*} (s_{i})}{\int f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b}^{*} (s_{i}) d s_{b}} \end{matrix}

(66a)

\begin{matrix} q_{j}^{*} (s_{j}) & = \frac{exp ({[η_{j b}^{*} + η_{j c}^{*}]}^{⊤} T_{j} (s_{j}))}{\int exp ({[η_{j b}^{*} + η_{j c}^{*}]}^{⊤} T_{j} (s_{j})) d s_{j}}, \end{matrix}

(66b)

with

η_{j b}^{*}, η_{j c}^{*}

and

μ_{j c}^{*} (s_{j})

being the fixed points of the iterations

\begin{matrix} {\tilde{μ}}_{j c}^{(k)} (s_{j}) & = \int f_{b} (s_{b}) \prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} μ_{i b}^{(k)} (s_{i}) d s_{b ∖ j} \\ {\tilde{q}}_{j}^{(k)} (s_{j}) & = \frac{μ_{j b}^{(k)} (s_{j}) {\tilde{μ}}_{j c}^{(k)} (s_{j})}{\int μ_{j b}^{(k)} (s_{j}) {\tilde{μ}}_{j c}^{(k)} (s_{j}) d s_{j}} . \end{matrix}

By moment-matching on

{\tilde{q}}_{j}^{(k)} (s_{j})

, we obtain the natural parameter

{\tilde{η}}_{j}^{(k)}

. The message update then follows from

\begin{matrix} η_{j c}^{(k)} & = {\tilde{η}}_{j}^{(k)} - η_{j b}^{(k)} \\ μ_{j c}^{(k + 1)} (s_{j}) & = exp (T_{j} {(s_{j})}^{⊤} η_{j c}^{(k)}) . \end{matrix}

Proof.

See Appendix D.13. □

Moment-matching can be performed by solving [24] (Proposition 3.1)

\begin{matrix} \nabla_{η_{j}} log Z_{j} (η_{j}) & = \int {\tilde{q}}_{j} (s_{j}) T_{j} (s_{j}) d s_{j} \end{matrix}

for

η_{j}

, where

\begin{matrix} Z_{j} (η_{j}) & = \int exp (η_{j}^{⊤} T_{j} (s_{j})) d s_{j} . \end{matrix}

In practice, for a Gaussian approximation, the natural parameters can be obtained by converting the matched mean and variance of

{\tilde{q}}_{j} (s_{j})

to the canonical form [18]. Computing the moments of

{\tilde{q}}_{j} (s_{j})

is often challenging due to lack of closed form solutions of the normalization constant. In order to address the computation of moments in EP, Ref. [44] proposes to evaluate challenging moments by quadrature methods. For multivariate random variables, moment-matching by spherical radial cubature would be advantageous as it will reduce the computational complexity [45]. Another popular way of evaluating the moments is through importance sampling [46] (Ch. 7) and [47].

Expectation propagation has been utilized in various applications ranging from time series estimation with Gaussian processes [48] to Bayesian learning with stochastic natural gradients [49]. When the likelihood functions for Gaussian process classification are not Gaussian, EP is often utilized [50] (Chapter 3). In [51], a message passing-based expectation propagation algorithm is developed for models that involve both continuous and discrete random variables. Perhaps the most practical applications of EP are in the context of probabilistic programming [52], where it is heavily used in real-world applications.

4.3. Hybrid Constraints

In this section, we consider hybrid methods that combine factorization and form constraints, and formalize some well-known algorithms in terms of message passing.

4.3.1. Mean-Field Variational Laplace

Mean-field variational Laplace applies the mean-field factorization to the Laplace-approximated factor function. The appeal of this method is that all messages outbound from the Laplace-approximated factor can be represented by Gaussians.

Theorem 6.

Mean-field variational Laplace: Given a TFFG

G = (V, E)

, consider the induced subgraph

G (b, j)

(Figure 13) with the Laplace-encoded factor

{\tilde{f}}_{b}

as per (53). We write the model (1) with substituted Laplace-encoded factor

{\tilde{f}}_{b}

for

f_{b}

, as

\tilde{f}

. Furthermore, assume a naive mean-field factorization

l (b) = {{i} f o r a l l i \in E (b)}

. Let

m \in l (b)

be the cluster where

j = m

. Given the local polytope of (33), the local stationary solutions to

\begin{matrix} {q_{b}^{m, *}, q_{j}^{*}} & = arg min_{L (G (b, j))} F [q, \tilde{f}; {\hat{s}}_{b}], \end{matrix}

(67)

are given by

\begin{matrix} q_{b}^{m, *} (s_{b}^{m}) & = q_{j}^{*} (s_{j}) = \frac{μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j})}{\int μ_{j b}^{*} (s_{j}) μ_{j c}^{*} (s_{j}) d s_{j}}, \end{matrix}

where

μ_{j c}^{*}

represents the fixed points of the following iterations

\begin{matrix} μ_{j c}^{(k + 1)} (s_{j}) & = exp (\int (\prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} q_{i}^{(k)} (s_{i})) log {\tilde{f}}_{b} (s_{b}; {\hat{s}}_{b}^{(k)}) d s_{b ∖ j}), \end{matrix}

(68)

with

\begin{matrix} {\hat{s}}_{b}^{(k)} = arg max_{s_{b}} log q_{b}^{(k)} (s_{b}) . \end{matrix}

Proof.

See Appendix D.14. □

Conveniently, under these constraints, every outbound message from node b will be proportional to a Gaussian. Substituting the Laplace-approximated factor function, we obtain:

\begin{matrix} log μ_{j c}^{(k)} (s_{j}) & = \int (\prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} q_{i}^{(k)} (s_{i})) {\tilde{L}}_{b} (s_{b}; {\hat{s}}_{b}^{(k)}) d s_{b ∖ j} + C . \end{matrix}

Resolving this expectation yields a quadratic form in

s_{j}

, which after completing the square leads to a proportionally Gaussian message

μ_{j c} (s_{j})

. This argument holds for any edge adjacent to b, and therefore for all outbound messages from node b. Moreover, if the incoming messages are represented by Gaussians as well (e.g., because these are also computed under the mean-field variational Laplace constraint), then all beliefs on the adjacent edges to b will also be Gaussian. This significantly simplifies the procedure of computing the expectations, which illustrates the computational appeal of mean-field variational Laplace.

Mean-field variational Laplace is widely used in dynamic causal modeling [53] and more generally in cognitive neuroscience, partly because the resulting computations are deemed neurologically plausible [54,55,56].

4.3.2. Expectation Maximization

Expectation Maximization (EM) can be viewed as a hybrid algorithm that combines a structured variational factorization with a Dirac-delta constraint, where the constrained value itself is optimized. Given a structured mean-field factorization

l (a) \subseteq P (a)

, with a single-edge cluster

m = j

, then expectation maximization considers local factorizations of the form

\begin{matrix} q_{a} (s_{a}) = δ (s_{j} - θ_{j}) \prod_{\begin{matrix} n \in l (a) \\ n \neq m \end{matrix}} q_{a}^{n} (s_{a}^{n}), \end{matrix}

(69)

where the belief for

s_{j}

is constrained by a Dirac-delta distribution, similar to Section 4.2.1. In (69), however, the variable

s_{j}

represents a random variable with (unknown) value

θ_{j} \in R^{d}

, where d is the dimension of the random variable

s_{j}

. We explicitly use the notation

θ_{j}

(as opposed to

{\hat{s}}_{j}

for the data constraint in Section 4.2.1) to clarify that this value is a parameter for the constrained belief over

s_{j}

that will be optimized—that is,

θ_{j}

does not represent a model parameter in itself. To make this distinction even more explicit, in the context of optimization, we will refer to Dirac-delta constraints as point-mass constraints.

The factor-local free energy

F [q_{a}, f_{a}; θ_{j}]

then becomes a function of the

θ_{j}

parameter.

Theorem 7.

Expectation maximization: Given a TFFG

G = (V, E)

, consider the induced subgraph

G (b, j)

(Figure 14) with a structured mean-field factorization

l (b) \subseteq P (b)

, with local clusters

n \in l (b)

. Let

m \in l (b)

be the cluster where

j = m

. Given the local polytope

\begin{matrix} L (G (b, j)) = \{q_{b}^{n} f o r a l l n \in l (b) s . t . (29 a)\}, \end{matrix}

(70)

the local stationary solutions to

\begin{matrix} θ_{j}^{*} = arg min_{L (G (b, j))} F [q, f; θ_{j}], \end{matrix}

are given by the fixed points of

\begin{matrix} μ_{b j}^{(k + 1)} (s_{j}) & = exp (\int \{\prod_{\begin{matrix} n \in l (b) \\ n \neq m \end{matrix}} q_{b}^{n, (k)} (s_{b}^{n})\} log f_{b} (s_{b}) d s_{b ∖ j}) \end{matrix}

(71a)

\begin{matrix} θ_{j}^{(k + 1)} & = arg max_{s_{j}} (log μ_{b j}^{(k + 1)} (s_{j}) + log μ_{c j}^{(k + 1)} (s_{j})) . \end{matrix}

(71b)

Proof.

See Appendix D.15. □

Expectation maximization was formulated in [57] as an iterative method that optimizes log-expectations of likelihood functions, where each EM iteration is guaranteed to increase the expected log-likelihood. Moreover, under some differentiability conditions, the EM algorithm is guaranteed to converge [57] (Theorem 3). A detailed overview of EM for exponential families is available in [24] (Ch. 6). A formulation of EM in terms of message passing is given by [58], where message passing for EM is applied in a filtering and system identification context. In [58], derivations are based on [57] (Theorem 1), whereas our derivations directly follow from variational principles.

Example 4.

Now suppose we do not know the angle

θ

for the state transition matrix

A_{t}

in Example 2 and would like to estimate the value of

θ

. Moreover, further suppose that we are interested in estimating the hyper-parameters for the prior

m_{x_{0}}

and

V_{x_{0}}

, as well as the precision matrix for the state transitions

Q_{t}

. For this purpose, we changed the constraints of (25a) into EM constraints in accordance with Theorem 7:

\begin{matrix} q (x_{t - 1}, z_{t}, A_{t} (θ)) & = δ (A_{t} (θ) - A_{t} (\hat{θ})) q (z_{t} | x_{t - 1}, A_{t} (θ)) q (x_{t - 1}) \end{matrix}

(72a)

\begin{matrix} q (x_{0}, m_{x_{0}}, V_{x_{0}}) & = q (x_{0}) δ (m_{x_{0}} - {\hat{m}}_{x_{0}}) δ (V_{x_{0}} - {\hat{V}}_{x_{0}}), \end{matrix}

(72b)

where we optimize

\hat{θ}, {\hat{V}}_{x_{0}}

and

{\hat{m}}_{x_{0}}

with EM (

{\hat{V}}_{x_{0}}

is further constrained to be positive definite during the optimization procedure). With the addition of the new EM constraints, the resulting FFG is given in Figure 15. The hybrid message passing algorithm consists of structured variational messages around the factor

h_{t}

, and sum-product messages around

w_{t}

,

n_{t}

,

m_{t}

and

r_{t}

, and EM messages around

g_{0}

and

g_{t}

. We used identical observations as in the previous examples. The results for the hybrid SVMP-EM-SP algorithm are given in Figure 16 along with the evaluated Bethe free energy over all iterations.

4.4. Overview of Message Passing Algorithms

In Section 4.1, Section 4.2 and Section 4.3, following a high-level recipe pioneered by [15], we presented first-principle derivations of some of the popular message passing-based inference algorithms by manipulating the local constraints of the Bethe free energy. The results are summarized in Table 1.

Crucially, the method of constrained BFE minimization goes beyond the reviewed algorithms. Through creating a new set of local constraints and following similar derivations based on variational calculus, one can obtain new message passing-based inference algorithms that better match the specifics of the generative model or application.

5. Scoring Models by Minimized Variational Free Energy

As discussed in Section 2.2, the variational free energy is an important measure of model performance. In Section 5.1 and Section 5.2, we discuss some problems that occur when evaluating the BFE on a TFFG. In Section 5.3, we propose an algorithm that evaluates the constrained BFE as a summation of local contributions on the TFFG.

5.1. Evaluation of the Entropy of Dirac-Delta Constrained Beliefs

For continuous variables, data and point-mass constraints, as discussed in Section 4.2.1 and Section 4.3.2 and Appendix A, collapse the information density to infinity, which leads to singularities in entropy evaluation [59]. More specifically, for a continuous variable

s_{j}

, the entropies for beliefs of the form

q_{j} (s_{j}) = δ (s_{j} - {\hat{s}}_{j})

and

q_{a} (s_{a}) = q_{a | j} (s_{a ∖ j} | s_{j}) δ (s_{j} - {\hat{s}}_{j})

both evaluate to

- \infty

.

In variational inference, it is common to define the VFE only with respect to the latent (unobserved) variables [2] (Section 10.1). In contrast, in this paper, we explicitly define the BFE in terms of an iteration over all nodes and edges (7), which also includes non-latent beliefs in the BFE definition. Therefore, we define

\begin{matrix} q_{j} (s_{j}) = δ (s_{j} - {\hat{s}}_{j}) & \Rightarrow H [q_{j}] ≜ 0, \\ q_{a} (s_{a}) = q_{a | j} (s_{a ∖ j} | s_{j}) δ (s_{j} - {\hat{s}}_{j}) & \Rightarrow H [q_{a}] ≜ H [q_{a ∖ j}], \end{matrix}

where

q_{a | j} (s_{a ∖ j} | s_{j})

indicates the conditional belief and

q_{a ∖ j} (s_{a ∖ j})

is the joint belief. These definitions effectively remove the entropies for observed variables from the BFE evaluation. Note that although

q_{a ∖ j} (s_{a ∖ j})

is technically not a part of our belief set (7), it can be obtained by marginalization of

q_{a} (s_{a})

(9b).

5.2. Evaluation of Node-Local Free Energy for Deterministic Nodes

Another difficulty arises with the evaluation of the node-local free energy

F [q_{a}]

for factors of the form

\begin{matrix} f_{a} (s_{a}) = δ (h_{a} (s_{a})) . \end{matrix}

(73)

This type of node function reflects deterministic operations, e.g.,

h (x, y, z) = z - x - y

corresponds to the summation

z = x + y

. In this case, directly evaluating

F [q_{a}]

again leads to singularities.

There are (at least) two strategies available in the literature that resolve this issue. The first strategy “softens” the Dirac-delta by re-defining:

\begin{matrix} f_{a} (s_{a}) ≜ \frac{1}{\sqrt{2 π ϵ}} exp (- \frac{1}{2 ϵ} h_{a} {(s_{a})}^{2}), \end{matrix}

with

0 < ϵ ≪ 1

[17]. A drawback of this approach is that it may alter the model definition in a numerically unstable way, leading to a different inference solution and variational free energy than originally intended.

The second strategy combines the deterministic factor

f_{a}

with a neighboring stochastic factor

f_{b}

into a new composite factor

f_{c}

, by marginalizing over a shared variable

s_{j}

, leading to [60]

\begin{matrix} f_{c} (s_{c}) & ≜ \int δ (h_{a} (s_{a})) f_{b} (s_{b}) d s_{j}, \end{matrix}

where

s_{c} = {s_{a} \cup s_{b}} ∖ s_{j}

. This procedure has drawbacks for models that involve many deterministic factors—namely, the convenient model modularity and resulting distributed compatibility are lost when large groups of factors are compacted in model-specific composite factors. We propose here a third strategy.

Theorem 8.

Let

f_{a} (s_{a}) = δ (h_{a} (s_{a}))

, with

h_{a} (s_{a}) = s_{j} - g_{a} (s_{a ∖ j})

, and node-local belief

q_{a} (s_{a}) = q_{j | a} (s_{j} | s_{a ∖ j}) q_{a ∖ j} (s_{a ∖ j})

. Then, the node-local free energy evaluates to

\begin{matrix} F [q_{a}, f_{a}] & = \{\begin{matrix} - H [q_{a ∖ j}] & i f q_{j | a} (s_{j} | s_{a ∖ j}) = δ (s_{j} - g_{a} (s_{a ∖ j})) \\ \infty & o t h e r w i s e . \end{matrix} \end{matrix}

Proof.

See Appendix D.16. □

An example that evaluates the node-local free energy for a non-trivial deterministic node can be found in Appendix C.

The equality node is a special case deterministic node, with a node function of the form (3). The argument of (Theorem 8) does not directly apply to this node. As the equality node function comprises two Dirac-delta functions, it can not be written in the form of Theorem 8. However, we can still reduce the node-local free energy contribution.

Theorem 9.

Let

f_{a} (s_{a}) = δ (s_{j} - s_{i}) δ (s_{j} - s_{k})

, with node-local belief

q_{a} (s_{a}) = q_{i k | j} (s_{i}, s_{k} | s_{j}) q_{j} (s_{j})

. Then, the node-local free energy evaluates to

\begin{matrix} F [q_{a}, f_{a}] & = \{\begin{matrix} - H [q_{j}] & i f q_{i k | j} (s_{i}, s_{k} | s_{j}) = δ (s_{j} - s_{i}) δ (s_{j} - s_{k}) \\ \infty & o t h e r w i s e . \end{matrix} \end{matrix}

Proof.

See Appendix D.17. □

5.3. Evaluating the Variational Free Energy

We propose here an algorithm that evaluates the BFE on a TFFG representation of a factorized model. The algorithm is based on the following results:

The definitions for the computation of data-constrained entropies ensure that only variables with associated stochastic beliefs count towards the Bethe entropy. This makes the BFE evaluation consistent with Theorems 3 and 7, where the single-variable beliefs for observed variables are excluded from the BFE definition;
We assume that a local mean-field factorization $l (a)$ is available for each $a \in V$ (Section 4.1). If the mean-field factorization is not explicitly defined, we assume $l (a) = {a}$ is the unfactored set;
Deterministic nodes are accounted for by Theorem 8, which reduces the joint entropy to an entropy over the “inbound” edges. Although the belief over the “inbounds” $q_{a ∖ j} (s_{a ∖ j})$ is not a term in the Bethe factorization (8), it can simply be obtained by marginalization of $q_{a} (s_{a})$ ;
The equality node is a special case, where we let the node entropy discount the degree of the associated variable in the original model definition. While the BFE definition on a TFFG (7) does not explicitly account for edge degrees, this mechanism implicitly corrects for “double-counting” [17]. In this case, edge selection for counting is arbitrary, because all associated edges are (by definition) constrained to share the same belief (Section 2.1, Theorem 9).

The decomposition of (7) shows that the BFE can be computed by an iteration over the nodes and edges of the graph. As some contributions to the BFE might cancel each other, the algorithm first tracks counting numbers

u_{a}

for the average energies

\begin{matrix} U_{a} [q_{a}] = - \int q_{a} (s_{a}) log f_{a} (s_{a}) d s_{a}, \end{matrix}

and counting numbers

h_{k}

for the (joint) entropies

\begin{matrix} H [q_{k}] = - \int q_{k} (s_{k}) log q_{k} (s_{k}) d s_{k}, \end{matrix}

which are ultimately combined and evaluated. We used an index k to indicate that the entropy computation may include not only the edges but a generic set of variables. We will give the definition of the set that k belongs to in Algorithm 1.

Algorithm 1 Evaluation of the Bethe free energy on a Terminated Forney-style factor graph.

given a TFFG $G = (V, E)$
given a local mean-field factorization $l (a)$ for all $a \in V$
define $q_{j} (s_{j}) = δ (s_{j} - {\hat{s}}_{j}) \Rightarrow H [q_{j}] ≜ 0$ ▹ Ignore entropy of Dirac-delta constrained beliefs
define $q_{a} (s_{a}) = q_{a | j} (s_{a ∖ j} | s_{j}) δ (s_{j} - {\hat{s}}_{j}) \Rightarrow H [q_{a}] ≜ H [q_{a ∖ j}]$ ▹ Reduce entropy of Dirac-delta constrained joint beliefs
define $K = {a, a ∖ i, n, for all a \in V, i \in E (a), n \in l (a)}$ the set of (joint) belief indices
initialize counting numbers $u_{a} = 0$ for all $a \in V$ , $h_{k} = 0$ for all $k \in K$
for all nodes $a \in V$ do
if a is a stochastic node then
$u_{a} + = 1$ ▹ Count the average energy
for all clusters $n \in l (a)$ do
$h_{n} + = 1$ ▹ Count the (joint) cluster entropy
end for
else if a is an equality node then
Select an edge $j \in E (a)$
$h_{j} + = 1$ ▹ Count the variable entropy
else ▹ Deterministic node a
Obtain the node function $f_{a} (s_{a}) = δ (s_{j} - g_{a} (s_{a ∖ j}))$
$h_{a ∖ j} + = 1$ ▹ Count the (joint) entropy of the inbounds
end if
end for
for all edges $i \in E$ do
$h_{i} - = 1$ ▹ Discount the variable entropy
end for
$U = \sum_{a \in V} u_{a} U_{a} [q_{a}]$
$H = \sum_{k \in K} h_{k} H [q_{k}]$
return $F = U - H$

6. Implementation of Algorithms and Simulations

We have developed a probabilistic programming toolbox ForneyLab.jl in the Julia language [61,62]. The majority of algorithms that are reviewed in Table 1 have been implemented in ForneyLab along with variety of demos (https://github.com/biaslab/ForneyLab.jl/tree/master/demo, accessed on 23 June 2021). ForneyLab is extendable and supports postulating new local constraints of the BFE for the creation of custom message passing-based inference algorithms.

In order to limit the length of this paper, we refer the reader to the demonstration folder of ForneyLab and to several of our previous papers with code. For instance, our previous work in [63] implemented a mean-field variational Laplace propagation for the hierarchical Gaussian filter (HGF) [64]. In the follow-up work [65], inference results improved by changing to structured factorization and moment-matching local constraints. In that case, modification of local constraints created a hybrid EP-VMP algorithm that better suited the model. Moreover, in [13], we formulated the idea of chance constraints in the form of violation probabilities leading to a new message passing algorithm that supports goal-directed behavior within the context of active inference. A similar line of reasoning led to improved inference procedures for auto-regressive models [66].

7. Related Work

Our work is inspired by the seminal work [17], which discusses the equivalence between the fixed points of the belief propagation algorithm [32] and the stationary points of the Bethe free energy. This equivalence is established through a Lagrangian formalism, which allows for the derivation of Generalized Belief Propagation (GBP) algorithms by introducing region-based graphs and the region-based (Kikuchi) free energy [16].

Region graph-based methods allows for overlapping clusters (Section 4.1) and thus offer a more generic message passing approach. The selection of appropriate regions (clusters), however, proves to be difficult, and the resulting algorithms may grow prohibitively complex. In this context, Ref. [67] addresses how to manipulate regions and manage the complexity of GBP algorithms. Furthermore, Ref. [68] also establishes a connection between GBP and expectation propagation (EP) by introducing structured region graphs.

The inspirational work of [15] derives message passing algorithms by minimization of

α

-divergences. The stationary points of

α

-divergences are obtained by a fixed point projection scheme. This projection scheme is reminiscent of the minimization scheme of the expectation propagation (EP) algorithm [18]. Compared to [15], our work focuses on a single divergence objective (namely, the VFE). The work of [12] derives the EP algorithm by manipulating the marginalization and factorization constraints of the Bethe free energy objective (see also Section 4.2.3). The EP algorithm is, however, not guaranteed to converge to a minimum of the associated divergence metric.

To address the convergence properties of the algorithms that are obtained by region graph methods, the outstanding work of [33] derives conditions on the region counting numbers that guarantee the convexity of the underlying objective. In general, however, the constrained Bethe free energy is not guaranteed to be convex and therefore the derived message passing updates are not guaranteed to converge.

8. Discussion

The key message in this paper is that a (variational) Bayesian model designer may tune the tractability-accuracy trade-off for evidence and posterior evaluation through constraint manipulation. It is interesting to note that the technique to derive message passing algorithms is always the same. We followed the recipe pioneered in [15] to derive a large variety of message passing algorithms solely through minimizing constrained Bethe free energy. This minimization leads to local fixed-point equations, which we can interpret as message passing updates on a (terminated) FFG. The presented lemmas showed how the constraints affect the Lagrangians locally. The presented theorems determined the stationary solutions of the Lagrangians and obtained the message passing equations. Thus, if a designer proposes a new set of constraints, then the first place to start is to analyze the effect on the Lagrangian. Once the effect of the constraint on the Lagrangian is known, then variational optimization may result in stationary solutions that can be obtained by a fixed-point iteration scheme.

In this paper, we selected the Forney-style factor graph framework to illustrate our ideas. FFGs are mathematically comparable to the more common bi-partite factor graphs that associate round nodes with variables and square nodes with factors [20]. Bi-partite factor graphs require two distinct types of message updates (one leaving variable nodes and one leaving factor nodes), while message passing on a (T)FFG requires only a single type of message update [69]. The (T)FFG paradigm thus substantially simplifies the derivations and resulting message passing update equations.

The message passing update rules in this paper are presented without guarantees on convergence of the (local) minimization process. In practice, however, algorithm convergence can be easily checked by evaluating the BFE (Algorithm 1) after each belief update.

In future work, we plan on extending the treatment of constraints to formulate sampling-based algorithms such as importance sampling and Hamiltonian Monte Carlo in a message passing framework. While introducing SVMP, we have limited the discussion to local clusters that are not overlapping. We plan to extend variational algorithms to include local clusters that are overlapping without altering the underlying free-energy objective or the graph structure [16,67].

9. Conclusions

In this paper, we formulated a message-passing approach to probabilistic inference by identifying local stationary solutions of a constrained Bethe free energy objective (Section 3 and Section 4). The proposed framework constructs a graph for the generative model and specifies local constraints for variational optimization in a local polytope. The constraints are then imposed on the variational objective by a Lagrangian construct. Unconstrained optimization of the Lagrangian then leads to local expressions of stationary points, which can be obtained by iterative execution of the resulting fixed point equations, which we identify with message passing updates.

Furthermore, we presented an approach to evaluate the BFE on a (terminated) Forney-style factor graph (Section 5). This procedure allows an algorithm designer to readily assess the performance of algorithms and models.

We have included detailed derivations of message passing updates (Appendix D) and hope that the presented formulation inspires the discovery of novel and customized message passing algorithms.

Author Contributions

Conceptualization: İ.Ş., T.v.d.L. and B.d.V.; methodology: İ.Ş. and T.v.d.L.; formal analysis, İ.Ş. and T.v.d.L.; investigation: İ.Ş., T.v.d.L. and B.d.V.; software, İ.Ş. and D.B.; validation, İ.Ş.; resources: İ.Ş., T.v.d.L. and B.d.V.; writing—original draft preparation: İ.Ş. and T.v.d.L.; writing—review and editing: T.v.d.L., D.B. and B.d.V.; visualizations: İ.Ş., T.v.d.L. and B.d.V.; supervision: T.v.d.L. and B.d.V.; project administration, B.d.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partly financed by GN Hearing A/S.

Acknowledgments

The authors would like to thank the BIASlab team members for many very interesting discussions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BFE	Bethe Free Energy
BP	Belief Propagation
DC	Data Constraint
EM	Expectation Maximization
EP	Expectation Propagation
FFG	Forney-style Factor Graph
GBP	Generalized Belief Propagation
LP	Laplace Propagation
MFVLP	Mean-Field Variational Laplace
MFVMP	Mean-Field Variational Message Passing
NLE	Negative Log-Evidence
TFFG	Terminated Forney-style Factor Graph
VFE	Variational Free Energy
VMP	Variational Message Passing
SVMP	Structured Variational Message Passing
SP	Sum-Product

Appendix A. Free Energy Minimization by Variational Inference

In this section, we present a pedagogical example of inductive inference. After we establish an intuition, we apply the same principles to a more general context in the further sections. We follow Caticha [42,70], who showed that a constrained free energy functional can be interpreted as a principled objective measure for inductive reasoning, see also [71,72]. The calculus of variations offers a principled method for optimizing this free energy functional.

In this section, we assume an example model

\begin{matrix} f (y, θ) = f_{y} (y, θ) f_{θ} (θ), \end{matrix}

(A1)

with observed variables

y

and a single parameter

θ

.

We define the (variational) free energy (VFE) as

\begin{matrix} F [q, f] & = \int \int q (y, θ) log \frac{q (y, θ)}{f (y, θ)} d y d θ . \end{matrix}

(A2)

The goal is to find a posterior

\begin{matrix} q^{*} = arg min_{q \in Q} F [q, f] \end{matrix}

(A3)

that minimizes the free energy subject to some pre-specified constraints. These constraints may include form or factorization constraints on q (to be discussed later) or relate to observations of a signal

y

.

As an example, assume that we obtained some measurements

y = \hat{y}

and wish to obtain a posterior marginal belief

q^{*} (θ)

over the parameter. We can then incorporate the data in the form of a data constraint

\begin{matrix} \int q (y, θ) d θ = δ (y - \hat{y}), \end{matrix}

(A4)

where

δ

defines a Dirac-delta. The constrained free energy can be rewritten by including Lagrange multipliers as

\begin{matrix} L [q, f] = F [q, f] + γ (\int \int q (y, θ) d y d θ - 1) + \int λ (y) (\int q (y, θ) d θ - δ (y - \hat{y})) d y, \end{matrix}

(A5)

where the first term specifies the (to be minimized) free energy objective, the second term a normalization constraint, and the third term the data constraint. Optimization of (A5) can be performed using variational calculus.

Variational calculus considers the impact of a variation in

q (y, θ)

on the Lagrangian

L [q, f]

. We define the variation as

\begin{matrix} δ q (y, θ) \overset{Δ}{=} ϵ ϕ (y, θ), \end{matrix}

where

ϵ \to 0

, and

ϕ

is a continuous and differentiable “test” function. The fundamental theorem of variational calculus states that the stationary solutions

q^{*}

are obtained by setting

δ L / δ q = 0

, where the functional derivative

δ L / δ q

is implicitly defined by Appendix D in [2]:

\begin{matrix} \frac{d L [q + ϵ ϕ, f]}{d ϵ} |_{ϵ = 0} & = \int \int \frac{δ L}{δ q} (y, θ) ϕ (y, θ) d y d θ . \end{matrix}

(A6)

Equation (A6) provides a way to derive the functional derivative through ordinary differentiation. For example, we take the Lagrangian defined by (A5) and work out the left hand side of (A6):

\begin{matrix} \frac{d L [q + ϵ ϕ, f]}{d ϵ} |_{ϵ = 0} & = \frac{d F [q + ϵ ϕ, f]}{d ϵ} |_{ϵ = 0} + \frac{d}{d ϵ} γ \int \int (q + ϵ ϕ) d y d θ |_{ϵ = 0} + \frac{d}{d ϵ} \int λ (y) \int (q + ϵ ϕ) d θ d y |_{ϵ = 0} \\ = \int \int \frac{d}{d ϵ} ((q + ϵ ϕ) log \frac{(q + ϵ ϕ)}{f}) |_{ϵ = 0} d y d θ + γ \int \int \frac{d}{d ϵ} (q + ϵ ϕ) |_{ϵ = 0} d y d θ \end{matrix}

(A7a)

\begin{matrix} + \int λ (y) \int \frac{d}{d ϵ} (q + ϵ ϕ) |_{ϵ = 0} d θ d y \end{matrix}

(A7b)

\begin{matrix} = \int \int [\underset{δ L [q, f] / δ q}{\underset{︸}{log \frac{q (y, θ)}{f (y, θ)} + 1 + γ + λ (y)}}] ϕ (y, θ) d y d θ . \end{matrix}

(A7c)

Note that, since (A7c) has been written in similar form as (A6), it is easy to identify the functional derivative. This procedure is one of many ways to obtain the functional derivatives [73].

Setting

δ L [q, f] / δ q = 0

we find the stationary solution as

\begin{matrix} q^{*} (y, θ) & = exp (- 1 - γ - λ (y)) f (y, θ) \end{matrix}

(A8a)

\begin{matrix} = \frac{1}{Z} exp (- λ (y)) f (y, θ), \end{matrix}

(A8b)

with

Z = \int \int exp (- λ (y)) f (y, θ) d y d θ = exp (γ + 1)

. In order to determine the Lagrange multipliers

γ

and

λ (y)

, we must substitute the stationary solution (A8b) back into the constraints. The normalization constraint evaluates to

\begin{matrix} \frac{1}{Z} \int \int exp (- λ (y)) f (y, θ) d y d θ = 1 . \end{matrix}

(A9)

We find that (A9) is always satisfied since

Z = \int \int exp (- λ (y)) f (y, θ) d y d θ

by definition. Note, however, that the computation of the normalization constant still depends on the undetermined Lagrange multiplier

λ (y)

.

The data constraint evaluates to

\int q^{*} (y, θ) d θ = \frac{1}{Z} exp (- λ (y)) \int f (y, θ) d θ = δ (y - \hat{y})

(A10)

which can be rewritten as

\frac{exp (- λ (y))}{Z} = \frac{δ (y - \hat{y})}{\int f (y, θ) d θ} .

(A11)

Equation (A11) shows that

λ (y)

can satisfy this constraint only if it is proportional to

δ (y - \hat{y})

. Indeed, substitution of (A11) into (A8b) gives

\begin{matrix} q^{*} (y, θ) = \frac{f (y, θ)}{\int f (y, θ) d θ} δ (y - \hat{y}), \end{matrix}

and the posterior for the parameters evaluates to

\begin{matrix} q^{*} (θ) & = \int q^{*} (y, θ) d y \\ = \int \frac{f (y, θ)}{\int f (y, θ) d θ} δ (y - \hat{y}) d y \\ = \frac{f (\hat{y}, θ)}{\int f (\hat{y}, θ) d θ} \\ = \frac{f_{y} (\hat{y}, θ) f_{θ} (θ)}{\int f_{y} (\hat{y}, θ) f_{θ} (θ) d θ}, \end{matrix}

which we recognize as the Bayes rule.

Note that the Bayes rule was derived here as a special case of constrained variational free energy minimization when data constraints are present. This derivation of the Bayes rule seems unnecessarily tedious but the value of this approach to inductive inference is that the same principle applies when other (not data) constraints on q are present.

Appendix B. Lagrangian Optimization and the Dual Problem

With the addition of Lagrange multipliers to the Bethe functional, the resulting Lagrangian depends both on the variational distribution

q (s)

and the Lagrange multipliers

Ψ (s)

. Formally, the introduction of the Lagrange multipliers allows us to rewrite the constrained optimization on the local polytope as an unconstrained optimization. We follow [33], and write

\begin{matrix} min_{q \in L (G)} F [q] = min_{q} max_{Ψ} L [q, Ψ] . \end{matrix}

Weak duality [74] (Chapter 5) then states that

\begin{matrix} min_{q} max_{Ψ} L [q, Ψ] \geq max_{Ψ} min_{q} L [q, Ψ] . \end{matrix}

The minimization with respect to q then yields a solution that depends on the Lagrange multipliers, as

\begin{matrix} q^{*} (s; Ψ) = arg min_{q} L [q, Ψ] . \end{matrix}

For any given q the Lagrangian is concave in

Ψ

. Therefore, substituting

q^{*}

in the Lagrangian, the maximization over

L [q^{*}, Ψ]

yields the unique solution

\begin{matrix} Ψ^{*} (s) = arg max_{Ψ} L [q^{*}, Ψ] . \end{matrix}

Stationary solutions are then given by

\begin{matrix} q^{*} (s; Ψ^{*}) = arg min_{q \in L (G)} F [q] . \end{matrix}

In the current paper, we consider factorized q’s (e.g., (8)), and consider variations with respect to the individual factors. We then need to show that the combined stationary points of the individual factors also constitute a stationary point of the total objective.

Consider a Lagrangian having multiple arguments, i.e.,

\begin{matrix} L [q] & = L [q_{1}, \dots, q_{n}, \dots, q_{N}] \end{matrix}

(A12)

\begin{matrix} q & ≜ {[q_{1}, \dots, q_{N}]}^{⊤} . \end{matrix}

(A13)

We want to determine the first total variation of the Lagrangian given by

\begin{matrix} δ L & = L [q + ϵ ϕ] - L [q] \end{matrix}

(A14)

\begin{matrix} ϕ (s) & ≜ {[ϕ_{1} (s), \dots, ϕ_{N} (s)]}^{⊤} . \end{matrix}

(A15)

By a Taylor series expansion on

ϵ

we obtain [73] (A.14) and [75] (Equation (23.2))

L [q + ϵ ϕ] - L [q] = \sum_{k = 1}^{K} \frac{1}{k!} \frac{d}{d ϵ^{k}} (L^{k} [q + ϵ ϕ]) ϵ^{k} + O (ϵ^{K + 1}) .

(A16)

Omitting all terms higher than the first order, we obtain the first variation as

δ L = \frac{d}{d ϵ} (L [q + ϵ ϕ]) ϵ .

(A17)

Rearranging the terms and letting

ϵ

vanish, we obtain the following expression

lim_{ϵ \to 0} \frac{δ L}{ϵ} = \frac{d}{d ϵ} (L [q + ϵ ϕ]) |_{ϵ = 0} .

(A18)

Let us assume that the Frechet derivative exists [73] such that we can obtain the following integral representation (It should be noted that this integral expression is not always possible for a generic Lagrangian. That is why we need to assume that the Frechet derivative exists)

\frac{d}{d ϵ} (L [q + ϵ ϕ]) |_{ϵ = 0} = \int ϕ {(s)}^{⊤} \frac{δ L}{δ q} d s

(A19)

where

\frac{δ L}{δ q}

is the variational derivative

\begin{matrix} \frac{δ L}{δ q} & = {[\frac{δ L}{δ q_{1}}, \dots, \frac{δ L}{δ q_{N}}]}^{⊤} \end{matrix}

(A20)

\begin{matrix} δ q_{n} & = ϵ ϕ_{n} (s) . \end{matrix}

(A21)

This means that (A19) can be written as [75] (Equation (22.5)) (Here, we use a more generic Lagrangian and our notation is different than in [75]; howeverm the expression is motivated again by a Taylor series expansion on

ϵ

)

lim_{ϵ \to 0} \frac{δ L}{ϵ} = \frac{d}{d ϵ} (L [q + ϵ ϕ]) |_{ϵ = 0} = \sum_{n} \int ϕ (s) \frac{δ L}{δ q_{n}} d s .

(A22)

Fundamental theorem of variational calculus states that in order for a point to be stationary, the first variation needs to vanish. In order for the first variation to vanish, it is sufficient to have vanishing of the variational derivatives

\frac{δ L}{δ q_{n}} = 0 for every n = 1, \dots, N .

(A23)

Vanishing of individual variational derivatives will mean that that the local stationary points will also correspond to a global stationary point.

Appendix C. Local Free Energy Example for a Deterministic Node

Theorem 8 tells us how to evaluate the node-local free energy for a deterministic node. As an example, consider the node function

f_{a} (y, x) = δ (y - sgn (x))

, with

y \in {- 1, 1}

and

x \in R

as depicted in Figure A1.

Figure A1. Messages around a “sign” node.

Interestingly, there is information loss in this node because the “sign” mapping is not bijective. Given an incoming Bernoulli distributed message

μ_{y a} (y) = B e r (y | p)

, the backward outgoing message is derived as

\begin{matrix} μ_{a x} (x) & = \int μ_{y a} (y) δ (y - sgn (x)) d y \\ = \{\begin{matrix} p & if x \geq 0 \\ 1 - p & if x < 0 . \end{matrix} \end{matrix}

Given a Gaussian distributed incoming message

μ_{x a} (x) = N (x | m, ϑ)

, the resulting belief then becomes

\begin{matrix} q_{x} (x) & = \frac{μ_{x a} (x) μ_{a x} (x)}{\int μ_{x a} (x) μ_{a x} (x) d x} \\ = \{\begin{matrix} \frac{p}{p + Φ - 2 p Φ} N (x | m, ϑ) & if x \geq 0 \\ \frac{1 - p}{p + Φ - 2 p Φ} N (x | m, ϑ) & if x < 0, \end{matrix} \end{matrix}

with

Φ = \int_{- \infty}^{0} N (x | m, ϑ) d x

. We define a truncated Gaussian distribution as

\begin{matrix} T (x | m, ϑ, a, b) = \{\begin{matrix} \frac{1}{Φ (a, b; m, ϑ)} N (x | m, ϑ) & if a \leq x \leq b, \\ 0 & otherwise, \end{matrix} \end{matrix}

with

Φ (a, b; m, ϑ) = \int_{a}^{b} N (x | m, ϑ) d x

. This leads to

\begin{matrix} q_{x} (x) & = \underset{K}{\underset{︸}{\frac{p (1 - Φ)}{p + Φ - 2 p Φ}}} T (x | m, ϑ, - \infty, 0) + \underset{1 - K}{\underset{︸}{\frac{(1 - p) Φ}{p + Φ - 2 p Φ}}} T (x | m, ϑ, 0, \infty), \end{matrix}

as a truncated Gaussian mixture.

The node-local free energy then evaluates to

\begin{matrix} F [q_{a}, f_{a}] & = - H [q_{x}] \\ = \int_{- \infty}^{0} q_{x} (x) log q_{x} (x) d x + \int_{0}^{\infty} q_{x} (x) log q_{x} (x) d x \\ = - K H [T (m, ϑ, - \infty, 0)] + K log K - (1 - K) H [T (m, ϑ, 0, \infty)] + (1 - K) log (1 - K) \\ = - K H [T (m, ϑ, - \infty, 0)] - (1 - K) H [T (m, ϑ, 0, \infty)] - H [B e r (K)], \end{matrix}

as a weighted sum of entropies, which can be computed analytically.

Appendix D. Proofs

Appendix D.1. Proof of Lemma 1

Proof.

We apply the variation

ϵ ϕ_{b}

to

q_{b}

and, as discussed in Appendix A, we can identify the functional derivative

δ L_{b} / δ q_{b}

through ordinary differentiation as

\begin{matrix} \frac{d L_{b} [q_{b} + ϵ ϕ_{b}, f_{b}]}{d ϵ} |_{ϵ = 0} & = \int (\overset{δ L_{b} / δ q_{b}}{\overset{︷}{log \frac{q_{b} (s_{b})}{f_{b} (s_{b})} + 1 + ψ_{b} - \sum_{i \in E (b)} λ_{i b} (s_{i})}}) ϕ_{b} (s_{b}) d s_{b} . \end{matrix}

Setting the functional derivative to zero and identifying

\begin{matrix} μ_{i b} (s_{i}) & = exp (λ_{i b} (s_{i})) \end{matrix}

(A24)

\begin{matrix} ψ_{b} & = log \int f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i}) d s_{b} - 1 \end{matrix}

(A25)

yields the stationary solutions (18) in terms of Lagrange multipliers that are to be determined. □

Appendix D.2. Proof of Lemma 2

Proof.

We follow the same procedure as in Appendix D.1, where we apply a variation

ϵ ϕ_{j}

to

q_{j}

(instead of

q_{b}

), and identify the functional derivative

δ L_{j} / δ q_{j}

through

\begin{matrix} \frac{d L_{j} [q_{j} + ϵ ϕ_{j}]}{d ϵ} |_{ϵ = 0} & = \int (\overset{δ L_{j} / δ q_{j}}{\overset{︷}{- log q_{j} (s_{j}) - 1 + ψ_{j} + \sum_{a \in V (j)} λ_{j a} (s_{j})}}) ϕ_{j} (s_{j}) d s_{j} . \end{matrix}

As the TFFG is terminated, each edge has 2 degrees and the node-induced edge set has only 2 factors, which we denote by

f_{b}

and

f_{c}

. Then, setting the functional derivative to zero and identifying

\begin{matrix} μ_{j a} (s_{j}) & = exp (λ_{j a} (s_{j})) \end{matrix}

(A26)

\begin{matrix} ψ_{j} & = - log \int μ_{j b} (s_{j}) μ_{j c} (s_{j}) d s_{j} + 1 \end{matrix}

(A27)

yields the stationary solution of (20) in terms of the Lagrange multipliers. □

Appendix D.3. Proof of Theorem 1

Proof.

The local polytope of (14) constructs the Lagrangians of (17) and (19). Substituting the stationary solutions from Lemmas 1 and 2 in the marginalization constraint,

\begin{matrix} q_{j} (s_{j}) & = \int q_{b} (s_{b}) d s_{b ∖ j}, \end{matrix}

we obtain the following relation

\begin{matrix} \frac{μ_{j b} (s_{j}) μ_{j c} (s_{j})}{Z_{j}} & = \frac{1}{Z_{b}} \int f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i}) d s_{b ∖ j}, \end{matrix}

where we defined the following normalization constants to ensure that the computed marginals are normalized:

\begin{matrix} Z_{j} & = \int μ_{j b} (s_{j}) μ_{j c} (s_{j}) d s_{j} \\ Z_{b} & = \int f_{b} (s_{b}) \prod_{i \in E (b)} μ_{i b} (s_{i}) d s_{b} . \end{matrix}

Extracting

μ_{j b}

from the integral

\begin{matrix} \frac{μ_{j b} (s_{j}) μ_{j c} (s_{j})}{Z_{j}} & = \frac{μ_{j b} (s_{j})}{Z_{b}} \int f_{b} (s_{b}) \prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} μ_{i b} (s_{i}) d s_{b ∖ j}, \\ μ_{j c} (s_{j}) & = \frac{Z_{j}}{Z_{b}} \int f_{b} (s_{b}) \prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} μ_{i b} (s_{i}) d s_{b ∖ j} \end{matrix}

(A28)

and cancelling

μ_{j b}

on both sides then yields the condition on the functional form of the message

μ_{j c}

.

We now need to show that the fixed points of (22) satisfy (A28). Let us assume that the fixed points exist, such that

μ_{j c}^{(k)} = μ_{j c}^{(k + 1)}

for some k. Then, we want to show that at the fixed points the following equality holds:

μ_{j c}^{(k)} (s_{j}) = \frac{Z_{j}^{(k)}}{Z_{b}^{(k)}} \int f_{b} (s_{b}) \prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} μ_{i b}^{(k)} (s_{i}) d s_{b ∖ j} .

Substituting (22), we need to show that

μ_{j c}^{(k)} (s_{j}) = \frac{Z_{j}^{(k)}}{Z_{b}^{(k)}} μ_{j c}^{(k + 1)} (s_{j}) .

Since

μ_{j c}^{(k)} = μ_{j c}^{(k + 1)}

, we can rearrange

μ_{j c}^{(k)} (1 - \frac{Z_{j}^{(k)}}{Z_{b}^{(k)}}) = 0 .

From

Z_{b}

, we obtain

\begin{matrix} Z_{b}^{(k)} & = \int μ_{j b}^{(k)} (s_{j}) \int f_{b} (s_{b}) \prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} μ_{i b}^{(k)} (s_{i}) d s_{b ∖ j} d s_{j} \\ = \int μ_{j b}^{(k)} (s_{j}) μ_{j c}^{(k + 1)} (s_{j}) d s_{j} \\ = \int μ_{j b}^{(k)} (s_{j}) μ_{j c}^{(k)} (s_{j}) d s_{j} \\ = Z_{j}^{(k)}, \end{matrix}

which implies that the fixed points satisfy the desired condition. This proves that the stationary solutions to the BFE within the local polytope can be obtained as fixed points of the sum-product update equations. □

Appendix D.4. Proof of Lemma 3

Proof.

Substituting the definition of (32), we can re-write the second term of Lagrangian (30) as

\begin{matrix} \int \{\prod_{n \in l (b)} q_{b}^{m} (s_{b}^{m})\} log f_{b} (s_{b}) d s_{b} & = \int q_{b}^{m} (s_{b}^{m}) (\int \{\prod_{\begin{matrix} n \in l (b) \\ n \neq m \end{matrix}} q_{b}^{n} (s_{b}^{n})\} log f_{b} (s_{b}) d s_{b}^{∖ m}) d s_{b}^{m} \\ = \int q_{b}^{m} (s_{b}^{m}) log {\tilde{f}}_{b}^{m} (s_{b}^{m}) d s_{b}^{m} . \end{matrix}

We apply the variation

ϵ ϕ_{b}^{m}

to

q_{b}^{m}

and identify the functional derivative

δ L_{b}^{m} / δ q_{b}^{m}

, as

\begin{matrix} \frac{d L_{b}^{m} [q_{b}^{m} + ϵ ϕ_{b}^{m}]}{d ϵ} |_{ϵ = 0} & = \int (\overset{δ L_{b}^{m} / δ q_{b}^{m}}{\overset{︷}{log \frac{q_{b}^{m} (s_{b}^{m})}{{\tilde{f}}_{b}^{m} (s_{b}^{m})} + 1 + ψ_{b}^{m} - \sum_{i \in m} λ_{i b} (s_{i})}}) ϕ_{b}^{m} (s_{b}^{m}) d s_{b}^{m}, \end{matrix}

whose functional form we recognize from Appendix D.1. Setting the functional derivative to zero and again identifying

μ_{i b} (s_{i}) = exp λ_{i b} (s_{i})

, yields the stationary solutions of (31). □

Appendix D.5. Proof of Theorem 2

Proof.

The local polytope of (33) constructs the Lagrangians

L_{b}^{m}

and

L_{j}

as (30) and (19), respectively. We substitute the stationary solutions of Lemmas 2 and 3 in the local marginalization constraint (29b), which yields

\begin{matrix} q_{j} (s_{j}) = \int q_{b}^{m} (s_{b}^{m}) d s_{b ∖ j}^{m} . \end{matrix}

Following the structure of the proof in Appendix D.3, we obtain the following condition for the stationary solutions in terms of messages:

\begin{matrix} \frac{μ_{j b} (s_{j}) μ_{j c} (s_{j})}{Z_{j}} & = \frac{μ_{j b} (s_{j})}{Z_{b}^{m}} \int {\tilde{f}}_{b}^{m} (s_{b}^{m}) \prod_{\begin{matrix} i \in m \\ i \neq j \end{matrix}} μ_{i b} (s_{i}) d s_{b ∖ j}^{m} \\ \frac{μ_{j c} (s_{j})}{Z_{j}} & = \frac{1}{Z_{b}^{m}} \int {\tilde{f}}_{b}^{m} (s_{b}^{m}) \prod_{\begin{matrix} i \in m \\ i \neq j \end{matrix}} μ_{i b} (s_{i}) d s_{b ∖ j}^{m} . \end{matrix}

(A29)

Now we want to show that the fixed points of the message updates (36) satisfy (A29). Let us assume that the fixed points exists for some k such that

μ_{j c}^{(k + 1)} = μ_{j c}^{(k)}

. Then, we will show that the fixed points satisfy

\frac{μ_{j c}^{(k)} (s_{j})}{Z_{j}^{(k)}} = \frac{1}{Z_{b}^{m, (k)}} \int {\tilde{f}}_{b}^{m, (k)} (s_{b}^{m}) \prod_{\begin{matrix} i \in m \\ i \neq j \end{matrix}} μ_{i b}^{(k)} (s_{i}) d s_{b ∖ j}^{m} .

(A30)

Similar to Appendix D.3, it will suffice to show that

Z_{b}^{m, (k)} = Z_{j}^{(k)}

at the fixed points. Arranging the order of integration in normalization constant

Z_{b}^{m, (k)}

, we obtain

\begin{matrix} Z_{b}^{m, (k)} & = \int μ_{j b}^{(k)} (s_{j}) \int {\tilde{f}}_{b}^{m, (k)} (s_{b}^{m}) \prod_{\begin{matrix} i \in m \\ i \neq j \end{matrix}} μ_{i b}^{(k)} (s_{i}) d s_{b ∖ j}^{m} d s_{j} \\ = \int μ_{j b}^{(k)} (s_{j}) μ_{j c}^{(k + 1)} (s_{j}) d s_{j} \\ = \int μ_{j b}^{(k)} (s_{j}) μ_{j c}^{(k)} (s_{j}) d s_{j} \\ = Z_{j}^{(k)} . \end{matrix}

By the same line of reasoning as in Appendix D.3, this shows that the fixed points of the message updates (36) leads to stationary distributions of the Bethe free energy with structured factorization constraints. □

Appendix D.6. Proof of Corollary 1

Proof.

For a fully factorized local variational distribution (41), the augmented node function

{\tilde{f}}_{b}^{m} (s_{b}^{m})

of (32) reduces to

\begin{matrix} {\tilde{f}}_{j} (s_{j}) & = exp (\int \{\prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} q_{i} (s_{i})\} log f_{b} (s_{b}) d s_{b ∖ j}) . \end{matrix}

(A31)

The message of (36) then reduces to

\begin{matrix} μ_{j c} (s_{j}) & = {\tilde{f}}_{j} (s_{j}), \end{matrix}

which, after substitution, recovers (43). □

Appendix D.7. Proof of Lemma 4

Proof.

When we apply the variation

ϵ ϕ_{b}

to

q_{b}

and identify the functional derivative

δ L_{b} / δ q_{b}

, we recover the result from Appendix D.1, which leads to a solution of the form (47). □

Appendix D.8. Proof of Theorem 3

Proof.

We construct the Lagrangian of (46), which by Lemma 4 leads to a solution of the form (47). Substituting this solution in the constraint of (45) leads to

\begin{matrix} [\overset{μ_{b j} (s_{j})}{\overset{︷}{\int f_{b} (s_{b}) \prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} μ_{i b} (s_{i}) d s_{b ∖ j}}}] μ_{j b} (s_{j}) & = δ (s_{j} - {\hat{s}}_{j}) . \end{matrix}

(A32)

This equation is then satisfied by (50), which proves the theorem. □

Appendix D.9. Proof of Lemma 5

Proof.

The proof follows directly from Appendix D.1, with

{\tilde{f}}_{b} (s_{b}; {\hat{s}}_{b})

substituted for

f_{b} (s_{b})

. □

Appendix D.10. Proof of Theorem 4

Proof.

Given the result of Lemma 5, the proof follows Appendix D.3, where Laplace propagation chooses the expansion point to be the fixed point

{\hat{s}}_{b} = arg max log q_{b} (s_{b})

.

For all second-order fixed points of the Laplace iterations, it holds that

{\hat{s}}_{b}

is a fixed point if and only if it is a local optimum of

q_{b}

. The proof is then concluded by Lemma 1 in [76]. □

Appendix D.11. Proof of Lemma 6

Proof.

We note that the Lagrange multiplier

η_{j b}

does not depend on

s_{j}

because the expectation removes all the functional dependencies on

s_{j}

. Furthermore, the expectations of

T_{j} (s_{j})

have the same dimension as the function

T_{j} (s_{j})

. This means that the dimension of

η_{j b}

needs to be compatible with that of

T_{j} (s_{j})

so that we can write the constraint as an inner product.

We apply the variation

ϵ ϕ_{b}

to

q_{b}

and identify the functional derivative

δ L_{b} / δ q_{b}

, as

\begin{matrix} \frac{d L_{b} [q_{b} + ϵ ϕ_{b}, f_{b}]}{d ϵ} |_{ϵ = 0} & = \int (\overset{δ L_{b} / δ q_{b}}{\overset{︷}{log \frac{q_{b} (s_{b})}{f_{b} (s_{b})} + 1 + ψ_{b} - \sum_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} λ_{i b} (s_{i}) - η_{j b}^{⊤} T_{j} (s_{j})}}) ϕ_{b} (s_{b}) d s_{b} . \end{matrix}

Setting the functional derivative to zero and identifying

μ_{i b} (s_{i}) = exp λ_{i b} (s_{i})

for

i \neq j

and identifying

μ_{j b} (s_{j}) = exp (η_{j b}^{⊤} T_{j} (s_{j}))

yields the functional form of the stationary solution as (62). □

Appendix D.12. Proof of Lemma 7

Proof.

We follow a similar procedure as in Appendix D.11 and apply the variation

ϵ ϕ_{j}

to

q_{j}

, which identifies the functional derivative

δ L_{j} / δ q_{j}

, as

\begin{matrix} \frac{d L [q_{j} + ϵ ϕ_{j}]}{d ϵ} |_{ϵ = 0} & = \int (\overset{δ L_{j} / δ q_{j}}{\overset{︷}{- log q_{j} (s_{j}) - 1 + ψ_{j} + \sum_{a \in V (j)} η_{j a}^{⊤} T_{j} (s_{j})}}) ϕ_{j} (s_{j}) d s_{j} . \end{matrix}

Setting the functional derivative to zero and following the same procedure as in Appendix D.2 yields (64). □

Appendix D.13. Proof of Theorem 5

Proof.

By substituting the stationary solutions given by Lemmas 6 and 7 into the moment-matching constraint (59), we obtain the following condition:

\begin{matrix} \int T_{j} (s_{j}) q_{j} (s_{j}) d s_{j} & = \int T_{j} (s_{j}) q_{b} (s_{b}) d s_{b} \\ \frac{1}{Z_{j}} \int T_{j} (s_{j}) exp ({[η_{j b} + η_{j c}]}^{⊤} T_{j} (s_{j})) d s_{j} & = \frac{1}{{\tilde{Z}}_{j}} \int T_{j} (s_{j}) \overset{μ_{j b} (s_{j})}{\overset{︷}{exp (η_{j b}^{⊤} T_{j} (s_{j}))}} [\overset{{\tilde{μ}}_{j c} (s_{j})}{\overset{︷}{\int f_{b} (s_{b}) \prod_{\begin{matrix} i \in E (b) \\ i \neq j \end{matrix}} μ_{i b} (s_{i}) d s_{b ∖ j}}}] d s_{j} \\ = \int T_{j} (s_{j}) {\tilde{q}}_{j} (s_{j}) d s_{j}, \end{matrix}

where we recognize the sum-product message

{\tilde{μ}}_{j c} (s_{j})

, which we multiply by the incoming exponential family message

μ_{j b} (s_{j})

and normalize to obtain

{\tilde{q}}_{j} (s_{j})

. By defining

η_{j} = η_{j b} + η_{j c}

, normalization constants are given by

\begin{matrix} Z_{j} (η_{j}) & = \int exp (η_{j}^{⊤} T_{j} (s_{j})) d s_{j} \\ {\tilde{Z}}_{j} & = \int exp (η_{j b}^{⊤} T_{j} (s_{j})) {\tilde{μ}}_{j c} (s_{j}) d s_{j} . \end{matrix}

Computing the moments allows us to determine the exponential family parameter by solving the following equation [24] (Proposition 3.1)

\begin{matrix} \nabla_{η_{j}} log Z_{j} (η_{j}) & = \int {\tilde{q}}_{j} (s_{j}) T_{j} (s_{j}) d s_{j} . \end{matrix}

Suppose you obtain a solution to this equation denoted by

{\tilde{η}}_{j}

, this allows us to approximate the sum-product message

{\tilde{μ}}_{j c} (s_{j})

by an exponential family message whose parameter is given by

\begin{matrix} η_{j c} = {\tilde{η}}_{j} - η_{j b} . \end{matrix}

Now let us assume that the fixed points of the sum-product iterations

{\tilde{μ}}_{j c}^{(k)} (s_{j}) = {\tilde{μ}}_{j c}^{(k + 1)} (s_{j})

and the incoming exponential family messages

μ_{j b}^{(k)} (s_{j}) = μ_{j b}^{(k + 1)} (s_{j})

exist for some k. Then, we need to show that the existence of these fixed points implies the existence of the fixed points of

μ_{j c}^{(k + 1)} = μ_{j c}^{(k)}

.

By moment-matching, we have

\begin{matrix} η_{j c}^{(k + 1)} & = {\tilde{η}}_{j}^{(k + 1)} - η_{j b}^{(k + 1)} \\ = {\tilde{η}}_{j}^{(k)} - η_{j b}^{(k)} \\ = η_{j c}^{(k)}, \end{matrix}

which proves the existence of the fixed point of

μ_{j c}

if

{\tilde{μ}}_{j c}

and

μ_{j b} (s_{j})

have fixed points. □

Appendix D.14. Proof of Theorem 6

Proof.

The proof follows directly from substituting the Laplace-approximated factor-function (53) in the naive mean-field result of Corollary 1. □

Appendix D.15. Proof of Theorem 7

Proof.

In order to obtain the optimal parameter value

θ_{j}^{*}

, we view the free energy as a function of

θ_{j}

. As there are two node-local free energies that depend upon

θ_{j}

, this leads to

\begin{matrix} θ_{j}^{*} & = arg min_{θ_{j}} (F [q_{b}, f_{b}; θ_{j}] + F [q_{c}, f_{c}; θ_{j}]) \\ = arg max_{θ_{j}} (\int \{δ (s_{j} - θ_{j}) \prod_{\begin{matrix} n \in l (b) \\ n \neq m \end{matrix}} q_{b}^{n} (s_{b}^{n})\} log f_{b} (s_{b}) d s_{b} + \int \{δ (s_{j} - θ_{j}) \prod_{\begin{matrix} n \in l (c) \\ n \neq m \end{matrix}} q_{c}^{n} (s_{c}^{n})\} log f_{c} (s_{c}) d s_{c}) \\ = arg max_{θ_{j}} (\int \{\prod_{\begin{matrix} n \in l (b) \\ n \neq m \end{matrix}} q_{b}^{n} (s_{b}^{n})\} log f_{b} (s_{b ∖ j}, θ_{j}) d s_{b ∖ j} + \int \{\prod_{\begin{matrix} n \in l (c) \\ n \neq m \end{matrix}} q_{c}^{n} (s_{c}^{n})\} log f_{c} (s_{c ∖ j}, θ_{j}) d s_{c ∖ j}) \\ = arg max_{s_{j}} (log μ_{b j} (s_{j}) + log μ_{c j} (s_{j})), \end{matrix}

where in the last step we replaced

θ_{j}

with

s_{j}

for convenience. Here, we recognize

μ_{b j}

and

μ_{c j}

as the structured variational updates of Theorem 2. Identification of the fixed points can then be obtained by [57] (Corollary 2). For a rigorous discussion on convergence of the EM algorithm, we refer to [77] (Corollary 32), [24] (Chapter 6) and [57] (Section 3). □

Appendix D.16. Proof of Theorem 8

Proof.

Substituting for

q_{a} (s_{a})

, the node-local free energy becomes

\begin{matrix} F [q_{a}, f_{a}] & = \int q_{a} (s_{a}) log \frac{q_{a} (s_{a})}{f_{a} (s_{a})} d s_{a} \\ = \int q_{a} (s_{a}) log \frac{q_{j | a} (s_{j} | s_{a ∖ j})}{f_{a} (s_{a})} d s_{a} + \int q_{a} (s_{a}) log q_{a ∖ j} (s_{a ∖ j}) d s_{a} \\ = \int q_{a ∖ j} (s_{a ∖ j}) q_{j | a} (s_{j} | s_{a ∖ j}) log \frac{q_{j | a} (s_{j} | s_{a ∖ j})}{f_{a} (s_{a})} d s_{a} + \int q_{a ∖ j} (s_{a ∖ j}) q_{j | a} (s_{j} | s_{a ∖ j}) log q_{a ∖ j} (s_{a ∖ j}) d s_{a} \\ = \int q_{a ∖ j} (s_{a ∖ j}) [\int q_{j | a} (s_{j} | s_{a ∖ j}) log \frac{q_{j | a} (s_{j} | s_{a ∖ j})}{f_{a} (s_{a})} d s_{j}] d s_{a ∖ j} + \int q_{a ∖ j} (s_{a ∖ j}) log q_{a ∖ j} (s_{a ∖ j}) d s_{a ∖ j} \\ = E_{q_{a ∖ j}} [D [q_{j | a} ∥ f_{a}]] - H [q_{a ∖ j}], \end{matrix}

where the first term expresses an expected Kullback–Leibler divergence, and the second term is a negative entropy. The only possibility for the local free energy to becomes finite, is when

q_{j | a} (s_{j} | s_{a ∖ j}) = f_{a} (s_{a}) = δ (s_{j} - g_{a} (s_{a ∖ j}))

. We then have:

\begin{matrix} F [q_{a}, f_{a}] & = \{\begin{matrix} - H [q_{a ∖ j}] & if q_{j | a} (s_{j} | s_{a ∖ j}) = δ (s_{j} - g_{a} (s_{a ∖ j})) \\ \infty & otherwise . \end{matrix} \end{matrix}

□

Appendix D.17. Proof of Theorem 9

Proof.

The proof is similar to Appendix D.16. Substituting for

q_{a} (s_{a})

, the node-local free energy becomes

\begin{matrix} F [q_{a}, f_{a}] & = \int q_{a} (s_{a}) log \frac{q_{a} (s_{a})}{f_{a} (s_{a})} d s_{a} \\ = \int q_{a} (s_{i}, s_{j}, s_{k}) log \frac{q_{i k | j} (s_{i}, s_{k} | s_{j})}{f_{a} (s_{i}, s_{j}, s_{k})} d s_{i} d s_{j} d s_{k} + \int q_{j} (s_{j}) log q_{j} (s_{j}) d s_{j} \\ = E_{q_{j}} [D [q_{i k | j} ∥ f_{a}]] - H [q_{j}] . \end{matrix}

In contrast to Appendix D.16, here we have a joint belief within the divergence with a single conditioning variable. Conditioning on

s_{j}

(or by symmetry

s_{i}

or

s_{k}

) determines the realization of the other variables. Therefore, we have:

\begin{matrix} F [q_{a}, f_{a}] & = \{\begin{matrix} - H [q_{j}] & if q_{i k | j} (s_{i}, s_{k} | s_{j}) = δ (s_{j} - s_{i}) δ (s_{j} - s_{k}) \\ \infty & otherwise . \end{matrix} \end{matrix}

□

References

Blei, D.M. Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models. Annu. Rev. Stat. Appl. 2014, 1, 203–232. [Google Scholar] [CrossRef] [Green Version]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006. [Google Scholar]
Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Forney, G. Codes on graphs: Normal realizations. IEEE Trans. Inf. Theory 2001, 47, 520–548. [Google Scholar] [CrossRef]
Loeliger, H.A. An introduction to factor graphs. IEEE Signal Process. Mag. 2004, 21, 28–41. [Google Scholar]
Winn, J.; Bishop, C.M. Variational message passing. J. Mach. Learn. Res. 2005, 6, 661–694. [Google Scholar]
Yedidia, J.S.; Freeman, W.T.; Weiss, Y. Understanding Belief Propagation and Its Generalizations; Mitsubishi Electric Research Laboratories, Inc.: Cambridge, MA, USA, 2001. [Google Scholar]
Cox, M.; van de Laar, T.; de Vries, B. A factor graph approach to automated design of Bayesian signal processing algorithms. Int. J. Approx. Reason. 2019, 104, 185–204. [Google Scholar] [CrossRef] [Green Version]
Yedidia, J.S. An Idiosyncratic Journey beyond Mean Field Theory. In Advanced Mean Field Methods; The MIT Press: Cambridge, MA, USA, 2000; pp. 37–49. [Google Scholar]
Yedidia, J.S.; Freeman, W.T.; Weiss, Y. Bethe Free Energy, Kikuchi Approximations, and Belief Propagation Algorithms; Mitsubishi Electric Research Laboratories, Inc.: Cambridge, MA, USA, 2001; p. 24. [Google Scholar]
Dauwels, J. On Variational Message Passing on Factor Graphs. In Proceedings of the IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 2546–2550. [Google Scholar] [CrossRef] [Green Version]
Zhang, D.; Wang, W.; Fettweis, G.; Gao, X. Unifying Message Passing Algorithms under the Framework of Constrained Bethe Free Energy Minimization. arXiv 2017, arXiv:1703.10932. [Google Scholar]
van de Laar, T.; Şenöz, I.; Özçelikkale, A.; Wymeersch, H. Chance-Constrained Active Inference. arXiv 2021, arXiv:2102.08792. [Google Scholar]
Smola, A.J.; Vishwanathan, S.V.N.; Eskin, E. Laplace propagation. In NIPS; The MIT Press: Cambridge, MA, USA, 2004; pp. 441–448. [Google Scholar]
Minka, T. Divergence Measures and Message Passing. Available online: https://www.microsoft.com/en-us/research/publication/divergence-measures-and-message-passing/ (accessed on 24 June 2021).
Yedidia, J.S. Generalized Belief Propagation and Free Energy Minimization. Available online: http://cba.mit.edu/events/03.11.ASE/docs/Yedidia.pdf (accessed on 24 June 2021).
Yedidia, J.S.; Freeman, W.; Weiss, Y. Constructing free-energy approximations and generalized belief propagation algorithms. IEEE Trans. Inf. Theory 2005, 51, 2282–2312. [Google Scholar] [CrossRef]
Minka, T.P. Expectation Propagation for Approximate Bayesian Inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, Seattle, WA, USA, 2–5 August 2001; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2001; pp. 362–369. [Google Scholar]
Heskes, T. Stable fixed points of loopy belief propagation are local minima of the bethe free energy. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2003; pp. 359–366. [Google Scholar]
Kschischang, F.R.; Frey, B.J.; Loeliger, H.A. Factor graphs and the sum-product algorithm. IEEE Trans. Inf. Theory 2001, 47, 498–519. [Google Scholar]
Hoffman, M.; Blei, D.M.; Wang, C.; Paisley, J. Stochastic Variational Inference. arXiv 2012, arXiv:1206.7051. [Google Scholar]
Archer, E.; Park, I.M.; Buesing, L.; Cunningham, J.; Paninski, L. Black box variational inference for state space models. arXiv 2015, arXiv:1511.07367. [Google Scholar]
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1988. [Google Scholar]
Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference. Found. Trends® Mach. Learn. 2008, 1, 1–305. [Google Scholar] [CrossRef] [Green Version]
Chertkov, M.; Chernyak, V.Y. Loop Calculus in Statistical Physics and Information Science. Phys. Rev. E 2006, 73. [Google Scholar] [CrossRef] [Green Version]
Weller, A.; Tang, K.; Jebara, T.; Sontag, D.A. Understanding the Bethe approximation: When and how can it go wrong? In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, Quebec City, QC, Canada, 23–27 July 2014; pp. 868–877. [Google Scholar]
Sibel, J.C. Region-Based Approximation to Solve Inference in Loopy Factor Graphs: Decoding LDPC Codes by Generalized Belief Propagation. Available online: https://hal.archives-ouvertes.fr/tel-00905668 (accessed on 24 June 2021).
Minka, T. From Hidden Markov Models to Linear Dynamical Systems; Technical Report 531; VIsion and Modeling Group, Media Lab, MIT: Cambridge, MA, USA, 1999. [Google Scholar]
Loeliger, H.A.; Dauwels, J.; Hu, J.; Korl, S.; Ping, L.; Kschischang, F.R. The Factor Graph Approach to Model-Based Signal Processing. Proc. IEEE 2007, 95, 1295–1322. [Google Scholar] [CrossRef] [Green Version]
Loeliger, H.A.; Bolliger, L.; Reller, C.; Korl, S. Localizing, forgetting, and likelihood filtering in state-space models. In Proceedings of the 2009 Information Theory and Applications Workshop, La Jolla, CA, USA, 8–13 February 2009; pp. 184–186. [Google Scholar] [CrossRef] [Green Version]
Korl, S. A Factor Graph Approach to Signal Modelling, System Identification and Filtering. Ph.D. Thesis, Swiss Federal Institute of Technology, Zurich, Switzerland, 2005. [Google Scholar]
Pearl, J. Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach. In Proceedings of the Second AAAI Conference on Artificial Intelligence, Pittsburgh, PA, USA, 18–20 August 1982; AAAI Press: Pittsburgh, PA, USA, 1982; pp. 133–136. [Google Scholar]
Heskes, T. Convexity arguments for efficient minimization of the Bethe and Kikuchi free energies. J. Artif. Intell. Res. 2006, 26, 153–190. [Google Scholar]
Särkkä, S. Bayesian Filtering and Smoothing; Cambridge University Press: London, UK; New York, NY, USA, 2013. [Google Scholar]
Khan, M.E.; Lin, W. Conjugate-Computation Variational Inference: Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models. arXiv 2017, arXiv:1703.04265. [Google Scholar]
Logan, B.; Moreno, P. Factorial HMMs for acoustic modeling. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, WA, USA, 15 May 1998; Volume 2, pp. 813–816. [Google Scholar] [CrossRef] [Green Version]
Hoffman, M.D.; Blei, D.M. Structured Stochastic Variational Inference. arXiv 2014, arXiv:1404.4114. [Google Scholar]
Singh, R.; Ling, J.; Doshi-Velez, F. Structured Variational Autoencoders for the Beta-Bernoulli Process. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; p. 9. [Google Scholar]
Bamler, R.; Mandt, S. Structured Black Box Variational Inference for Latent Time Series Models. arXiv 2017, arXiv:1707.01069. [Google Scholar]
Zhang, C.; Yuan, Z.; Wang, Z.; Guo, Q. Low Complexity Sparse Bayesian Learning Using Combined BP and MF with a Stretched Factor Graph. Signal Process. 2017, 131, 344–349. [Google Scholar] [CrossRef] [Green Version]
Wand, M.P. Fast Approximate Inference for Arbitrarily Large Semiparametric Regression Models via Message Passing. J. Am. Stat. Assoc. 2017, 112, 137–168. [Google Scholar] [CrossRef]
Caticha, A. Entropic Inference and the Foundations of Physics. In Proceedings of the 11th Brazilian Meeting on Bayesian Statistics, Amparo, Brazil, 18–22 March 2012. [Google Scholar]
Pearl, J. A Probabilistic Calculus of Actions. Available online: https://arxiv.org/ftp/arxiv/papers/1302/1302.6835.pdf (accessed on 24 June 2021).
Zoeter, O.; Heskes, T. Gaussian Quadrature Based Expectation Propagation. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Bridgetown, Barbados, 6–8 January 2005; p. 9. [Google Scholar]
Arasaratnam, I.; Haykin, S. Cubature Kalman Filters. IEEE Trans. Autom. Control 2009, 54, 1254–1269. [Google Scholar] [CrossRef] [Green Version]
Sarkka, S. Bayesian Estimation of Time-Varying Systems: Discrete-Time Systems. Available online: https://users.aalto.fi/~ssarkka/course_k2011/pdf/course_booklet_2011.pdf (accessed on 24 June 2021).
Gelman, A.; Vehtari, A.; Jylänki, P.; Robert, C.; Chopin, N.; Cunningham, J.P. Expectation propagation as a way of life. arXiv 2014, arXiv:1412.4869. [Google Scholar]
Deisenroth, M.P.; Mohamed, S. Expectation Propagation in Gaussian Process Dynamical Systems: Extended Version. arXiv 2012, arXiv:1207.2940. [Google Scholar]
Teh, Y.W.; Hasenclever, L.; Lienart, T.; Vollmer, S.; Webb, S.; Lakshminarayanan, B.; Blundell, C. Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server. arXiv 2015, arXiv:1512.09327. [Google Scholar]
Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Cox, M. Robust Expectation Propagation in Factor Graphs Involving Both Continuous and Binary Variables. In Proceedings of the 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; p. 5. [Google Scholar]
Minka, T.; Winn, J.; Guiver, J.; Webster, S.; Zaykov, Y.; Yangel, B.; Spengler, A.; Bronskill, J. Infer.NET 2.6. 2014. Available online: http://research.microsoft.com/infernet (accessed on 23 June 2021).
Friston, K.J.; Harrison, L.; Penny, W. Dynamic causal modelling. Neuroimage 2003, 19, 1273–1302. [Google Scholar]
Mathys, C.D.; Daunizeau, J.; Friston, K.J.; Klaas, S.E. A Bayesian foundation for individual learning under uncertainty. Front. Hum. Neurosci. 2011, 5. [Google Scholar] [CrossRef] [Green Version]
Friston, K.; Kilner, J.; Harrison, L. A free energy principle for the brain. J. Physiol. 2006, 100, 70–87. [Google Scholar] [CrossRef] [Green Version]
Friston, K. The free-energy principle: A rough guide to the brain? Trends Cogn. Sci. 2009, 13, 293–301. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1–38. [Google Scholar]
Dauwels, J.; Eckford, A.; Korl, S.; Loeliger, H.A. Expectation maximization as message passing—Part I: Principles and gaussian messages. arXiv 2009, arXiv:0910.2832. [Google Scholar]
Bouvrie, P.; Angulo, J.; Dehesa, J. Entropy and complexity analysis of Dirac-delta-like quantum potentials. Phys. A Stat. Mech. Appl. 2011, 390, 2215–2228. [Google Scholar] [CrossRef]
Dauwels, J.; Korl, S.; Loeliger, H.A. Expectation maximization as message passing. In Proceedings of the International Symposium on Information Theory 2005, (ISIT 2005), Adelaide, Australia, 4–9 September 2005; pp. 583–586. [Google Scholar] [CrossRef] [Green Version]
Cox, M.; van de Laar, T.; de Vries, B. ForneyLab.jl: Fast and flexible automated inference through message passing in Julia. In Proceedings of the International Conference on Probabilistic Programming, Boston, MA, USA, 4–6 October 2018. [Google Scholar]
Bezanson, J.; Edelman, A.; Karpinski, S.; Shah, V. Julia: A Fresh Approach to Numerical Computing. SIAM Rev. 2017, 59, 65–98. [Google Scholar] [CrossRef] [Green Version]
Şenöz, I.; de Vries, B. Online Variational Message Passing in the Hierarchical Gaussian Filter. In Proceedings of the 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), Aalborg, Denmark, 17–20 September 2018; pp. 1–6. [Google Scholar] [CrossRef]
Mathys, C.D. Uncertainty, Precision, and Prediction Errors; UCL Computational Psychiatry Course; UCL: London, UK, 2014. [Google Scholar]
Şenöz, I.; de Vries, B. Online Message Passing-based Inference in the Hierarchical Gaussian Filter. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 2676–2681. [Google Scholar] [CrossRef]
Podusenko, A.; Kouw, W.M.; de Vries, B. Online Variational Message Passing in Hierarchical Autoregressive Models. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 1337–1342. [Google Scholar] [CrossRef]
Welling, M. On the Choice of Regions for Generalized Belief Propagation. arXiv 2012, arXiv:1207.4158. [Google Scholar]
Welling, M.; Minka, T.P.; Teh, Y.W. Structured Region Graphs: Morphing EP into GBP. arXiv 2012, arXiv:1207.1426. [Google Scholar]
Loeliger, H.A. Factor Graphs and Message Passing Algorithms—Part 1: Introduction. 2007. Available online: http://www.crm.sns.it/media/course/1524/Loeliger_A.pdf (accessed on 3 April 2019).
Caticha, A. Relative Entropy and Inductive Inference. AIP Conf. Proc. 2004, 707, 75–96. [Google Scholar] [CrossRef]
Ortega, P.A.; Braun, D.A. A Minimum Relative Entropy Principle for Learning and Acting. J. Artif. Intell. Res. 2010, 38, 475–511. [Google Scholar]
Shore, J.; Johnson, R. Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy. IEEE Trans. Inf. Theory 1980, 26, 26–37. [Google Scholar] [CrossRef] [Green Version]
Engel, E.; Dreizler, R.M. Density Functional Theory: An Advanced Course; Theoretical and Mathematical Physics; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar] [CrossRef] [Green Version]
Boyd, S.P.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK; New York, NY, USA, 2004. [Google Scholar]
Lanczos, C. The Variational Principles of Mechanics; Courier Corporation: North Chelmsford, MA, USA, 2012. [Google Scholar]
Ahn, S.; Chertkov, M.; Shin, J. Gauging Variational Inference. Available online: https://dl.acm.org/doi/10.5555/3294996.3295048 (accessed on 24 June 2021).
Tran, V.H. Copula Variational Bayes inference via information geometry. arXiv 2018, arXiv:1803.10998. [Google Scholar]

Figure 1. Example Forney-style factor graph for the model of (2).

Figure 2. Visualization of the node-induced subgraph for an equality node. If the node function

f_{a}

is known, a symbol representing the node function is often substituted within the node (“=” in this case).

Figure 2. Visualization of the node-induced subgraph for an equality node. If the node function

f_{a}

is known, a symbol representing the node function is often substituted within the node (“=” in this case).

Figure 3. The subgraph around node b with indicated messages. Ellipses indicate an arbitrary (possibly zero) amount of edges.

Figure 4. An edge-induced subgraph

G (j)

with indicated messages.

Figure 4. An edge-induced subgraph

G (j)

with indicated messages.

Figure 5. Visualization of a subgraph with indicated sum-product messages.

Figure 6. (Left) One time segment of the FFG corresponding to the linear Gaussian state space model specified in Example 1, with the sum-product messages computed according to (22). The three small dots at both sides of the graph indicate identical continuation of the graph over time. (Right) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using parameter matrices as specified in (24). The posterior distribution for the hidden states are inferred by sum-product message passing and are drawn with shaded regions, indicating plus and minus the variance. The Bethe free energy evaluates to

F [q, f] = 580.698

.

Figure 6. (Left) One time segment of the FFG corresponding to the linear Gaussian state space model specified in Example 1, with the sum-product messages computed according to (22). The three small dots at both sides of the graph indicate identical continuation of the graph over time. (Right) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using parameter matrices as specified in (24). The posterior distribution for the hidden states are inferred by sum-product message passing and are drawn with shaded regions, indicating plus and minus the variance. The Bethe free energy evaluates to

F [q, f] = 580.698

.

Figure 7. A node-induced subgraph

G (b)

with shaded sections that enclose the edges of an exemplary structured mean-field factorization

l (b) = {m, n, r}

. Note that, in this example, the cluster n only encompasses the single edge j, such that

q_{b}^{n} (s_{b}^{n}) = q_{j} (s_{j})

. In general, the assignment and number of edges in a cluster can be arbitrary.

Figure 7. A node-induced subgraph

G (b)

with shaded sections that enclose the edges of an exemplary structured mean-field factorization

l (b) = {m, n, r}

. Note that, in this example, the cluster n only encompasses the single edge j, such that

q_{b}^{n} (s_{b}^{n}) = q_{j} (s_{j})

. In general, the assignment and number of edges in a cluster can be arbitrary.

Figure 8. An example subgraph corresponding to

G (b, j)

. Dashed ellipses enclose the edges of an exemplary exact cover

l (b) = {m, n, r}

. In general, the assignment and number of edges in a cluster can be arbitrary.

Figure 8. An example subgraph corresponding to

G (b, j)

. Dashed ellipses enclose the edges of an exemplary exact cover

l (b) = {m, n, r}

. In general, the assignment and number of edges in a cluster can be arbitrary.

Figure 9. (Left) One time segment of the FFG corresponding to the linear Gaussian state space model specified in Example 2 with the sum-product messages computed according to (36). (Right) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using matrices specified in (24). The posterior distribution of the hidden states inferred by structured variational message passing is depicted with shaded regions representing plus and minus one variances. The minimum of the evaluated Bethe free energy over all iterations is

F [q, f] = 586.178

(compared to

F [q, f] = 580.698

in Example 1). The posterior distribution for the precision matrix is given by

Q \sim W ([\begin{matrix} 0.00266 & 0.000334 \\ 0.00034 & 0.00670 \end{matrix}], 102.0)

.

Figure 9. (Left) One time segment of the FFG corresponding to the linear Gaussian state space model specified in Example 2 with the sum-product messages computed according to (36). (Right) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using matrices specified in (24). The posterior distribution of the hidden states inferred by structured variational message passing is depicted with shaded regions representing plus and minus one variances. The minimum of the evaluated Bethe free energy over all iterations is

F [q, f] = 586.178

(compared to

F [q, f] = 580.698

in Example 1). The posterior distribution for the precision matrix is given by

Q \sim W ([\begin{matrix} 0.00266 & 0.000334 \\ 0.00034 & 0.00670 \end{matrix}], 102.0)

.

Figure 10. (Left) The small dots indicate the noisy observations that were synthetically generated by the linear state space model of (23) using matrices specified in (24). The posterior distribution for the hidden states inferred by naive variational message passing is depicted with shaded regions representing plus and minus one variances. The minimum of the evaluated Bethe free energy over all iterations is

F [q, f] = 617.468

, which is more than for the less-constrained Example 2 (with

F [q, f] = 586.178

) and Example 1 (with

F [q, f] = 580.698

). The posterior for the precision matrix is given by

Q \sim W ([\begin{matrix} 0.00141 & - 6.00549 e^{- 5} \\ - 6.00549 e^{- 5} & 0.00187 \end{matrix}], 102.0)

. (Right) A comparison of the Bethe free energies for sum-product, structured and naive variational message passing algorithms for the data generated in Example 1.

Figure 10. (Left) The small dots indicate the noisy observations that were synthetically generated by the linear state space model of (23) using matrices specified in (24). The posterior distribution for the hidden states inferred by naive variational message passing is depicted with shaded regions representing plus and minus one variances. The minimum of the evaluated Bethe free energy over all iterations is

F [q, f] = 617.468

, which is more than for the less-constrained Example 2 (with

F [q, f] = 586.178

) and Example 1 (with

F [q, f] = 580.698

). The posterior for the precision matrix is given by

Q \sim W ([\begin{matrix} 0.00141 & - 6.00549 e^{- 5} \\ - 6.00549 e^{- 5} & 0.00187 \end{matrix}], 102.0)

. (Right) A comparison of the Bethe free energies for sum-product, structured and naive variational message passing algorithms for the data generated in Example 1.

Figure 11. Visualization of a subgraph

G (b, j)

with indicated messages, where the dark circled delta indicates a data constraint—i.e., the variable

s_{j}

is constrained to have a distribution of the form

δ (s_{j} - {\hat{s}}_{j})

.

Figure 11. Visualization of a subgraph

G (b, j)

with indicated messages, where the dark circled delta indicates a data constraint—i.e., the variable

s_{j}

is constrained to have a distribution of the form

δ (s_{j} - {\hat{s}}_{j})

.

Figure 12. The subgraph around a Laplace-approximated node b with indicated messages.

Figure 13. Visualization of a subgraph with indicated Laplace propagation messages. The node function

f_{b}

is denoted by

{\tilde{f}}_{b}

according to (53).

Figure 13. Visualization of a subgraph with indicated Laplace propagation messages. The node function

f_{b}

is denoted by

{\tilde{f}}_{b}

according to (53).

Figure 14. Visualization of a subgraph

G (b, j)

with indicated messages. The open circle indicates a point-mass constraint of the form

δ (s_{j} - θ_{j})

, where the value

θ_{j}

is optimized.

Figure 14. Visualization of a subgraph

G (b, j)

with indicated messages. The open circle indicates a point-mass constraint of the form

δ (s_{j} - θ_{j})

, where the value

θ_{j}

is optimized.

Figure 15. The FFG of the linear Gaussian state space model augmented with the EM constraints in Example 4.

Figure 16. (Left) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using matrices specified in (24). The posterior distribution of the hidden states inferred by structured variational message passing is depicted with shaded regions representing plus and minus one variances. The minimum of the evaluated Bethe free energy over iterations is

F [q, f] = 583.683

. Moreover, the posterior distribution for the precision matrix is given by

Q \sim W ([\begin{matrix} 0.00286 & 0.00038 \\ 0.00038 & 0.0 . 00691 \end{matrix}], 102.0)

. The EM estimates are

θ = π / 7.821

,

{\hat{m}}_{x_{0}} = [7.23, - 7.016]

and

{\hat{V}}_{x_{0}} = [\begin{matrix} 11.028 & - 1.926 \\ - 1.926 & 10.918 \end{matrix}]

. (Right) Free energy plots of the 4 algorithms discussed in Examples 1–4 on the same data set.

Figure 16. (Left) The small dots indicate the noisy observations that are synthetically generated by the linear state space model of (23) using matrices specified in (24). The posterior distribution of the hidden states inferred by structured variational message passing is depicted with shaded regions representing plus and minus one variances. The minimum of the evaluated Bethe free energy over iterations is

F [q, f] = 583.683

. Moreover, the posterior distribution for the precision matrix is given by

Q \sim W ([\begin{matrix} 0.00286 & 0.00038 \\ 0.00038 & 0.0 . 00691 \end{matrix}], 102.0)

. The EM estimates are

θ = π / 7.821

,

{\hat{m}}_{x_{0}} = [7.23, - 7.016]

and

{\hat{V}}_{x_{0}} = [\begin{matrix} 11.028 & - 1.926 \\ - 1.926 & 10.918 \end{matrix}]

. (Right) Free energy plots of the 4 algorithms discussed in Examples 1–4 on the same data set.

Table 1. Relation between local constraints and derived message updates. The rows refer to different constraints that relate to factor–variable combinations, factors, and variables, respectively. Note that each message passing algorithm combines a set of constraints. Abbreviations: Sum-Product (SP), Structured Variational Message Passing (SVMP), Mean-Field Variational Message Passing (MFVMP), Data Constraint (DC), Laplace Propagation (LP), Mean-Field Variational Laplace (MFVLP), Expectation Maximization (EM), and Expectation Propagation (EP).

Local Constraint	SP	SVMP	MFVMP	DC	LP	MFVLP	EM	EP
Normalization	✓	✓	✓	✓	✓	✓	✓	✓
Marginalization	✓	✓	✓	✓	✓	✓	✓	✓
Moment-Matching								✓
Structured Mean-Field		✓					✓
Naive Mean-Field			✓			✓
Laplace Approximation					✓	✓
Dirac-delta				✓			✓
Estimation							✓

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Şenöz, İ.; van de Laar, T.; Bagaev, D.; de Vries, B. Variational Message Passing and Local Constraint Manipulation in Factor Graphs. Entropy 2021, 23, 807. https://doi.org/10.3390/e23070807

AMA Style

Şenöz İ, van de Laar T, Bagaev D, de Vries B. Variational Message Passing and Local Constraint Manipulation in Factor Graphs. Entropy. 2021; 23(7):807. https://doi.org/10.3390/e23070807

Chicago/Turabian Style

Şenöz, İsmail, Thijs van de Laar, Dmitry Bagaev, and Bert de Vries. 2021. "Variational Message Passing and Local Constraint Manipulation in Factor Graphs" Entropy 23, no. 7: 807. https://doi.org/10.3390/e23070807

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Variational Message Passing and Local Constraint Manipulation in Factor Graphs

Abstract

1. Introduction

2. Factor Graphs and the Bethe Free Energy

2.1. Terminated Forney-Style Factor Graphs

2.2. Variational Free Energy

2.3. Bethe Free Energy

2.4. Problem Statement

2.5. Sketch of Solution Approach

3. Bethe Lagrangian Optimization by Message Passing

3.1. Stationary Points of the Bethe Lagrangian

3.2. Minimizing the Bethe Free Energy by Belief Propagation

4. Message Passing Variations through Constraint Manipulation

4.1. Factorization Constraints

4.1.1. Structured Variational Message Passing

4.1.2. Naive Variational Message Passing

4.2. Form Constraints

4.2.1. Data Constraints

4.2.2. Laplace Propagation

4.2.3. Expectation Propagation

4.3. Hybrid Constraints

4.3.1. Mean-Field Variational Laplace

4.3.2. Expectation Maximization

4.4. Overview of Message Passing Algorithms

5. Scoring Models by Minimized Variational Free Energy

5.1. Evaluation of the Entropy of Dirac-Delta Constrained Beliefs

5.2. Evaluation of Node-Local Free Energy for Deterministic Nodes

5.3. Evaluating the Variational Free Energy

6. Implementation of Algorithms and Simulations

7. Related Work

8. Discussion

9. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Free Energy Minimization by Variational Inference

Appendix B. Lagrangian Optimization and the Dual Problem

Appendix C. Local Free Energy Example for a Deterministic Node

Appendix D. Proofs

Appendix D.1. Proof of Lemma 1

Appendix D.2. Proof of Lemma 2

Appendix D.3. Proof of Theorem 1

Appendix D.4. Proof of Lemma 3

Appendix D.5. Proof of Theorem 2

Appendix D.6. Proof of Corollary 1

Appendix D.7. Proof of Lemma 4

Appendix D.8. Proof of Theorem 3

Appendix D.9. Proof of Lemma 5

Appendix D.10. Proof of Theorem 4

Appendix D.11. Proof of Lemma 6

Appendix D.12. Proof of Lemma 7

Appendix D.13. Proof of Theorem 5

Appendix D.14. Proof of Theorem 6

Appendix D.15. Proof of Theorem 7

Appendix D.16. Proof of Theorem 8

Appendix D.17. Proof of Theorem 9

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI