Bayesian Inference and Deep Learning for Inverse Problems

Mohammad-Djafari, Ali; Chu, Ning; Wang, Li; Yu, Liang

doi:10.3390/psf2023009014

Open AccessProceeding Paper

Bayesian Inference and Deep Learning for Inverse Problems^†

¹

International Science Consulting and Training (ISCT), 91440 Bures sur Yvette, France

²

Zhejiang Shangfeng Special Blower Company, Shaoxing 312352, China

³

School of Mathematics and Statistics, Central South University, Changsha 410017, China

⁴

School of Civil Aviation, Northwestern Polytechnical University, Xi’an 710072, China

⁵

State Key Laboratory of Airliner Integration Technology and Flight Simulation, Shanghai 200126, China

^*

Author to whom correspondence should be addressed.

^†

Presented at the 42nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, Garching, Germany, 3–7 July 2023.

Phys. Sci. Forum 2023, 9(1), 14; https://doi.org/10.3390/psf2023009014

Published: 1 December 2023

(This article belongs to the Proceedings of The 42nd International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Inverse problems arise anywhere we have an indirect measurement. In general, they are ill-posed to obtain satisfactory solutions, which needs prior knowledge. Classically, different regularization methods and Bayesian inference-based methods have been proposed. As these methods need a great number of forward and backward computations, they become costly in computation, particularly when the forward or generative models are complex, and the evaluation of the likelihood becomes very costly. Using deep neural network surrogate models and approximate computation can become very helpful. However, in accounting for the uncertainties, we need first to understand Bayesian deep learning, and then we can see how we can use it for inverse problems. In this work, we focus on NN, DL, and, more specifically, the Bayesian DL particularly adapted for inverse problems. We first give details of Bayesian DL approximate computations with exponential families; then, we see how we can use them for inverse problems. We consider two cases: First, we consider the case where the forward operator is known and used as a physics constraint, and the second examines more general data-driven DL methods.

Keywords:

bayesian inference; neural network; deep learning (DL); inverse problems; physics-based DL; infrared imaging

1. Introduction

Inverse problems arise almost everywhere in science and engineering. In fact, anytime and in any application when we have indirect measurements related to what we really want to measure through some mathematical relation, this is called the forward model. Then, we have to infer the desired unknown from the observed data using this forward model or a surrogate one. In general, many inverse problems are ill-posed, and many methods for finding well-posed solutions for them are mainly based either on regularization theory or Bayesian inference. We mention those in particular, which are based on the optimization of a criterion with two parts: a data model output matching criterion (likelihood part in the Bayesian) and a regularization term (prior model in the Bayesian). Different criteria for these two terms and a great number of standard and advanced optimization algorithms have been proposed and used with great success. When these two terms are distances, they can have a Bayesian maximum a posteriori (MAP) interpretation, where these two terms correspond, respectively, to the likelihood and prior probability models. The Bayesian approach gives more flexibility in choosing these terms via the likelihood and the prior probability distributions. This flexibility goes much further with the hierarchical models and appropriate hidden variables [1]. Also, the possibility of estimating the hyperparameters gives much more flexibility for semisupervised methods.

However, full Bayesian computations can become very heavy computationally. In particular, this occurs when the forward model is complex, and the evaluation of the likelihood has a high computational cost [2]. In those cases, using surrogate simpler models can become very helpful to reduce the computational costs, but then we have to account for the uncertainty quantification (UQ) of the obtained results [3]. Neural networks (NNs), with their diversity such as convolutional neural networks (CNNs), deep learning (DL), etc., have become tools for fast and low computational surrogate forward models.

Over the last decades, machine learning (ML) methods and algorithms have gained great success in many tasks, such as classification, clustering, segmentation, object detection, and many other areas. There are many different structures of neural networks (NNs), such as feed-forward, Convolutional NNs (CNNs), Deep NNs, etc. [4]. Using these methods directly for inverse problems, as intermediate preprocessing or as tools for performing fast approximate computation in different steps of regularization or Bayesian inference, has also been successful, but not as much as could be possible. Recently, physics-informed neural networks have gained great success in many inverse problems, thereby proposing interaction between the Bayesian formulation of forward models and optimization algorithms and ML-specific algorithms for intermediate hidden variables. These methods have become very helpful to obtain approximate practical solutions to inverse problems in real-world applications [5,6,7,8,9,10,11].

In this paper, first, in Section 2, some mathematical notations for dealing with NNs are given. In Section 3, a detailed presentation of the Bayesian inference and the approximate computation needed for BDL are given. Then, in Section 4, we consider a focus on the NN and DL methods for inverse problems. First, we present the same cases where we know the forward and its adjoint model. Then, we consider the case where we may not have this knowledge and want to propose directly data-driven DL methods [12,13].

2. Neural Networks, Deep Learning, and Bayesian DL

The main objective of the NN for a regression problem can be described as follows:

x_{i} \to NN : f (x_{i}) \to y_{i} .

The objective is to infer the function

f : R^{M} \to R

from the observations

y = {(y_{1}, \dots, y_{N})}^{T}

at locations given by

x = (x_{1}, \dots, x_{N})

. The usual NN learning approach is to define a parametric family of functions

ϕ_{θ} : R^{M} \times θ \to R

that is flexible enough so that

\exists θ^{★}

such that

ϕ (\cdot) \approx ϕ_{θ^{★}} (\cdot)

:

{x_{i}, y_{i}} \to NN Learning : y_{i} = ϕ (x_{i}) \approx ϕ_{θ^{★}} (x_{i}) \to θ^{★} .

Deep learning focuses on learning the optimal parameters

θ^{★}

, which can then be used for predicting the output

{\hat{y}}_{j}

for any input

x_{j}

:

x_{j} \to NN : ϕ_{θ^{★}} (x_{j}) \to {\hat{y}}_{j} .

In this approach, there is no uncertainty quantification.

The Bayesian deep learning approach can be summarized as follows:

{x_{i}, y_{i}} \to p (θ | D) = \frac{p (D | θ) p (θ)}{p (D)} \to p (θ | D),

x_{j} \to p (y_{j} | x_{j}, D) = \int p (y_{j} | x_{j}, θ, D) p (θ | D) d θ \to {\hat{y}}_{j} .

As we can see, uncertainties are accounted for in both steps of the parameter estimation and prediction. However, as we will see, the computational costs are important. We need to find solutions to perform fast computation.

3. Bayesian Inference and Approximate Computation

In a general Bayesian framework for NNs and DL, the objective is to infer the parameters

θ

from the data,

D : {x_{i}, y_{i}}

, using the Bayes rule:

p (θ | D) = \frac{p (D | θ) p (θ)}{p (D)},

(1)

where

p (θ)

is the prior,

ℓ (D | θ) = p (D | θ)

is the likelihood,

p (θ | D)

is the posterior, and

p (D) = \int p (D | θ) p (θ) d θ

is called the evidence. We can also write

p (θ | D) \propto ℓ (D | θ) p (θ)

, where the classical maximum likelihood estimation (MLE) is defined as

\hat{θ} = \arg {max}_{θ} \{ℓ (D | θ)\}

.

A particular point of the posterior is of high interest, because we may be interested in maximum a posterior (MAP) estimation:

\hat{θ} = \arg {max}_{θ} \{ℓ (D | θ) p (θ)\}

. We may also be interested in mean squared error (MSE) estimation, which is shown that it corresponds to

\hat{θ} = E_{p (θ | D)} \{θ\} = \int θ p (θ | D) d θ

.

The exact expression of the posterior and the computations of these integrals for great dimensional problems may be very computationally costly. For this reason, we need to perform approximate computation. In the following subsections, we review a few solutions.

3.1. Laplace Approximation

Rewriting the general Bayes rules slightly differently gives us the following:

p (θ | D) = \frac{1}{p (D)} p (D | θ) p (θ) = \frac{1}{Z} exp [L (θ | D)],

(2)

the Laplace approximations use a second-order expansion of

L (θ | D) = ln p (D | θ) + ln p (θ)

around

{\hat{θ}}_{M A P}

to construct a Gaussian approximation of

p (θ | D)

:

L (θ) \approx L ({\hat{θ}}_{M A P}) + \frac{1}{2} {(θ - {\hat{θ}}_{M A P})}^{'} (\nabla_{θ}^{2} {L (θ) |}_{{\hat{θ}}_{M A P}}) (θ - {\hat{θ}}_{M A P}),

(3)

where the first-order term vanishes at the

{\hat{θ}}_{M A P}

. This is equivalent to calculating the following approximation:

Approximate p (θ | D) by q (θ) = N (θ | {\hat{θ}}_{M A P}, Σ = [\nabla_{θ}^{2} L (θ) |_{{\hat{θ}}_{M A P}}]) .

With this approximation, the evidence

p (D)

is approximated by the following:

Z = p (D) = {(2 π)}^{d / 2} {| Σ |}^{1 / 2} exp [- L ({\hat{θ}}_{M A P})] .

(4)

For great dimensional problems such as BDL, the full computation of

Σ

is very costly. We have still to do more approximations. The following are a few solutions for scalable approximations for BDL:

Work with the subnetwork or last layer (transfer learning);
Perform covariance matrix decomposition (low rank, Kronecker-factored approximate curvature (KFAC), Diag);
Conduct the computation during the hyperparameter tuning using crossvalidation;
Use approximate predictive computation.

3.2. Approximate Computation: Variational Inference

The main idea then is to perform approximate Bayesian computation (ABC) by approximating the posterior

p (θ | D)

using a simpler expression

q (θ)

. When approximation is performed by minimizing

KL [q (θ) : p (θ | D)] = \int q (θ) \log \frac{q (θ)}{p (θ | D)} d θ,

(5)

the method is called the variational Bayesian approximation (VBA). When

q (θ)

is chosen to be separable in some components of the parameters

q (θ) = \prod_{j} q_{j} (θ_{j})

, the approximation is called the mean field VBA (MFVBA).

Let us come back to the general VBA,

KL [q : p] = \int q \log \frac{q}{p} = E_{q} \{\log q\} - E_{q} \{\log p\}

, and note using

L = E_{q} \{\log p\}

the expected likelihood and using

S = - E_{q} \{\log q\}

the entropy of q. Then, we have the following:

q * = arg min_{q} \{KL [q : p]\} = arg max_{q} \{E = S + L\} .

(6)

E is also called the evidence lower bound (ELBO):

E L B O = - E_{q} \{\log q\} + E_{q} \{\log p\} .

(7)

At this point, it is important to note one main property of the VBA: When

p (θ | D)

, the posterior probability law p and the approximate probability law q are in the exponential family; then,

E_{p} \{θ\} = E_{q} \{θ\}

.

3.3. VBA and Natural Exponential Family

If q is chosen to be in a natural exponential family,

q (θ | η) = exp [η^{'} θ - A (η)]

, then it is entirely characterized by its mean

m = E_{q} \{θ\}

, and if q is conjugate to p, then

q^{*} (θ | η) = exp [{η^{*}}^{'} θ - A (η^{*})]

, which is entirely characterized by its mean

m^{*} = E_{q^{*}} \{θ\}

. We can then define the objective E as a function of

m

, and the first order condition of the optimality is

\frac{\partial E}{\partial m} |_{m = m *} = 0

. From this property, we can obtain a fixed-point algorithm to compute

m^{*} = E_{q^{*}} \{θ\}

:

\frac{\partial E}{\partial m} |_{m = m *} = 0 \to \frac{\partial E}{\partial m} |_{m = m *} + m = m \to M (m) = m .

Iterating on this fixed-point algorithm gives us the following:

M (m) = m^{(k - 1)} with M (m) : = \frac{\partial E}{\partial m} + m

(8)

which converges to

m^{*} = E_{q^{*}} \{θ^{*}\}

and is also

m^{*} = E_{q} \{θ^{*}\}

. This algorithm can be summarized as follows:

Having chosen the prior and likelihood, find the expression of $p (θ | D) \propto p (D | θ) p (θ)$ ;
Choose a family q and find the expressions of $L = E_{q} \{ln p\}$ and $S = E_{q} \{ln q\}$ , which thus yield $E = L + S$ as a function of $m = E_{q} \{θ\}$ ;
Find the expression of the vector operator $M = \nabla_{m} E + m$ and update it $M (m) = m^{(k - 1)}$ until convergence, which results in $m^{*} = E_{q^{*}} \{θ^{*}\} = E_{p^{*}} \{θ^{*}\}$ .

At this point, it is important to note that, in this approach, even if the mean is well approximated, the variances or the covariance are underestimated. Some authors who are interested in this approach have proposed solutions to better estimate the covariance. See [14] for one of the solutions.

4. Deep Learning and Bayesian DL

As introduced before, in classical DL, the training and prediction steps can be summarized as follows:

{x_{i}, y_{i}} \to NN Learning : ϕ_{θ^{★}} (\cdot) \to θ^{★} x_{j} \to NN : ϕ_{θ^{★}} (x_{j}) \to {\hat{y}}_{j}

In this approach, there is no uncertainty quantification. The Bayesian deep learning can be summarized as follows:

{x_{i}, y_{i}} \to p (θ | D) = \frac{p (D | θ) p (θ)}{p (D)} \to p (θ | D) or q (θ | D),

x_{j} \to p (y_{j} | x_{j}, D) = \int p (y_{j} | x_{j}, θ, D) p (θ | D) d θ \to p (y_{i} | x_{j}, D) or q (y_{i} | x_{j}, D) .

As we can see, uncertainties are accounted for in both steps of the parameter estimation and prediction. However, computational costs are important. We need to find solutions to perform fast computation. As mentioned before, the different possible tools are Laplace and Gaussian approximation, variational inference, and more controlled approximation to design new deep learning algorithms, which can scale up for practical situations.

Exponential Family Approximation

Let us consider the case of general exponential families

q (θ | λ) = exp [λ^{T} t (θ) - F (λ)]

, where

θ

represents the original parameters,

λ

represents the natural parameters,

t (θ)

represents sufficient statistics,

F (λ)

represents the log partition function, and we define the expectations parameters as

μ : = E_{q} \{t (θ)\}

. Let us also define the dual function

F^{*} (η)

and the dual parameters

η

via the Legendre transform:

G (η) = F^{*} (η) = sup_{λ} {< λ, η > - F (λ)} .

Then, we can show the triangular relation between

θ \in Θ

,

λ \in Λ

, and

μ \in M

shown in Figure 1.

With these notations, applying the VBA rule

{min}_{q \in Q} E_{q} \{L (θ)\} - H (q)

results in the following updating rule for the natural parameters:

λ \leftarrow λ - ρ \nabla_{μ} (L (θ) - H (q)) .

(9)

For example, by considering the Gaussian case

q (θ) = N (θ | m, S^{- 1}) \propto exp [- \frac{1}{2} {(θ - m)}^{T} S (θ - m)] \propto exp [{(S m)}^{T} θ + Tr (- \frac{S}{2} θ θ^{T})] .

(10)

we can identify the natural parameters as

λ = [S m, - S / 2] \to μ : = [E_{q} {θ}, E_{q} {θ θ^{T}}]

, and we easily obtain the following algorithm:

\{\begin{matrix} S m \leftarrow (1 - ρ) S m - ρ \nabla_{E_{q} \{θ\}} E_{q} \{L (θ)\}, \\ S \leftarrow (1 - ρ) S - \frac{ρ}{2} \nabla_{E_{q} \{θ θ^{T}\}} E_{q} \{L (θ)\} . \end{matrix}

(11)

where

\{\begin{matrix} \nabla_{E_{q} \{θ\}} E_{q} \{L (θ)\} = E_{q} \{\nabla_{θ} L (θ)\} - 2 E_{q} \{L (θ)\}, \\ \nabla_{E_{q} \{θ θ^{T}\}} E_{q} \{L (θ)\} = E_{q} \{H (θ)\}, \end{matrix}

which results, explicitly, in

\{\begin{matrix} m \leftarrow m - ρ S^{- 1} \nabla_{m} L (m), \\ S \leftarrow (1 - ρ) S + ρ H_{m} . \end{matrix}

(12)

For a linear model

y = X θ

and Gaussian priors, we have

L (θ) = {(y - X θ)}^{T} (y - X θ) + γ θ^{T} θ = - 2 θ^{T} (X^{T} y) + Tr [θ θ^{T} (X^{T} X + γ I)],

(13)

and

E_{q} \{L (θ)\} = λ^{T} μ, \nabla_{μ} E_{q} \{L (θ)\} = λ

.

It is interesting to note that many classical algorithms for updating the parameters, such as forward–backward, sparse variational inference, and variational message passing, become special cases. A main remark here is that this linear generating function case will link us to the linear inverse problems

g = H f + ϵ

if we replace

y

with

g

,

X

with

H

, and

θ

with

f

.

5. NN, DL, and Bayesian Inference for Linear Inverse Problems

To show the possibilities of the interaction between inverse problems methods, neural networks and deep learning, the best way is to give a few examples.

5.1. First Example: A Known Linear Forward Model

The first and easiest example is the case of linear inverse problems

g = H f + ϵ

, where we know the forward model

H

and we assume the Gaussian likelihood

p (g | f) = N (g | H f, σ_{ϵ}^{2} I)

and the Gaussian prior

p (f) = N (f | 0, σ_{f}^{2} I)

are the easiest cases to consider, where we know that the posterior is Gaussian

p (f | g) = N (f | \hat{f}, \hat{Σ})

with

\hat{f} = {(H^{t} H + λ I)}^{- 1} H^{t} g = A g = B H^{t} g or still \hat{f} = H^{t} {(\frac{1}{λ} H H^{t} + I)}^{- 1} g = H^{t} C g,

where

λ = σ_{ϵ}^{2} / σ_{f}^{2}

,

A = {(H^{t} H + λ I)}^{- 1} H^{t}

,

B = {(H^{t} H + λ I)}^{- 1}

,

C = {(\frac{1}{λ} H H^{t} + I)}^{- 1}

, and

\hat{Σ} = {(H^{t} H + λ I)}^{- 1}

.

These relations can be presented schematically as

g \to A \to \hat{f}, g \to H^{t} \to B \to \hat{f}, g \to C \to H^{t} \to \hat{f} .

We can then consider replacing

A

,

B

, and

C

with appropriate deep neural networks and apply all the previous BDL methods to them. As we can see, these relations directly induce a linear feed-forward NN structure. In particular, if

H

represents a convolution operator, then

H^{t}

,

H^{t} H

, and

H H^{t}

are too, as well as the operators

B

and

C

. Thus, the whole inversion can be modeled using a CNN [15,16].

For the case of computed tomography (CT), the first operation is equivalent to an analytic inversion, the second corresponds to back projection that is first followed by 2D filtering in the image domain, and the third corresponds to the famous filtered back projection (FBP), which is implemented on classical CT scans. These cases are illustrated in Figure 2.

5.2. Second Example: A Deep Learning Equivalence of Iterative Gradient-Based Algorithms

One of the classical iterative methods in linear inverse problem algorithms is based on the gradient descent method to optimize

J (f) = {∥ g - H f ∥}^{2}

:

f^{(k + 1)} = f^{(k)} + α H^{t} (g - H f^{(k)}) = α H^{t} g + (I - α H^{t} H) f^{(k)},

(14)

where the solution to the problem is obtained recursively. Everybody knows that when the forward model operator

H

is singular or ill-conditioned, this iterative algorithm starts by converging, but it may diverge easily. One of the experimental methods to obtain an acceptable approximate solution is just to stop the iterations after K iterations. This idea can be translated to a deep learning NN by using K layers. Each layer represents one iteration of the algorithm. See Figure 3.

This DL structure can easily be extended to a regularized criterion.

J (f) = \frac{1}{2} {∥ g - H f ∥}^{2} + λ {∥ D f ∥}^{2}

, which can also be interpreted as the MAP or posterior mean solution with a Gaussian likelihood and prior. In this case, we have the following:

f^{(k + 1)} = f^{(k)} + α [H^{t} (g - H f^{(k)}) - λ D^{t} D] = α H^{t} g + (I - α H^{t} H - α λ D^{t} D) f^{(k)} .

(15)

We just need to replace

(I - α H^{t} H)

with

(I - α H^{t} H - α λ D^{t} D)

.

This structure can also be extended to all the sparsity-enforcing regularization terms such as

ℓ_{1}

and the total variation (TV) using appropriate algorithms such as the ISTA (iterative soft threshold algorithm) or its fast version FISTA by replacing the update expression and by adding an NL operation, much like the ordinary NNs. A simple example is given in the following subsection.

5.3. Third Example: $ℓ_{1}$ Regularization and NN

The case of the MAP solution with a Gaussian likelihood and double exponential prior becomes equivalent to the

ℓ_{1}

regularization criterion:

J (f) = {∥ g - H f ∥}_{2}^{2} + λ {∥ f ∥}_{1},

(16)

where the solution can be obtained with an iterative optimization algorithm, such as ISTA:

f^{(k + 1)} = P r o x_{ℓ_{1}} (f^{(k)}, λ) \overset{▵}{=} S_{λ α} (α H^{t} g + (I - α H^{t} H) f^{(k)}),

(17)

where

S_{θ}

is a soft threshold operator, and

α \leq | e i g (H^{t} H) |

is the Lipschitz constant of the normal operator. When

H

is a convolution operator, then

$(I - α H^{t} H) f^{(k)}$ can also be approximated by a convolution and thus considered as a filtering operator;
$\frac{1}{α} H^{t} g$ can be considered as a bias term and is also a convolution operator;
$S_{θ = λ α}$ is a nonlinear pointwise operator. In particular, when $f$ is a positive quantity, this soft threshold operator can be compared to the ReLU activation function of the NN. See Figure 4.

Using the iterative gradient-based algorithm with a fixed number of iterations for computing a GI or a regularized one, as explained in the previous section, can be used to propose a DL structure with K layers, with K being the number of iterations before stopping. Figure 5 shows this structure for a quadratic regularization, which results in a linear NN, and Figure 6 shows the case of

ℓ_{1}

regularization.

In all these examples, we could directly obtain the structure of the NN from the forward model and known parameters. However, in these approaches, there are some difficulties that consist of the determination of the structure of the NN. For example, in the first example, obtaining the structure of

B

depends on the regularization parameter

λ

. The same difficulty arises in determining the shape and the threshold level of the threshold bloc of the network in the second example. The same need for the regularization parameter, as well as many other hyperparameters, makes it necessary to create the NN structure and weights. In practice, we can decide, for example, on the number and structure of a DL network, but as their corresponding weights depend on many unknown or difficult-to-fix parameters, ML may become of help. In the following, we first consider the training part of a general ML method. Then, we will see how to include the physics-based knowledge of the forward model in the structure of learning [4,10,17,18].

5.4. Decomposition of the NN Structure to Fixed and Trainable Parts

The first easiest and most understandable method consists of decomposing the structure of the network

W

in two parts: a fixed part and a learnable part. As the simplest example, we can consider the case of analytical expression of the quadratic regularization

\hat{f} = {(H H^{t} + λ D D^{t})}^{- 1} H^{t} g = B H^{t} g

, which suggests having a two-layer network with a fixed part structure

H^{t}

and a trainable one

B = {(H H^{t} + λ D D^{t})}^{- 1}

. See Figure 6.

It is interesting to note that in X-ray-computed tomography (CT), the forward operator

H

is called projection, the adjoint operator

H^{t}

is called back projection (BP), and the

B

operator is assimilated to a 2D filtering (convolution).

6. DL Structure and Deterministic or Bayesian Computation

To be able to look at the DNN and analyze it either in a deterministic or Bayesian manner, let us come back to the general notations and consider the following NN with input

x

, output

y

, and intermediate hidden variables

z_{1}, z_{2}, \dots, z_{L}

:

x \to first layer \to z_{1} \to second layer \to z_{2} \dots \to last layer \to y .

In the deterministic case, each layer is defined by its parameters

(W_{l}, b_{l})

:

x \to f_{W_{0}, b_{0}} (x) \to z_{1} \to f_{W_{1}, b_{1}} (z_{1}) \to z_{2} \dots \to f_{W_{L}, b_{L}} (z_{L - 1}) \to y,

and we can write:

y = f_{θ} (x) = (f_{(W_{0}, b_{0})} ⊙ f_{(W_{1}, b_{1})} ⊙ \dots f_{(W_{L}, b_{L})}) (x)

with

θ = ((W_{0}, b_{0}), (W_{1}, b_{1}), \dots, (W_{L}, b_{L}))

and

y = f_{L} (W_{L} z_{L - 1} + b_{L}), z_{l} = f_{l} (W_{l} z_{l - 1} + b_{l}), l = L, \dots, 1, z_{0} = x .

(18)

6.1. Deterministic DL Computation

In general, during the training steps, the parameters

θ

are estimated via

θ^{*} = \arg min_{θ} \{\sum_{i} ℓ (y_{i}, f_{θ} (x_{i})) + \sum_{k} λ_{k} ϕ_{k} (W_{k}, b_{k})\} .

(19)

The main point here is to choose how to choose

λ_{k}

and

ϕ (.)

, as well as which optimization algorithm to choose for better convergence.

When parameters are obtained (the model is trained), we can use it easily via

z_{0} = x, z_{l} = f_{l} (W_{l} z_{l - 1} + b_{l}), y = f_{L} (W_{L} z_{L - 1} + b_{L}) .

(20)

6.2. Bayesian Deep Neural Network

In Bayesian DL, the question of choosing

λ_{k}

and

ϕ (.)

in the previous case becomes the choice of the prior,

p (θ)

, which can also be assumed to be a priori separable in the components of

θ

or not. Then, we have to choose the expression of the likelihood (in general Gaussian) and find the expression of the posterior

p (θ | D)

. As explained extensively before, directly using this posterior is almost impossible. Hopefully, we have a great number of approximate computation methods, such as MCMC sampling, slice sampling, nested sampling, data augmentation, and variational inference, which can still be used in practical situations. However, the training step in the Bayesian DL still stays very costly, particularly if we want to quantify the uncertainties.

Prediction Step

In the prediction step, we again have to consider choosing a probability law

p (x)

for the class of the possible inputs and for all the outputs

z_{l}

that are conditional to their inputs

z_{l - 1}

. Then, we can consider, for example, the Gibbs sampling scheme. A comparison between deterministic and Bayesian DL is shown here:

\begin{matrix} Deterministic : & Bayesian : \\ \{\begin{matrix} z_{0} = x, \\ z_{l} = f_{l} (W_{l} z_{l - 1} + b_{l}), l = 1, \dots, L, \\ y = f_{L} (W_{L} z_{L - 1} + b_{L}) . \end{matrix} & \to & \{\begin{matrix} z_{0} \sim p (x), \\ z_{l} \sim p (z_{l} | z_{l - 1}), l = 1, \dots, L, \\ y \sim p (y | z_{L - 1}) . \end{matrix} \end{matrix}

If we consider Gaussian laws for the input and all the conditional variables, then we can write the following:

z_{0} \sim N (z_{0} | x, τ_{0} I), z_{l} \sim N (z_{l} | f_{l} (z_{l - 1}), τ_{l} I), l = 1, \dots, L, y \sim p (y | z_{L - 1}) .

(21)

Here too, the main difficulty occurs when there are nonlinear activation functions, particularly in the last layer, where the Gaussian approximation may not be more valid.

7. Application: Infrared Imaging

Infrared (IR) imaging is used to diagnose and to survey the temperature field distribution of sensitive objects in many industrial applications. These images are, in general, low resolution and very noisy. The real values of the temperature also depend on many other parameters, such as emissivity, attenuation, and diffusion, due to the distance of the camera to the sources. To be really useful in practice, we need to reduce the noise, calibrate and increase the resolution, segment for detection of the hot area, and finally survey the temperature values of the different areas during the time to be able to conduct preventive diagnosis and possible maintenance.

Reducing the noise can be accomplished by filtering using the Fourier transform, wavelet transform, or other sparse representations of images. To increase the resolution, we may use deconvolution methods if we can obtain the point spread function (PSF) of the camera, or if not by using blind deconvolution techniques. The segmentation and detection of the hot area and the temperature value estimation at each area are also very important steps in real applications. Any of these steps can be performed separately, but trying to propose global processing using DL or BDL is necessary for real applications. As any of these steps are in fact different inverse problems, and it is difficult to fix the parameters in each step in a robust way, we propose a global process using the BDL. See the global scheme in Figure 7.

In the first step, as the final objective is to segment the image to obtain different levels of temperature (for example, four levels: background, normal, high, and very high), we propose to design an NN that obtains as input a low resolution and noisy image and outputs a segmented image with those four levels and, at the same time, a good estimate of the temperature at each segment. See Figure 8.

To train this NN, we can generate different known shaped images to consider as the ground truth and simulate the blurring effects of temperature diffusions via the convolution of different appropriate PSFs. We can also add some noise to generate realistic images. We can also use black body thermal sources and acquire different images at different conditions. All these images can be used for the training of the network. See an example of the obtained result in Figure 9.

Author Contributions

Conceptualization and methodology, A.M.-D., N.C., L.W. and L.Y.; software, A.M.-D.; validation, A.M.-D., N.C., L.W. and L.Y.; formal analysis, A.M.-D.; investigation, resources, and data curation, N.C.; writing—original draft preparation, writing—review and editing, visualization and supervision, A.M.-D.; project administration and funding acquisition, N.C. and A.M.-D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data sharing is not applicable to this article.

Conflicts of Interest

The authors Ali Mohammad-Djafari and Ning Chu are the scientific researchers of Zhejiang Shangfeng Special Blower Company. The authors declare no conflict of interest.

References

Ayasso, H.; Mohammad-Djafari, A. Joint NDT Image Restoration and Segmentation Using Gauss-Markov-Potts Prior Models and Variational Bayesian Computation. IEEE Trans. Image Process. 2010, 19, 2265–2277. [Google Scholar] [CrossRef] [PubMed]
Tang, J.; Egiazarian, K.; Golbabaee, M.; Davies, M. The Practicality of Stochastic Optimization in Imaging Inverse Problems. IEEE Trans. Comput. Imaging 2020, 6, 1471–1485. [Google Scholar] [CrossRef]
Zhu, Y.; Zabaras, N. Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. J. Comput. Phys. 2018, 366, 415–447. [Google Scholar] [CrossRef]
McCann, M.T.; Jin, K.H.; Unser, M. A Review of Convolutional Neural Networks for Inverse Problems in Imaging. Image Video Process. 2017, 34, 85–95. [Google Scholar] [CrossRef]
Fang, Z. A High-Efficient Hybrid Physics-Informed Neural Networks Based on Convolutional Neural Network. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 5514–5526. [Google Scholar] [CrossRef] [PubMed]
Gilton, D.; Ongie, G.; Willett, R. Neumann Networks for Linear Inverse Problems in Imaging. IEEE Trans. Comput. Imaging 2020, 6, 328–343. [Google Scholar] [CrossRef]
Gong, D.; Zhang, Z.; Shi, Q.; van den Hengel, A.; Shen, C.; Zhang, Y. Learning Deep Gradient Descent Optimization for Image Deconvolution. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5468–5482. [Google Scholar] [CrossRef] [PubMed]
De Haan, K.; Rivenson, Y.; Wu, Y.; Ozcan, A. Deep-Learning-Based Image Reconstruction and Enhancement in Optical Microscopy. Proc. IEEE 2020, 108, 30–50. [Google Scholar] [CrossRef]
Aggarwal, H.K.; Mani, M.P.; Jacob, M. MoDL: Model-Based Deep Learning Architecture for Inverse Problems. IEEE Trans. Med. Imaging 2019, 38, 394–405. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Lu, L.; Karniadakis, G.E.; Negro, L.D. Physics-informed neural networks for inverse problems in nano-optics and metamaterials. Opt. Express 2020, 28, 11618–11633. [Google Scholar] [CrossRef] [PubMed]
Raissi, M.; Perdikaris, P.; Karniadakis, G.E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 2019, 378, 686–707. [Google Scholar] [CrossRef]
Mohammad-Djafari, A. Hierarchical Markov modeling for fusion of X-ray radiographic data and anatomical data in computed tomography. In Proceedings of the IEEE International Symposium on Biomedical Imaging, Washington, DC, USA, 7–10 July 2002; pp. 401–404. [Google Scholar] [CrossRef]
Mohammad-djafari, A. Regularization, Bayesian Inference and Machine Learning methods for Inverse Problems. Entropy 2021, 23, 1673. [Google Scholar] [CrossRef] [PubMed]
Giordano, R.; Broderick, T.; Jordan, M. Linear response methods for accurate covariance estimates from mean field variational Bayes. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015. [Google Scholar]
Chun, I.Y.; Huang, Z.; Lim, H.; Fessler, J. Momentum-Net: Fast and convergent iterative neural network for inverse problems. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 45, 4915–4931. [Google Scholar] [CrossRef] [PubMed]
Lucas, A.; Iliadis, M.; Molina, R.; Katsaggelos, A.K. Using Deep Neural Networks for Inverse Problems in Imaging: Beyond Analytical Methods. IEEE Signal Process. Mag. 2018, 35, 20–36. [Google Scholar] [CrossRef]
Chang, J.H.R.; Li, C.L.; Poczos, B.; Kumar, B.V.K.V.; Sankaranarayanan, A.C. One Network to Solve Them All—Solving Linear Inverse Problems using Deep Projection Models. In Proceedings of the IEEE International Conference on Computer Vision 2017, Venice, Italy, 22–29 October 2017. [Google Scholar]
Liang, D.; Cheng, J.; Ke, Z.; Ying, L. Deep Magnetic Resonance Image Reconstruction: Inverse Problems Meet Neural Networks. IEEE Signal Process. Mag. 2020, 37, 141–151. [Google Scholar] [CrossRef] [PubMed]

Figure 1. General exponential family, with original parameters

θ

, natural parameters

λ

, expectations parameters

μ : = E_{q} \{t (θ)\}

, and their relations via the Legendre Transform.

Figure 1. General exponential family, with original parameters

θ

, natural parameters

λ

, expectations parameters

μ : = E_{q} \{t (θ)\}

, and their relations via the Legendre Transform.

Figure 2. Three linear NN structures, which are derived directly from quadratic regularization inversion method. The right part of this figure is adapted from [16].

Figure 3. A K layers DL NN equivalent to K iterations of the basic optimization algorithm.

Figure 4. A K layers DL NN equivalent to K iterations of a basic gradient-based optimization algorithm. A quadratic regularization results in a linear NN, while a

ℓ_{1}

regularization results in a classical NN with a nonlinear activation function. Left: supervised case. Right: unsupervised case. In both cases, all the K layers have the same structure.

Figure 4. A K layers DL NN equivalent to K iterations of a basic gradient-based optimization algorithm. A quadratic regularization results in a linear NN, while a

ℓ_{1}

regularization results in a classical NN with a nonlinear activation function. Left: supervised case. Right: unsupervised case. In both cases, all the K layers have the same structure.

Figure 5. All the K layers of DL NN equivalent to K iterations of an iterative gradient-based optimization algorithm. The simplest solution is to choose

W_{0} = α H

and

W^{(k)} = W = (I - α H^{t} H)

,

k = 1, \dots, K

. A more robust, but more costly approach, is to learn all the layers for

W^{(k)} = (I - α^{(k)} H^{t} H)

,

k = 1, \dots, K

.

Figure 5. All the K layers of DL NN equivalent to K iterations of an iterative gradient-based optimization algorithm. The simplest solution is to choose

W_{0} = α H

and

W^{(k)} = W = (I - α H^{t} H)

,

k = 1, \dots, K

. A more robust, but more costly approach, is to learn all the layers for

W^{(k)} = (I - α^{(k)} H^{t} H)

,

k = 1, \dots, K

.

Figure 6. Training (top) and testing (bottom) steps in the first use of physics-based ML approach.

Figure 7. The proposed four groups of layers of NN for denoising, deconvolution, and segmentation of IR images.

Figure 8. Example of expected results in deterministic methods. First row: a simulated IR image (left), its ground truth labels (middle), and the result of the deconvolution and segmentation (right). Second row: a real IR image (left) and the result of its deconvolution and segmentation (right).

Figure 9. Example of expected results in Bayesian methods. First row from left: (a) simulated IR image, (b) its ground truth labels, (c) the result of the deconvolution and segmentation, and (d) uncertainties. Second row: (e) a real IR image, (f) no ground truth, (g) the result of its deconvolution and segmentation, and (h) uncertainties.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mohammad-Djafari, A.; Chu, N.; Wang, L.; Yu, L. Bayesian Inference and Deep Learning for Inverse Problems. Phys. Sci. Forum 2023, 9, 14. https://doi.org/10.3390/psf2023009014

AMA Style

Mohammad-Djafari A, Chu N, Wang L, Yu L. Bayesian Inference and Deep Learning for Inverse Problems. Physical Sciences Forum. 2023; 9(1):14. https://doi.org/10.3390/psf2023009014

Chicago/Turabian Style

Mohammad-Djafari, Ali, Ning Chu, Li Wang, and Liang Yu. 2023. "Bayesian Inference and Deep Learning for Inverse Problems" Physical Sciences Forum 9, no. 1: 14. https://doi.org/10.3390/psf2023009014

Article Menu

Bayesian Inference and Deep Learning for Inverse Problems^†

Abstract

1. Introduction

2. Neural Networks, Deep Learning, and Bayesian DL

3. Bayesian Inference and Approximate Computation

3.1. Laplace Approximation

3.2. Approximate Computation: Variational Inference

3.3. VBA and Natural Exponential Family

4. Deep Learning and Bayesian DL

Exponential Family Approximation

5. NN, DL, and Bayesian Inference for Linear Inverse Problems

5.1. First Example: A Known Linear Forward Model

5.2. Second Example: A Deep Learning Equivalence of Iterative Gradient-Based Algorithms

5.3. Third Example: $ℓ_{1}$ Regularization and NN

5.4. Decomposition of the NN Structure to Fixed and Trainable Parts

6. DL Structure and Deterministic or Bayesian Computation

6.1. Deterministic DL Computation

6.2. Bayesian Deep Neural Network

Prediction Step

7. Application: Infrared Imaging

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Bayesian Inference and Deep Learning for Inverse Problems †

Abstract

1. Introduction

2. Neural Networks, Deep Learning, and Bayesian DL

3. Bayesian Inference and Approximate Computation

3.1. Laplace Approximation

3.2. Approximate Computation: Variational Inference

3.3. VBA and Natural Exponential Family

4. Deep Learning and Bayesian DL

Exponential Family Approximation

5. NN, DL, and Bayesian Inference for Linear Inverse Problems

5.1. First Example: A Known Linear Forward Model

5.2. Second Example: A Deep Learning Equivalence of Iterative Gradient-Based Algorithms

5.3. Third Example: ℓ 1 Regularization and NN

5.4. Decomposition of the NN Structure to Fixed and Trainable Parts

6. DL Structure and Deterministic or Bayesian Computation

6.1. Deterministic DL Computation

6.2. Bayesian Deep Neural Network

Prediction Step

7. Application: Infrared Imaging

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Bayesian Inference and Deep Learning for Inverse Problems^†

5.3. Third Example: $ℓ_{1}$ Regularization and NN