Accelerating Convergence of Langevin Dynamics via Adaptive Irreversible Perturbations

Wu, Zhenqing; Huang, Zhejun; Wu, Sijin; Yu, Ziying; Zhu, Liuxin; Yang, Lili

doi:10.3390/math12010118

Open AccessArticle

Accelerating Convergence of Langevin Dynamics via Adaptive Irreversible Perturbations

by

Zhenqing Wu

¹

,

Zhejun Huang

¹

,

Sijin Wu

^1,2

,

Ziying Yu

¹,

Liuxin Zhu

¹ and

Lili Yang

^1,*

¹

Department of Statistics and Data Science, Southern University of Science and Technology, Shenzhen 518055, China

²

Institute for Transport Studies, University of Leeds, Leeds LS2 9JT, UK

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(1), 118; https://doi.org/10.3390/math12010118

Submission received: 19 November 2023 / Revised: 20 December 2023 / Accepted: 26 December 2023 / Published: 29 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

Irreversible perturbations in Langevin dynamics have been widely recognized for their role in accelerating convergence in simulations of multi-modal distributions

π (θ)

. A commonly used and easily computed standard irreversible perturbation is

J \nabla log π (θ)

, where J is a skew-symmetric matrix. However, Langevin dynamics employing a fixed-scale standard irreversible perturbation encounter a trade-off between local exploitation and global exploration, associated with small and large scales of standard irreversible perturbation, respectively. To address this trade-off, we introduce the adaptive irreversible perturbations Langevin dynamics, where the scale of the standard irreversible perturbation changes adaptively. Through numerical examples, we demonstrate that adaptive irreversible perturbations in Langevin dynamics can enhance performance compared to fixed-scale irreversible perturbations.

Keywords:

adaptive irreversible perturbations; Langevin dynamics; Markov chain Monte Carlo

MSC:

65B15; 65C05; 65C30

1. Introduction

Markov chain Monte Carlo (MCMC) algorithms represent the gold standard for Bayesian inference. One widely employed category of MCMC algorithms is based on Langevin dynamics (LD), which employs the gradient of the negative logarithm of the target distribution (energy function) to define a stochastic differential equation (SDE). Recent research has focused on improving the efficiency of LD-based MCMC algorithms [1,2,3,4]. While LD-based MCMC algorithms have made notable strides in simulating high-dimensional systems to address big data challenges, they exhibit a fast mixing rate, primarily when the energy landscape is uncomplicated. However, when dealing with simulations of multi-modal distributions, the LD approach is susceptible to becoming ensnared in local energy minima, leading to biased statistical inferences. To improve convergence in complex energy landscapes, various strategies have been proposed. One widely recognized and effective approach involves introducing specific perturbations to LD, a technique that has demonstrated promising utility in the literature [5,6,7,8,9].

It is well established that specific perturbations to LD can accelerate the convergence of the dynamics toward the target distribution. Hwang et al. [5] demonstrate that the inclusion of a weighted divergence-free drift in LD leads to accelerated convergence to target distribution. Rey-Bellet and Spiliopoulos [6] demonstrate that introducing appropriate reversible and irreversible perturbations to diffusion processes can have several beneficial effects. These include reducing the spectral gap of the generator, increasing the rate function for large deviations, and decreasing the asymptotic variance of estimators. One notable example of a reversible perturbation is the Riemannian manifold Langevin dynamics (RMLD), as introduced by Girolami and Calderhead [10]. This approach involves defining a Riemannian metric to modify the computation of distances and gradients. It is also well-documented [6,7,11] that certain irreversible perturbations to the LD approach can enhance its convergence while still maintaining its invariant distribution. Given the individual successes of both irreversible and reversible perturbations, such as the RMLD, in enhancing Langevin samplers, Zhang [9] has introduced a novel approach involving geometry-informed irreversible perturbations to accelerate the convergence of LD-based algorithms for Bayesian computation. Zhang’s work combines irreversible and reversible perturbations by introducing a unique form of irreversible perturbation for RMLD that is informed by the underlying geometry. It is demonstrated that this innovative irreversible perturbation can lead to improved estimation performance compared to traditional irreversible perturbations that do not consider geometric characteristics.

While perturbations that account for the geometric characteristics of the invariant distribution, such as RMLD and geometry-informed irreversible perturbations, notably accelerate the convergence of LD towards its invariant distribution, considering these geometric features often necessitates the computation of metrics like the expected Fisher information and Hessian matrices in each iteration. This substantially elevates the computational overhead per iteration. So, in many application scenarios, for computational convenience, we often choose standard irreversible perturbations,

Γ (θ) = J log π (θ)

, with J denoting a constant skew-symmetric matrix and

π (θ)

denoting the target distribution. Indeed, with a large magnitude of J, the flow

J log π (θ)

will result in rapid exploration of the energy function. However, this introduces additional constraints, as arbitrarily increasing the magnitude of J not only leads to a significant discretization error but also weakens the ability of LD to exploit the local geometry. Therefore, we expect J’s magnitude to be adaptable in the utilization of an LD sampler with irreversible perturbations. In most cases, the magnitude of J remains relatively small. Then, increasing the magnitude of J adaptively enables LD to rapidly explore the energy function. Once LD reaches new regions, the magnitude of J adapts to be small, ensuring stable sampling. This ensures that the LD sampler with irreversible perturbations possesses excellent exploration capabilities while maintaining stable local exploitation capabilities. So far, the concept of adaptively changing the magnitude of J has not been implemented. This is because a simple alteration in the magnitude of J would result in LD not converging to the target distribution.

In this paper, we propose an adaptively irreversible replica-exchange Langevin dynamics (AIRELD) algorithm by combining the parallel tempering strategy replica exchange Langevin dynamics (RELD) [1] and perturbation strategy. Within AIRELD, the magnitude of irreversible perturbations adjust dynamically, correlating with the temperature changes in the parallel tempering process. At higher temperatures, the irreversible perturbation scale amplifies, resulting in rapid exploration of the energy function. Consequently, this enhancement raises the probability of escaping local traps, fostering a comprehensive exploration of the entire domain. Conversely, with lower temperatures, the behavior of LD takes on the characteristics of gradient flow, emphasizing the utilization of local geometry. This, in turn, results in stable local sampling. The adaptive irreversible perturbations accelerate the convergence of LD and, if needed, can be employed in conjunction with the stochastic gradient Langevin dynamics (SGLD) algorithm to leverage the computational advantages of stochastic gradient methods.

The organization of the paper is as follows: In Section 2, we offer an overview of the parallel tempering algorithm, RELD, and irreversible perturbations within LD, which have the potential to enhance sampling efficiency from target distribution. Simultaneously, In Section 3, we introduce our innovative algorithm AIRELD and extend it to the scenario of stochastic gradients. Section 4 is devoted to presenting simulation studies that highlight the AIRELD approach’s performance compared to the choice of standard irreversible perturbations and RELD. Lastly, Section 5 provides a summary of our findings and outlines potential directions for future research.

2. Improving the Performance of Langevin Dynamics

We begin by revisiting relevant background information on Langevin samplers, RELD, and irreversible perturbations in LD.

A frequently employed LD sampling algorithm is the LD, which is characterized by the SDE

d θ_{t} = - \nabla U (θ_{t}) d t + \sqrt{2 τ} d W_{t},

(1)

where

θ \in R^{d}

,

{W_{t}}_{t \geq 0}

represents a standard d-dimensional Brownian motion,

U (\cdot)

is the energy function, and

τ > 0

is the temperature parameter. Under reasonable regularity assumptions, the process

{θ_{t}}_{t \geq 0}

possesses a unique invariant distribution, which is absolutely continuous with respect to the Lebesgue measure with density

π (θ) = \frac{e^{- U (θ) / τ}}{\int_{R^{d}} e^{- U (θ) / τ} d θ} .

(2)

In the context of Bayesian statistics, when dealing with independent data,

y_{1}, \dots, y_{N}

, the posterior distribution is

π (θ) \propto p (θ) \prod_{i = 1}^{N} f (y_{i} | θ)

, where

p (θ)

represents the prior density, and

f (y_{i} | θ)

represents the likelihood for the ith observation. In this particular framework,

U (θ) = - log p (θ) - \sum_{i = 1}^{N} log f (y_{i} | θ)

.

2.1. Replica Exchange Langevin Dynamics

The temperature parameter

τ

in LD (1) plays a crucial role in the sampling process of the high-dimensional energy function

U (θ)

. In practice, the LD approach is often executed with a fixed temperature. A high temperature

τ

endows the LD with an enhanced exploration capability. It significantly accelerates the convergence to the flattened invariant distribution across the entire domain. However, although the distribution of

{θ_{t}}_{t \geq 0}

converges more rapidly to the global invariant distribution

π (θ)

, it is less concentrated around the global minima of

U (θ)

. On the other hand, when the temperature parameter

τ

is small, the convergence of

{θ_{t}}_{t \geq 0}

becomes slow, and particles may remain trapped in a local minimum under a limited computational budget without effectively exploring the entire domain. As a result, over a finite period, the obtained samples are distant from the invariant distribution

π (θ)

. Therefore, the effectiveness of utilizing a constant temperature is quite limited. Hence, in the tempering strategy, temperature is considered as an auxiliary variable, lower temperatures concentrate the distribution’s mass around the global optimum, resulting in higher energy barriers and local traps. On the other hand, higher temperatures flatten the distribution, reduce energy barriers, and facilitate transitions between different modes of distribution. Two extensively researched methods in the field of tempering are worth noting. The first is parallel tempering, which has been investigated in the context of MCMC for the purpose of sampling from multi-modal distributions, such as RELD. For further insights, refer to works by Earl and Deem [12]; Sindhikara et al. [13]; Deng et al. [14], as well as their associated references. Additionally, another well-explored approach is simulated tempering, a technique akin to parallel tempering, both aimed at accelerating the convergence of LD. This method, introduced by Marinari and Parisi [15] and further developed by Geyer and Thompson [16], has a primary distinction from replica exchange: while the replica exchange tracks multiple diffusion dynamics driven by different temperatures and accelerates convergence through swapping, the simulated tempering focuses on a single diffusion process but treats temperature as a stochastic process.

A recent and potent parallel tempering algorithm, known as RELD, a specialized variant of the parallel tempering algorithm, has been introduced to accelerate the convergence of the SDE, as illustrated in Figure 1.

RELD bridges the gap between global exploration and local exploitation by employing a high-temperature particle for exploration and a low-temperature particle for exploitation, allowing them to swap simultaneously. RELD consists of the following LD with temperatures

τ_{1} < τ_{2}

:

\begin{matrix} d θ_{t}^{(1)} = - \nabla U (θ_{t}^{(1)}) d t + \sqrt{2 τ_{1}} d W_{t}^{(1)}, \\ d θ_{t}^{(2)} = - \nabla U (θ_{t}^{(2)}) d t + \sqrt{2 τ_{2}} d W_{t}^{(2)} . \end{matrix}

(3)

The invariant distribution of

(θ_{t}^{(1)}, θ_{t}^{(2)})

is

π (θ^{(1)}, θ^{(2)}) \propto e^{- \frac{U (θ^{(1)})}{τ_{1}} - \frac{U (θ^{(2)})}{τ_{2}}} .

(4)

The primary concept behind the replica exchange is to enhance the low-temperature particle’s capacity for comprehensive global exploration through the exchange of its position with the high-temperature particle. Concurrently, the low-temperature particle possesses good local exploitation capability, making its trajectory the preferred choice for the sampling outcome.

By allowing the exchange of positions between the two particles, the positions may transition from

(θ_{t}^{(1)}, θ_{t}^{(2)})

to

(θ_{t + d t}^{(2)}, θ_{t + d t}^{(1)})

at a swapping rate denoted as

r S (θ_{t}^{(1)}, θ_{t}^{(2)}) d t

. Here, the constant

r \geq 0

represents the swapping intensity, and

S (\cdot, \cdot)

is defined as:

S (θ_{t}^{(1)}, θ_{t}^{(2)}) = e^{(\frac{1}{τ_{1}} - \frac{1}{τ_{2}}) (U (θ_{t}^{(1)}) - U (θ_{t}^{(2)}))}

(5)

In such scenarios, RELD operates as a Markov jump process, which is reversible, leading to the same invariant distribution (4).

2.2. Irreversible Perturbations

When faced with a multi-modal target distribution, characterized by numerous local minima of

U (θ) = - log π (θ)

, the convergence to invariant distribution for the LD may be characterized by a slow pace. In such circumstances, it can be beneficial to modify the dynamics through adjustments to the drift, facilitating the escape of the dynamics from the local minima of

U (θ)

.

Consider the following LD:

d θ_{t} = [\nabla log π (θ_{t}) + Γ (θ_{t})] d t + \sqrt{2} d W_{t} .

(6)

when

Γ (θ) \equiv 0

, the process exhibits reversibility and possesses

π (θ)

as its invariant distribution. However, if

Γ (θ) \neq 0

, then the resultant process will generally lack time-reversibility, unless

Γ (θ)

can be expressed as a multiple of

\nabla log π (θ)

[17].

However, even with an irreversible perturbation, it is still possible to preserve the invariant distribution of the system. By considering the Fokker–Planck equation, it can be shown that if

Γ (θ)

is selected such that

\nabla \cdot (Γ (θ) π (θ)) = 0

, then

π

will remain the invariant distribution. It is clear that numerous (in fact, infinitely many) approaches can be employed to alter the reversible dynamics while maintaining the invariant distribution. A natural inquiry arises: can the inclusion of an irreversible term enhance the rate of convergence to invariant distribution, and if so, is there an optimal perturbation choice that maximizes this rate of convergence to invariant distribution? To date, this problem has not been systematically resolved, and no general conclusions have been reached. Only a small subset of problems has been addressed. For example, Lelièvre et al. [18] provide a comprehensive analysis of the best choice for an irreversible perturbation in the context of a linear system, one that leads to the fastest convergence. In practical applications, a frequently employed selection in the literature is the standard irreversible perturbation

Γ (θ) = J \nabla log π (θ)

, with J representing a constant skew-symmetric matrix, i.e.,

J = - J^{T}

. The advantage of this choice is clear, as it necessitates only one additional matrix–vector multiplication for implementation.

The advantages of incorporating irreversible perturbations are widely acknowledged. Hwang et al. [5] present a significant finding, revealing that, under certain conditions, the spectral gap, representing the difference between the top two eigenvalues of the Markov semigroup generator, increases when

Γ (θ) \neq 0

. In the works of Rey-Bellet and Spiliopoulos [6,7], they introduce the concept of a large deviation rate function as a performance measure for sampling from the invariant distribution. By establishing a connection between large deviation rate function and the asymptotic variance of the long-term average estimator, they illustrate that introducing an appropriate perturbation

Γ (θ)

not only increases the large deviation rate function but also diminishes the asymptotic variance of the estimator. The use of irreversible proposals in the Metropolis-adjusted Langevin algorithm (MALA) was studied by Ottobre et al. [19].

3. Adaptively Irreversible Replica-Exchange Langevin Dynamics

While perturbations that account for the geometric characteristics of the invariant distribution, such as RMLD and geometry-informed irreversible perturbations, notably accelerate the convergence of LD toward its invariant distribution, the consideration of these geometric features often necessitates the computation of metrics like expected Fisher information and Hessian matrices during each iteration. This substantially elevates the computational overhead per iteration. As a result, in most practical scenarios, we frequently adopt the standard irreversible perturbation

Γ (θ) = J \nabla log π (θ)

, with J denoting a constant skew-symmetric matrix. Using

Γ (θ) = J \nabla log π (θ)

, since

π (θ) \propto \exp (- U (θ))

, Equation (6) transforms into the following expression:

\begin{matrix} d θ (t) & = [\nabla log π (θ_{t}) + Γ (θ_{t})] d t + \sqrt{2} d W_{t}, \\ = - (I + J) \nabla U (θ) d t + \sqrt{2} d W_{t} . \end{matrix}

(7)

From the above discussion in Section 2.2, we know that the introduction of the standard irreversible perturbation

Γ (θ) = J \nabla log π (θ)

leads to a certain improvement in the convergence speed of LD. In fact, we can provide a rudimentary explanation from the LD (7). With the presence of J, and without altering the invariant distribution, both the magnitude and direction of

\nabla U (θ)

undergo some changes. When the magnitude of J is very small, LD (7) behaves quite similarly to standard LD. However, when the magnitude of J is significantly larger, it enables LD to transition from one peak to another in a multi-modal invariant distribution, all without altering the invariant distribution. It is worth noting that throughout the entire process, J remains constant.

During the implementation of LD (7), we notice that the magnitude of J introduces a trade-off between two opposing effects, commonly referred to as global exploration and local exploitation. In more detail, a larger magnitude of J results in a larger gradient, thereby increasing the probability of escaping from local minima and facilitating exploration across the entire domain. Conversely, a smaller magnitude of J causes the LD to behave more like a gradient flow, emphasizing the exploitation of local geometry, leading to a more concentrated invariant distribution. Hence, a simple idea came to mind: when the magnitude of J is very small, it can slightly improve the convergence speed while ensuring the stability of LD for sampling. Conversely, when the magnitude of J is relatively large, it can aid LD in exploring the various peaks of a multi-modal target distribution. Consequently, our proposed naive approach entails alternating between small and large magnitudes of J during LD execution, with these transitions occurring at manually defined intervals. We introduced these straightforward variations with the expectation of improving the operational efficiency of LD. Unfortunately, this artificial swapping led to the failure of LD to converge to the invariant distribution. Given that the manually determined swaps resulted in non-convergence of LD, we attempted to find an adaptive swapping strategy. An attempt involves utilizing the temperature swapping criterion defined in the RELD. This criterion can be used to adaptively adjust the magnitude of the skew-symmetric matrix J. Incorporating this adjustment, we introduce the adaptively irreversible replica-exchange Langevin dynamics (AIRELD) algorithm, outlined in Algorithm 1.

Algorithm 1: Adaptively irreversible replica-exchange Langevin dynamics

AIRELD is characterized by the following LD with temperatures

τ_{1} < τ_{2}

:

\begin{matrix} d θ_{t}^{(1)} = - (I + J) \nabla U (θ_{t}^{(1)}) d t + \sqrt{2 τ_{1}} d W_{t}^{(1)}, \\ d θ_{t}^{(2)} = - (I + J) \nabla U (θ_{t}^{(2)}) d t + \sqrt{2 τ_{2}} d W_{t}^{(2)} . \end{matrix}

(8)

In contrast to the standard LD approach, the LD approach in Equation (8) introduces perturbations

Γ (θ) = J \nabla log π (θ)

in the drift. The ensuing dynamics are described by

d θ_{t} = (\nabla log π (θ_{t}) + Γ (θ_{t})) d t + \sqrt{2} d W_{t},

(9)

where

Γ (\cdot)

is a smooth vector field. The stationary Fokker–Planck Equation (9) then takes the form

\nabla \cdot (Γ (θ) π (θ)) = 0 .

Consequently, any divergence-free vector field perturbations respecting the target distribution can be employed to construct irreversible ergodic dynamics, with

π (θ)

as the invariant distribution. Given

\nabla \cdot (J \nabla log π (θ) π (θ)) = 0

, the invariant distribution of

(θ_{t}^{(1)}, θ_{t}^{(2)})

in Equation (8) is

π (θ^{(1)}, θ^{(2)}) \propto e^{- \frac{U (θ^{(1)})}{τ_{1}} - \frac{U (θ^{(2)})}{τ_{2}}} .

(10)

AIRELD facilitates the exchange of positions between the two particles using the swapping rule in RELD, which possesses the invariance property. Therefore, the invariant distribution of AIRELD remains as expressed in (10). Our experiments also affirm that the AIRELD approach retains the same invariant distribution.

Stochastic Gradient Langevin Dynamics

The SGLD approach stands out as the most effective approach for addressing computational challenges in Bayesian inference when confronted with large datasets. In order to reduce computational requirements per iteration, Welling and Teh [20] introduced the innovative SGLD method, which combines the strengths of LD and stochastic gradients. In each iteration of LD (1), a subset of n samples is randomly selected with replacement from the entire dataset, and the gradient with respect to these n samples is computed. Consequently,

\nabla U (θ_{t})

in (1) is replaced by an unbiased estimator

\hat{U} (θ_{t}) = \frac{N}{n} \sum_{i \in S_{n}} \nabla U_{i} (θ_{t})

, where

U_{i} (θ_{t}) = - log p (y_{i} | θ_{t}) - \frac{1}{N} log p (θ_{t})

and

S_{n}

is a random sample from

{1, \dots, N}

without replacement. The detailed SGLD approach is outlined in Algorithm 2. When the size of the selected subset, n, is much smaller than the total number of samples, N, it is evident that the SGLD approach offers a significantly reduced computational cost per iteration compared to MALA. On the other hand, Welling provided informal proof that the SGLD approach converges to the true posterior distribution as the step size approaches zero. Therefore, the SGLD approach offers certain advantages when computing resources are limited.

Algorithm 2: Stochastic Gradient Langevin Dynamics

The SGLD approach, when employed with a variable and diminishing step size, has been demonstrated to be consistent. In other words, the invariant distribution of the discretized system is equivalent to that of the continuous system [21]. However, the use of a decreasing step size mitigates the computational cost savings associated with computing stochastic gradients. To address this, a version of the SGLD approach with a fixed step size was introduced by Vollmer et al. [22]. This work also includes theoretical characterizations of asymptotic and finite-time bias and variance. In our numerical experiments, we utilize the stochastic gradient version of the Langevin algorithm with a fixed step size to illustrate that the SGLD approach can be combined with adaptive irreversible perturbations.

However, directly substituting the gradient

\nabla U (θ_{t})

with its unbiased estimator, the stochastic gradient

\hat{U} (θ_{t})

, in AIRELD is not feasible. It is apparent that the unbiased estimators

\hat{U} (θ_{t}^{(1)})

and

\hat{U} (θ_{t}^{(2)})

in

\hat{S} = r e^{(\frac{1}{τ_{1}} - \frac{1}{τ_{2}}) (\hat{U} (θ_{t}^{(1)}) - \hat{U} (θ_{t}^{(2)}))}

do not offer an unbiased estimator of

S = r e^{(\frac{1}{τ_{1}} - \frac{1}{τ_{2}}) (U (θ_{t}^{(1)}) - U (θ_{t}^{(2)}))}

after a non-linear transformation, resulting in a large bias. Deng et al. [14] introduce replica exchange stochastic gradient Langevin dynamics (RESLGD) to automatically mitigate this bias by assuming normality on the stochastic energy, i.e.,

\hat{U} (θ) \sim N (U (θ), σ^{2})

. Consequently, it follows that

\hat{U} (θ^{(1)}) - \hat{U} (θ^{(2)}) = U (θ^{(1)}) - U (θ^{(2)}) + \sqrt{2} σ W_{1},

(11)

where

W_{1}

follows the standard normal distribution. Within this assumption, an unbiased estimator of

S = r e^{(\frac{1}{τ_{1}} - \frac{1}{τ_{2}}) (U (θ_{t}^{(1)}) - U (θ_{t}^{(2)}))}

is obtained, given by

\hat{S} = e^{(\frac{1}{τ_{1}} - \frac{1}{τ_{2}}) (\hat{U} (θ^{(1)}) - \hat{U} (θ^{(2)}) - (\frac{1}{τ_{1}} - \frac{1}{τ_{2}}) σ^{2})},

(12)

and the correction term

(\frac{1}{τ_{1}} - \frac{1}{τ_{2}}) σ^{2}

aims to eliminate the bias from the swaps. In practice, the exact variance

σ^{2}

is rarely known and is subject to estimation. Hence, stochastic approximation (SA) is employed. In each SA step, one obtains an unbiased sample variance

{\tilde{σ}}^{2}

for the true

σ^{2}

and updates the adaptive estimate

{\hat{σ}}_{t + 1}^{2}

through

{\hat{σ}}_{t + 1}^{2} = (1 - γ_{t}) {\hat{σ}}_{t}^{2} + γ_{t} {\tilde{σ}}_{t + 1}^{2},

(13)

where

γ_{t}

is the smoothing factor at the t-th SA step. With this adjustment, Algorithm 3 outlines the adaptively irreversible replica-exchange stochastic gradient Langevin dynamics (AIRESGLD) algorithm.

Algorithm 3: Adaptively irreversible replica-exchange stochastic gradient Langevin dynamics.

4. Experiments

4.1. Parameters of a Normal Distribution

This example is consistent with the one utilized in work by Girolami and Calderhead [10]. Consider a dataset of

R

-valued data denoted as

X = {X_{i}}_{i = 1}^{N} \sim

N (μ, σ^{2})

, where our goal is to infer the parameters

μ

and

σ

. In this example, the state space is represented as

θ = {[μ, σ]}^{T}

. The prior distribution over

μ

and

σ

is intentionally chosen to be uniform (flat) and, therefore, improper. The log-posterior distribution can be expressed as follows:

log p (μ, σ ∣ X) = - \frac{N}{2} log 2 π - N log σ - \sum_{i = 1}^{N} \frac{{(X_{i} - μ)}^{2}}{2 σ^{2}} .

(14)

The log-posterior gradient can be expressed as follows:

\nabla log p (μ, σ ∣ X) = [\begin{matrix} M_{1} (μ) / σ^{2} \\ - N / σ + M_{2} (μ) / σ^{3} \end{matrix}]

where

M_{1} (μ) = \sum_{i = 1}^{N} (X_{i} - μ)

, and

M_{2} (μ) = \sum_{i = 1}^{N} (X_{i} -

{μ)}^{2}

.

In this experiment, we conduct a comparison involving LD, RELD, AIRELD, and standard irreversible perturbation LD through the performance of the estimators of

ϕ_{1} (μ, σ) = μ + σ

and

ϕ_{2} (μ, σ) = μ^{2} + σ^{2}

. Our assessment is based on various performance metrics, including mean squared error (MSE), bias, and variance.

In our experimental configuration, we use a dataset comprising

N = 30

data points. This dataset is generated by sampling from a normal distribution with true parameters

μ_{true} = 0

and

σ_{true} = 10

. We adopt a constant step size of

h = 10^{- 3}

. For RELD and AIRELD, we use two temperatures: a low temperature

τ_{1} = 0.1

and a high temperature

τ_{2} = 1.5

, with a swapping intensity of 1. For standard irreversible LD (7), we set

J = 0.1 \times [\begin{matrix} 0 & 1 \\ - 1 & 0 \end{matrix}]

. Our simulations involve generating

M = 1000

independent trajectories for each system, with each trajectory consisting of

T = 10^{6}

steps. We initialize the system with

μ = 5

and

σ = 20

. To ensure the reliability of our results, we subject each trajectory to a burn-in period of 5000 iterations.

We visualize the MSE, squared bias, and variance of the resulting estimators for

ϕ_{1} (μ, σ) = μ + σ

and

ϕ_{2} (μ, σ) = μ^{2} + σ^{2}

in Figure 2 and Figure 3, respectively, where Irr represents the LD with standard irreversible perturbation. From the results in the graphs, the AIRELD approach has superior performance compared to RELD, which can be attributed to the introduction of perturbations. Additionally, we can see that the AIRELD approach’s performance surpasses that of the standard irreversible LD approach, thanks to the utilization of adaptive perturbations. In Figure 4 and Figure 5, we also provide results under the scenario of stochastic gradients, showing a high consistency with outcomes obtained using true gradients.

4.2. Mixture of 25 Two-Dimensional Gaussian Distribution

In this experiment, we observed the early-stage performances of various algorithms. In the experiments, we set the the target distribution

π (θ)

as a mixture of 25 two-dimensional Gaussian probability density functions, defined as follows:

π (θ_{1}, θ_{2}) = \sum_{i = 1}^{25} w_{i} \cdot 1 / (2 π κ) \cdot exp \{- 1 / (2 κ) \cdot ({(θ_{1} - ξ_{i}^{(1)})}^{2} + {(θ_{2} - ξ_{i}^{(2)})}^{2})\} .

Here,

ξ_{i} = (ξ_{i}^{(1)}, ξ_{i}^{(2)})

represents the center of the i-th Gaussian probability density function, and

w_{i}

is the corresponding weight. It is worth noting that we set all the covariance matrices as

κ \cdot I_{2}

, where

I_{2}

denotes the two-dimensional identity matrix, and

κ

is a positive constant. In our experiments, we choose

ξ_{i}

in a successive pattern, starting with

(0, 0), (0, 1), \dots, (0, 4), \dots, (4, 0), (4, 1), \dots, (4, 4)

, with their corresponding weights set as

1 / 325, 1 / 325, \dots, 25 / 325

. With these settings, it can be easily demonstrated that

U (θ_{1}, θ_{2}) = - log π (θ_{1}, θ_{2})

is non-convex and exhibits multiple local minima.

In this experiment, with stochastic gradients employed, we choose

κ = 0.015

and a fixed step size of

h = 10^{- 4}

. For RESGLD and AIRESGLD, we incorporate two temperatures: a low temperature of

τ_{1} = 1

and a high temperature of

τ_{2} = 3

with a swapping intensity of 1. The target multi-modal density is shown in Figure 6a. Figure 6b–f displays the early-stage empirical performance of all the sampling methods with

T = 10^{5}

steps. AIRESGLD still maintains relatively stable local sampling capabilities with significantly increasing sampling exploration capabilities. AISGPLD demonstrates superior empirical performance compared to RESGLD and SGLD with standard irreversible perturbations (irr-small SGLD with

J = τ_{1} \times [\begin{matrix} 0 & 1 \\ - 1 & 0 \end{matrix}]

and irr-large SGLD with

J = τ_{2} \times [\begin{matrix} 0 & 1 \\ - 1 & 0 \end{matrix}]

). In Figure 7, we illustrate the KL divergence variations across various sampling methods. Our method, AIRESGLD, demonstrates a marked improvement in sampling performance compared to RESGLD and SGLD with standard irreversible perturbation, evident from the fluctuations in the KL divergence.

4.3. Simulations of Multi-Modal Distribution

This example is consistent with the one utilized in Deng et al. [4]. The target density function is given by

π (θ) \propto exp (- U (θ))

,

U (θ) = 0.2 (θ_{1}^{2} + θ_{2}^{2}) - 2 (cos (2 π θ_{1}) + cos (2 π θ_{2}))

. In this experiment, with stochastic gradients computed, we make comparisons between SGLD, irr-small SGLD, irr-large SGLD, RESGLD, and AIRESGLD. We opt for a constant step size of

h = 10^{- 4}

. In RESGLD and AIRESGLD, we incorporate two temperatures:

τ_{1} = 1

for the low temperature and

τ_{2} = 3

for the high temperature, with a swapping intensity set at 1. For SGLD with standard irreversible perturbation, we set

J = τ_{1} \times [\begin{matrix} 0 & 1 \\ - 1 & 0 \end{matrix}]

and

J = τ_{2} \times [\begin{matrix} 0 & 1 \\ - 1 & 0 \end{matrix}]

for irr-small SGLD and irr-large SGLD, respectively. In this experiment, we have also shown the performance of contour stochastic gradient Langevin dynamics (CSGLD) [23], specifically designed for simulating multi-modal distributions. In Figure 8, the trajectories of various sampling algorithms are depicted. Relative to SGLD, irr-small SGLD, CSGLD, and irr-large SGLD, AIRESGLD display better global exploration capability. Additionally, when compared to RESGLD, the AIRESGLD approach demonstrates an improved capacity for global exploration. Simultaneously, the AIRESGLD approach offers more focused sampling within local regions, demonstrating superior local exploitation ability.

The target multi-modal density is depicted in Figure 9a. Figure 9b–f illustrate the empirical performance of all the sampling algorithms. Based on the sampling results, the AIRESGLD approach demonstrates the most favorable performance. In Figure 10, we illustrate how the KL divergence changes across sampling iterations. The AIRESGLD approach notably minimizes the KL divergence early on, leveraging its robust global exploration capability. Additionally, thanks to its local exploitation ability, the algorithm consistently converges to the target distribution.

4.4. Bayesian Logistic Regression

Next, we consider a binary regression model, wherein

y = {\{y_{i}\}}_{i = 1}^{N}

denotes a vector containing N binary responses, and

X

represents an

N \times d

matrix of covariates.

θ

stands as a d-dimensional vector representing model parameters, the likelihood function for the logistic regression model takes the form:

p (y, X ∣ θ) = \prod_{i = 1}^{N} {[\frac{1}{1 + exp (- θ^{⊤} x_{i})}]}^{y_{i}} {[1 - \frac{1}{1 + exp (- θ^{⊤} x_{i})}]}^{1 - y_{i}},

(15)

where

x_{i}

represents a d-dimensional vector corresponding to the ith observation. The prior distribution for

θ

assumes a zero-mean Gaussian distribution with a covariance matrix

Σ_{θ} = 10 I_{d}

, where

I_{d}

denotes a

d \times d

identity matrix.

We simulate

N =

10^{3}

data points with

x_{i} \sim N (0, Σ_{x})

, where

Σ_{x}^{(i, j)} =

U

{[- ρ, ρ]}^{| i - j |}

with

ρ = 0.4

. After obtaining

x_{i}

, we obtain

y_{i}

by randomly generating

θ

and using a binomial distribution with probability

1 / (1 + exp (- θ^{⊤} x_{i}))

. We partition the dataset into a training set and a test set, with the training set comprising 800 data points and the test set comprising 200 data points. Additionally, we employ the log-loss as the metric for assessing the predictive accuracy of the classifier on test dataset

D_{*}

. In this case, the log-loss is expressed as

\begin{matrix} l (θ, D_{*}) = & - \frac{1}{|T_{*}|} \sum_{(y_{*}, x_{*}) \in D_{*}} y_{*} log p (x_{*}, θ) \\ + (1 - y_{*}) log (1 - p (x_{*}, θ)), \end{matrix}

(16)

where

p (x_{*}, θ) = {(1 + exp (- θ^{⊤} x_{*}))}^{- 1}

. In this experiment, we compare the performance of SGLD, irr-small SGLD, irr-large SGLD, RESGLD, and AIRESGLD using log-loss. We fix the parameter dimension of

θ

at 4. Additionally, we choose a constant step size of

h = 10^{- 4}

. In RESGLD and AIRESGLD, we incorporate two temperatures:

τ_{1} = 1

for the low temperature and

τ_{2} = 3

for the high temperature, with a swapping intensity set to 1. We create the skew-symmetric matrix

J_{4 \times 4}

by forming a lower triangular matrix with entries randomly drawn from

{1, - 1}

and then subtracting its transpose. For SGLD with standard irreversible perturbation, we set

J_{1} = τ_{1} \times J_{4 \times 4}

and

J_{2} = τ_{2} \times J_{4 \times 4}

for irr-small SGLD and irr-large SGLD, respectively.

The outcomes are visualized in Figure 11. In the logistic regression scenario, there is a noticeable potential improvement in the predictive performance of AIRESGLD. Specifically, among the conventional methods including SGLD, irr-small SGLD, irr-large SGLD, and RESGLD, The AIRESGLD approach exhibits the most favorable predictive performance, consistent with the apparent trade-off between exploitation and exploration.

5. Conclusions

The primary contribution of this paper lies in the design of the AIRELD algorithm tailored for simulating multi-modal distributions. This algorithm capitalizes on the insight that constant irreversible perturbations can aid LD in escaping local high-density regions within the target distribution. Simultaneously, by incorporating the swapping mechanism from the RELD approach, which mitigates the divergence issue encountered when artificially modifying the magnitude of irreversible perturbations, the AIRELD approach achieves adaptively changed perturbations. Consequently, the AIRELD approach harnesses the profound exploration capabilities of large constant irreversible perturbations, facilitating frequent transitions between distinct peaks in the target density. Conversely, as the dynamic process transitions to a new peak, small constant irreversible perturbations equip it with local exploitation abilities, enhancing its sampling efficiency in the vicinity of these local regions. This adaptive and versatile approach provides a robust solution for effectively sampling multi-modal distributions.

Across multiple numerical experiments, we have validated that adaptive irreversible perturbations do improve the performance of LD-based sampling algorithms. Our proposed algorithm, AIRELD, leverages the advantages of RELD and LD with standard irreversible perturbations. In all experiments, we compared the performance of AIRELD against RELD and LD with standard irreversible perturbations in terms of empirical sampling results, MSE, and KL divergence. The achieved results indicate that our proposed AIRELD algorithm demonstrates notable improvements compared to the other two methods. In Section 4.1, we calculate both real gradients and stochastic gradients. Then, in Section 4.2, Section 4.3 and Section 4.4, we continue with stochastic gradient computations. These findings highlight the applicability of using stochastic gradients alongside adaptive irreversible perturbations for practical computational purposes. Our numerical results affirm the convergence of AIRELD to the target distribution. Theoretically, a constant irreversible perturbation J can maintain the system’s invariant distribution [17]. Additionally, the swapping rule in the RELD approach [1] does not alter the invariant distribution of the dynamics. Hence, the AIRELD approach converges to the target distribution.

Future work can explore theoretical characterizations of the performance of adaptive irreversible perturbations and compare them with other approaches. Additionally, from a computational perspective, investigating parallel multi-chain AIRELD will be an interesting avenue for future research.

Author Contributions

Conceptualization, Z.W.; methodology, Z.W.; software, Z.W., L.Z. and Z.Y.; validation, Z.H., S.W. and L.Y.; writing—original draft preparation, Z.W.; writing—review and editing, Z.H.; visualization, S.W., L.Z. and Z.Y.; supervision, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received partial support from the Shenzhen Science and Technology Program (grant no. ZDSYS20210623092007023, JCYJ20200109141218676) and the SUSTech International Center for Mathematics.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Acknowledgments

Our sincere appreciation goes out to Zhang B.J. and Deng W. for their invaluable contribution of sharing the code from their research paper, which greatly facilitated our study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chen, Y.; Chen, J.; Dong, J.; Peng, J.; Wang, Z. Accelerating nonconvex learning via replica exchange Langevin diffusion. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Li, X.; Wu, Y.; Mackey, L.; Erdogdu, M.A. Stochastic Runge-Kutta accelerates Langevin Monte Carlo and beyond. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Zhang, R.; Li, C.; Zhang, J.; Chen, C.; Wilson, A.G. Cyclical stochastic gradient MCMC for Bayesian deep learning. In Proceedings of the 8th International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Deng, W.; Liang, S.; Hao, B.; Lin, G.; Liang, F. Interacting contour stochastic gradient Langevin dynamics. In Proceedings of the 10th International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Hwang, C.R.; Hwang-Ma, S.Y.; Sheu, S.J. Accelerating diffusions. Ann. Appl. Probab. 2005, 15, 1433–1444. [Google Scholar] [CrossRef]
Rey-Bellet, L.; Spiliopoulos, K. Irreversible Langevin samplers and variance reduction: A large deviations approach. Nonlinearity 2015, 28, 2081. [Google Scholar] [CrossRef]
Rey-Bellet, L.; Spiliopoulos, K. Variance reduction for irreversible Langevin samplers and diffusion on graphs. Electron. Commun. Probab. 2015, 20, 1–16. [Google Scholar] [CrossRef]
Rey-Bellet, L.; Spiliopoulos, K. Improving the Convergence of Reversible Samplers. J. Stat. Phys. 2016, 164, 472–494. [Google Scholar] [CrossRef]
Zhang, B.J.; Marzouk, Y.M.; Spiliopoulos, K. Geometry-informed irreversible perturbations for accelerated convergence of Langevin dynamics. Stat. Comput. 2022, 32, 78. [Google Scholar] [CrossRef] [PubMed]
Girolami, M.; Calderhead, B. Riemann manifold langevin and hamiltonian monte carlo methods. J. R. Stat. Soc. Ser. B Stat. Methodol. 2011, 73, 123–214. [Google Scholar] [CrossRef]
Duncan, A.B.; Lelievre, T.; Pavliotis, G.A. Variance reduction using nonreversible Langevin samplers. J. Stat. Phys. 2016, 163, 457–491. [Google Scholar] [CrossRef] [PubMed]
Earl, D.J.; Deem, M.W. Parallel tempering: Theory, applications, and new perspectives. Phys. Chem. Chem. Phys. 2005, 7, 3910–3916. [Google Scholar] [CrossRef] [PubMed]
Sindhikara, D.; Meng, Y.; Roitberg, A.E. Exchange frequency in replica exchange molecular dynamics. J. Chem. Phys. 2008, 128, 024103. [Google Scholar] [CrossRef] [PubMed]
Deng, W.; Feng, Q.; Gao, L.; Liang, F.; Lin, G. Non-convex learning via replica exchange stochastic gradient MCMC. In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020. [Google Scholar]
Marinari, E.; Parisi, G. Simulated tempering: A new Monte Carlo scheme. Europhys. Lett. 1992, 19, 451. [Google Scholar] [CrossRef]
Geyer, C.J.; Thompson, E.A. Annealing Markov chain Monte Carlo with applications to ancestral inference. J. Am. Stat. Assoc. 1995, 90, 909–920. [Google Scholar] [CrossRef]
Pavliotis, G.A. Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations, 1st ed.; Springer: New York, NY, USA, 2014; pp. 122–124. [Google Scholar]
Lelièvre, T.; Nier, F.; Pavliotis, G.A. Optimal non-reversible linear drift for the convergence to equilibrium of a diffusion. J. Stat. Phys. 2013, 152, 237–274. [Google Scholar] [CrossRef]
Ottobre, M.; Pillai, N.S.; Spiliopoulos, K. Optimal scaling of the MALA algorithm with irreversible proposals for Gaussian targets. Stochastics Partial. Differ. Equ. Anal. Comput. 2020, 8, 311–361. [Google Scholar] [CrossRef]
Welling, M.; Teh, Y.W. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, DC, USA, 28 June–2 July 2011. [Google Scholar]
Teh, Y.W.; Thiéry, A.; Vollmer, S. Consistency and fluctuations for stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 2016, 17, 1–33. [Google Scholar]
Vollmer, S.J.; Zygalakis, K.C.; Teh, Y.W. Exploration of the (non-) asymptotic bias and variance of stochastic gradient Langevin dynamics. J. Mach. Learn. Res. 2016, 17, 5504–5548. [Google Scholar]
Deng, W.; Lin, G.; Liang, F. A contour stochastic gradient Langevin dynamics algorithm for simulations of multi-modal distributions. Adv. Neural Inf. Process. Syst. 2020, 3, 15725–15736. [Google Scholar]

Figure 1. The paths of a pair of Langevin dynamics, driven by two different temperatures, are depicted. The orange and red traces represent the lower and higher temperatures, respectively. The background corresponds to the contours of the objective function. It is worth noting that lines of the same color are separate from each other due to the swap operation, indicated by the dashed line.

Figure 2. MSE, squared bias, and variance of the resulting estimator for

ϕ_{1} (μ, σ) = μ + σ

. Real gradients are computed.

Figure 2. MSE, squared bias, and variance of the resulting estimator for

ϕ_{1} (μ, σ) = μ + σ

. Real gradients are computed.

Figure 3. MSE, squared bias, and variance of the resulting estimator for

ϕ_{2} (μ, σ) = μ^{2} + σ^{2}

. Real gradients are computed.

Figure 3. MSE, squared bias, and variance of the resulting estimator for

ϕ_{2} (μ, σ) = μ^{2} + σ^{2}

. Real gradients are computed.

Figure 4. MSE, squared bias, and variance of the resulting estimator for

ϕ_{1} (μ, σ) = μ + σ

. Stochastic gradients are computed.

Figure 4. MSE, squared bias, and variance of the resulting estimator for

ϕ_{1} (μ, σ) = μ + σ

. Stochastic gradients are computed.

Figure 5. MSE, squared bias, and variance of the resulting estimator for

ϕ_{2} (μ, σ) = μ^{2} + σ^{2}

. Stochastic gradients are computed.

Figure 5. MSE, squared bias, and variance of the resulting estimator for

ϕ_{2} (μ, σ) = μ^{2} + σ^{2}

. Stochastic gradients are computed.

Figure 6. Empirical behavior on 25 two-dimensional Gaussian distributions. Stochastic gradients are computed.

Figure 7. KL divergence of different sampling algorithms. Stochastic gradients are computed.

Figure 8. Trajectory of various sampling algorithms. Stochastic gradients are computed.

Figure 9. Empirical behavior on multi-modal distribution. Stochastic gradients are computed.

Figure 10. KL divergence of different sampling algorithms. Stochastic gradients are computed.

Figure 11. Log-loss calculated on test dataset for different algorithms. Stochastic gradients are computed.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, Z.; Huang, Z.; Wu, S.; Yu, Z.; Zhu, L.; Yang, L. Accelerating Convergence of Langevin Dynamics via Adaptive Irreversible Perturbations. Mathematics 2024, 12, 118. https://doi.org/10.3390/math12010118

AMA Style

Wu Z, Huang Z, Wu S, Yu Z, Zhu L, Yang L. Accelerating Convergence of Langevin Dynamics via Adaptive Irreversible Perturbations. Mathematics. 2024; 12(1):118. https://doi.org/10.3390/math12010118

Chicago/Turabian Style

Wu, Zhenqing, Zhejun Huang, Sijin Wu, Ziying Yu, Liuxin Zhu, and Lili Yang. 2024. "Accelerating Convergence of Langevin Dynamics via Adaptive Irreversible Perturbations" Mathematics 12, no. 1: 118. https://doi.org/10.3390/math12010118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Accelerating Convergence of Langevin Dynamics via Adaptive Irreversible Perturbations

Abstract

1. Introduction

2. Improving the Performance of Langevin Dynamics

2.1. Replica Exchange Langevin Dynamics

2.2. Irreversible Perturbations

3. Adaptively Irreversible Replica-Exchange Langevin Dynamics

Stochastic Gradient Langevin Dynamics

4. Experiments

4.1. Parameters of a Normal Distribution

4.2. Mixture of 25 Two-Dimensional Gaussian Distribution

4.3. Simulations of Multi-Modal Distribution

4.4. Bayesian Logistic Regression

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI