“Exact” and Approximate Methods for Bayesian Inference: Stochastic Volatility Case Study

Shapovalova, Yuliya

doi:10.3390/e23040466

Open AccessArticle

“Exact” and Approximate Methods for Bayesian Inference: Stochastic Volatility Case Study

by

Yuliya Shapovalova

Institute for Computing and Information Sciences, Radboud University Nijmegen, Toernooiveld 212, 6525 EC Nijmegen, The Netherlands

Entropy 2021, 23(4), 466; https://doi.org/10.3390/e23040466

Submission received: 28 February 2021 / Revised: 11 April 2021 / Accepted: 12 April 2021 / Published: 15 April 2021

(This article belongs to the Special Issue Approximate Bayesian Inference)

Download

Browse Figures

Versions Notes

Abstract

:

We conduct a case study in which we empirically illustrate the performance of different classes of Bayesian inference methods to estimate stochastic volatility models. In particular, we consider how different particle filtering methods affect the variance of the estimated likelihood. We review and compare particle Markov Chain Monte Carlo (MCMC), RMHMC, fixed-form variational Bayes, and integrated nested Laplace approximation to estimate the posterior distribution of the parameters. Additionally, we conduct the review from the point of view of whether these methods are (1) easily adaptable to different model specifications; (2) adaptable to higher dimensions of the model in a straightforward way; (3) feasible in the multivariate case. We show that when using the stochastic volatility model for methods comparison, various data-generating processes have to be considered to make a fair assessment of the methods. Finally, we present a challenging specification of the multivariate stochastic volatility model, which is rarely used to illustrate the methods but constitutes an important practical application.

Keywords:

Bayesian inference; Markov Chain Monte Carlo; Sequential Monte Carlo; Riemann Manifold Hamiltonian Monte Carlo; integrated nested laplace approximation; fixed-form variational Bayes; stochastic volatility

1. Introduction

The field of Bayesian statistics and machine learning has advanced in recent years quite rapidly. The methods that have been developed do not often find fast assimilation across different fields. In this review, we aim to provide the reader with methodologies that try to solve the estimation problem in models with latent variables and intractable likelihoods. We are in particular interested in the methods that can be used to estimate nonlinear state-space models and in particular stochastic (latent) volatility models. There are multiple studies that conducted review and comparison of the methods of estimation of the stochastic volatility models [1,2,3]. We briefly mention some of the methods that have been reviewed; however, most of the methods considered in this paper have not entered those reviews. In this paper, we focus in particular on comparing methods that target posterior distribution exactly and the methods that try to approximate it. We also conduct the review from the point of view of estimating multivariate models with these methods and discuss what the bottleneck is in each of them when extending to higher-dimensional stochastic volatility (SV) models. We consider different data-generating processes for simulating data in the empirical studies and conclude that the choice of the data-generating process can heavily affect performance of a method. Thus, illustrating the performance of a method on just one data generating process or one real-world data set is not sufficient.

In financial econometrics literature, GARCH-type models prevail since they are much simpler to estimate. Stochastic (latent) volatility models, however, can be more natural frameworks for modeling asset returns. They can provide flexible and intuitive tools for applications in financial econometrics as well as some other disciplines. In particular, multivariate stochastic volatility models offer an attractive framework for detection and measuring volatility spillover effects. Volatility spillovers in this framework can be defined through Granger-causal links in the latent (unobservable) volatility process, which is modeled with a Vector Autoregressive model (VAR

(p)

). Insights about the causal structure can help to identify the relationship (Granger-causality or/and contemporaneous correlation) between the financial markets. Such information can be insightful and helpful in the decision-making process of portfolio managers and policymakers. These models are, however, rarely considered in practice. Multiple Bayesian inference methods have been proposed for the estimation of this class of models in recent years. In this paper, we identify the bottlenecks in different classes of methods for the estimation of these models in the multivariate case.

One of the stepping stones of estimation of the nonlinear state-space models in general (and stochastic volatility models in particular) lies in the intractability of the likelihood, which is the result of the presence of an unobservable process in the model and nonlinear dependence between this process and the observed data. The likelihood can be estimated with particle filter methods, also known as Sequential Monte Carlo. This is a computationally intensive procedure; however, depending on the problem and the data, it can provide excellent results. The second stepping stone of the estimation is the intractable posterior distribution. A standard starting point for sampling from the posterior distribution is the Metropolis-Hastings algorithm, which is a general method and can be applied straightforwardly to different models. It works well when the number of parameters in the model is small. However, the convergence of the algorithm can be slow in larger models, due to inefficiency of the sampling with random walk proposals. Particle Metropolis-Hastings [4] combines Sequential Monte Carlo for the likelihood estimation with Metropolis-Hastings for the sampling from the posterior, which results in a state-of-the-art method in terms of the estimation quality since it targets the exact posterior. The downside of this method is that it is computationally extremely demanding. Note that, while we consider particle Metropolis-Hastings in this paper, the class of methods from [4] is more general.

Two main downsides of particle Metropolis-Hastings are random walk behavior of the proposals and computational burden. One of the possible solutions to the first problem are the algorithms that use gradient information for the construction of the proposal distribution and thus explore the parameter space more efficiently. An additional step in improving these algorithms is defining them on a Riemann manifold instead of Euclidean space as proposed in [5]. The resulting algorithm, which we consider for the comparison in this paper, is Riemann Manifold Hamiltonian Monte Carlo. For extensive comparison of the methods that exploit gradient information and Langevin dynamics—such as Metropolis-adjusted Langevin algorithm, Hamiltonian Monte Carlo, Riemann manifold Metropolis adjusted Langevin algorithm, andRiemann Manifold Hamiltonian Monte Carlo—we refer to [5].

Thus far, we have discussed the methods that target the posterior distribution exactly and have a high computational burden, which makes empirical investigation of their performance in high-dimensional cases infeasible. In the last decade, a large number of methods have been published on approximate posterior inference thaat allow much faster computations, but lose in terms of precision of the estimation. In this paper, we consider two such methods that deal with different types of approximation. Fixed-form variational Bayes, proposed in [6], assumes hierarchical factorization of the prior and posterior distributions, and the factorized distributions are approximated by an analytically tractable distribution from a certain family of distributions

q (\cdot)

. Then, instead of solving integration problem, one solves the optimization problem of minimizing the Kullback–Leibler divergence between

q (\cdot)

and

p (\cdot)

, where

p (\cdot)

is the target distribution. The second approximate method that we consider is the integrated nested Laplace approximation (INLA) [7]. The method relies on the nested version of the classical Laplace approximation. It became very popular in recent years and made computations in many models feasible.

In this review paper, we focus our attention on the following methodologies and provide a comparison for some of the methods via a simulation study. We consider how the variance of the estimated likelihood is affected by choosing different particle-filtering algorithms. Unlike previous studies, we consider the variance of the estimated likelihood over the whole parameter space and notice that it is affected by some parameters of the model more than by the others. We compare particle Metropolis-Hastings with Riemann Manifold Hamiltonian Monte Carlo as two state-of-the-art sampling methods for this type of problem. We asses how well the INLA method performs in the task of the estimation of the parameters of stochastic volatility model and finally, compare fixed-form variational Bayes methods with sampling by RMHMC. All the between-methods comparisons are performed on multiple simulated data sets with different underlying parameters. We illustrate that, for fair comparison and performance assessment, illustration only on data sets is not sufficient.

The paper is organized as follows. In Section 2, we introduce the model and its different specifications. While in simulation studies we use univariate model, we do introduce multivariate stochastic volatility models with Granger-causal feedback as the model of interest for high-dimensional inference. In Section 3, we review the methods that can be used for the estimation of this class of models. We introduce major ideas behind these methods, and for the details of the derivations we refer to the original papers. In Section 4, we perform empirical case study on different simulated data sets and compare the methods on two real-world time series. We in particular focus on the precision loss of parameter estimation when using approximate methods and how adaptable the methods are to perform multivariate estimation and estimation of various model specifications.

2. Model

2.1. Univariate Stochastic Volatility Model

In this section, we introduce the model of interest that we will use in the simulation studies. Stochastic volatility (SV) models are concerned with modeling asset prices or asset returns depending on how the model is formulated. Let

P_{t}

be the price of the asset at time t or the exchange rate at time t (we consider two applications to real data in Section 3.5: one to exchange rate and one to log-returns). Then the log-return

y_{t}

is

y_{t} = log (1 + R_{t}) = log \frac{P_{t}}{P_{t - 1}} .

(1)

Stochastic volatility models are built in such a way that they can mimic stylized facts about financial markets and log-returns

y_{t}

. Stylized facts are empirically observed statistical properties of asset prices and asset returns. Typical examples of stylized facts are

Volatility clustering and persistence: the big changes in asset returns tend to be followed by big changes, and small changes in asset returns tend to be followed by small changes; in other words, there are periods of large fluctuations and small fluctuations [8].
Leverage effect: the changes in stock prices may be negatively related to the changes in volatility [9].
Co-movements: different stocks tend to exhibit co-movements, which means that if the volatility of one stock changes in a specific direction, volatilities of the other stocks tend to change in the same direction [9].

One of the earlier works that received much attention in the financial literature and proposed a mathematical model that tried to explain the dynamics of financial markets is [10]. Numerous continuous-time stochastic volatility models have been proposed since then, and among the first ones, multiple variants should be mentioned [11,12,13,14]. The model we will be considering in this chapter can be viewed as a discrete version of the model in [13] derived by using Euler–Maruyama approximation. The stochastic volatility model in continuous time can be written as

\begin{matrix} d s (t) = σ (t) d B_{1} (t), \end{matrix}

(2)

\begin{matrix} ln σ^{2} (t) = μ + β ln σ^{2} (t) d t + σ_{η} d B_{2} (t), \end{matrix}

(3)

where

s (t)

is log of asset price,

σ^{2} (t)

is the volatility,

B_{1} (t)

and

B_{2} (t)

are Brownian motions that satisfy

c o r r (B_{1} (t), B_{2} (t)) = ρ

. If

ρ < 0

, there is leverage effect present. Thus, log of asset price follows diffusion and its volatility parameter also follows diffusion [15]. As we often get the data in discrete time, usually the discrete time approximation of the model is used in practice. The discrete model then follows by using Euler–Maruyama approximation

\begin{matrix} y_{t} = σ_{t} ϵ_{t}, \end{matrix}

(4)

\begin{matrix} ln σ_{t + 1}^{2} = μ + ϕ ln σ_{t}^{2} + σ_{η} η_{t + 1}, \end{matrix}

(5)

where

y_{t}

is logarithmic return,

ϵ_{t} = B_{1} (t + 1) - B_{1} (t)

,

η_{t + 1} = B_{2} (t + 1) - B_{2} (t)

,

ϕ = 1 + β

. Further,

ϵ_{t} \sim N (0, 1)

and

η_{t} \sim N (0, 1)

,

c o r r (ϵ_{t}, η_{t + 1}) = ρ .

We get state-space representation of the model that is commonly used by defining

h_{t} = ln σ_{t}^{2}

and

σ_{t}^{2} = exp (h_{2})

\begin{matrix} y_{t} = exp (h_{t} / 2) ϵ_{t}, \end{matrix}

(6)

\begin{matrix} h_{t + 1} = μ + ϕ h_{t} + η_{t + 1}, \end{matrix}

(7)

where

y_{t}

are log-returns that are observed and volatility

h_{t}

is latent and drives the dynamics of

y_{t}

. Figure 1 illustrates this structure of the model. Note that the latent volatility process has an autoregressive form. However, unlike in the standard autoregressive model, the latent volatility is not observed and thus has to be estimated together with the model parameters

μ

,

ϕ

, and

σ_{η}

, which are the scale, the volatility persistence and the noise variance of the latent volatility process, respectively. The persistence parameter

ϕ

reflects one of the stylized facts of financial returns, namely volatility persistence. The intuition is as follows: if

ϕ > 0

and

exp (h_{t - 1} / 2)

is large, then

exp (h_{t} / 2)

will tend to be large too. Hence, the model can account for volatility clustering. In this paper, we consider stationary volatility cases with

| ϕ | < 1

. Finally, one can also incorporate leverage effects by defining negative correlation between noise terms

ϵ_{t}

and

η_{t + 1}

. Intuitive interpretation of the leverage effect goes as follows: bad news tends to decrease the price, which means that financial leverage increases, the firm becomes riskier, and thus expected volatility also increases. The leverage effects in this model have been studied in [16]. The stochastic volatility model can be parametrized in multiple ways; often, the following alternatives are considered [2]. Other ways to parametrize this model are presented in Equarions (8) and (9). The left-hand side version of the model corresponds to that of [17]. The right-hand side version is a different way to define the scaling parameter; in this case, it is

β

. For identifiability reasons, only

β

or

μ

as in Equation (7) should be included in the model.

\begin{matrix} y_{t} = \sqrt{h_{t}} ϵ_{t} & y_{t} = β exp (h_{t} / 2) ϵ_{t} \end{matrix}

(8)

\begin{matrix} log h_{t} = μ + ϕ log h_{t - 1} + η_{t} & h_{t} = ϕ h_{t - 1} + η_{t} . \end{matrix}

(9)

Note that the authors of [17] define the leverage effect as correlation between

ϵ_{t}

and

η_{t}

, so the correlation between noise terms is contemporaneous while [16] model correlation between

ϵ_{t}

and

η_{t + 1}

, which corresponds to correlation of the returns with one-step-ahead volatility. Reference Yu [18] shows that the approach of [16] is preferable. In particular, while in case of [16] the model is a martingale difference sequence, i.e., the past does not help to predict the future of the time series, in the case of [17], it is not. Hence, in the latter case, the efficient market hypothesis is violated.

In the remainder of this manuscript, we will work with either specification of the model defined in Equations (6) and (7) or in right-hand side of Equations (8) and (9). These models are equivalent, and we interchange the representation either for the convenience of using some of the methods or for comparison with other work. In the literature, both specifications are frequently used, and in some papers (for example, ref. [19]) the transition from one specification to another is conducted by observing that

β = exp (μ / 2)

.

Under the assumption that

| ϕ | < 1

, the unconditional first and second moments of the latent process

h_{t}

are

E (h_{t}) = \frac{μ}{1 - ϕ}, V a r (h_{t}) = \frac{σ_{η}^{2}}{1 - ϕ^{2}} .

(10)

The challenge of the estimation of the model lies in the intractability of the likelihood and posterior distribution. The likelihood factorizes as

L (y | θ) = \prod_{t = 1}^{T} p (y_{t} | y_{1 : t - 1}, θ),

(11)

where the terms in the product can be computed recursively, and it becomes clear that the likelihood is a high-dimensional integral

p (y_{t} | y_{1 : t - 1}, θ) = \int p (y_{t} | h_{t}, θ) p (h_{t} | y_{1 : t - 1}, θ) d h_{t} .

(12)

There is no analytical solution to the integral in Equation (12), and in this paper, we consider methods to estimate it using sequential Monte Carlo methods.

2.2. Multivariate Stochastic Volatility Model

In this section, we introduce the multivariate stochastic volatility model, which is rarely used in practice due to the challenges of estimation. One of the objectives of this paper is to assess whether modern methods in Bayesian inference are capable of the estimation of these models in high-dimensional case. Multivariate or high dimensional application of this class of models can give insightful information to practitioners. We deal with the same set-up as before; however, we now consider multiple time series of logarithmic returns that are interconnected through the latent volatility process

y_{t} = Ω_{t} ϵ_{t},

(13)

where

ϵ_{t} \sim N (0, R)

and

R

is a correlation matrix with entries

r_{i i} = 1

,

i = 1, \dots, n

on the diagonal. Furthermore,

Ω_{t}

is a diagonal matrix that contains time-varying volatilities that are driven by an independent stochastic process

h_{t}

,

Ω_{t} = d i a g (exp (h_{t} / 2)) .

The process

h_{t}

of log-volatilities follows a VAR(p) process

h_{t} = μ + \sum_{k = 1}^{p} Φ_{k} (h_{t - i} - μ) + η_{t},

(14)

where

Φ_{k} = {(ϕ_{i j, k})}_{i, j = 1, \dots, n}

are

n \times n

coefficient matrices. Introducing the matrices

Φ_{k} = {(ϕ_{i j, k})}_{i, j = 1, \dots, n}

allows us to model connectivity in financial time series through the concept of Granger-causality in latent volatility process. We say that

h_{i}

does not Granger-cause

h_{j}

if all

{(ϕ_{i j, k})}_{k = 1, \dots, p} = 0

. The standard conditions on stationarity of a vector autoregressive model apply: the root of

| I - λ Φ | = 0

should lie outside the unit circle, and the errors

η_{t}

are independent and identically normally distributed with mean zero and variance-covariance matrix

Σ = d i a g (σ_{1}^{2}, \dots, σ_{n}^{2})

. Equations (13) and (14) are multivariate extensions of the model described in Equations (6) and (7). The representation from the right-hand side of Equations (8) and (9) can be obtained by including a vector of parameters

β

into

Ω_{t}

and removing

μ

from Equation (14). As before, for identifiability, only one vector of the scale parameters—either

μ

or

β

—should be included in the model.

The above MSV model can also be viewed as a non-linear state-space model where (14) is the state equation of the latent process

h_{t}

and (13) is the observation equation that depends non-linearly on the latent state. Note that, in this model, the time series are interconnected and the relationship between them can be interpreted through the concept of Granger-causality in latent volatility processes.

3. Methods

3.1. Bayesian Inference

In this paper, we review various methods that sample from or approximate the posterior distribution of the parameters of the model

θ

. The sampling or approximate methods are necessary since we are working in the framework when the posterior distribution and the likelihood are analytically intractable. The Bayes’ rule allows us to write posterior distribution in the form

p (θ | y) = \frac{π (θ) g (y | θ)}{m (y)},

(15)

where

π (θ)

is the prior distribution of the parameters of the model,

g (y | θ)

is the likelihood of the data given parameters of the model, and

m (y)

is the marginal density of y, which can be viewed as normalizing constant and which we will ignore in this paper. In the remainder of the paper we will work with the Bayes’ rule in proportionality terms:

\begin{matrix} p (θ | y) \propto π (θ) g (y | θ) . \end{matrix}

(16)

Note that in the stochastic volatility model we have to estimate parameters of the model

θ = (μ, ϕ, σ^{2})

and the latent vector of volatilities

h

. Thus, we are interested in the following form of the Bayes’ rule

p (θ, h | y) \propto g (y | θ, h) f (h | θ) π (θ) .

(17)

Multiple approaches can be used for the estimation of

p (θ, h | y)

. One of the challenges is that neither posterior

p (θ, h | y)

nor the likelihood

g (y | θ, h)

is tractable. We start our review by considering sequential Monte Carlo methods, also known as particle filtering, for the estimation of the likelihood

g (y | θ, h)

. We then discuss Metropolis-Hastings algorithm for sampling from the posterior and how these two algorithm can be combined into particle Metropolis-Hastings for sampling from the posterior distribution. We continue the review of the methods by considering RMHMC method in which the parameters and volatilities are sampled within the same framework. Finally, we review two approximate methods: integrated nested Laplace approdximation and fixed-form variational Bayes, two different ways of approximating posterior distribution.

3.2. Sequential Monte Carlo for the Estimation of the Likelihood

The Sequential Monte Carlo (SMC) method, also known in the literature as particle filtering, is considered a state-of-the-art method for estimation of the intractable likelihoods in nonlinear state-space models. The general idea behind this method lies in the estimation of the latent states by drawing multiple samples (particles) and then propagating them in time according to corresponding importance weights. By combining the weights over all time steps, one obtains a marginal likelihood estimate. Standard and well-known schemes are Bootstrap particle filter (BPF) [20], Seqiential Importance Sampling (SIS), and Seqiential Importance Resampling (SIR) [21]. Sequential Monte Carlo methods were elegantly combined with Markov Chain Monte Carlo in [4], and the method was named particle Markov Chain Monte Carlo (PMCMC). This method provides a powerful and coherent approach for Bayesian inference in a wide range of complex models. In the later subsections, we will discuss how sequential Monte Carlo methods are combined with Markov Chain Monte Carlo for fully Bayesian inference in stochastic volatility models. One of the concerns when using and implementing SMC for the likelihood estimation is the variance of the estimated likelihood. Standard SMC techniques such as SIS are prone to have high variance of the estimated likelihood once the dimensionality of the problem increases [22]. A number of studies have tried to address this problem. The common choice of proposal for sample of particles in standard schemes is

f (h_{t} | h_{t - 1})

. Pitt and Shephard [23] propose an auxiliary particle filter as a solution that is using proposal for particles which takes into account the current observation

q (h_{t} | h_{t - 1}, y_{t})

and not only the dynamics of the latent process itself. Scharth and Kohn [24] suggest using efficient importance sampling [25] inside the PMCMC procedure. Guarniero et al. [26] use twisted representation of the model and use the look-ahead type of particle filtering to address the issue of high variance of the estimated likelihood. Johansen and Doucet [27] compare sequential importance resampling (SIR) with auxiliary particle filter and find that APF does not always outperform SIR. Often, the variance of the estimated likelihood is analyzed in the true value of the parameters, such as in [24]. However, when using particle Markov Chain Monte Carlo, it is also of interest whether the same conclusions hold in different points of the parameter space. In particular, we never start running the algorithm at the point of the true parameter values. This means that if the variance of the estimated likelihood is much larger in some areas of the parameter space, the convergence of the algorithm can be affected. Having insights into how the variance of the estimated likelihood is different in the parameter space can help to make a more efficient choice of the starting point for the algorithm.

We first review the sequential Monte Carlo methods for the estimation of the likelihood. After that, we discuss Metropolis-Hastings algorithm and how SMC and Metropolis-Hastings can be combined for Bayesian inference in general and stochastic volatility models in particular.

3.2.1. Sequential Monte Carlo

Assume that we are in the framework with an observed time series process

y_{t}

and a latent Markovian process

h_{t}

. Since we never observe the latent process, we need to infer it. The objective that can be achieved with Sequential Monte Carlo (SMC) is also known as particle filtering. The method operates in sequential manner with arriving observations

y_{t}

. The posterior distribution of the latent process can be computed sequentially

p (h_{0 : t} | y_{1 : t}) = p (h_{0 : t - 1} | y_{0 : t - 1}) \frac{g (y_{1 : t} | h_{0 : t}) f (h_{t} | h_{t - 1})}{p (y_{t} | y_{t - 1})} .

(18)

The denominator of Equation (18) is not analytically tractable, which can be also seen from Equation (12) earlier. SMC allows us to estimate the posterior distribution

p (h_{0 : t} | y_{1 : t})

and additionally get the estimate of the likelihood

\begin{matrix} L (y_{1 : T}) = \int p (y_{1 : T}, h_{1 : T}) d h_{1 : T} = \int g (y_{1 : T} | h_{1 : T}) p (h_{1 : T}) d h_{1 : T} \\ = \int g (y_{1} | h_{1}) p (h_{1}) \prod_{t = 2}^{T} g (y_{t} | h_{t}) f (h_{t} | h_{t - 1}) d h_{1} \dots h_{T} . \end{matrix}

(19)

The basic procedure of particle filtering in this setting can be summarized by three crucial steps: prediction, updating, and resampling. The outline of a basic particle filter can be summarized in the following way.

Initialization: given the prior distribution $π (θ_{0})$ , we draw N independent random samples ${h_{i}^{0}}_{i = 1}^{N}$ ; these samples we call particles.
Prediction: we sample particles according to the importance density

$h_{t}^{(i)} \sim q (h_{t} | h_{t - 1}^{(i)}, y_{t}) .$

(20)
Updating: During updating, we assign a weight $w_{t}^{(i)}$ to every particle

$w_{t}^{(i)} = \frac{p (y_{t} | h_{t}^{(i)}) f (h_{t}^{(i)} | h_{t - 1}^{(i)})}{p (y_{t} | y_{1 : t - 1}) q_{t} (h_{t}^{(i)} | h_{0 : t - 1}^{(i)})}$

(21)

and normalize these weights to sum to 1. Every weight can be interpreted as our “confidence” about a particle.
Resampling: resample the particles if the effective number of particles,

$N_{e f f} = \frac{1}{\sum_{i = 1}^{N} {(ω_{t}^{(i)})}^{2}},$

(22)

is too low. In Equation (22), $ω_{t}^{(i)}$ is the normalized weight of particle i at the time step k. The threshold for the resampling step is set depending on whether particle degeneracy is a problem. In general, we perform resampling when $N_{e f f} < N / c$ , where c is a constant.

The resampling step is performed to find the trade-off between two well-documented problems: particle degeneracy and particle impoverishment [28]. The former happens when the resampling step is ignored or is not performed frequently enough. In this case, one ends up with a particle set that has zero weights. The latter problem happens when the particle set is resampled too frequently; then, eventually one gets one particle with a large weight and hence the particle set lacks the diversity. The way to find the balance between these two problems is resampling when the efficient number of particles is smaller than a certain threshold.

In this paper, we consider two particle filters: bootstrap and auxiliary particle filters. A generic particle filter is presented in Algorithm A1 [28]. The bootstrap filter is a variation of a more general approach—sequential importance sampling (resampling). The distinction of the bootstrap filter is the proposal mechanism for the particles. In the bootstrap particle filter the proposals for the particles are made on the basis of the dynamics of the model

f (h_{t} | h_{t - 1})

. If

q (h_{t} | y_{t}, h_{t - 1}) = f (h_{t} | h_{t - 1})

, then the term

\frac{f (h_{t} | h_{t - 1})}{q (h_{t} | y_{t}, h_{t - 1})}

is equal to 1. In the case of the auxiliary particle filter, we also incorporate the current observation into the proposal mechanism

q (h_{t} | h_{t - 1}, y_{t})

. Incorporating the current observation into the proposal for the particles in some cases allows us to reduce the variance of the estimated likelihood. In our case, there is no analytical expression for the proposal density. In the next subsection, we discuss how it can be approximated as proposed in [23].

3.2.2. Auxiliary Particle Filter for SV Model

Incorporating knowledge of

y_{t}

into proposals for particles

q (h_{t} | h_{t - 1}, y_{t})

can help to reduce the variance of the estimated likelihood and improve the approximation of the filtering distribution

p (h_{t} | y_{1 : t})

. Note, however, that it is not always the case as has been shown in [27]. Only in the case of linear Gaussian state-space models does the proposal density from Equation (A2) have an analytical expression. Hence, for the stochastic volatility models, this term must be approximated. Pitt and Shephard [23] propose using non-blind proposals for the next generation of particles by first expanding

log g (y_{t + 1} | h_{t + 1})

to a second-order term around

μ_{t + 1}^{k}

via Taylor expansion

\begin{matrix} log g (y_{t + 1} | h_{t + 1}, μ_{t + 1}^{k}) = log p {(y_{t + 1} | μ_{t + 1}^{k})}^{'} \times \frac{\partial log p (y_{t + 1} | μ_{t + 1}^{k})}{\partial h_{t + 1}} + \\ \frac{1}{2} \times {(h_{t + 1} - μ_{t + 1}^{k})}^{'} \times \frac{\partial^{2} log p (y_{t + 1} | μ_{t + 1}^{k})}{\partial h_{t + 1} h_{t + 1}^{'}} \times (h_{t + 1} - μ_{t + 1}^{k}) \end{matrix}

(23)

For deriving the expression for

log g (y_{t + 1} | h_{t + 1}, μ_{t + 1}^{k})

, recall that

y_{t} \sim N (0, exp (h_{t}))

and hence

\begin{matrix} g (y_{t} | h_{t}) = \frac{1}{\sqrt{2 π exp (h_{t})}} exp \{- \frac{y_{t}^{2}}{2 exp (h_{t})}\} = \\ \frac{1}{\sqrt{2 π}} exp \{- \frac{y_{t}^{2}}{exp (h_{t})} - \frac{h_{t}}{2}\} \end{matrix}

(24)

and further note that

f (h_{t} | h_{t - 1}) = N (μ + ϕ (h_{t - 1} - μ), σ_{η}^{2})

; thus

f (h_{t} | h_{t - 1}) = \frac{1}{\sqrt{2 π σ_{η}^{2}}} exp \{\frac{{(h_{t} - μ - ϕ (h_{t - 1} - μ))}^{2}}{2 σ_{η}^{2}}\} .

(25)

It follows that the proposal for particles at time

t + 1

when taking into account the observation of the same period is

q (h_{t + 1} ∣ h_{t}^{(k)}, y_{t + 1}; μ_{t + 1}^{(k)}) = N (μ_{t + 1}^{(k)} + \frac{σ^{2}}{2} (\frac{y_{t}^{2}}{β^{2}} exp (- μ_{t + 1}^{(k)}) - 1), σ^{2}) .

(26)

3.2.3. Metropolis–Hastings

In this section, we consider the problem of sampling from the posterior distribution and a general algorithm to construct such a sampling scheme. With the Metropolis–Hastings algorithm, we sample from the posterior distribution by proposing a transition

θ \to θ^{*}

with the density

q (θ^{*} | θ)

, which we accept with probability

α (θ, θ^{*}) = min \{1, \frac{\tilde{p} (θ^{*})}{\tilde{p} (θ)} \frac{q (θ | θ^{*})}{q (θ^{*} | θ)}\},

(27)

where

\tilde{p} (\cdot)

is a function proportional to our target distribution. A common choice for the proposal distribution is a random-walk, which we also use when applying PMCMC later in this paper,

q (θ^{*} | θ) = N (θ^{*} | θ, Σ)

. The Metropolis–Hastings algorithm is one of the off-the-shelf MCMC methods in the statistical community. It is quite general and can be applied to various problems. The implementation of the Metropolis–Hastings algorithm requires specification of multiple quantities. We need to specify a conditional density

q (θ^{*} | θ)

that is a proposal distribution, generally

q (θ^{*} | θ)

should be such that we can easily simulate from it. In many applications, including ours, it is reasonable to take the Gaussian distribution as proposal distribution. In this case, it is also symmetric, meaning

q (θ^{*} | θ) = q (θ | θ^{*})

. The Metropolis-Hastings iteration is outlined in the Algorithm 1.

In this algorithm,

α (θ, θ^{*})

is the Metropolis–Hastings acceptance probability, where

θ

is the current state of the chain and

θ^{*}

is the candidate state of the parameter vector. Generally, in the simulations, it is desired to have around 25% of proposed candidate values accepted [29]. The idea is that when the proposal steps are too large (we make a proposal that is far away from the current state,

θ

, in the Markov chain), we do not explore local regions sufficiently well; moreover many of the candidates are then very likely to be rejected. When the proposal steps are very small, the acceptance rate will be very high, however, then we are not likely to leave regions of the local maximum or the convergence will happen very slowly.

Algorithm 1 Metropolis-Hastings Algorithm.

1:: Given $θ^{(t)}$ ,
2:: Generate $θ_{t}^{*} \sim q (θ^{*} ∣ θ^{(t)})$ ,
3:: Take
$θ^{(t + 1)} = \{\begin{matrix} θ_{t}^{*}, with probability α (θ^{(t)}, θ_{t}^{*}) \\ θ^{(t)} with probability 1 - α (θ^{(t)}, θ_{t}^{*}), \end{matrix}$
where
$α (θ, θ^{*}) = min (1, \frac{\tilde{p} (θ^{*})}{\tilde{p} (θ_{t})} \frac{q (θ ∣ θ^{*})}{q (θ^{*} ∣ θ)})$

The performance of Metropolis–Hastings depends on the choice of

q (\cdot)

proposal distribution. In the simulation studies, we consider random-walk proposals of the form

θ_{i + 1}^{*} = θ_{i} + ϵ_{i}

, where i is iteration of the algorithm and

ϵ_{i}

is assumed to be Gaussian. More information on the theoretical properties of this algorithm can be found in [30].

3.2.4. Particle Metropolis-Hastings

Particle Markov Chain Monte Carlo (PMCMC) methods were introduced in [4]. The basic idea is that MCMC methods, and in particular, Metropolis–Hastings algorithm, which is of interest to us, can be combined with Sequential Monte Carlo to make draws from the posterior distributions of the parameters. Algorithm 2 presents the particle Metropolis–Hastings algorithm. The difference from the standard Metropolis-Hastings is in the quantity

{\hat{p}}_{θ^{*}} (y_{1 : T})

, which is the estimate of the likelihood obtained with a particle filter conditioning on the parameters vector

θ

. In this algorithm,

q (θ (i - 1) | θ^{*})

is the proposal distribution (which cancels out when it is symmetric), and

π (\cdot)

is prior distribution.

Algorithm 2 Particle Metropolis-Hastings.

1:: Initialize algorithm at $i = 0$ and initialize parameters $θ (0)$
2:: Run an SMC algorithm targeting $p_{θ^{(0)}} (h_{1 : T} ∣ y_{1 : T})$ , sample $h_{1 : T}^{(0)} \sim {\hat{p}}_{θ^{(0)}} (\cdot ∣ y_{1 : T})$ and let ${\hat{p}}_{θ}^{(0)} (y_{1 : T})$ denote the marginal likelihood estimate for the initialized parameters
3:: for $i = 1, \dots, M$ do
4:: Generate $θ^{*} \sim q (θ^{*} ∣ θ^{(i - 1)})$ ,
5:: Run an SMC algorithm targeting $p_{θ^{*}} (h_{1 : T} ∣ y_{1 : T})$ , sample $h_{1 : T}^{*} \sim {\hat{p}}_{θ^{*}} (\cdot ∣ y_{1 : T})$ and let $\hat{p_{θ^{*}}} (y_{1 : T})$ denote the marginal likelihood estimate for the proposed parameters $θ^{*}$
6:: With probability

$min \{1, \frac{{\hat{p}}_{θ^{*}} (y_{1 : T})}{{\hat{p}}_{θ^{(i - 1)}} (y_{1 : T})} \frac{π (θ^{*})}{π (θ^{(i - 1)})} \frac{q (θ^{(i - 1)} | θ^{*})}{q (θ^{*} | θ^{(i - 1)})}\}$

(28)
7:: Set $θ^{(i)} = θ^{*}$ , $h_{1 : T}^{(i)} = h_{1 : T}^{*}$ and ${\hat{p}}_{θ^{(i)}} (y_{1 : T}) = {\hat{p}}_{θ^{*}} (y_{1 : T})$ ;
8:: Otherwise set $θ^{(i)} = θ^{(i - 1)}$ , $h_{1 : T}^{(i)} = h_{1 : T}^{(i - 1)}$ and ${\hat{p}}_{θ^{(i)}} (y_{1 : T}) = {\hat{p}}_{θ^{(i - 1)}} (y_{1 : T})$ .
9:: end for

3.3. MCMC with Gradient Information

In this section, we discuss Riemann Manifold Langevin Hamiltonial Monte Carlo methods that are introduced in [5] and in particular can be applied to stochastic volatility models.

The method originates in physics statistical literature and provides a tool that allows one to make large transitions with high acceptance probability, something that standard methods such as Metropolis–Hastings fail to achieve. The idea of HMC is based on relation between differential geometry and statistical theory (MCMC in particular). Girolami and Calderhead [5] propose the Metropolis-adjusted Langevin algorithm and Hamiltonian Monte Carlo sampling algorithms that are defined on the Riemann manifold. Their methods allow us to overcome the problem of sampling from high-dimensional densities that may show strong correlation. We further provide the general background and summary of the algorithms together with the necessary quantities for their implementation in the case of stochastic volatility models. It is not our goal to provide theoretical foundations of these methods in this article. For deeper theoretical foundations, see [31,32,33].

In standard MCMC setting, one uses probability distribution to make a proposal for the next state of the Markov chain. Hamiltonian Monte Carlo methods exploit physical system dynamics to make proposals for the next state. It can improve the mixing drastically and result in a more efficient algorithm. Especially since we are interested in multivariate modeling, a more efficient exploration of the posterior distribution is of interest. Once the dimension of the model grows with standard random walk, it is very hard to make proposals that would be accepted frequently enough and result in a good mixing Markov chain. We first introduce some basic ideas on which Hamiltonian Monte Carlo method is built; for an extensive introduction, we refer to [33].

3.3.1. Metropolis-Adjusted Langevin Algorithm

Previously we have discussed the Metropolis-Hastings algorithm. The idea of the Metropolis-Hastings algorithm is to make a new proposal

θ^{*}

using random walk. Then this proposal is accepted with probability.

α (θ, θ^{*}) = min \{1, \frac{\tilde{p} (θ^{*})}{\tilde{p} (θ)} \frac{q (θ | θ^{*})}{q (θ^{*} | θ)}\} .

(29)

Although this algorithm benefits from desirable theoretical guarantees, the random walk proposal is not efficient, especially when the number of parameters in the model becomes large. Metropolis-adjusted Langevin algorithm (MALA), originally proposed in [34], is designed to solve the same problem—sample from the target distribution. The big advantage of MALA in comparison to Metropolis–Hastings is the construction for the proposal of the candidate parameter

θ^{*}

. The proposal mechanism for MALA originates from the stochastic differential equation based on Langevin diffusion; the proposal mechanism reads

θ^{*} = θ^{n} + ϵ^{2} \nabla_{θ} L (θ^{n}) / 2 + ϵ z^{n},

(30)

where we define

L (θ^{n}) = log (p (θ))

and

z \sim N (z | 0, I)

and

ϵ

—integration step size. Convergence for this proposal is not guaranteed unless we employ a Metropolis acceptance probability after every integration step. For convenience, let us define

μ (θ^{n}, ϵ) = θ^{n} + \frac{ϵ^{2}}{2} \nabla_{θ} L (θ^{n});

(31)

then the proposal density can be written as

q (θ^{*} | θ^{n}) = N (θ^{*} | μ (θ^{n}, ϵ), ϵ^{2} I) .

The standard acceptance probability follows

min {1, p (θ^{*}) q (θ^{n} | θ^{*}) / p (θ^{n}) q (θ^{*} | θ^{n})} .

(32)

The type of proposal in Equation (29) is inefficient for strongly correlated parameters

θ

. To solve this issue, one can use a preconditioning matrix

M

θ^{*} = θ^{n} + ϵ^{2} M \nabla_{θ} L (θ^{n}) / 2 + ϵ \sqrt{M} z^{n} .

(33)

Unfortunately there is no principled way to choose matrix

M

. As we will see later, HMC encounters the same problem. Generally, MALA iterates between two general steps. First, Langevin dynamics is used for the proposals, and it exploits the gradients of the target. Second, the proposals are accepted or rejected similarly to those of the Metropolis–Hastings algorithm.

3.3.2. Hamiltonian Monte Carlo Algorithm

The HMC algorithm [31] also uses gradient information for constructing the proposal of the parameters in the MCMC scheme. In particular, it exploits the ideas from simulating the behavior of the physical systems. Similarly to describing the behavior of the physical system, HMC performs sampling by exploiting Hamiltonian dynamics. A conceptual introduction to this class of methods and its relationship to differential geometry can be found in [33]. In this section, we discuss the general idea behind the algorithm without performing detailed derivations. We focus on the final proposal machinery that can be used in practice and investigate which quantities need to be manually computed before implementing the algorithm and which variables need to be calibrated for the successful performance of the algorithm. First, let us consider a general set-up. In Hamiltonian Monte Carlo, we consider a Hamiltonian function

H (θ, p) = - log p (θ) + \frac{1}{2} {log {(2 π)}^{D} | M |} + \frac{1}{2} p^{T} M^{- 1} p,

(34)

which consists of potential energy in the system

E (θ) = - L (θ)

and kinetic energy

K (p) = \frac{1}{2} {log {(2 π)}^{D} | M |} + \frac{1}{2} p^{T} M^{- 1} p

; variables

p

are called momentum variables. The dynamics of the system then evolves according to Hamiltonian equations

\frac{d θ}{d τ} = \frac{\partial H}{\partial p} = M^{- 1} p,

(35)

\frac{d p}{d τ} = - \frac{\partial H}{\partial θ} = \nabla_{θ} L (θ),

(36)

where by

τ

in physical interpretation of the system we denote continuous time. Practical implementation requires discretization, and the commonly used scheme for this purpose is the leapfrog discretezation:

p (τ + ϵ / 2) = p (τ) + ϵ \nabla_{θ} L {θ (τ)} / 2,

(37)

θ (τ + ϵ) = θ (τ) + ϵ M^{- 1} p (τ + ϵ / 2),

(38)

p (τ + ϵ) = p (τ + ϵ / 2) + ϵ \nabla_{θ} L {θ (τ + ϵ)} / 2 .

(39)

This scheme does not sample from the target distribution and to correct for that, implementation of Metropolis acceptance probability is necessary. For a proposal

(θ, p) \to (θ^{*}, p^{*})

, acceptance probability in this algorithm is defined as

min {1, exp {- H (θ^{*}, p^{*}) + H (θ, p)} .

Thus, HMC iterates between updating momentum variables, proposal for the parameter values, additional update to the momentum variables, and then an acceptance/rejection step. The Gibbs sampler provides a good understanding for the system evolution in this algorithm:

p^{n + 1} | θ^{n} \sim p (p^{n + 1} | θ^{n}) = p (p^{n + 1}) = N (p^{n + 1} | 0, M),

(40)

θ^{n + 1} | p^{n + 1} \sim p (θ^{n + 1} | p^{n + 1})

(41)

Similarly to MALA, the choice of matrix

M

is crucial for good performance of HMC. While the choice of the step size and the leapfrog steps can be tuned relatively easily by considering acceptance rate, the choice of the matrix

M

is challenging, and there is no principled way to define it. Leapfrog step and step size proposal are two variables that need to be calibrated when implementing HMC. Usually, different combinations of these two variables are considered, and the combination leading to the highest acceptance rate is picked.

3.4. Riemann Manifold Hamiltonian Monte Carlo

The further improvement of HMC and MALA can done by defining the algorithms on Riemann manifold instead of Euclidean space. Proposals guided by Riemann metric instead of Euclidean distance have the potential to explore parameter space more efficiently, especially in the cases when the target density is high-dimensional or exhibits strong correlation [5]. The method originally proposed in [5] and multiple algorithms were compared in the paper: MALA, MMALA, HMC, and RMHMC. For detailed comparison between these methods, we refer to [5], while in our simulation studies, we will focus on comparing RMHMC and particle Metropolis–Hastings for the estimation of parameters in stochastic volatility models.

Girolami and Calderhead [5] define HMC methods in the form of Riemann manifold, and this can be seen as generalization of HMC. The Hamiltonian on the Riemann manifold is defined as follows

H (θ, p) = - log p (θ) + \frac{1}{2} log ({(2 π)}^{n} ∣ G (θ) ∣) + \frac{1}{2} p^{T} G (θ) p

(42)

with

exp (- H (θ, p)) = p (θ, p) = p (θ) p (p ∣ θ)

and the marginal target density

\begin{matrix} p (θ) \propto \int exp (- H (θ, p)) d p = \frac{exp {log p (θ)}}{\sqrt{2 π^{n} ∣ G (θ) ∣}} \int exp \{- \frac{1}{2} p^{T} G {(θ)}^{- 1} p\} d p \\ = exp (log p (θ)) . \end{matrix}

(43)

The general idea behind the updates in RMHMC is similar to that of HMC, and the updates for the momentum variables and parameters of the model are defined in Equations (44)–(46).

p (τ + \frac{ϵ}{2}) = p (τ) - \frac{ϵ}{2} \nabla_{θ} H \{θ (τ), p (τ + \frac{ϵ}{2})\},

(44)

θ (τ + ϵ) = θ (τ) + ϵ / 2 [\nabla_{p} H \{θ (τ), p (τ + \frac{ϵ}{2})\} + \nabla_{p} H \{θ (τ + ϵ), p (τ + \frac{ϵ}{2})\}],

(45)

p (τ + ϵ) = p (τ + \frac{ϵ}{2}) - \frac{ϵ}{2} \nabla_{θ} H \{θ (τ + ϵ), p (τ + \frac{ϵ}{2})\}

(46)

Therefore, as in standard HMC algorithm, we iterate between half-step update of the momentum variables, and then we update position variables, and we finish iteration with additional half-step update of the momentum variables and Metropolis acceptance/rejection step with the probability

min {1, exp {- H (θ^{*}, p^{*}) + H (θ^{n}, p^{n + 1})}} .

Similarly to HMC, RMHMC can be viewed as a Gibbs sampling scheme

p^{n + 1} | θ^{n} \sim p (p^{n + 1} | θ^{n}) = N {p^{n + 1} | 0, G (θ^{n})},

(47)

θ^{n + 1} | p^{n + 1} \sim p (θ^{n + 1} | p^{n + 1}) .

(48)

Recall that in the case of MALA and HMC, matrix

M

has to be chosen manually and there is no principled way to choose it. In RMHMC, matrix

G (θ)

is defined at each step by underlying geometry; see for more details [5]. Below we discuss quantities that need to be computed for the implementation of RMHMC in the case of stochastic volatility model and in particular

G (θ)

.

Recall stochastic volatility model parametrized through

β

y_{t} = β exp (h_{t} / 2) ϵ_{t},

(49)

h_{t + 1} = ϕ h_{t} + η_{t + 1},

(50)

ϵ_{t} \sim N (0, 1)

,

η_{t} \sim N (0, σ^{2})

, with

h_{1} \sim N (0, σ^{2} / (1 - ϕ^{2}))

.

The joint likelihood of the model is

p (y, h, β, ϕ, σ) = \prod_{t = 1}^{T} p (y_{t} ∣ h_{t}, β) \prod_{t = 2}^{T} p (h_{t} ∣ h_{t - 1}, ϕ, σ) π (β) π (ϕ) π (σ)

(51)

The prior distributions are chosen as follows

β \propto exp (β), σ^{2} \sim I n v - χ^{2} (10, 0.05), (ϕ + 1) / 2 \sim B e t a (20, 1.5) .

(52)

Further, following [5], we write the partial derivatives for

L = p (y, h ∣ β, ϕ, σ)

\frac{\partial L}{\partial β} = - \frac{T}{β} + \sum_{t = 1}^{T} \frac{y^{2}}{β^{3} exp (h_{t})},

(53)

\frac{\partial L}{\partial ϕ} = - \frac{ϕ}{(1 - ϕ^{2})} + \frac{ϕ h_{1}^{2}}{σ^{2}} + \sum_{t = 2}^{T} \frac{h_{t - 1} (h_{t} - ϕ h_{t - 1})}{σ^{2}},

(54)

\frac{\partial L}{\partial σ} = - \frac{T}{σ} + \frac{h_{1}^{2} (1 - ϕ^{2})}{σ^{3}} + \sum_{t = 2}^{T} \frac{{(h_{t} - ϕ h_{t - 1})}^{2}}{σ^{3}} .

(55)

To implement the algorithms, we require the expressions for the individual components of the metric tensor for the likelihood. Following [5], the expressions are

\begin{matrix} E \{\frac{\partial L}{\partial β} \frac{\partial L}{\partial β}\} = \frac{2 T}{β^{2}}, E \{\frac{\partial L}{\partial σ} \frac{\partial L}{\partial σ}\} = \frac{2 T}{σ^{2}}, E \{\frac{\partial L}{\partial β} \frac{\partial L}{\partial σ}\} = E \{\frac{\partial L}{\partial β} \frac{\partial L}{\partial ϕ}\} = 0, \end{matrix}

(56)

\begin{matrix} E \{\frac{\partial L}{\partial σ} \frac{\partial L}{\partial ϕ}\} = \frac{2 ϕ}{σ^{3} (1 - ϕ^{2})}, E \{\frac{\partial L}{\partial ϕ} \frac{\partial L}{\partial ϕ}\} = \frac{2 ϕ^{2}}{{(1 - ϕ^{2})}^{2}} + \frac{T - 1}{1 - ϕ^{2}} . \end{matrix}

(57)

Furthermore, the expressions for the metric tensor for the likelihood and its partial derivatives follow

G (ϕ, σ, β) = [\begin{matrix} \frac{2 T}{β^{2}} & 0 & 0 \\ 0 & \frac{2 T}{σ^{2}} & \frac{2 ϕ}{σ^{3} (1 - ϕ^{2})} \\ 0 & \frac{2 ϕ}{σ^{3} (1 - ϕ^{2})} & \frac{2 ϕ^{2}}{{(1 - ϕ^{2})}^{2}} + \frac{T - 1}{1 - ϕ^{2}} \end{matrix}],

(58)

\frac{\partial G}{\partial β} = [\begin{matrix} - \frac{4 T}{β^{3}} & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{matrix}],

(59)

\frac{\partial G}{\partial σ} = [\begin{matrix} 0 & 0 & 0 \\ 0 & - \frac{4 T}{σ^{3}} & - \frac{6 ϕ}{σ^{4} (1 - ϕ^{2})} \\ 0 & - \frac{6 ϕ}{σ^{4} (1 - ϕ^{2})} & 0 \end{matrix}],

(60)

\frac{\partial G}{\partial ϕ} = [\begin{matrix} 0 & 0 & 0 \\ 0 & 0 & \frac{2}{σ^{3} (1 - ϕ^{2})} + \frac{4 ϕ^{2}}{σ^{3} {(1 - ϕ^{2})}^{2}} \\ 0 & \frac{2}{σ^{3} (1 - ϕ^{2})} + \frac{4 ϕ^{2}}{σ^{3} {(1 - ϕ^{2})}^{2}} & \frac{2 ϕ (1 + T)}{{(1 - ϕ^{2})}^{2}} + \frac{6 ϕ^{3}}{{(1 - ϕ^{2})}^{3}} \end{matrix}] .

(61)

The proposal machinery in RMHMC provides advantages for exploring parameter space efficiently. However, it is not easily adaptable for different model specifications, especially when increasing the model’s dimensionality, as we discussed in Section 1. In particular, although matrix G can be computed in the multivariate model specified in Equations (12) and (13) exactly, it scales quadratically with the number of parameters. This might be one of the reasons why the method has not been used on multivariate stochastic volatility models we introduced in Section 1. However, probabilistic programming languages [35,36] and automatic differentiation possibilities developed in recent years allow the efficient and adaptable implementation of these algorithms in practice.

3.5. Integrated Nested Laplace Approximation

Integrated Nested Laplace Approximation was introduced in [7]. The method is based on the nested version of the classical Laplace approximation and was introduced for latent Gaussian models (LGMs). It became a popular approach in Bayesian inference due to its good performance in the variety of models in the class of LGMs and its computational advantages over other methods in Bayesian literature. The computational appeal of this method comes from the possibility of exploiting sparse matrix computations when evaluating certain approximations. INLA has found its applications in many fields in the models where high-dimensional problems arise. Stochastic volatility models have been analyzed using INLA in [37,38]. Bivariate stochastic volatility model has been considered in [39], where the authors present and solve some issues that arise in using INLA in the multivariate case of the model. One of the conclusions of this study was that INLA loses its computational advantage with increased dimensionality of the stochastic volatility model. We further discuss the details of the method and the implementation shortcomings in a multivariate case and present the reader with a simulation study that illustrates the discussed approach’s performance.

Stochastic volatility model can be written in the form of LGMs

y ∣ h, θ_{1} \sim \prod_{i \in I} π (y_{i} ∣ h_{i}, θ_{1}),

(62)

h ∣ θ_{2} \sim N (μ (θ_{2}), Q^{- 1} (θ_{2})) .

(63)

As before,

y_{t}

is the data that we observe and

h_{t}

is the latent volatility process, and we are interested in the posterior distribution of the parameters of the model

θ

and the latent process given the data

p (h, θ ∣ y) \propto p (θ) p (h ∣ θ) \prod_{t = 1}^{T} p (y ∣ h_{t}, θ) .

(64)

The outline of the INLA approach can be summarized in the following steps [7,37]

(1): Build an approximation $p (θ ∣ y)$
(2): Build an approximation to $p (h_{t} ∣ θ, y)$
(3): Compute an approximation to $p (h_{t} ∣ y)$ using the approximations from steps 1 and 2.

The first approximation

p (θ ∣ y)

relies on the Gaussian approximation of the form

p (x ∣ y, θ) \propto exp \{- \frac{1}{2} x^{T} Q x + \sum g_{t} (h_{t})\},

(65)

where

x = (μ, h)

,

g_{t} (h_{t}) = log p (y_{t} ∣ h_{t}, θ)

. By matching the mode and curvature in the mode, we obtain the Gaussian approximation for our model

{\tilde{p}}_{G} (x ∣ y, θ) = K_{1} exp \{- \frac{1}{2} {(x - μ)}^{T} (Q + d i a g (c)) (x - m)\},

(66)

where

K_{1}

is a normalizing constant, m is the modal value of

p (x ∣ y, θ)

, the vector c contains the second order terms in the Taylor expansion of

\sum g_{t} (h_{t})

at the modal value m, and Q is the precision matrix that has the form

Q = [\begin{matrix} 1 & - ϕ \\ - ϕ & 1 + ϕ^{2} & - ϕ \\ ⋱ & ⋱ & ⋱ \\ - ϕ & 1 + ϕ^{2} & - ϕ \\ - ϕ & 1 \end{matrix}] .

(67)

The sparsity of the precision matrix above allows one to exploit efficient sparse matrix computational methods and thus gain computational speed. Note that in the multivariate case, this advantage disappears since the matrix Q is not sparse anymore.

When it comes to the estimation of stochastic volatility models, approximation of the marginals

p (h_{t} ∣ θ, y)

is always the most challenging task. The solution that is proposed in [7] is (simplified) Laplace approximation of the form

log {\tilde{p}}_{S L A} (x_{t} ∣ θ, y) = c o n s t - \frac{1}{2} x_{t}^{2} + γ_{t}^{(1)} (θ) x_{t} + \frac{1}{6} x_{t}^{3} γ_{t}^{3} (θ) + \dots,

(68)

where

γ_{t}^{(1)}

and

γ_{t}^{(3)}

are the terms in the Taylor expansion. The final step of the method is to approximate

p (x_{t} ∣ y)

with the numerical integration scheme

\tilde{p} (x_{t} ∣ y) = \sum_{k} \tilde{p} (x_{t} ∣ θ^{k}, y) \tilde{p} (θ^{k} ∣ y) ∆_{k},

(69)

for some

θ^{k}

of

θ

, where

θ^{k}

is selected by creating a grid of points that covers the area of high density for

\tilde{p} (θ ∣ y)

. For more details on implementation of the simplified Laplace approximation and the selection of grid of points for

θ^{k}

, see [7,37].

3.6. Fixed-Form Variational Bayes

In this section, we discuss how the posterior distribution can be approximated using the fixed-form variational Bayes method proposed in [6]. The general idea of fixed-form variational inference consists in assuming a certain factorization of the prior distribution, which naturally leads to the factorized structure of the posterior. The factorizing distributions of the posterior are then assumed to come from a certain parametric family of distributions (for example, exponential) and instead of a sampling task, as in the previous section, we would perform the optimization task of minimizing the distance between the approximating distribution and the unknown posterior distribution.

As before, assume we observe a process

{y_{t}}_{t = 1}^{T}

that is driven by an unobservable or latent process

{h_{t}}_{t = 1}^{T}

. Recall that Bayes’ rule gives us the posterior distribution of the parameters of the system

p (h ∣ y) \propto g (y ∣ h) π (h) .

(70)

In the Bayesian framework, we formulate our prior beliefs, which we update once we acquire more data. In general, Variational Bayes methods focus on approximating the posterior distribution

p (h ∣ y)

with some distribution

q (h ∣ y)

. Further, it is common to choose blocks of the parameters and impose independence for these blocks

p (h ∣ y) \approx q (h ∣ y) = q (h_{1} ∣ y) q (h_{2} ∣ y) .

(71)

By construction, the posterior of the blocks of the parameters is independent. In the literature, this is referred to as the mean-field assumption. To find the optimal approximation, we minimize the Kullback–Leibler (KL) divergence from

q (h ∣ y)

to

p (h ∣ y)

\tilde{p} (h ∣ y) = \underset{q (h_{1} ∣ \cdot) q (h_{2} ∣ \cdot)}{arg min} K L (q (h_{1} ∣ y) q (h_{2} ∣ y) ∣ ∣ p (h ∣ y)) .

(72)

Distributional approximation can be viewed as an optimization problem; i.e., an optimal distribution has to be chosen from the space of all possible distributions, and the KL divergence is chosen as a loss function [40]. Salimans et al. [6] propose a specific approach to the minimization problem of KL divergence, which is based on the similarities between the optimal solution to the problem and linear regression. The general idea of the approach is summarized as comprising the following steps:

initialize all the posterior approximations $q (θ)$ ;
iterate over the parameters updating every one of them given the others;
repeat until convergence.

Consider the stochastic volatility model

y_{t} = β exp (h_{t} / 2) ϵ_{t}

(73)

h_{t + 1} = ϕ h_{t} + η_{t + 1},

(74)

with

h_{1} \sim N (0, σ^{2} / (1 - ϕ^{2}))

and

ϵ_{t} \sim N (0, 1)

,

η_{t} \sim N (0, σ_{η}^{2})

. We specify our a priori beliefs in the following manner

p (β) \propto β^{- 1} (ϕ + 1) / 2 \sim B e t a (20, 1.5), σ^{2} \sim I G (5, 0.25) .

(75)

To apply the Variational Bayes method, we need to specify the posterior approximations

q (θ)

. It is convenient to assume a hierarchical structure of the prior, in which case it factorizes to

p (ϕ, σ^{2}, β, f) = p (ϕ) p (σ^{2}) p (f ∣ ϕ, σ^{2}) p (y ∣ f),

(76)

where

f = (log (β), h^{'})

. The hierarchical structure of the prior leads to the following factorization of the posterior approximation

q_{ξ} (σ_{η}^{2}, f ∣ f) = q_{ξ} (σ_{η}^{2}) q_{ξ} (f ∣ ϕ, σ^{2}) = \frac{q_{ξ} (σ_{η}^{2} ∣ ϕ) p (f ∣ ϕ, σ^{2}) q_{ξ} (y ∣ f)}{q_{ξ} (y ∣ ϕ, σ^{2})} .

(77)

Thus, the posterior approximations can be chosen as follows

q_{ξ} ((ϕ + 1) / 2) = B e t a (ξ_{1}, ξ_{2}),

(78)

q_{ξ} (σ^{2} ∣ ϕ) \sim I G (ξ_{3}, ξ_{4} + ξ_{5} ϕ^{2}),

(79)

q (log (β), h ∣ ϕ, σ^{2}) = N (m, V),

(80)

where

V^{- 1} = P (ϕ, σ^{2}) + ξ_{6}, m = V^{- 1} ξ_{7},

with

P (ϕ, σ^{2})

precision matrix of

p (l o g (β), h ∣ ϕ, σ^{2})

.

Once the posterior approximations are initialized, we proceed with the next step and iterate over the parameters. The parameters are updated in blocks that correspond to the factorization of the posterior approximations. First, we update the block for the persistence parameter in the latent process

q_{ξ} (ϕ)

ϕ^{*} = s_{1} (ξ, z_{1}^{*}), with s_{1} () and z_{1}^{*} such that σ^{2 *} \sim q_{ξ} (σ^{2} ∣ ϕ^{*}),

(81)

σ^{2 *} = s_{2} (ξ, z_{2}^{*}, ϕ^{*}), with s_{2} () and z_{2}^{*} such that σ^{2 *} \sim q_{ξ} (σ^{2} ∣ ϕ^{*}),

(82)

{\hat{C}}_{1} = \nabla_{ξ} [s_{1} (ξ, z_{1}^{*})] \nabla_{ϕ} [T_{1} (ϕ^{*})],

(83)

{\hat{g}}_{1} \approx \nabla_{ξ} [s_{1} (ξ, z_{1}^{*})] {\nabla_{ϕ} [log p (ϕ^{*}) + log q_{ξ} (y ∣ ϕ^{*}, σ^{2 *}) - log q_{ξ} (σ^{2 *} ∣ ϕ^{*})]} .

(84)

Second, we update the block for the variance of the latent process

q_{ξ} (σ^{2} ∣ ϕ)

{\hat{C}}_{2} = \nabla_{ξ} [s_{2} (ξ, z_{2}^{*}, ϕ^{*})] \nabla_{σ^{2}} [T_{2} (σ^{2 *})]

(85)

{\hat{g}}_{2} \approx \nabla_{ξ} [s_{2} (ξ, z_{2}^{2 *}, ϕ^{*})] \nabla_{σ^{2}} [log p (σ^{2 *}) + log q_{ξ} (y ∣ ϕ^{*}, σ^{2 *})],

(86)

where

T_{2} (σ^{2 *})

are the sufficient statistics of

q_{ξ} (σ^{2} ∣ ϕ)

. The last update is the update of the likelihood approximation

a_{t + 1} = (1 - ω) a_{t} + ω E_{q_{ξ} (f ∣ ϕ^{*}, σ^{2 *})} [\nabla_{f} log p (y ∣ f)],

(87)

z_{t + 1} = (1 - ω) z_{t} + ω E_{q_{ξ} (f ∣ ϕ^{*}, σ^{2 *})} [f],

(88)

ξ_{6, t + 1} = (1 - ω) ξ_{6, t} - ω E_{q_{ξ} (f ∣ ϕ^{*}, σ^{2 *})} [\nabla_{f} \nabla_{f} log p (y ∣ f)],

(89)

ξ_{7, t + 1} = a_{t + 1} + ξ_{6, t + 1} z_{t + 1} .

(90)

For more extensive derivations of the updates, we refer the reader to [6]. Further, one might wonder how the latent process is estimated in this procedure. Salimans et al. [6] propose using the Kalman filter to estimate the filtering distribution. Even though it is a valid approach that is also undertaken in quasi-maximum likelihood method [41], its weakness lies in the linearization of the observation equation which implies that the distribution of the noise process is not longer Gaussian.

4. Results

In this section, we present results for the comparison of the discussed methods. We compare two particle filters (bootstrap and auxiliary particle filters) on the basis of bias, variance and on the estimated effective number of particles. We choose the better performing procedure of the two for using in the particle Metropolis–Hastings algorithm. We compare particle Metropolis–Hastings (PMH), Riemann Manifold Hamiltonian Monte Carlo (RMHMC), integrated nested Laplace approximation (INLA), and fixed-form variational Bayes (VB) on the basis of how well the posterior distributions obtained with these methods capture the ground truth (e.g., true parameter values). The ability of the methods to recover ground truth is assessed based on five simulated data sets with different underlying parameters. We additionally provide effective sample sizes for the comparison of the sampling methods (PMH and RMHMC). For illustration purposes, we also provide comparison on two real-world data sets.

4.1. Variance of the Estimated Likelihood

As we mentioned before, the marginal likelihood can be approximated sequentially through particle filtering. The marginal likelihood approximation of the parameters

θ

reads

p (y_{1 : T} | θ) \approx \prod_{t} \hat{p} (y_{t} | y_{1 : t - 1}, θ),

(91)

where the right hand side is obtained by running particle filter presented in Algorithm A1. In practice, usually the log-likelihood

log p_{θ} (y_{1 : T}) = log p_{θ} (y_{1}) + \sum_{t = 2}^{T} log p_{θ} (y_{t} | y_{1 : t - 1})

(92)

is estimated for the purpose of numerical stability (as the product of small weights would lead to unstable results). The estimate of the log-likelihood is the by-product of the particle filtering, as it is the average over log-weights that are assigned to the particles at every time step. In this section, we compare the bootstrap (BPF) and auxiliary particle filters (APF) in terms of bias, variance, and number of effective particles. Both of them can be used for obtaining simulated likelihood estimates, which can be further used in the particle Metropolis–Hastings algorithm. We denote by

\hat{L}

the estimate of the likelihood obtained with a particle filter. Then, the bias and the variance can be estimated as follows

B i a s = 5000^{- 1} \sum_{i = 1}^{K} \sum_{j = 1}^{M} (log {\hat{L}}^{j} - log \bar{L} (y^{i})),

(93)

V a r i a n c e = 5000^{- 1} \sum_{i = 1}^{K} \sum_{j = 1}^{M} {(log {\hat{L}}^{j} - log \bar{L} (y^{i}))}^{2},

(94)

where

y_{i}

is the i-th time series, and

log L

is the “true” log-likelihood value. For the comparison, we use

K = 50

different time series generated from the stochastic volatility model and

M = 100

Monte Carlo iterations. We use

N = 100

,

N = 1000

, and N = 10,000 number of particles for this study. As the true value of the likelihood is not available, we substitute it with an estimate that is obtained with N = 1,000,000 number of particles. First, we conduct the analysis of the variance of the estimated likelihood in true parameter values as discussed in [24]. The authors of, ref. [27] showed theoretically that the asymptotic variance is not always smaller for the APF in comparison to the BPF. We run additional simulation studies to examine whether the variance of the estimated likelihood varies in the parameter space. Since we are interested in using the estimated likelihood in the Markov Chain Monte Carlo setting, it is relevant how the variance behaves in different points of the parameter space. If we start far away from the true value and the variance of the estimated likelihood is larger in that part of the parameter space, it can affect the convergence and calibration of the algorithm. Table 1 shows variance of the estimated likelihood for bootstrap and auxiliary particle filters. N indicates the number of the particles that we used for the estimation of the likelihood. It is clear that, on average, APF performs better in terms of the variance of the estimated likelihood. Table 2 indicates results for a similar experiment, but on the level of individual times series. We consider different data-generating processes and find that, in particular, higher variance of the latent volatility process is associated with higher variance of the estimated likelihood. Finally, in Figure 2, Figure 3 and Figure 4, we illustrate that the variance of the estimated likelihood changes depending on the location in the parameter space, and these changes can be specific to a data-generating process. These figures correspond to the experiments with time series 2, 3, and 4 from Table 2. The likelihood was estimated with

N = 1000

particles. We observe that the variance of the latent process has a strong effect on the landscape of the variance of the estimated likelihood in the parameter space. From Figure 4c,d, we see that the variance of the estimated likelihood obtained with the bootstrap particle filter appears to be more strongly affected by the location in the parameter space than the variance of the estimated likelihood obtained with the auxiliary particle filter. In Figure 3, we observe that the variance of the estimated likelihood is affected by the scale parameter

β

in the case of auxiliary particle filter, but not so much in the case of the bootstrap particle filter. Thus, initialization of PMCMC and the choice of number of particles should be considered with care for the optimal performance of the algorithm as the variance of the estimated likelihood can differ in the parameter space, and these changes can vary across different time series.

4.2. Particle Metropolis–Hastings and Riemann Manifold Hamiltonian Monte Carlo

In this section, we compare particle Metropolis–Hastings (PMH) and Riemann Manifold Hamiltonian Monte Carlo. We evaluate the algorithms based on how well they recover the true parameters of the model

β

,

ϕ

, and

σ_{η}

and on the basis of the effective sample size. We obtained 20,000 samples and discarded the first 1000 as burn-in. Further, Figure 5 and Figure A1, Figure A2, Figure A3 and Figure A4 show results for both samplers: trace plots, histograms, and autocorrelation function are depicted. Table 3 presents the moments and highest posterior density intervals for the parameters of the model. The marginal likelihood in PMH was estimated with auxiliary particle filter as discussed in [23]. The Metropolis–Hastings part of the algorithm was calibrated to achieve 20–40% acceptance rate. RMHMC was implemented as in [5] with openly available implementation of the method by the authors. Both PMH and RMHMC require careful calibration of the step-size, and RMHMC additionally needs calibration of the number of the leapfrog steps; thus in Table A1, we additionally present results for the no-u-turn sampler (NUTS). NUTS is an extension of HMC algorithm that allows one to tune the algorithm automatically. From Figure 5 and Figure A1, Figure A2, Figure A3 and Figure A4, we observe that autocorrelation of the samples indicated in the third column of the plots decreases faster for PMH than for RMHMC, in particular for the parameters

ϕ

and

σ_{η}

. Effective sample size is similarly high for parameter

β

for both samplers. Effective sample size for the parameters

ϕ

and

σ_{η}

is lower in the case of RMHMC. However, if we compute the ESS per second as presented in the last column of Table 3, this advantage disappears. This result is not surprising since PMH is the most computationally intensive procedure we are considering. Both the likelihood and the posterior use sequential sampling methods, which makes computations very demanding. Nevertheless, PMH allows us to recover the underlying parameters more accurately. In particular, in most of the presented examples, true variance of the latent volatility process lies inside the 95% highest posterior density interval for PMH. RMHMC tends to overestimate this parameter. As Table A1 indicates, the highest posterior density intervals obtained with NUTS are larger than those obtained with PMH and RMHMC. Number of gradient evaluations for RMHMC are 69718, 70042, 69802, 69801, and 70041 for Experiments 1–5, respectively.

4.3. Integrated Nested Laplace Approximation

We provide two simulation studies for the integrated nested Laplace approximation. First, we replicate and extend the simulation study provided in [38] by analyzing data-generating processes with different values of

μ

and

σ_{η}

since the estimation of the variance parameter appears to be a challenge for existing methods. Table 4 replicates results from [38] with the parametrization of the model with the scale parameter

μ

, and Table 5 presents results for the parametrization with the scale parameter

β

and different data-generating processes. Both Monte Carlo studies are conducted with 1000 iterations. Our findings are comparable to those of [38]: the mean of the volatility process and the persistence parameter are estimated quite accurately, while the variance of the latent volatility process estimated with INLAis biased—usually, it is overestimated. Second, we provide the posterior moments for INLA similarly to Table 3 for PMH and RMHMC. These results are presented in Table 6. These results also suggest that the variance of the latent volatility process tends to be overestimated with INLA to a larger degree than with RMHMC, which also overestimates this parameter, as can be seen from the Table 3. Moreover, highest posterior density intervals for the parameter

ϕ

obtained with INLA are larger than those obtained with PMH and RMHMC.

4.4. Fixed-Form Variational Bayes

In this section, we discuss results for the simulation study with fixed-form variational Bayes. We consider the same time series as in the case of comparison between PMH and RMHMC. In Table 7, we present estimated variational parameters and in Figure 6, comparison of the posterior with fixed-form variational Bayes (in blue), RMHMC (histograms from the posterior samples), and INLA (green). It is clear that in some cases, the variational Bayes method performs quite well; in particular, parameter

β

is very well estimated in most of the cases. Only in Figure 6j is the approximate posterior for

β

far from the truth. The variance parameter is underestimated in all cases with VB; this is less severe in the cases when the true variance is relatively small. However, when the true variance is relatively large, the discrepancy between VB estimate and the true value increases, as can be seen from Figure 6o. We observe the opposite picture with INLA: it tends to overestimate the variance of the latent volatility process. Overestimation of the variance of the latent volatility process for stochastic volatility models with INLA has been previously reported in [38]. Additionally, it is reported in [38] that this effect decreases with larger values of

σ_{η}

. The source of this has to be investigated further. RMHMC overestimates the variance to a lesser degree than INLA, and as can be seen from Figure 6o, this is also connected to the value of the ground truth for

σ_{η}

: with larger true value of

σ_{η}

, RMHMC provides more accurate results.

4.5. Comparison of the Methods on the Real Data

In this section, we present posterior distributions of the parameters estimated with different Bayesian inference methods on two real-world time series. First, we consider the mean corrected log-returns of the Australian dollar against the US dollar. The data range from January 1994 to December 2003 with a total of 519 weekly observations. Resulting posterior distributions obtained with different inference methods are presented in Figure 7. Second, we consider daily log-returns for the DAX index from 3 January 2000 until 17 May 2001, which in total constitute 1000 observations. We provide descriptive statistics for both time series in Table A2. Resulting posterior distributions for this time series are presented in Figure 8. The main discrepancies between the methods are largest in the estimation of the parameter

σ_{η}

for both time series, and the results are consistent with the simulation studies in terms of the difference of these discrepancies. As we can see from Figure 7c and Figure 8d, the posterior distribution of

σ_{η}

obtained with variational Bayes is concentrated in smaller values in comparison to the other methods. INLA suggests the higher values for

σ_{η}

in comparison to the other methods. The posterior samples obtained with RMHMC are concentrated in values higher than the ones obtained with PMH. Both sampling methods appear to give results larger than VB but smaller than INLA for the parameter

σ_{η}

.

In Table 8, we present results for efficient sample size (ESS) for both empirical applications and both samplers. Similarly to what is found in simulation studies, ESS is higher in the case of the PMH algorithm. However, if the computational time were taken into account, this advantage would have disappeared, similarly to the results in Table 2.

5. Discussion

This paper reviewed multiple methods for the estimation of nonlinear state-space modes and stochastic volatility modes in particular that appear in Bayesian statistics and machine learning. We in particular focused on representative inference methods from different classes: methods that can recover the posterior distribution ‘exactly’ and the ones that build an approximation. We discussed which methods have the potential to be applied in a multivariate or high-dimensional situation and why they have this potential. Finally, we discovered that while stochastic volatility models are common for use in simulation studies for demonstrating the performance of the methods, usually not enough possible data-generating processes are considered to make a fair comparison. In particular, the performance of the methods is heavily connected to the variance of the latent volatility process.

State-space models can be powerful tools for modeling latent variables in different scientific fields. However, already for univariate time series, they are challenging to estimate. This paper’s main aim was to review and understand the existing classes of methods of estimation (targeting exact posterior or approximating it) and define the direction one can undertake for the estimation of multivariate nonlinear state-space models. The challenge arises from both statistical and computational perspectives. By this, we mean it is hard to develop methods that both provide sufficiently good results from the estimation point of view and are computationally feasible. We have reviewed a number of methods that allow a trade-off between these two aspects. In particular, we have considered particle Markov Chain Monte Carlo and reviewed multiple particle filtering approaches for this method, Riemann Manifold Langevin Hamiltonian Monte Carlo, Integrated Nested Laplace Approximation, and Variational Bayes. All these methods are equipped with the ability to estimate models with intractable likelihoods.

5.1. Sequential Monte Carlo

We compared the auxiliary particle filter with the bootstrap particle filter in terms of the variance of the estimated likelihood. We found that the auxiliary particle filter outperformed the bootstrap particle filter for most of the data-generating processes. As discussed in [27], auxiliary particle filter does not always have a smaller variance of the estimated likelihood. Additionally, we looked into how the variance of the estimated likelihood changes in the parameter space. We found that, in particular, the variance of the latent process affects the variance of the estimated likelihood. This implies that one has to find the balance for the number of particles used in Sequential Monte Carlo and a clever way of finding initial parameter values for the sampling from the posterior, especially when considering multivariate models. The advantage of the auxiliary particle filter from the methodological point of view is that it takes into account current observation

y_{t}

when constructing the proposal for the particles

q (h_{t} ∣ h_{t - 1}, y_{t})

. The method that we did not include in our simulation study, but that possibly can solve the problem with the variance of the estimated likelihood, is the iterated auxiliary particle filter (iAPF): for the proposal of the particles, it uses not only current observation

y_{t}

, but all observations

q (h_{t} ∣ h_{t - 1}, y_{1 : T})

. A backward sequential procedure with an optimization step is used in this proposal mechanism for the particles, which makes the algorithm computationally intensive. The multivariate application of the stochastic volatility model in [26] considers only diagonal case of the matrix

Φ

, and the proposed procedure for the particle proposals does not incorporate such dependence. While this method does introduce an additional computational burden on already computationally intensive method (particle Metropolis–Hastings), it is promising for getting state-of-the-art results for the task of parameter estimation.

5.2. Particle Metropolis-Hastings

Metropolis-Hastings is a general MCMC method that is easy to implement and works well for the univariate model. The estimation results are satisfying when it is properly calibrated, and good mixing of the chains is achieved. It works well in low-dimensional problems but is unlikely to be successful in the case of multivariate stochastic volatility models. Considering the non-diagonal matrix

Φ

in a five-dimensional case, we would have 45 parameters to be estimated. The random walk proposal would be very inefficient even with a reasonable sparseness assumption on

Φ

. Nevertheless, in the low-dimensional model, we get the best estimation results with particle Metropolis–Hastings, where the particle filtering scheme is chosen to be an auxiliary particle filter. From the methods considered in this paper, particle Markov Chain Monte Carlo methods are easiest to adapt to different specifications of the model and are easiest to implement.

5.3. Riemann Manifold Hamiltonian Monte Carlo Methods

Hamiltonian Monte Carlo is a very attractive method for high-dimensional problems as it allows us to explore the parameter space efficiently. In particular, the gain in efficiency comes from avoiding random walk behavior in the proposals. The disadvantage comes from the need of careful calibration since there is no principled way of choosing matrix

M

. RMHMC avoids this problem by exploiting underlying geometry in the proposal mechanism. In our study, we notice that RMHMC results in good mixing of the Markov chains, and the method is generally easy to calibrate, but the estimation of the parameters is not very good. In particular, it appears that the variance of the latent volatility process is challenging for the method. It is not surprising that the PMH algorithm performs better in terms of parameter estimation since we use an auxiliary particle filter for the volatility process estimation and thus take current observation

y_{t}

for the particle proposals. RMHMC does not benefit from similar information when estimating model parameters. Therefore, improved estimation of the volatility process can be one of the directions for improving the performance of RMHMC for the parameter estimation of stochastic volatility models.

5.4. Variational Bayes

As one can see from the illustrative example, in some cases, variational Bayes performs quite well; however, there are also situations when it is far off from the underlying truth. The challenge with stochastic volatility models remains the same: it is difficult to estimate the latent states. In the approach of [6], this is done via Kalman filtering. Therefore, the drawback of linearization of the model will remain and will show in the final results. In this respect, the possible combination of VB and SMC can be of interest. Some advances in this direction have already been made [42].

5.5. Integrated Nested Laplace Approximation

Integrated Nested Laplace Approximation is another approach that works well considering how fast the method is, but it clearly overestimates the variance of the latent volatility process. Additionally, the sparse matrix computation that is used in univariate models is not applicable to the multivariate case. In the multivariate case, the precision matrix in Equation (67) is not sparse, and thus, the method does not benefit from fast sparse matrix computation. An approach that we have not considered in this paper is the Expectation Propagation algorithm. In particular, the authors of [43] propose a way to improve approximate marginals

p (x_{t} ∣ θ, y)

in latent Gaussian fields by using EP. The motivation of the approach builds on the fact that EP can give better approximations than the Laplace approximation in this case. The improvements, however, would come at computational costs. In the univariate case, the extra computational costs do not play a significant role as the algorithm can be parallelized. However, it is hard to say how big the difference would be in the multivariate model, both in terms of improvement in the estimation and loss in computational speed.

6. Conclusions

We reviewed multiple Bayesian inference methods, which both target the exact posterior distribution and approximate it. By comparing methods on various data-generating processes, we notice that variational Bayes tends to underestimate the latent volatility process variance, while INLA and RMHMC, in the cases considered, overestimated this parameter. We also get similar disposition of the results on two real-world data sets. We achieved the best performance with PMH in terms of recovering ground truth and uncertainty quantification. In PMH, the particle filtering step was performed with an auxiliary particle filter. This indicates that filtering with look-ahead approaches, which include current (or future) observations into proposal machinery can improve the performance of the inference method. It is important to note that different data-generating processes for simulation studies would indicate different performance results. Thus, we stress that when using stochastic volatility models, more than one data-generating process should be considered for methods comparison. This practice would allow indicating in which situation a method can fail or perform differently. Our results indicate that fixed-form variational Bayes tends to underestimate the variance of the latent process, while RMHMC and INLA overestimated this parameter. To estimate the stochastic volatility model in the multivariate case, the combination of different strategies appears to be necessary. In a high-dimensional case, the random-walk proposal would become extremely inefficient. At the same time, approximate methods lose their outstanding computational advantage (for example, INLA), and the implementation of these methods in the multivariate case is not straightforward.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author sincerely thanks anonymous referees for constructive and useful comments. Their suggestions helped improve and clarify this manuscript.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SV	Stochastic volatility
SMC	Sequential Monte Carlo
iAPF	Iterated Auxiliary Particle Filter
PMCMC	Particle Markov Chain Monte Carlo
PMH	Particle Metropolis-Hastings
HMC	Hamiltonian Monte Carlo
RMHMC	Riemann Manifold Hamiltonian Monte Carlo
VB	Variational Bayes
INLA	Integrated Nested Laplace Approximation

Appendix A