Comparing the MCMC Efficiency of JAGS and Stan for the Multi-Level Intercept-Only Model in the Covariance- and Mean-Based and Classic Parametrization

Hecht, Martin; Weirich, Sebastian; Zitzmann, Steffen

doi:10.3390/psych3040048

Open AccessArticle

Comparing the MCMC Efficiency of JAGS and Stan for the Multi-Level Intercept-Only Model in the Covariance- and Mean-Based and Classic Parametrization

by

Martin Hecht

^1,*

,

Sebastian Weirich

²

and

Steffen Zitzmann

¹

Hector Research Institute of Education Sciences and Psychology, University of Tübingen, 72072 Tübingen, Germany

²

Institute for Educational Quality Improvement, Humboldt-Universität zu Berlin, 10117 Berlin, Germany

^*

Author to whom correspondence should be addressed.

Psych 2021, 3(4), 751-779; https://doi.org/10.3390/psych3040048

Submission received: 17 October 2021 / Revised: 19 November 2021 / Accepted: 25 November 2021 / Published: 30 November 2021

(This article belongs to the Special Issue Computational Aspects, Statistical Algorithms and Software in Psychometrics)

Download

Browse Figures

Versions Notes

Abstract

:

Bayesian MCMC is a widely used model estimation technique, and software from the BUGS family, such as JAGS, have been popular for over two decades. Recently, Stan entered the market with promises of higher efficiency fueled by advanced and more sophisticated algorithms. With this study, we want to contribute empirical results to the discussion about the sampling efficiency of JAGS and Stan. We conducted three simulation studies in which we varied the number of warmup iterations, the prior informativeness, and sample sizes and employed the multi-level intercept-only model in the covariance- and mean-based and in the classic parametrization. The target outcome was MCMC efficiency measured as effective sample size per second (ESS/s). Based on our specific (and limited) study setup, we found that (1) MCMC efficiency is much higher for the covariance- and mean-based parametrization than for the classic parametrization, (2) Stan clearly outperforms JAGS when the covariance- and mean-based parametrization is used, and that (3) JAGS clearly outperforms Stan when the classic parametrization is used.

Keywords:

JAGS; Stan; effective sample size; MCMC efficiency; Bayesian SEM

1. Introduction

Bayesian statistics is gaining in popularity in many disciplines and are used for many different purposes, for instance, to include previous knowledge, to estimate otherwise intractable models, to model uncertainty (e.g., [1]), and to stabilize parameter estimates (e.g., [2]).

A popular software platform for Bayesian estimation is the BUGS family including BUGS [3,4] (see [5] for an overview of the history of BUGS), WinBUGS [6], OpenBUGS [7,8], JAGS [9], and NIMBLE [10]. Monnahan et al. (p. 339, [11]) even call BUGS the “workhorse for Bayesian analyses in ecology and other fields for the last 20 years.” More recently, the software Stan whose development was inspired by the “pathbreaking programs” BUGS and JAGS (p. 538, [12]) and “motivated by the desire to solve problems that could not be solved in a reasonable time […] using other packages” (p. 537, [12]) entered the market with promises of higher computational and algorithmic efficiency. Often, the superiority of Stan over JAGS, which is a more modern member of the BUGS family, is claimed to be due to more advanced MCMC algorithms. Whereas JAGS uses conjugate and slice sampling, Stan uses the No-U-Turn Sampler (NUTS; [13]), which is an adaptive variant of Hamiltonian Monte Carlo (HMC; [14]). A comprehensive illustration of NUTS and HMC can be found in the work of Monnahan et al. [11] and more detailed technical descriptions in the works of Nishio and Arakaw [15] and Betancourt [16].

There has been much debate on efficiency differences between Stan and JAGS, and some authors have explored this research question by conducting comparison studies. For instance, Carpenter et al. (p. 10, [17]) found that “Compared to BUGS and JAGS, Stan is often relatively slow per iteration but relatively fast to generate a target effective sample size.” Monnahan et al. (p. 339, [11]) conclude that “[f]or small, simple models there is little practical difference between the two platforms, but Stan outperforms BUGS as model size and complexity grows.” which is in line with Gelman et al.’s (p. 538, [12]) statement that “Stan is faster for complex models and scales better than Bugs or Jags for large data sets”. However, Grant et al. [18] compared the performance (total time per effective sample size) of various software (including StataStan and JAGS) depending on the number of parameters in the Rasch and hierarchical Rasch model and found that “no one software option dominated” (p. 350, [18]). Additionally, Merkle et al. (p. 2, [19]) report that the original Stan implementation in the package blavaan ”was not much faster or more efficient than the JAGS approach.”, and Wingfeet [20] concluded that ”[n]either JAGS nor Stan came out clearly on top”. Further, Bølstad [21] reports mixed results depending on the conjugacy of the priors, with JAGS beating Stan for a fully conjugate hierarchical model. For a completely non-conjugate model with t-distributions instead of normal distribution, the effect reversed, with Stan being much faster.

In summary, the competition between JAGS and Stan has not been finally decided and performance might depend on several factors, for instance, on the model itself, its complexity, number of parameters, priors, and the parametrization.

Purpose and Scope

The purpose of the present work is to contribute to the discussion about the efficiency of JAGS and Stan. To this end, we conducted three simulation studies in which we varied the number of warmup iterations, the informativeness of the prior distributions, the sample sizes, and the model parametrization, and compared the MCMC efficiency operationalized as the effective sample size per second (ESS/s) between JAGS and Stan. The targeted model was the multi-level intercept-only model, which is a popular model in, for example, psychological research and the building block for many more complex multi-level models.

The article is organized into the following sections. First, we describe our Simulation Study 1 including the data generation, the simulation design, the analysis approaches and procedures, and the results of this simulation study. As suggested by anonymous reviewers, we extended the scope of our work by adding Simulation Study 2 (in which we explored a small sample scenario) and Simulation Study 3 (in which we used another model parametrization). Second, we conclude with a discussion of our work. Annotated JAGS/rjags and Stan/rstan code and an example generated data file are provided in the Supplementary Material and Appendix A, Appendix B, Appendix C, Appendix D, Appendix E, Appendix F and Appendix G.

2. Simulation Study 1

In this simulation study, the MCMC efficiency of estimating the covariance- and mean-based parametrization of the multi-level intercept-only model with JAGS and Stan is explored.

2.1. Data Generation

The data generating model was the multi-level intercept-only model with overall mean

μ

= 0, level 2 variance

σ_{θ}^{2}

= 1, residual variance

σ_{ϵ}^{2}

= 1, J = 1000 level 2 units, and P = 20 level 1 units:

\begin{matrix} y_{j p} & \sim N (θ_{j}, σ_{ϵ}^{2}), \end{matrix}

(1)

\begin{matrix} θ_{j} & \sim N (μ, σ_{θ}^{2}), \end{matrix}

(2)

where

y_{j p}

is the pth value of level 2 unit j, and

θ_{j}

is unit j’s mean parameter. The number of generated data sets (replications) was

N_{repl}

= 1000. Each of these data sets were analyzed within all 16 design cells of the simulation design.

2.2. Simulation Design

We varied the following factors in our simulation study: software (JAGS, Stan), number of warmup iterations (150, 1000), and prior informativeness (ordered categories A (lower informativeness), B, C, D (higher informativeness)). These factors were fully crossed, yielding 16 design cells. As priors, we used an inverse gamma distribution

IG (α, β)

with shape

α

and scale

β

for the variance parameters (

σ_{θ}^{2}

,

σ_{ϵ}^{2}

) and a normal distribution

N (0, σ_{p}^{2})

with variance

σ_{p}^{2}

for the mean

μ

. The levels of the prior informativeness factor refer to differing degrees of prior informativeness: A:

α = β = 0.001

,

σ_{p}^{2}

= 10,000; B:

α = β = 0.01

,

σ_{p}^{2}

= 2500; C:

α = β = 0.1

,

σ_{p}^{2} = 100

; D:

α = β = 1

,

σ_{p}^{2} = 25

. Thus, the prior informativeness ranges from lower (A) to higher (D).

2.3. Analysis

The simulation study was conducted with the statistical software R [22]. For JAGS [23], the R package rjags [24], and for Stan [25], the R package rstan [26] was used. The analysis model was similar to the data generating model, but we used the covariance- and mean-based implementation of the multi-level intercept-only model [27]:

\begin{matrix} S \sim W_{P} (\sum, J - 1), \end{matrix}

(3)

\begin{matrix} \bar{y} & \sim N_{P} (μ, \frac{1}{J} \sum), \end{matrix}

(4)

where S is the sample scatter matrix,

\bar{y}

is the sample mean vector,

W

is the Wishart distribution, and where ∑ and μ are the model-implied covariance matrix and mean vector:

\begin{matrix} \sum = (\begin{matrix} σ_{θ}^{2} + σ_{ϵ}^{2} & σ_{θ}^{2} \\ ⋱ \\ σ_{θ}^{2} & σ_{θ}^{2} + σ_{ϵ}^{2} \end{matrix}), \end{matrix}

(5)

\begin{matrix} μ = {(\begin{matrix} μ \dots μ \end{matrix})}^{'} . \end{matrix}

(6)

The inverse of ∑ is:

\begin{matrix} C = & \sum^{- 1} = (\begin{matrix} \frac{(P - 1) σ_{θ}^{2} + σ_{ϵ}^{2}}{ξ} & - \frac{σ_{θ}^{2}}{ξ} \\ ⋱ \\ - \frac{σ_{θ}^{2}}{ξ} & \frac{(P - 1) σ_{θ}^{2} + σ_{ϵ}^{2}}{ξ} \end{matrix}), \end{matrix}

(7)

\begin{matrix} with ξ & = P σ_{θ}^{2} σ_{ϵ}^{2} + σ_{ϵ}^{4} . \end{matrix}

(8)

This “covariance- and mean-based approach” [27] is conceptually similar to the “marginal Stan method” [19] as both groups of authors capitalize on the idea of integrating latent variables out of the model likelihood and has, for instance, been shown to be beneficial for MCMC estimation of continuous-time models [28]. Throughout the paper, we consistently use the terms “covariance- and mean-based” and “classic” as defined in the work of Hecht et al. [27], although other labels exist. The former approach is also called “marginal” approach as parameters are integrated out of the likelihood. Formulating models for continuous variables in terms of multivariate normal and Wishart distributions has also previously been described in, for instance, the work of Goldstein [29]. Hecht et al. [27] created the term “classic” to distinguish approaches that include certain parameters from the ones in which they are integrated out.

The parametrization of the Wishart distribution differs in JAGS and Stan. Whereas JAGS’s dwish(∑⁻¹, df) function uses the inverse covariance matrix, Stan’s wishart(df, ∑) function uses the covariance matrix. To make the comparison between JAGS and Stan as fair as possible, we avoided time-consuming in-software matrix inversion by passing the parametrization-consistent matrix to the functions, that is, matrix C from Equation (7) for setting up the model in JAGS and matrix ∑ from Equation (5) for setting up the model in Stan. Avoiding the inverse of large covariance matrices has also been recommended by Goldstein [29].

Each replication ran on one Intel Xeon E5–2687W (3.0 GHz) CPU of a 64-bit Windows Server 2008 system with a total of 48 CPUs, on which we ran 36 replications (each on one core) in parallel. Run times for the following process steps were obtained. For JAGS, the warmup run time [jags.model()] and the sampling run time [jags.samples()] were determined by the before-after difference of time stamps obtained from the function Sys.time(). For Stan, the warmup and sampling run times can be obtained from the console output of the function sampling() and are also retrievable from the returned results object (attributes(fitted@sim$samples[[1]])[["elapsed_time"]]). Additionally, we recorded the time to set up and compile the Stan model [stan_model(), stanc()] and also the run time of the function sampling() via Sys.time() differences. We report all run times in seconds. The number of warmup iterations [set via argument n.adapt in jags.model() and argument warmup in Stan’s sampling()] was varied in the simulation design (see above), whereas the number of sampling iterations [set via argument n.iter in jags.samples() and argument iter in Stan’s sampling()] was 100,000. Only these 100,000 sampling iterations served as the basis for computing further statistics (see next paragraph). Hence, the warmup iterations were excluded from further processing and can be considered as omitted “burn-in”. Whether omission of additional burn-in was needed or not was determined by visual inspections of trace plots and based on convergence statistics (potential scale reduction (PSR)). One chain per parameter was used without thinning. Starting values were the true parameter values (see Data Generation above).

The effective sample size (ESS) and the PSR were computed with the R package shinystan [30] for both JAGS and Stan samples. The mode of the converged chain served as the parameter estimate. As parameter recovery and precision statistics, bias, root mean squared error (RMSE), and the 95% coverage rate were computed. Our target outcome variable is the MCMC efficiency, which we calculated as the ratio of the ESS and the run time (in seconds) of the sampling on one CPU (a similar definition of MCMC efficiency can be found, for example, in the work of Turek et al. [31]). Carpenter et al. (p. 10, [17]) termed this ESS/s (or its inverse) the “most relevant statistic for comparing the efficiency of sampler implementations” because the estimation accuracy is governed by the square root of the ESS, a fact that has also been shown by Zitzmann and Hecht [32] (see also [33]).

2.4. Results

Descriptive statistics (based on the 100,000 sampling iterations) are presented in Table 1. The maximum PSR value is 1.0002 (JAGS) and 1.0003 (Stan), respectively, indicating that all chains had converged. Visual inspection of randomly selected trace plots confirmed trouble-free convergence. Mean ESS values were 74,334 for JAGS and 86,486 for Stan. Hence, Stan produced a higher ESS than JAGS on average.

Average bias is very close to zero (

M_{bias, JAGS}

=

- 0.0009

,

M_{bias, Stan}

=

0.0005

), RMSEs are practically equal (

M_{RMSE, JAGS}

=

0.0297

,

M_{RMSE, Stan}

=

0.0303

), and the average coverage rates nearly hit 0.95 (

M_{CR, JAGS}

= 0.9608,

M_{CR, Stan}

= 0.9520). Bias, RMSE, and coverage rates of JAGS and Stan are very comparable. Thus, both software estimate the parameters of the multi-level intercept-only model similarly well.

Stan needed on average 0.10 s to set up the model [stan_model()] and 56.83 s for compilation [stanc()]. For the warmup, Stan [console output from rstan’s function sampling()] took 0.36 s on average, whereas JAGS [rjags’s function jags.model()] needed 0.87 s on average (however, time to set up the model was included in this warmup run time for JAGS, but not for Stan). For the sampling of the 100,000 values, JAGS [jags.samples()] needed 114.99 s and Stan [console output from rstan’s function sampling()] 52.23 s. Thus, Stan was about twice as fast as JAGS. Considering that a higher ESS was achieved in much shorter time, Stan clearly samples more efficiently than JAGS on average (in our simulation setup). However, in addition to warmup and sampling, Stan’s sampling() function took another 215 seconds on average before returning the samples. Thus, users need to wait much longer for the results of the sampling process when using rstan instead of rjags.

Figure 1 shows the MCMC efficiency (i.e., the effective sample size produced in one second) split by the levels of the simulation factors and the model parameters. Additional to these marginal mean MCMC efficiencies, we present mean MCMC efficiencies for the simulation factors and model parameters split by software JAGS and Stan in Figure 2 to investigate interaction effects. The overall mean MCMC efficiency was 1180 ESS/s. Considering software, the average MCMC efficiency is 649 for JAGS and 1712 for Stan. Thus, Stan outperforms JAGS by roughly a factor of two and a half on average. Mean MCMC efficiency is lower for 150 warmup iterations (

M_{warmup = 150}

= 1034) than for 1000 warmup iterations (

M_{warmup = 1000}

= 1327). From Figure 2 it becomes clear that there is an interaction between software and number of warmup iterations. Whereas JAGS’s MCMC efficiency is approximately equal for 150 and 1000 warmup iterations (

M_{warmup = 150, JAGS}

= 652,

M_{warmup = 1000, JAGS}

= 647), Stan profits very much from more warmup iterations (

M_{warmup = 150, Stan}

= 1416,

M_{warmup = 1000, Stan}

= 2007). The prior informativeness has practically no effect on the MCMC efficiency (

M_{A}

= 1180,

M_{B}

= 1175,

M_{C}

= 1181,

M_{D}

= 1186). With respect to the three model parameters, the MCMC efficiency differs on average (

M_{μ}

= 1243,

M_{σ_{θ}^{2}}

= 1053,

M_{σ_{ϵ}^{2}}

= 1246), and an interaction with software is evident. Whereas JAGS’s MCMC efficiency for both variance parameters is equal (

M_{σ_{θ}^{2}, JAGS}

= 539,

M_{σ_{ϵ}^{2}, JAGS}

= 540), it is higher for the mean (

M_{μ, JAGS}

= 869). For Stan, the picture is different. Here,

μ

and

σ_{θ}^{2}

show less MCMC efficiency (

M_{μ, Stan}

= 1616,

M_{σ_{θ}^{2}, Stan}

= 1567) than the residual variance

σ_{ϵ}^{2}

(

M_{σ_{ϵ}^{2}, Stan}

= 1952).

Effect sizes

η^{2}

for the simulation factors are presented in Table 2. Software exhibits the by far highest variance explanation (68.4%). Warmup iterations and parameter explain 5.2% and 2.0% of the variance in MCMC efficiency, respectively, whereas variance explanation by prior informativeness is essentially zero. Interactions with above zero variance explanation are Software × Warmup Iterations (5.4%), Software × Parameter (4.5%), Warmup Iterations × Parameter (1.0%), and Software × Warmup Iterations × Parameter (1.0%).

In summary, both software estimate the multi-level intercept-only model equally well, but Stan outperforms JAGS in the production of effective sample size per time unit. Further, Stan profits from more warmup iterations, the prior informativeness is practically not related to the MCMC efficiency, and the MCMC efficiency differs between model parameters.

3. Simulation Study 2: Small Sample Size

In this simulation study, the MCMC efficiency of estimating the covariance- and mean-based parametrization of the multi-level intercept-only model with JAGS and Stan is explored for a small sample size scenario. The simulation design and the analysis strategy were similar to Simulation Study 1. The data generation was similar as well, except that the number of level 2 units was reduced to J = 100 and the number of level 1 units to P = 5.

The overall mean efficiency in this small sample scenario is 8343 ESS/s and thus roughly seven times higher than in Simulation Study 1 where sample sizes were larger (J = 1000, P = 20). Investigating the mean MCMC efficiencies split by the simulation factors, model parameters, and software (Figure 3) yields basically the same pattern as in Simulation Study 1. Stan outperforms JAGS; however, JAGS’ underperformance is not as pronounced as in the large sample scenario. Concerning warmup, the figure again shows that JAGS does not profit from more warmup iterations, whereas Stan does. Prior informativeness exhibits no clear effect. An interaction of parameter and software can again be identified. Whereas JAGS performs better in efficiently estimating

μ

than

σ_{θ}^{2}

and

σ_{ϵ}^{2}

, Stan is approximately equally efficient in estimating all three model parameters.

In summary, reducing the sample size resulted in higher overall MCMC efficiency, but software differences in MCMC efficiency remained, although JAGS caught up somewhat to Stan.

4. Simulation Study 3: Classic Parametrization

In this simulation study, the MCMC efficiency of estimating the classic parametrization of the multi-level intercept-only model with JAGS and Stan is explored. The data generation and the simulation design were similar to Simulation Study 1. For the analysis model, the classic parametrization was now chosen, in which the parameters of the level 2 units (

θ_{j}

in Equations (1) and (2)) were part of the model formulation and thus needed to be sampled (see [27] for further details on the differences between the classic and the covariance- and mean-based parametrization of the multi-level intercept-only model). With J = 1000 level 2 units, this meant that 1000 additional parameters needed to be sampled (an increase by a factor of 334 compared to Simulation Study 1). Hence, to keep the simulation within manageable boundaries, the number of sampling iterations was reduced to 10,000.

Compared to Simulation Studies 1 and 2 (with the covariance- and mean-based parametrization), two major differences emerged: (1) The overall mean efficiency (137 ESS/s) is by far lower, and (2) the software rank order reversed: When employing the classic parametrization, JAGS is much more efficient than Stan (see Figure 4). Again, Stan profits from more warmup iterations, whereas JAGS does not, and prior informativeness has no effect on MCMC efficiency. Concerning the parameters, there is no clear differential picture (i.e., within software, the parameters are estimated with roughly the same efficiency), although Stan seems to have problems with estimating the residual variance

σ_{ϵ}^{2}

(with just 6 ESS/s).

In summary, choosing the classic instead of the covariance- and mean-based implementation resulted in lower overall MCMC efficiency, and JAGS clearly outperformed Stan.

5. Discussion

With this study, we want to contribute empirical results to the discussion about the sampling efficiency of JAGS and Stan. In our limited simulations, Stan outperformed (i.e., exhibited a higher ESS/s) JAGS for the covariance- and mean-based parametrization of the multi-level intercept-only model. However, when the classic parametrization was chosen, JAGS outperformed Stan. Additionally, we found that Stan profited from more warmup iterations and that prior informativeness had no effect on the MCMC efficiency.

The results from our simulation study most certainly will not generalize to other models (or model parametrizations), conditions, or other values/levels of our simulation factors.

Our model was a very simple one. Stan is often said to gain in efficiency with increasing model complexity (e.g., [11,12]). We found higher efficiency (roughly by a factor of 2.5) already for a simple model. It would be interesting whether Stan outperforms JAGS even more for more complex multi-level models and other relevant models for psychological research.

Besides the model itself, the parametrization of the model is crucial as well. We showed that with the classic parametrization of the multi-level intercept-only model, JAGS clearly outperformed Stan. As the classic parametrization is presumably the most intuitive and easiest to implement for psychologists, the software recommendation clearly leans towards JAGS here. However, as shown, users can speed up their analyses (i.e., more efficiently estimate the model parameters) by switching to the covariance- and mean-based parametrization (a comprehensive tutorial on how to set up this parametrization is given by Hecht et al. [27]). For this parametrization, Stan clearly outperforms JAGS and should be the software of choice (if MCMC efficiency is the target criterion). Our results are in line with Merkle et al.’s [19] results who reported superiority of their “new Stan approach” which is conceptually similar to Hecht et al.’s [27] covariance- and mean-based approach. The efficiency of this new approach was even so convincing that the authors of R package blavaan made it their new default method (p. 12, [19]). However, in contrast to Hecht et al. ([27], Equations (13) and (14)), Merkle et al. ([19], Equations (5) and (6)) use multivariate normal marginal distributions of the sample value vectors (presumably because it might be a more flexible and convenient approach, for instance for handling missing data). Future research could investigate performance differences between the Wishart and the multivariate normal parametrization. From our experience with JAGS (not published), multivariate normal modeling of the sample value vectors was much slower than the Wishart modeling of the sample scatter matrix, but this might be different for Stan. Future research could compare more parametrizations of multi-level models.

An anonymous reviewer pointed us to an interesting advantage of marginalized model parametrizations which is detailed in the work of Nielsen et al. [34]. In the classic parametrization that includes random effects, positive within-cluster correlations are assumed. The covariance-based parametrization, in which random effects are integrated out, is more general because negative correlations are allowed as well.

Another approach to improve the efficiency of MCMC sampling is to formulate the model in a way that autocorrelation in the chains is reduced. Various strategies for various models have been proposed (e.g., [35,36,37]). According to Monnahan et al. (p. 344, [11]), “MCMC efficiency for hierarchical models depends on the random effects parameterization, with the centered and non-centered complementary forms being useful for a broad class of models”. In our simulations, we solely used the centered form (Equation (2)) because it is presumably the “natural” form researchers obtain when they translate their model equations into code. In the alternative non-centered form, the random effects (

θ_{j}

) would not be modeled directly, instead as

θ_{j} = μ + σ Z

with

Z \sim N (0, 1)

. Results from Monnahan et al.’s (p. 346, [11]) comparisons of JAGS and Stan suggest that Stan is “more sensitive to the parameterization of the random effects, suggesting analysts use non-centered parameterizations to improve performance”. Future research could generate further evidence on how to improve MCMC efficiency by model reformulations and explore which software profits most.

Concerning sample sizes, other studies (e.g., [11]) report that Stan’s relative efficiency over JAGS increases with increasing sample size. Our results are in line with this finding (for the covariance- and mean-based parametrization); for our small sample size scenario, Stan did not outperform JAGS as much as in the large sample size scenario. As we only had two sample size scenarios (and varied sample size only for the covariance- and mean-based parametrization), generalizations are limited. Future research could investigate the dependency of sample sizes on the efficiency and the moderating effect of sample size on the difference in efficiency between both software in more detail, especially also with respect to the model parametrization. Additionally, the relative performance of different methods might depend on the ratio of the variance components in the simulation model.

We used a high number of iterations to achieve sensibly sized absolute runtimes and to obtain reliable ESS estimates [38]. We assume that the MCMC efficiency (ESS/s) is constant over the course of sampling. Hence, given the chain has converged, the length of the sampling should have no effect on the MCMC efficiency.

We limited our study to two popular Bayesian software in psychological research, namely JAGS and Stan. Other Bayesain software packages, for instance, NIMBLE [10], PyMC3 [39], and LaplacesDemon [40], exist and have been the focus of research (e.g., [38,41,42,43]). Future software comparisons should take these packages even more strongly into account.

The effect of the warmup iterations on Stan’s MCMC efficiency is in line with the functioning of the NUTS algorithm as this algorithm needs to tune the step size to achieve a target acceptance rate and to tune the mass matrix whose function is to transform the posterior to have a simpler geometry for sampling (e.g., [11]). As the optimization of the step size and the mass matrix are mutually dependent, a sufficiently long warmup is needed [11]. We had only two warmup iteration sizes (150 and 1000), with 150 being the lowest default size for one warmup cycle (p. 150, [44]). An anonymous reviewer pointed out that the marginal variances of the parameters in our model are rather small; therefore, it is not surprising that one warmup cycle was not sufficient for optimization. In future research, it would be interesting to explore the shape of the functional dependency of the MCMC efficiency on the number of warmup iterations and determine areas of diminishing marginal utility to derive rule-of-thumb thresholds for sufficient warmup of the NUTS algorithm. Additionally, one could facilitate warmup optimization by programming all parameters “so that they have unit scale and so that posterior correlation is reduced; […] For Hamiltonian Monte Carlo, this implies a unit mass matrix, which requires no adaptation as it is where the algorithm initializes.” (p. 266, [45]). Further, the shown warmup dependency should be taken into consideration when comparing results from studies that differed in this aspect. Merkle et al. (p. 8, [19])—who used 300 warmup iterations for Stan—already pointed to this problem and concluded that therefore “the ESS/s metric is somewhat crude”.

We found no effect of the prior informativeness on the MCMC efficiency. However, this result is just valid for the specific four variations of prior informativeness (and all other specifications) in our simulation setup and might not generalize. In fact, in simulation runs with much lower informativeness (not reported), Stan’s MCMC efficiency was lower. Future research could investigate the prior informativeness effect on efficiency with a wider range of informativeness. Additionally, in our simulation, the amount of data was rather high, marginalizing the effect of the priors. In scenarios with less data, prior effects might arise.

The conjugacy of priors may play a role in MCMC efficiency as well. Using conjugate priors is usually considered computationally more efficient than using non-conjugate priors. Some software/algorithms might even profit from conjugate priors more than others. For instance, results from Bølstad [21] suggest that JAGS performs better than Stan when priors are conjugate and worse when priors are non-conjugate. In our simulations, we used conjugate priors for all parameters in all conditions, leaving prior conjugacy a constant. Thus, we cannot contribute to the discussion of the effect of prior conjugacy on MCMC efficiency. Researchers could pick up this interesting topic in the future.

Although Stan was clearly more efficient than JAGS in the sampling phase, the total time users encounter until samples are returned depends on additional steps. Stan models need to be compiled into C++ code prior to sampling, which was considerable in our study (on average, model compilation took longer than the sampling of 100,000 iterations). Of course, once compiled, models may be reused to avoid the extensive compilation time. Still, users who run a model for the first time (or are not aware that previously compiled models may be reused) must afford the compilation time, and rstan’s primary user-level function stan() that includes all processes may mask the opportunity for reusing compiled models and contribute to a lengthy user experience. Further, we encountered a relatively huge consumption of additional time besides warmup and sampling by rstan’s sampling() function. In fact, the additional time was about twice the time for all other steps combined. We are not sure what this function needs this additional time for. According to an anonymous reviewer, the additional time is partly due to calculating ESS and PSR. As this needs to be done for JAGS as well, fair software comparisons would also need to take this into account. Maybe future versions of rstan’s sampling() function may include an option to return the samples directly after sampling or explicitly report the time needed for additional calculations of convergence and precision statistics.

To conclude, in our specific study setup, the picture concerning MCMC efficiency was mixed. Stan clearly outperformed JAGS when the covariance- and mean-based parametrization of the multi-level intercept-only model was used and JAGS clearly outperformed Stan when the classic parametrization was used. In both software, MCMC efficiency is much higher for the covariance- and mean-based parametrization than for the classic parametrization.

Supplementary Materials

The following are available online at https://www.mdpi.com/article/10.3390/psych3040048/s1, The supplementary materials include an exemplary generated data set in Rdata format, JAGS and Stan code for the multi-level intercept-only model in the covariance- and mean-based and in the classic parametrization, R code to run the models with rjags and rstan, and R code to run simulations.

Author Contributions

The authors declare the following contributions (as defined by http://credit.niso.org (accessed on 30 November 2021)) to this article: M.H.: conceptualization, data curation, formal analysis, investigation, methodology, project administration, software, supervision, validation, visualization, writing: original draft, writing—review and editing; S.W.: conceptualization, data curation, formal analysis, investigation, methodology, software, validation, writing—review and editing; S.Z.: conceptualization, methodology, supervision, validation, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

In this study, only generated data was used. The data generating code is provided in Appendix G.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. rjags Code

Appendix B. JAGS Code (Covariance- and Mean-Based Parametrization)

Appendix C. JAGS Code (Classic Parametrization)

Appendix D. rstan Code

Appendix E. Stan Code (Covariance- and Mean-Based Parametrization)

Appendix F. Stan Code (Classic Parametrization)

Appendix G. Simulation Code

References

van de Schoot, R.; Winter, S.D.; Ryan, O.; Zondervan-Zwijnenburg, M.; Depaoli, S. A systematic review of Bayesian articles in psychology: The last 25 years. Psychol. Methods 2017, 22, 217–239. [Google Scholar] [CrossRef]
Zitzmann, S. A computationally more efficient and more accurate stepwise approach for correcting for sampling error and measurement error. Multivar. Behav. Res. 2018, 53, 612–632. [Google Scholar] [CrossRef]
Gilks, W.R.; Thomas, A.; Spiegelhalter, D.J. A language and program for complex Bayesian modelling. Statistician 1994, 43, 169–177. [Google Scholar] [CrossRef] [Green Version]
Lunn, D.; Jackson, C.; Best, N.; Thomas, A.; Spiegelhalter, D. The BUGS Book; CRC Press: Boca Raton, FL, USA, 2013. [Google Scholar]
Lunn, D.; Spiegelhalter, D.; Thomas, A.; Best, N. The BUGS project: Evolution, critique and future directions. Stat. Med. 2009, 28, 3049–3067. [Google Scholar] [CrossRef] [PubMed]
Lunn, D.J.; Thomas, A.; Best, N.; Spiegelhalter, D. WinBUGS—A Bayesian modelling framework: Concepts, structure, and extensibility. Stat. Comput. 2000, 10, 325–337. [Google Scholar] [CrossRef]
Spiegelhalter, D.; Thomas, A.; Best, N.; Lunn, D. OpenBUGS Version 3.2.3 User Manual. 2014. Available online: http://www.openbugs.net/w/Manuals (accessed on 15 May 2021).
Thomas, A.; O’Hara, R.; Ligges, U.; Sturtz, S. Making BUGS open. R News 2006, 6, 12–17. Available online: http://cran.r-project.org/doc/Rnews (accessed on 30 October 2021).
Plummer, M. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), Vienna, Austria, 20–22 March 2003. [Google Scholar]
de Valpine, P.; Turek, D.; Paciorek, C.J.; Anderson-Bergman, C.; Temple Lang, D.; Bodik, R. Programming with models: Writing statistical algorithms for general model structures with NIMBLE. J. Comput. Graph. Stat. 2017, 26, 403–413. [Google Scholar] [CrossRef] [Green Version]
Monnahan, C.C.; Thorson, J.T.; Branch, T.A. Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods Ecol. Evol. 2017, 8, 339–348. [Google Scholar] [CrossRef]
Gelman, A.; Lee, D.; Guo, J. Stan: A probabilistic programming language for Bayesian inference and optimization. J. Educ. Behav. Stat. 2015, 40, 530–543. [Google Scholar] [CrossRef] [Green Version]
Hoffman, M.D.; Gelman, A. The No-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res. 2014, 15, 1593–1623. [Google Scholar]
Neal, R. MCMC using Hamiltonian dynamics. In Handbook of Markov Chain Monte Carlo; Brooks, S., Gelman, A., Jones, G.L., Meng, X.-L., Eds.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2011; pp. 116–162. [Google Scholar]
Nishio, M.; Arakawa, A. Performance of Hamiltonian Monte Carlo and No-U-Turn Sampler for estimating genetic parameters and breeding values. Genet. Sel. Evol. 2019, 51, 1–12. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Betancourt, M. A conceptual introduction to Hamiltonian Monte Carlo. arXiv 2018, arXiv:1701.02434. [Google Scholar]
Carpenter, B.; Gelman, A.; Hoffman, M.D.; Lee, D.; Goodrich, B.; Betancourt, M.; Brubaker, M.; Guo, J.; Li, P.; Riddell, A. Stan: A probabilistic programming language. J. Stat. Softw. 2017, 76, 1–32. [Google Scholar] [CrossRef] [Green Version]
Grant, R.L.; Furr, D.C.; Carpenter, B.; Gelman, A. Fitting Bayesian item response models in Stata and Stan. Stata J. 2017, 17, 343–357. [Google Scholar] [CrossRef] [Green Version]
Merkle, E.C.; Fitzsimmons, E.; Uanhoro, J.; Goodrich, B. Efficient Bayesian structural equation modeling in Stan. arXiv 2020, arXiv:2008.07733. [Google Scholar]
Wingfeet. JAGS and Stan. 24 August 2014. Available online: https://www.r-bloggers.com/2014/08/jags-and-stan (accessed on 25 June 2021).
Bølstad, J. How Efficient is Stan Compared to JAGS? Conjugacy, Pooling, Centering, and Posterior Correlations. 2 January 2019. Available online: www.boelstad.net/post/stan_vs_jags_speed (accessed on 1 July 2021).
R Core Team. R: A Language and Environment for Statistical Computing (Version 4.0.5) [Software]; R Foundation for Statistical Computing: Vienna, Austria, 2021; Available online: https://www.r-project.org (accessed on 15 May 2021).
Plummer, M. JAGS (Version 4.3.0) [Software]. 2017. Available online: https://sourceforge.net/projects/mcmc-jags/files (accessed on 15 May 2021).
Plummer, M. rjags: Bayesian Graphical Models Using MCMC (Version 4–10) [Software]. 2019. Available online: https://cran.r-project.org/package=rjags (accessed on 15 May 2021).
Stan Development Team. Stan (Version 2.27) [Software]. 2021. Available online: https://mc-stan.org (accessed on 21 June 2021).
Stan Development Team. rstan: The R Interface to Stan (Version 2.21.2) [Software]. 2020. Available online: https://cran.r-project.org/package=rstan (accessed on 15 May 2021).
Hecht, M.; Gische, C.; Vogel, D.; Zitzmann, S. Integrating out nuisance parameters for computationally more efficient Bayesian estimation—An illustration and tutorial. Struct. Equ. Model. A Multidiscip. J. 2020, 27, 483–493. [Google Scholar] [CrossRef]
Hecht, M.; Zitzmann, S. A computationally more efficient Bayesian approach for estimating continuous-time models. Struct. Equ. Model. A Multidiscip. J. 2020, 27, 829–840. [Google Scholar] [CrossRef]
Goldstein, H. Multilevel Statistical Models; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Gabry, J. Shinystan: Interactive Visual and Numerical Diagnostics and Posterior Analysis for Bayesian Models (Version 2.5.0) [Software]. 2018. Available online: http://cran.r-project.org/package=shinystan (accessed on 15 May 2021).
Turek, D.; de Valpine, P.; Paciorek, C.J. Efficient Markov chain Monte Carlo sampling for hierarchical hidden Markov models. Environ. Ecol. Stat. 2016, 23, 549–564. [Google Scholar] [CrossRef] [Green Version]
Zitzmann, S.; Hecht, M. Going beyond convergence in Bayesian estimation: Why precision matters too and how to assess it. Struct. Equ. Model. A Multidiscip. J. 2019, 26, 646–661. [Google Scholar] [CrossRef]
Zitzmann, S.; Weirich, S.; Hecht, M. Using the effective sample size as the stopping criterion in Markov chain Monte Carlo with the Bayes module in Mplus. Psych 2021, 3, 336–347. [Google Scholar] [CrossRef]
Nielsen, N.M.; Smink, W.A.C.; Fox, J.-P. Small and negative correlations among clustered observations: Limitations of the linear mixed effects model. Behaviormetrika 2021, 48, 51–77. [Google Scholar] [CrossRef]
Gelman, A.; Hill, J. Data Analysis Using Regression and Multilevel/Hierarchical Models; Cambridge University Press: Cambridge, UK, 2006. [Google Scholar]
Kruschke, J.K. Doing Bayesian Data Analysis: A Tutorial Introduction with R and BUGS, 2nd ed.; Academic Press: Cambridge, MA, USA, 2015. [Google Scholar]
Papaspiliopoulos, O.; Roberts, G.O.; Sköld, M. A General Framework for the Parametrization of Hierarchical Models. Stat. Sci. 2007, 22. [Google Scholar] [CrossRef] [Green Version]
Paganin, S.; Paciorek, C.J.; Wehrhahn, C.; Rodriguez, A.; Rabe-Hesketh, S.; de Valpine, P. Computational methods for Bayesian semiparametric Item Response Theory models. arXiv 2021, arXiv:2101.11583. [Google Scholar]
Salvatier, J.; Wiecki, T.V.; Fonnesbeck, C. Probabilistic programming in Python using PyMC3. PeerJ Comput. Sci. 2016, 2. [Google Scholar] [CrossRef] [Green Version]
Statisticat, LLC. LaplacesDemon: Complete Environment for Bayesian Inference [Software]. 2021. Available online: https://web.archive.org/web/20150206004624/http://www.bayesian-inference.com/software (accessed on 22 August 2021).
Beraha, M.; Falco, D.; Guglielmi, A. JAGS, NIMBLE, Stan: A Detailed Comparison Among Bayesian MCMC Software. arXiv 2021, arXiv:2107.09357. [Google Scholar]
De Valpine, P. Some Comparisons between NIMBLE, JAGS and Stan for a Couple of Examples from Gelman and Hill (2007). 8 August 2021. Available online: https://nature.berkeley.edu/~pdevalpine/MCMC_comparisons/some_ARM_comparisons/nimble_ARM_comparisons.html (accessed on 27 August 2021).
Ponisio, L.C.; Valpine, P.; Michaud, N.; Turek, D. One size does not fit all: Customizing MCMC methods for hierarchical models using NIMBLE. Ecol. Evol. 2020, 10, 2385–2416. [Google Scholar] [CrossRef] [Green Version]
Stan Development Team. Stan Reference Manual (Version 2.27). 2021. Available online: https://mc-stan.org/docs/2_27/reference-manual-2_27.pdf (accessed on 21 June 2021).
Stan Development Team. Stan User’s Guide (Version 2.27). 2021. Available online: https://mc-stan.org/docs/2_27/stan-users-guide-2_27.pdf (accessed on 21 June 2021).

Figure 1. Simulation Study 1: ESS performance (ESS/s) of simulation factors software (JAGS/Stan), number of warmup iterations (150/1000), prior informativeness (A/B/C/D), and model parameters (

μ

,

σ_{θ}^{2}

,

σ_{ϵ}^{2}

). Number of level 2 units: J = 1000. Number of level 1 units: P = 20. Covariance- and mean-based model parametrization.

Figure 1. Simulation Study 1: ESS performance (ESS/s) of simulation factors software (JAGS/Stan), number of warmup iterations (150/1000), prior informativeness (A/B/C/D), and model parameters (

μ

,

σ_{θ}^{2}

,

σ_{ϵ}^{2}

). Number of level 2 units: J = 1000. Number of level 1 units: P = 20. Covariance- and mean-based model parametrization.

Figure 2. Simulation Study 1: ESS performance (ESS/s) of number of warmup iterations (150/1000), prior informativeness (A/B/C/D), and model parameters (

μ

,

σ_{θ}^{2}

,

σ_{ϵ}^{2}

) by software (JAGS/Stan). Number of level 2 units: J = 1000. Number of level 1 units: P = 20. Covariance- and mean-based model parametrization.

Figure 2. Simulation Study 1: ESS performance (ESS/s) of number of warmup iterations (150/1000), prior informativeness (A/B/C/D), and model parameters (

μ

,

σ_{θ}^{2}

,

σ_{ϵ}^{2}

) by software (JAGS/Stan). Number of level 2 units: J = 1000. Number of level 1 units: P = 20. Covariance- and mean-based model parametrization.

Figure 3. Simulation Study 2 (small sample size): ESS performance (ESS/s) of number of warmup iterations (150/1000), prior informativeness (A/B/C/D), and model parameters (

μ

,

σ_{θ}^{2}

,

σ_{ϵ}^{2}

) by software (JAGS/Stan). Number of level 2 units: J = 100. Number of level 1 units: P = 5. Covariance- and mean-based model parametrization.

Figure 3. Simulation Study 2 (small sample size): ESS performance (ESS/s) of number of warmup iterations (150/1000), prior informativeness (A/B/C/D), and model parameters (

μ

,

σ_{θ}^{2}

,

σ_{ϵ}^{2}

) by software (JAGS/Stan). Number of level 2 units: J = 100. Number of level 1 units: P = 5. Covariance- and mean-based model parametrization.

Figure 4. Simulation Study 3 (classic parametrization): ESS performance (ESS/s) of number of warmup iterations (150/1000), prior informativeness (A/B/C/D), and model parameters (

μ

,

σ_{θ}^{2}

,

σ_{ϵ}^{2}

) by software (JAGS/Stan).

M_{θ}

is the average MCMC efficiency of level 2 units parameters

θ_{j}

. Number of level 2 units: J = 1000. Number of level 1 units: P = 20. Classic model parametrization.

Figure 4. Simulation Study 3 (classic parametrization): ESS performance (ESS/s) of number of warmup iterations (150/1000), prior informativeness (A/B/C/D), and model parameters (

μ

,

σ_{θ}^{2}

,

σ_{ϵ}^{2}

) by software (JAGS/Stan).

M_{θ}

is the average MCMC efficiency of level 2 units parameters

θ_{j}

. Number of level 2 units: J = 1000. Number of level 1 units: P = 20. Classic model parametrization.

Table 1. Descriptive statistics of convergence, precision, recovery, accuracy, and run time by software (Simulation Study 1).

Statistic	Software	M	SD	Min	Max
Convergence/Precision
PSR	JAGS	1.0000	0.0000	1.0000	1.0002
	Stan	1.0000	0.0000	1.0000	1.0003
ESS	JAGS	74,334	17,950	46,727	100,000
	Stan	86,486	17,591	27,583	100,000
Recovery/Accuracy
Bias	JAGS	−0.0009	0.0016	−0.0030	0.0010
	Stan	0.0005	0.0005	−0.0002	0.0010
RMSE	JAGS	0.0297	0.0155	0.0101	0.0474
	Stan	0.0303	0.0164	0.0096	0.0489
Coverage rate 95%	JAGS	0.9608	0.0054	0.9530	0.9670
	Stan	0.9520	0.0075	0.9420	0.9620
Run time (s)
Warmup	JAGS	0.87	0.81	0.13	21.95
	Stan	0.36	0.24	0.08	2.05
Sampling	JAGS	114.99	7.48	89.55	155.42
	Stan	52.23	8.23	30.18	97.65
Translation	Stan	0.10	0.02	0.06	0.37
Compilation	Stan	56.83	7.29	42.09	132.18
`sampling()`	Stan	214.65	12.86	185.62	270.97

Note. Run time for rstan’s sampling() function is the additional time that this function runs besides warmup and sampling.

Table 2. Effect size

η^{2}

for the simulation factors (Simulation Study 1).

Table 2. Effect size

η^{2}

for the simulation factors (Simulation Study 1).

Factor	$η^{2}$
Software	0.684
Warmup Iterations	0.052
Prior Informativeness	0.000
Parameter	0.020
Software × Warmup Iterations	0.054
Software × Prior Informativeness	0.000
Software × Parameter	0.045
Warmup Iterations × Prior Informativeness	0.000
Warmup Iterations × Parameter	0.010
Prior Informativeness × Parameter	0.000
Software × Warmup Iterations × Prior Informativeness	0.000
Software × Warmup Iterations × Parameter	0.010
Software × Prior Informativeness × Parameter	0.000
Warmup Iterations × Prior Informativeness × Parameter	0.000
Software × Warmup Iterations × Prior Informativeness × Parameter	0.000

Note. Dependent variable: MCMC efficiency (ESS/s).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hecht, M.; Weirich, S.; Zitzmann, S. Comparing the MCMC Efficiency of JAGS and Stan for the Multi-Level Intercept-Only Model in the Covariance- and Mean-Based and Classic Parametrization. Psych 2021, 3, 751-779. https://doi.org/10.3390/psych3040048

AMA Style

Hecht M, Weirich S, Zitzmann S. Comparing the MCMC Efficiency of JAGS and Stan for the Multi-Level Intercept-Only Model in the Covariance- and Mean-Based and Classic Parametrization. Psych. 2021; 3(4):751-779. https://doi.org/10.3390/psych3040048

Chicago/Turabian Style

Hecht, Martin, Sebastian Weirich, and Steffen Zitzmann. 2021. "Comparing the MCMC Efficiency of JAGS and Stan for the Multi-Level Intercept-Only Model in the Covariance- and Mean-Based and Classic Parametrization" Psych 3, no. 4: 751-779. https://doi.org/10.3390/psych3040048

Article Menu

Comparing the MCMC Efficiency of JAGS and Stan for the Multi-Level Intercept-Only Model in the Covariance- and Mean-Based and Classic Parametrization

Abstract

1. Introduction

Purpose and Scope

2. Simulation Study 1

2.1. Data Generation

2.2. Simulation Design

2.3. Analysis

2.4. Results

3. Simulation Study 2: Small Sample Size

4. Simulation Study 3: Classic Parametrization

5. Discussion

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. rjags Code

Appendix B. JAGS Code (Covariance- and Mean-Based Parametrization)

Appendix C. JAGS Code (Classic Parametrization)

Appendix D. rstan Code

Appendix E. Stan Code (Covariance- and Mean-Based Parametrization)

Appendix F. Stan Code (Classic Parametrization)

Appendix G. Simulation Code

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI