Next Article in Journal
Forecasting US Inflation in Real Time
Next Article in Special Issue
Nonfractional Long-Range Dependence: Long Memory, Antipersistence, and Aggregation
Previous Article in Journal
Forecasting FOMC Forecasts
Previous Article in Special Issue
Multivariate Analysis of Cryptocurrencies
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

Inference Using Simulated Neural Moments

Department of Economics and Economic History and MOVE, Universitat Autònoma de Barcelona, 08193 Bellaterra, Spain
Barcelona School of Economics, 08005 Barcelona, Spain
Econometrics 2021, 9(4), 35;
Submission received: 25 February 2021 / Revised: 15 September 2021 / Accepted: 18 September 2021 / Published: 24 September 2021


This paper studies method of simulated moments (MSM) estimators that are implemented using Bayesian methods, specifically Markov chain Monte Carlo (MCMC). Motivation and theory for the methods is provided by Chernozhukov and Hong (2003). The paper shows, experimentally, that confidence intervals using these methods may have coverage which is far from the nominal level, a result which has parallels in the literature that studies overidentified GMM estimators. A neural network may be used to reduce the dimension of an initial set of moments to the minimum number that maintains identification, as in Creel (2017). When MSM-MCMC estimation and inference is based on such moments, and using a continuously updating criteria function, confidence intervals have statistically correct coverage in all cases studied. The methods are illustrated by application to several test models, including a small DSGE model, and to a jump-diffusion model for returns of the S&P 500 index.

1. Introduction

It has long been known that classical inference methods based on first-order asymptotic theory, when applied to the generalized method of moments estimator, may lead to unreliable results, in the form of substantial finite sample biases and variances, and incorrect coverage of confidence intervals, especially when the model is overidentified (Donald et al. 2009; Hall and Horowitz 1996; Hansen et al. 1996; Tauchen 1986). In another strand of the literature, Chernozhukov and Hong (2003) introduced Laplace-type estimators, which allow for estimation and inference with classical statistical methods (those which are defined by optimization of an objective function) to be done by working with the elements of a tuned Markov chain, so that potentially difficult or unreliable steps such as optimization or computation of asymptotic standard errors, etc., may be avoided. A third important strand of literature is simulation-based estimation. The strands of moment-based estimation, simulation, and Laplace-type methods meet in certain applications. The code by Gallant and Tauchen (Gallant and Tauchen 2010) for efficient method of moments estimation (Gallant and Tauchen 1996), which has been used in numerous papers, is an example. Another is Christiano et al. (2010) (see also Christiano et al. 2016), which proposes a Laplace-type estimation methodology that uses simulated moments which are defined in terms of impulse response functions for estimation of macroeconomic modes. Very similar methodologies may be found in the broad Approximate Bayesian Computing literature, some of which uses MCMC methods and criteria functions that involve simulated moments (e.g., Marjoram et al. 2003).
Given the uneven performance of inference in classical GMM applications, one may wonder how reliable are inferences made using the combination of Laplace-type methods and simulated moments. Henceforth, this combination is referred to as MSM-MCMC, because the specific Laplace-type method considered here is to use the criteria function of the MSM estimator to define the likelihood that determines acceptance/rejection in Metropolis-Hastings MCMC, as was the focus of Chernozhukov and Hong (2003). This paper provides experimental evidence that confidence intervals derived from such estimators may have poor coverage when the moments over-identify the parameters, a result that parallels the above cited results for classical GMM estimators. It goes on to provide evidence that the simulated neural moments that were introduced in Creel (2017), which are just-identifying, when used with MSM-MCMC techniques, cause inferences to become much more reliable, especially when the continuously updating version of the GMM criteria is used. This paper is a continuation of the line of research in Creel (2017), its main new contribution being the experimental confirmation that inferences based upon simulated neural moments are reliable. The paper concludes with an example that uses the methods to estimate a jump-diffusion model for returns of the S&P 500 index.
Section 2 reviews how Laplace-type methods may be used with simulated moments, giving the MSM-MCMC combination, and Section 3 then discusses how neural networks may be used to reduce the dimension of the moment conditions. Section 4 presents four test models, and Section 5 gives results for these models. Section 6 illustrates the methods in the context of an empirical analysis of a model of more complexity, concretely, a jump-diffusion model for financial returns, and Section 7 summarizes the conclusions. The SNM archive (release version 1.2) contains all the code and results reported in this paper. These results were obtained using the Julia package SimulatedNeuralMoments.jl (release version 0.1.0), which provides a convenient way to use the methods for other research projects.

2. Simulated Moments, Indirect Likelihood, and MSM-MCMC Inference

This section summarizes results from the part of the simulation-based estimation literature that bases estimation on a statistic, including (Gallant and Tauchen 1996; Gouriéroux et al. 1993; McFadden 1989; Smith 1993), among others, which is reviewed in (Jiang and Turnbull 2004). Suppose there is a model M ( θ ) which generates data from a probability distribution P ( θ ) which depends on the unknown parameter vector θ . M ( θ ) is fully known up to θ , so that we can make draws of the data from the model, given θ . Let Y = Y ( θ ) be a sample drawn at the parameter vector θ , where θ Θ R k and Θ is a known parameter space. Suppose we have selected a finite-dimensional statistic Z = Z ( θ ) = Z ( Y ( θ ) ) upon which to base estimation, and assume that the statistic satisfies a central limit theorem, uniformly, for all values of θ of interest:
n Z E θ Z d N ( 0 , Σ ¯ ( θ ) )
Let Z s ( θ ) = Z ( Y s ( θ ) ) be the statistic evaluated using an artificial sample drawn from the model at the parameter value θ . This statistic has the same asymptotic distribution as does Z ( θ ) , and furthermore, the two statistics are independent of one another. With S such simulated statistics, define m ( θ ) = Z ( θ ) S 1 s Z s ( θ ) and V ¯ ( θ ) = ( 1 + S 1 ) Σ ¯ ( θ ) . We can easily obtain
n m ( θ ) d N ( 0 , V ¯ ( θ ) ) .
Now, suppose we have a real sample which was generated at the unknown true parameter value θ 0 , and let Z ^ be the associated value of the statistic. Define m ^ ( θ ) = Z ^ S 1 s Z s ( θ ) . With this, and Equation (2), we can define the indirect likelihood function1
L = L ( θ | Z ^ ) = 2 π V ¯ ^ ( θ ) 1 / 2 exp ( 1 2 H )
H = H ( θ | Z ^ ) = n · m ^ ( θ ) T V ¯ ^ 1 ( θ ) m ^ ( θ ) ,
where V ¯ ^ ( θ ) is a consistent estimate of V ¯ ( θ ) .
To estimate V ¯ ( θ ) , one possibility is to use a fixed sample-based estimate that does not rely on an estimate of θ 0 (see, for example, Christiano et al. 2010, 2016). Another possibility is to (1) compute the estimate Σ ¯ ^ ( θ ) of the covariance matrix in expression (1) as the sample covariance of R draws of n Z s ( θ ) :
Σ ¯ ^ ( θ ) = 1 R r = 1 R ( n Z r ( θ ) M ) ( n Z r ( θ ) M ) ,
where M = 1 R r n Z r ( θ ) is the sample mean of the draws, and then (2) multiply the result by 1 + S 1 to obtain the estimate
V ¯ ^ ( θ ) = ( 1 + S 1 ) Σ ¯ ^ ( θ ) .
This estimator may be used in a continuously updating fashion, by updating V ¯ ^ ( θ ) in Equations (3) or (4) every time the respective function is evaluated. Alternatively, if we obtain an initial consistent estimator of θ 0 , then V ¯ ^ ( θ ) can be computed at this estimate, and kept fixed in subsequent computations, in the usual two-step manner. Please note that if a fixed covariance estimator is used, then the maximizer of L is the same as the minimizer of H.
Extremum estimators may be obtained by maximizing log L , or minimizing H . Laplace-type estimators, as defined by Chernozhukov and Hong (2003), may be defined by setting their general criteria function, L n ( θ ) , as defined in their Section 3.1, to either log L , or 1 2 H . Once this is done, then the practical methodology is to use Markov chain Monte Carlo (MCMC) methods to draw a chain C = { θ r } , r = 1 , 2 , , R , given the sample statistic Z ^ , where acceptance/rejection is determined using the chosen L n ( θ ) , along with a prior, and standard proposal methods2. This specific version of Laplace-type methods is referred to as MSM-MCMC in this paper. This paper will rely directly on the theory and methods of Chernozhukov and Hong (2003), as MSM-MCMC falls within the class of methods they study. In the following, a primary use of the Chernozhukov and Hong (2003) methodology will be in order to obtain confidence intervals. For a function f ( θ ) , Theorem 3 of Chernozhukov and Hong (2003) proves that a valid confidence interval can be obtained using the quantiles of { f ( θ r ) } r = 1 , 2 , R , based on the final chain C = { θ r } , r = 1 , 2 , , R . For example, a 95% confidence interval for a parameter θ j is given by the interval ( Q θ j ( 0.025 ) , Q θ j ( 0.975 ) ), where Q θ j ( τ ) is the τ th quantile of the R values of the parameter θ j in the chain C .

3. Neural Moments

The dimension of the statistics used for estimation, Z , can be made minimal (equal to the dimension of the parameter to estimate, θ ) by filtering an initial set of statistics, say, W, through a trained neural net. Details of this process are explained in Creel (2017) and references cited therein, and the process is made explicit in the code which accompanies this paper3. A summary of this process is: Suppose that W is a p vector of statistics W = W ( Y ) , with p k , where k = dim θ . We may generate a large sample of ( W , θ ) pairs, following:
  • Draw θ s from the parameter space Θ , using some prior distribution.
  • Draw a sample Y s from the model M ( θ ) at θ s .
  • Compute the vector of raw statistics W ( Y s ) .
We can repeat this process to generate a large data set { θ s , W s } , s = 1 , 2 , , S , which can be used to train a neural network which predicts θ , given W. This process can be done without knowledge of the real sample data, and can in fact be done before the real sample data are gathered. The prediction from the net will be of the same dimension as θ , and, according to results collectively known as the universal approximation theorem, will be a very accurate approximation to the posterior mean of θ conditional on W (Hornik et al. 1989; Lu et al. 2017). The output of the net may be represented as θ ^ = f ( W , ϕ ^ ) , where f ( W , ϕ ) : R p R k is the neural net, with parameters ϕ , that takes as inputs the p statistics W , and has k = dim θ outputs. The parameters of the net, ϕ , are adjusted using standard training methods from the neural net literature to obtain the trained parameters, ϕ ^ . Then we can think of θ ^ = f ( W , ϕ ^ ) as a k dimensional statistic which can be computed essentially instantaneously once provided with W. We will use this statistic θ ^ as the Z of the previous section. Because the statistic is an accurate approximation to the posterior mean conditional on W (supposing the net was well trained), it has two virtues: it is informative for θ (supposing that the initial statistics W contain information on θ ) and it has the minimal dimension needed to identify θ . From the related GMM literature, GMM methods are known to lead to inaccurate inference when the dimension of the moments is large relative to the dimension of the parameter vector (Donald et al. 2009). Use of a neural net as described here reduces the dimension of the statistic to the minimum required for identification.
When the statistic Z is the output of a neural net f ( W , ϕ ) , where the parameter vector of the net, ϕ , can have a very high dimension (hundreds or thousands of parameters are not uncommon), the simulated likelihood of Equation (3) will be a wavy function, with many local maxima. This will occur even if the net is trained using regularization methods. Because of this waviness, gradient-based methods will not be effective when attempting to maximize log L or to minimize H (Equations (3) and (4)), and attempts to compute the covariance matrix of the estimator that rely on derivatives of the log likelihood function will also be unlikely to succeed. However, derivative free methods can be used to compute extremum estimators, to obtain point estimators or to initialize a MCMC chain, and the simulation-based estimator of the covariance matrix Σ ¯ ( θ ) of Equation (1) discussed in the previous section does not depend on derivatives. A major motivation of using Laplace-type estimators in the first place is to overcome problems of local extrema, as Chernozhukov and Hong (2003) emphasize. It is worth noting that the output of the net evaluated at the real sample statistic, θ ^ , will also provide an excellent starting value for computing extremum estimators, or for initializing a MCMC chain. Likewise, the covariance estimator of Equation (6) can be used to define a random walk multivariate normal proposal density for MCMC, by drawing the trial value θ s + 1 from N ( θ s , V ¯ ^ ) , where θ s is the current value of the chain. Experience with this proposal density, as reported below, is that it is easy to tune, by scaling the covariance by a scalar, to achieve an acceptance rate withing the desired limits4.
Creel (2017) used neural moments to compute a Laplace-type estimator, similarly to what is done here. That paper used nonparametric regression quantiles applied to the set of draws from the Laplace-type posterior to compute confidence intervals, and the posterior draws were generated by a procedure similar to sequential Monte Carlo, rather than MCMC. Additionally, the metric used for selection of particles was different from the GMM criteria, which were used here. The use of nonparametric regression quantiles is very costly to study by Monte Carlo. Thus, this paper focuses on straightforward use of the methods that Chernozhukov and Hong (2003) focus on: traditional MCMC using the GMM criteria function, with confidence intervals computed using the direct quantiles from the posterior sample. These simplifications give a simpler and more tractable procedure that can reasonably be studied and verified by Monte Carlo. For theoretical support, we can note that the methods fall within the class of methods studied by Chernozhukhov and Hong, with the only innovation being the use of statistics filtered through a previously trained neural net. The neural nets used here consist of a finite series of nonstochastic nonlinear mappings to the (−1, 1) interval, followed by a final linear transformation. As such, the conjecture that the final statistics that are the output of the net follow a uniform law of large numbers and a uniform central limit theorem seems reasonable, but this is not formally verified in this paper.

4. Examples

This section presents example models that are used to investigate the performance of the proposed methods. For all models, the code used (for the Julia language) is available in an archive5, release version 1.2, where the details of each example may be consulted. The example models also serve as templates that may be used to apply to proposed methods to models of the reader’s interest: one simply needs to provide similar functions to what is found in the directory for each example, for the model of interest. These are, fundamentally, (1) a prior from which to draw the parameters; (2) code to simulate the model given the parameter value, and finally, (3) code to compute the initial statistics, W , given the data generated from the model. For the examples, uniform and fairly uninformative priors were used in all cases. The details regarding priors and statistics, W, may be consulted in the links provided, below.

4.1. Stochastic Volatility

The simple stochastic volatility (SV) model is
y t = ϕ exp ( h t / 2 ) ϵ t h t = ρ h t 1 + σ u t
where ϵ t and u t are independent standard normal random variables. We use a sample size of 500 observations, and the true parameter values are θ 0 = ( ϕ 0 , ρ 0 , σ 0 ) = ( 0.692 , 0.9, 0.363). These parameter values have been chosen to facilitate comparison with results of several previous studies that have used the same SV model to check properties of estimators. For estimation, 11 statistics are used to form the initial set, W , which include moments of y and of y , as well as the estimated parameters of a heterogeneous autoregressive (HAR) auxiliary model (Corsi 2009) fit to y .6

4.2. ARMA

The next example is a simple ARMA(1, 1) model
x t = α x t 1 + f t β f t 1 f t I I N ( 0 , σ 2 ) ,
with true values θ 0 = ( α 0 , β 0 , σ 0 2 ) = ( 0.95, 0.5, 1.0). The sample size is n = 300 . The 13 statistics used to define the initial set, W , include sample moments and correlations, OLS estimates of an AR(1) auxiliary model fit to x t , as well as another AR(1) model fit to the residuals of the first model, plus partial autocorrelations of x t .7

4.3. Mixture of Normals

For the mixture of normals example, the variable y is drawn from the distribution N ( μ 1 , σ 1 2 ) with probability p and from N ( μ 1 μ 2 , σ 1 2 + σ 2 2 ) with probability 1 p . Samples of 1000 observations are drawn. The true parameter values are θ 0 = ( μ 1 , σ 1 , μ 2 , σ 2 , p ) = ( 1.0 , 1.0 , 0.2 , 1.8 , 0.4 ) , and the prior restricts all parameters to be positive. Thus, the parameterization and the prior together impose that the first component has a larger mean and a lower variance than does the second component, in order to ensure identification. Additionally, the probability that either component is sampled is restricted to be at least 0.05. The 15 auxiliary statistics are the sample mean, standard deviation, skewness, kurtosis, and 11 quantiles of y.8

4.4. DSGE Model

The previous models are all simple, quickly simulated, and with relatively few parameters. This section presents a model which is more representative of an actual research problem. The model is a simple dynamic stochastic general equilibrium model with two shocks:
At the beginning of period t, the representative household owns a given amount of capital ( k t ) , and chooses consumption ( c t ) , investment ( i t ) and hours of labor ( n t ) to maximize expected discounted utility
E t s = 0 β s c t + s 1 γ 1 γ + ( 1 n t + s ) η t ψ
subject to the budget constraint c t + i t = r t k t + w t n t , available time 0 n t 1 , and the accumulation of capital k t + 1 = i t + ( 1 δ ) k t , each of which must hold for all t. The shock, η t , that affects the desirability of leisure relative to consumption, evolves according to ln η t = ρ η ln η t 1 + σ η ϵ t .
The single competitive firm maximizes profits y t w t n t r t k t from production of the good ( y t ), taking wages ( w t ) and the interest rate ( r t ) as given, using the constant returns to scale technology
y t = k t α n t 1 α z t .
The technology shock, z t , also follows an AR(1) process in logarithms: ln z t = ρ z ln z t 1 + σ z u t . The innovations to the preference and technology shocks, ϵ t and u t , are independent standard normal random variables. Production ( y t ) can be allocated by the consumer to consumption or investment: y t = c t + i t . The consumer provides capital and labor to the firm, and is paid at the competitive rates r t and w t , respectively.
From this model, samples of size 160, which simulate 40 years of quarterly data, are drawn, given the 9 parameters α , β , γ , δ , ρ z , σ z , ρ η , σ η and ψ . The variables available for estimation are y , c , n , w , and r . It is possible to recover the parameters α and δ exactly, given the observable variables, so these two parameters are set to fixed values, and the remaining 7 parameters are estimated. To facilitate setting priors, the steady state value of hours (n) is estimated instead of ψ , which may then be recovered. For estimation, 45 statistics are used, including means and standard deviations of the observable variables, and estimates from auxiliary regressions9.

5. Monte Carlo Results

This section reports results for MSM-MCMC estimation of each of the test models, using the GMM-like criteria function H (Equation (4)) as the L n of Chernozhukov and Hong (2003). Results using the criterion L (Equation (3)) were qualitatively very similar in all cases where the two versions were computed, and are thus not reported10. In all cases, 500 Monte Carlo replications were done. For all the test models, the number of artificial samples used to train the neural net was 20,000 times the number of parameters of the model. This is actually a fairly small number, given that generating the samples and training the nets is an operation that takes only 10 min or less for the test models, other than the DSGE model. The reason that a larger number of samples was not used is that it was desired to obtain results that may be more relevant for cases where it is more costly to simulate from the model, as is the case of the jump diffusion model studied below.
First, we report results for the SV and ARMA models, where MSM-MCMC estimators were computed using both the overidentifying statistic, W , and the exactly identifying neural moments, Z . For the plain overidentifying statistics, the results are computed using the CUE GMM criteria. For the neural moments, both the two-step and the CUE criteria were used. Table 1 reports RMSE. The three versions of the MSM-MCMC estimators lead to similar RMSEs, generally speaking. The version that uses the raw statistics has somewhat higher RMSE than do the versions based on the neural statistics, Z, in most cases, but the differences are not important.
Table 2, Table 3 and Table 4 address the main point of the paper, inference, reporting confidence interval coverage, which is the proportion of times that the true parameter lies inside the computed confidence interval. Critical coverage proportions that would lead one to reject correct coverage may be computed from the binomial(500, p ) distribution, where p is the significance level associated with the respective confidence interval. These critical coverage proportions are 0.864 and 0.932 for 90% intervals, 0.924 and 0.974 for 95% intervals, and 0.976–1.0 for 99% intervals. Looking at the column labeled W (CUE) in these tables, we see that the results on the unreliability of inferences for overidentified GMM estimators, which were reviewed in the Introduction, carry over to Bayesian MCMC methods, at least for the models considered. In all entries but one, the coverage is significantly different from correct coverage, erring on the side of being too low, and, in many cases, considerably so. This implies that the probability of Type-I error is higher than the associated nominal significance level. For the neural net statistics, Z, coverage is improved. For the two-step version, coverage is in all cases closer to the correct proportion than when the raw statistics, W , are used. In several cases, correct coverage is also statistically rejected, but now, the error is on the side of conservative confidence intervals, which contain the true parameters more often than the nominal coverage. In this case, the probability of Type-I error will be less than the nominal significance level associated with the confidence intervals. For the CUE version that uses the neural statistics, Z , coverage is very good, and is close to the nominal proportion in all cases. Statistically correct coverage is never rejected when neural moments and the CUE criteria are used.
For the other two test models, MN and DSGE, results were computed only for the neural moments, as the results for the SV and ARMA models, as well as the results from the GMM literature, already indicated that inferences based on the raw statistics, W , were very likely to be unreliable. Table 5 has the RMSE results for the MN and DSGE models. We can see that the use of the two-step or CUE criteria makes little difference for RMSE, in common with the above results for the SV and ARMA models. Table 6, Table 7 and Table 8 hold the confidence interval coverage results for these two models. Again, the intervals based on the two-step criteria often contain the true parameters more often than they should. The coverage of intervals based on the CUE criteria is very good in all cases, and is never statistically significantly different from correct.
In summary, this section has shown that confidence intervals based on raw overidentifying statistics may be unreliable, rejecting the true parameter values more often than they should. Intervals based on exactly identifying neural moments are more reliable, in general. When the two-step version is used, the intervals are often too broad, so the probability of Type-I error is less than what it should be, and power to reject false hypotheses is lower than it could be. When the CUE version is used, coverage is very accurate: correct coverage was never rejected in any of the cases.
It is to be noted that the CUE version is computationally more demanding than is the two-step version, as the weight matrix must be estimated at each MCMC trial vector. Each of these estimations requires a reasonably large number of simulations to be drawn, to estimate the covariance matrix accurately. If a researcher is primarily concerned with limiting the probability of Type-I error, and is willing to accept a loss of power to accelerate computations, then the two-step version might be preferred. If one is willing to accept more costly computations, all the examples considered here indicate that the CUE version will lead to accurate confidence intervals.

6. Application: A Jump-Diffusion Model of S&P 500 Returns

The previous examples are mostly small models that are not costly to simulate, except for the DSGE example. As an example of a more computationally challenging model that may be more representative of actual research problems, this section presents results for estimation of a jump-diffusion model of S&P 500 returns. Solving and simulating11 the model for each MCMC trial parameter acceptance/rejection decision takes about 15 s, when the CUE criteria are used, so training a net and estimation by MCMC is somewhat costly, requiring approximately 2.5 days to complete using a moderate power workstation12 and threads-based parallelization, where possible. This example is intended to show that the methods are feasible for moderately complex models.
The jump-diffusion model is
d p t = μ d t + exp h t d W 1 t + J t d N t d h t = κ ( α h t ) + σ d W 2 t
where p t is 100 times log price, h t is log volatility, J t is jump size, and N t is a Poisson process with jump intensity λ 0 . W 1 t and W 2 t are two standard Brownian motions with correlation ρ . When a jump occurs, its size is J t = a λ 1 exp h t , where a is 1 with probability 0.5 and 1 with probability 0.5 . Therefore, jump size depends on the current standard deviation, and jumps are positive or negative with equal probability. Log price, p t , is simulated using 10-minute tics, and the observed log price adds a N ( 0 , τ 2 ) measurement error to p t , when τ is greater than zero. From this model, 1000 daily observations on returns, realized volatility (RV), and bipower variation (BV) are generated. Because log price has been scaled by 100 in the parameterization of the model, returns, computed as the first difference of p t at the close of trading days, are directly in percentage terms. Both RV and BV are informative about volatility, and, because BV is somewhat robust to jumps, while RV is not, the difference between the two can help to identify the frequency and size of jumps (Barndorff-Nielsen and Shephard 2002). The model is simulated on a continuous 24-hour basis, and returns are computed using the change in daily log closing price, for trading days only. Overnight periods and weekends are simulated, but returns, RV and BV are recorded only at the close of trading days. In summary, the seven parameters are θ = ( μ , κ , α , σ , ρ , λ 0 , λ 1 , τ ) , and simulated data consists of 1000 daily observations on returns, RV and BV. The model studied here is quite similar to that studied in (Creel 2017; Creel and Kristensen 2015), except that the drift process is simplified to be constant, and the jump process is modeled somewhat differently, with constant intensity, and with the magnitude of a jump depending on the current instantaneous volatility. These changes were motivated by the results of the previous papers, and by the better tractability of the present specification.
The raw statistics, W , which are used to train the net and to do estimation, are a combination of coefficients from auxiliary regressions between the three observed variables, summary statistics, and functions of quantiles of the variables. The details of the 25 statistics are found in the file JDlib.jl (this same file also gives details of the priors, which are uniform over fairly broad supports, for all parameters). The neural net was fit using 160,000 draws from the prior to generate the training and testing data. The importance of each of the statistics can be assessed by examining the maximal absolute weights on each of the raw statistics in the first layer of the neural net, as is discussed in Creel (2017). These may be seen in Figure 1. We see that most of the 25 statistics have a non-negligible importance, which means that these statistics are contributing information to the fit of the neural net. The output of the net is of dimension 8, the same as the dimension of the parameters of the model. The net is combining the information of the overidentifying statistics to construct a just-identifying vector of statistics, which is then used to define the moments for MSM-MCMC estimation.
The model was fit, using MSM-MCMC and the CUE criteria, to S&P 500 data13 from 16 December 2013 to 4 December 2017, which is an interval of 1000 trading days, the same as was used to train the neural net. The data may be seen in Figure 2, where we observe typical volatility clusters and some jumps. For example, the Brexit drop of June 2016 is clearly seen, and the more extreme spike in RV versus BV at this point illustrates the fact that jumps can be identified by comparing the two. Over the sample period, this price index climbed from about 1800 to 2600, which is approximately a 44% increase, or approximately 0.04% per trading day.
Ten MCMC chains of length 1000 were drawn independently using threads-based parallelization, for a final chain of length 10,000. The estimation results are in Figure 3, which shows nonparametric plots of the marginal posterior density for each parameter, along with posterior means and medians, and 90% confidence intervals defined by the limits of the green areas. All posteriors are considerably more concentrated than are the priors. Drift ( μ ) is concentrated around a value slightly below 0.04, which is consistent with the average daily returns over the sample period. There is quite a bit of persistence in volatility, as mean reversion, κ , is estimated to be quite low, concentrated around 0.125. Leverage ( ρ ) is quite strong, concentrated around −0.85. The jump probability per day ( λ 0 ) is concentrated around 0.014, and is significantly different from zero. Therefore, jumps are a statistically important feature of the model. When a jump does occur, its magnitude ( λ 1 ) is approximately 4 times the current instantaneous standard deviation, but this parameter is not very well identified, as the posterior is quite dispersed. An interesting result is that τ , the standard deviation of measurement error of log price, is concentrated around 0.0055. It is not significantly different from zero at the 0.10 significance level, but it is close to being so. From the Figure, one can see that H o : τ < 0 would be very close to being rejected at the 10% significance level. The parameterization of the model is such that there is no measurement error when τ 0 . Thus, it appears that it is a safer option to allow for measurement error in the model, as the evidence suggests that it is very likely present, and its omission could bias the estimates of the other parameters.

7. Conclusions

This paper has shown, through Monte Carlo experimentation, that confidence intervals based upon quantiles of a tuned MCMC chain may have coverage which is far from the nominal level, even for simple models with few parameters. The results on poor reliability of inferences when using overidentified GMM estimators, which were referenced in the Introduction, carry over to a Bayesian version of overidentified MSM, implemented using the Chernozhukov and Hong (2003) methodology, for the models considered in this paper. The paper proposes to use neural networks to reduce the dimension of an initial set of moments to the minimum number of moments needed to maintain identification, following Creel (2017). When estimation and inference using the MSM-MCMC methodology is based on neural moments, which are exactly identifying, confidence intervals have statistically correct coverage in all cases studied by Monte Carlo, when the CUE version of MSM is used. Thus, there seems to be no generic problems with the MSM-MCMC methodology for the purpose of inference. A potential problem has to do with the choice of moments upon which MSM is based. Too much over-identification results in poor inferences, for the models studied. The use of neural moments solves this problem, by reducing the number of moments, without losing the information that they contain. The fact that RMSE does not rise when one moves from raw to neural moments illustrates that the neural moments do not lose the information that is contained in the larger set. The methods have been illustrated empirically by the estimation of a jump-diffusion model for S&P 500 data. An interesting result of the empirical work is that measurement error in log prices is likely to be present.
It is to be noted that the step of filtering moments though a neural net is very easy and quick to perform using modern deep learning software environments. The software archive that accompanies this paper provides a function for automatic training, requiring no human intervention. It only requires functions that provide simulated moments computed using data drawn from the model at parameter values drawn from the prior. Filtering moments through a neural net gives an informative, minimal dimension statistic as the output. This provides a convenient and automatic alternative to moment selection procedures. Uninformative moments are essentially removed, and correlated moments are combined.
This paper has examined how inference using the MSM-MCMC estimator may be improved when neural moments are used instead of a vector of overidentifying moments. It seems likely that other inference methods which are used with simulation-based estimators, such as Hamiltonian Monte Carlo and sequential Monte Carlo, among others, may be made more reliable if neural moments are used, as dimension reduction while maintaining relevant information is likely to be generally beneficial.


This research was funded by Government of Spain/FEDER, grant number PGC2018-094364-B-I00, and Government of Catalonia, Agència de Gestió d’Ajuts Universitaris i de Recerca grant number 2017-SGR-1765.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here:

Conflicts of Interest

The author declares no conflict of interest.


These definitions and notation are loosely based on Jiang and Turnbull (2004).
It may be noted that methods other than MCMC may be used to generate the set of draws from the posterior, C. For example, one might use sequential Monte Carlo. Point estimation and inference using C remains the same regardless of how C is generated.
The function which specifies and trains the neural net is MakeNeuralMoments.jl.
See the file MCMC.jl for the details of how this proposal density is implemented.
See the file SVlib.jl for details.
Details are in the file ARMAlib.jl.
Details are in the file MNlib.jl.
The details of the model and priors may be seen at CKlib.jl. The model is solved using third order projection, making use of the SolveDSGE.jl package. The model is discussed in more detail in Chapter 14 of the document econometrics.pdf.
These results are available for the SV and ARMA models, as well as an unreported additional model, in the WP branch of the GitHub archive.
The model is solved and simulated using the SRIW1 strong order 1.5 solver from the DifferentialEquations.jl package for the Julia language.
The workstation has 4 Opteron 6380 processors, each with 4 physical cores, running at 2500 MHz.
The data source is the Oxford–Man Institute’s realized library, v. 0.3,


  1. Barndorff-Nielsen, Ole E., and Neil Shephard. 2002. Econometric analysis of realized volatility and its use in estimating stochastic volatility models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64: 253–80. [Google Scholar] [CrossRef]
  2. Chernozhukov, Victor, and Han Hong. 2003. An MCMC approach to classical estimation. Journal of Econometrics 115: 293–346. [Google Scholar] [CrossRef] [Green Version]
  3. Christiano, Lawrence J., Martin S. Eichenbaum, and Mathias Trabandt. 2016. Unemployment and business cycles. Econometrica 84: 1523–69. [Google Scholar] [CrossRef]
  4. Christiano, Lawrence J., Mathias Trabandt, and Karl Walentin. 2010. Dsge models for monetary policy analysis. In Handbook of Monetary Economics. Amsterdam: Elsevier, vol. 3, pp. 285–367. [Google Scholar]
  5. Corsi, Fulvio. 2009. A simple approximate long-memory model of realized volatility. Journal of Financial Econometrics 7: 174–96. [Google Scholar] [CrossRef]
  6. Creel, Michael. 2017. Neural nets for indirect inference. Econometrics and Statistics 2: 36–49. [Google Scholar] [CrossRef] [Green Version]
  7. Creel, Michael, and Dennis Kristensen. 2015. Abc of sv: Limited information likelihood inference in stochastic volatility jump-diffusion models. Journal of Empirical Finance 31: 85–108. [Google Scholar] [CrossRef] [Green Version]
  8. Donald, Stephen G., Guido W. Imbens, and Whitney K. Newey. 2009. Choosing instrumental variables in conditional moment restriction models. Journal of Econometrics 152: 28–36. [Google Scholar] [CrossRef]
  9. Gallant, A. Ronald, and George Tauchen. 1996. Which moments to match? Econometric Theory 12: 363–90. [Google Scholar] [CrossRef]
  10. Gallant, A. Ronald, and George Tauchen. 2010. Emm: A Program for Efficient Method of Moments Estimation, Version 2.6, User’S Guide. Chapel Hill: University of North Carolina. [Google Scholar]
  11. Gourieroux, Christian, Alain Monfort, and Eric Renault. 1993. Indirect inference. Journal of Applied Econometrics 8: S85–S118. [Google Scholar] [CrossRef]
  12. Hall, Peter, and Joel L. Horowitz. 1996. Bootstrap critical values for tests based on generalized-method-of-moments estimators. Econometrica 64: 891–916. [Google Scholar] [CrossRef]
  13. Hansen, Lars Peter, John Heaton, and Amir Yaron. 1996. Finite-sample properties of some alternative GMM estimators. Journal of Business & Economic Statistics 14: 262–80. [Google Scholar] [CrossRef] [Green Version]
  14. Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural Networks 2: 359–66. [Google Scholar] [CrossRef]
  15. Jiang, Wenxin, and Bruce Turnbull. 2004. The indirect method: Inference based on intermediate statistics a synthesis and examples. Statistical Science 19: 239–63. [Google Scholar] [CrossRef]
  16. Lu, Zhou, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. 2017. The expressive power of neural networks: A view from the width. Paper presented at 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, December 4–9; pp. 6232–40. [Google Scholar]
  17. Marjoram, Paul, John Molitor, Vincent Plagnol, and Simon Tavaré. 2003. Markov chain monte carlo without likelihoods. Proceedings of the National Academy of Sciences 100: 15324–28. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. McFadden, Daniel. 1989. A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica 57: 995–1026. [Google Scholar] [CrossRef] [Green Version]
  19. Smith, Anthony A. 1993. Estimating nonlinear time-series models using simulated vector autoregressions. Journal of Applied Econometrics 8: S63–S84. [Google Scholar] [CrossRef] [Green Version]
  20. Tauchen, George. 1986. Statistical properties of generalized method-of-moments estimators of structural parameters obtained from financial market data. Journal of Business & Economic Statistics 4: 397. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Importance of statistics, jump-diffusion model.
Figure 1. Importance of statistics, jump-diffusion model.
Econometrics 09 00035 g001
Figure 2. Plot of returns, RV and BV, S&P 500, 16 December 2013–5 December 2017.
Figure 2. Plot of returns, RV and BV, S&P 500, 16 December 2013–5 December 2017.
Econometrics 09 00035 g002aEconometrics 09 00035 g002b
Figure 3. MCMC results for the jump-diffusion model of S&P 500 data. Posterior mean in blue, posterior median in black. The green-yellow borders define the limits of a 90% confidence interval.
Figure 3. MCMC results for the jump-diffusion model of S&P 500 data. Posterior mean in blue, posterior median in black. The green-yellow borders define the limits of a 90% confidence interval.
Econometrics 09 00035 g003
Table 1. RMSE for SV and ARMA models, using raw ( W ) or neural net ( Z ) statistics.
Table 1. RMSE for SV and ARMA models, using raw ( W ) or neural net ( Z ) statistics.
ModelParameterTrue ValueW (CUE)Z (Two-Step)Z (CUE)
SV ϕ 0.6920.1230.0640.076
ρ 0.900.0860.0820.086
σ 0.3630.1380.1050.105
ARMA α 0.950.0300.0280.047
β 0.50.0780.0670.068
σ 2 1.00.0990.0910.084
Table 2. 90% confidence interval coverage for SV and ARMA models, using raw ( W ) or neural net ( Z ) statistics. Correct coverage rejected when outside 0.864–0.932.
Table 2. 90% confidence interval coverage for SV and ARMA models, using raw ( W ) or neural net ( Z ) statistics. Correct coverage rejected when outside 0.864–0.932.
ModelParameterW (CUE)Z (Two-Step)Z (CUE)
SV ϕ 0.8760.8840.912
ρ 0.7320.9760.910
σ 0.7620.9560.928
ARMA α 0.7860.9880.916
β 0.8140.9540.918
σ 2 0.8080.9200.912
Table 3. 95% confidence interval coverage for SV and ARMA models, using raw ( W ) or neural net ( Z ) statistics. Correct coverage rejected when outside 0.924–0.974.
Table 3. 95% confidence interval coverage for SV and ARMA models, using raw ( W ) or neural net ( Z ) statistics. Correct coverage rejected when outside 0.924–0.974.
ModelParameterW (CUE)Z (Two-Step)Z (CUE)
SV ϕ 0.9160.9380.954
ρ 0.7960.9900.944
σ 0.8240.9760.958
ARMA α 0.8380.9940.966
β 0.8560.9840.942
σ 2 0.8800.9660.954
Table 4. 99% confidence interval coverage for SV and ARMA models, using raw ( W ) or neural net ( Z ) statistics. Correct coverage rejected when outside 0.976–1.000.
Table 4. 99% confidence interval coverage for SV and ARMA models, using raw ( W ) or neural net ( Z ) statistics. Correct coverage rejected when outside 0.976–1.000.
ModelParameterW (CUE)Z (Two-Step)Z (CUE)
SV ϕ 0.9360.9680.990
ρ 0.8480.9980.978
σ 0.8880.9940.986
ARMA α 0.8981.0000.988
β 0.9160.9980.986
σ 2 0.9200.9940.990
Table 5. RMSE for MN and DSGE models.
Table 5. RMSE for MN and DSGE models.
ModelParameterTrue ValueZ (Two-Step)Z (CUE)
MN μ 1 1.00.0190.018
σ 1 0.20.0870.089
μ 2 0.00.0210.020
σ 2 2.00.0640.065
DSGE β 0.990.0010.000
γ 2.000.0830.085
ρ z 0.90.0090.008
σ z 0.020.0010.001
ρ η 0.70.0500.055
σ η 0.010.0010.001
n s s ¯ 1/30.0010.001
Table 6. 90% confidence interval coverage for MN and DSGE models. Correct coverage rejected when outside 0.864–0.932.
Table 6. 90% confidence interval coverage for MN and DSGE models. Correct coverage rejected when outside 0.864–0.932.
ModelParameterZ (Two-Step)Z (CUE)
MN μ 1 0.9200.914
σ 1 0.9340.922
μ 2 0.9060.918
σ 2 0.9340.920
DSGE β 0.9500.914
γ 0.9680.920
ρ z 0.9280.928
σ z 0.9100.892
ρ η 0.8920.890
σ η 0.9720.906
n s s ¯ 0.9240.902
Table 7. 95% confidence interval coverage for MN and DSGE models. Correct coverage rejected when outside 0.924–0.974.
Table 7. 95% confidence interval coverage for MN and DSGE models. Correct coverage rejected when outside 0.924–0.974.
ModelParameterZ (Two-Step)Z (CUE)
MN μ 1 0.9560.962
σ 1 0.9760.962
μ 2 0.9440.952
σ 2 0.9640.958
DSGE β 0.9720.962
γ 0.9900.962
ρ z 0.9600.958
σ z 0.9520.946
ρ η 0.9500.938
σ η 0.9960.952
n s s ¯ 0.9660.956
Table 8. 99% confidence interval coverage for MN and DSGE models. Correct coverage rejected when outside 0.976–1.000.
Table 8. 99% confidence interval coverage for MN and DSGE models. Correct coverage rejected when outside 0.976–1.000.
ModelParameterZ (Two-Step)Z (CUE)
MN μ 1 0.9900.990
σ 1 0.9960.992
μ 2 9.9860.984
σ 2 0.9920.994
DSGE β 0.9960.990
γ 1.0000.986
ρ z 0.9760.980
σ z 0.9860.990
ρ η 0.9900.982
σ η 1.0000.992
n s s ¯ 0.9880.988
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Creel, M. Inference Using Simulated Neural Moments. Econometrics 2021, 9, 35.

AMA Style

Creel M. Inference Using Simulated Neural Moments. Econometrics. 2021; 9(4):35.

Chicago/Turabian Style

Creel, Michael. 2021. "Inference Using Simulated Neural Moments" Econometrics 9, no. 4: 35.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop