# Learning Functions and Approximate Bayesian Computation Design: ABCD

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

_{X,θ}(x,θ) = f(x|θ)π(θ) the joint density of X and θand use f

_{X}(x) for the marginal density of X. The nature of expectations will be clear from the notation. To make the development straightforward, we shall look at the case of distributions with densities (with respect to Lebesgue measure) or, occasionally, discrete distributions with finite support. All necessary conditions for conditional densities, integration and differentiation will be implicitly assumed.

## 2. Information-Based Learning

_{U}(u). Let g(·) be a function on R+ → R and define a measure of information of the Shannon type for U with respect to g as

_{g}(θ;X) with the prior information:

#### Theorem 1

_{g}(θ, X) and prior value, I

_{g}(θ), satisfy:

_{X,θ}(x, θ) if and only if h(u) = ug(u) is convex on R

^{+}.

#### Theorem 2

_{X,θ}(x, θ) if and only if φ is convex as a functional:

_{1}, π

_{2}.

#### Proof

_{1}, π

_{2}, α}, such that:

#### Proof

_{α}(θ) = (1 − α)π

_{1}(θ) + απ

_{2}(θ). If h(u) = ug(u) is convex as a function of its argument u:

_{g}is convex for all π, then h is convex. For this, again, we need a special construction. We carry this out on one dimension, the extension to more than one dimension being straightforward. For ease of exposition, we also make the necessary differentiability conditions. The second directional derivative of I

_{g}(θ) in the space of distributions (which is convex) at π

_{1}towards π

_{2}is:

_{1}represent a uniform distribution on [0, ${\scriptstyle \frac{1}{z}}$], for some z ≥ 0, and let π

_{2}be a distribution with support contained in [0, ${\scriptstyle \frac{1}{z}}$]. Then, the above becomes:

_{2}, which makes the integral on the right-hand side positive, shows that I

_{g}(θ) is not convex at z. This completes the proof.

^{+}is equivalent to $g({\scriptstyle \frac{1}{u}})$ being convex, which is referred to as g(u) being “reciprocally convex” by Goldman and Shaked [14]; see also Fallis and Lyddell [15].

## 3. Distance-Based Information Functions

_{1},Z

_{2}be independent copies from π(z), and let d(z

_{1}, z

_{2}) be a distance or metric. Define d-information as:

_{1}(z) + απ

_{2}(z),

_{1}(z

_{1}) − π

_{2}(z

_{1})) = 0, (5) is a generalized version of the following condition:

_{ij}= d(z

_{i}, z

_{j}), is called almost positive and is the necessary and sufficient condition for an abstract set of points P

_{1}, . . . , P

_{k}, with interpoint distances {d

_{ij}}, to be embedded in Euclidean space.

#### Theorem 3

_{ij}= d

_{ji}, 1 ≤ i < j ≤ n, are ${\scriptstyle \frac{1}{2}}n(n-1)$ positive quantities, then a necessary and sufficient condition that the d

_{ij}are the interpoint distances between points P

_{i}, i = 1, . . . , n, in R

^{n}is that the distance matrix D = −{d

_{ij}} is an almost positive matrix.

#### Theorem 4

_{2}if and only if A(x, y) = −d(x, y) is an almost positive matrix.

_{2}), such that, when d(x, y) is a Euclidean or Hilbert space metric, the space with the new metric can still be embedded into the Hilbert space. Schoenberg [10] gives the following major result that such B(·) comprise the Bernstein function defined as follows (see Theorem 12.14 in [11]):

#### Definition 1

^{∞}, f(λ) ≥ 0 for all λ > 0 and the derivatives f

^{(}

^{n}

^{)}satisfy (−1)

^{n}

^{−}

^{1}f

^{(}

^{n}

^{)}≥ 0 for all positive integers n and all λ > 0.

#### Theorem 5

- (1)
- B(||x − y||
^{2}) (x, y ∈ H) is the square of a distance function, which isometrically embeds into Hilbert space H, i.e., there exists a φ : H ↦ H, such that:$$B({\Vert x-y\Vert}^{2})={\Vert \phi (x)-\phi (y)\Vert}^{2}.$$ - (2)
- B is a Bernstein function.
- (3)
- e
^{−}^{B}^{(}^{t}^{)}is the Laplace transform of an infinitely divisible distribution, i.e.,$$B(t)=-\text{log\hspace{0.17em}}{\int}_{0}^{\infty}\frac{{e}^{-tu}}{u}d\gamma (u),$$where γ is an infinitely divisible distribution. - (4)
- B has the Lévy-Khintchine representation:$$B(t)={B}_{\mu ,b}(t)=bt+{\int}_{0}^{\infty}(1-{e}^{-tu})d\mu (u)$$for some b ≥ 0 and a measure μ, such that ${\int}_{0}^{\infty}(1\wedge t)d\mu (t)<\infty $, with the condition that B
_{μ,b}(t) > 0 for t > 0.

#### Theorem 6

_{1}, z

_{2}) is a Euclidean distance, then φ(π) = −E

_{Z}

_{1,}

_{Z}

_{2}(B(d(Z

_{1},Z

_{2})

^{2})) is a learning function.

_{1},Z

_{2}of Z and use Euclidean distance, and we have that minus the trace of the covariance matrix of Z, Γ, is a learning function:

_{1}, z

_{2})

^{2}, give a learning function:

## 4. Counterexamples

^{2}with joint distribution having support on [0, 1]

^{2}. Let π(θ) be the prior distribution and define a sampling distribution:

_{X}(x) = 1 on [0, 1], since the integral of (9) is unity, so that (9) is also the posterior distribution π(θ|x). Note that, in order for (9) to be a proper density, we require that $\pi (\theta )\ge {\scriptstyle \frac{1}{2}}$ for 0 ≤ θ ≤ 1.

_{1}and I

_{0}are equal and I

_{0}= I

_{1}. When x = 1, the integrand of I

_{1}is zero, as expected. Thus, for a non-uniform prior, we have less posterior information in a neighborhood of x = 1, as we aimed to achieve.

_{1}decreases from a maximum of $\text{log}(2)-{\scriptstyle \frac{1}{2}}$ at x = 0, through the value I

_{0}at $x={\scriptstyle \frac{1}{2}}$ , to the value zero at x = 1; see also Figure 1. Thus, I

_{0}> I

_{1}for ${\scriptstyle \frac{1}{2}}<x\le 1$. Since the marginal distribution of X is uniform on [0, 1], we have the challenging fact that:

#### 4.1. Surprise and Ignorance

#### 4.2. Minimal Information Prior Distributions

_{i}= 1, but the minimal trace of the covariance matrix is for mass ${\scriptstyle \frac{1}{k}}$ at each corner of the simplex ∑θ

_{i}= 1.

## 5. The Role of Majorization

_{1}(θ) and π

_{2}(θ), the second is more peaked than the first if and only if:

_{1}is the prior distribution and π

_{2}is the posterior distribution. We have seen from the counterexamples that it does not hold in general, but, loosely speaking, always holds in expectation, by Theorem 1. However, it is natural to try to understand the partial ordering, and we shall now indicate that the ordering is equivalent to a well-known majorization ordering for distributions.

_{2}is said to majorize π

_{1}, written π

_{1}≼ π

_{2}, when:

- A1. there is a doubly stochastic matrix P
_{n×n}, such that π_{1}= Pπ_{2}; - A2. ${\sum}_{i}^{n}h({\pi}_{i}^{(1)})\le {\sum}_{i}^{n}h({\pi}_{i}^{(2)})$ for all continuous convex functions h(x).

_{1}(θ)dθ = ∫π

_{2}(θ)dθ = 1. The natural analogue of the ordered values in the discrete case is that every density π has a unique density π̃ , called a “decreasing rearrangement”, obtained by a reordering of the probability mass to be non-increasing, by direct analogy with the discrete case above. In the theory, π and π̃ are then referred to as being equimeasurable, in the sense that the supports are transformed in a measure-preserving way.

#### Definition 2

#### Definition 3

_{2}majorizes π

_{1}, written π

_{1}≼ π

_{2}, if and only if, for the decreasing rearrangements,

- B1. π
_{1}(θ) = ∫_{Θ}P(θ, z)π_{2}(z)dz for some non-negative doubly stochastic kernel P(x, y). - B2. ∫
_{Θ}h(π_{1}(z))dz ≤ ∫_{Θ}h(π_{2}(z))dz for all continuous convex functions h. - B3. ∫
_{Θ}(π_{1}(z) − c)_{+}dz ≤ ∫_{Θ}(π_{2}(z) − c)_{+}dz for all c > 0.

_{1}(θ) ≼ π

_{2}(θ). We also see that ≼ is equivalent to standard first order stochastic dominance of the decreasing rearrangements, since $\tilde{F}(\theta )={\int}_{0}^{\theta}\tilde{\pi}(z)dz$ is the cdf corresponding to π̃(θ). Condition B3 says that the probability mass under the density above a “slice” at height c is more for π

_{2}than for π

_{1}.

#### Proposition 1

## 6. Learning Based on Covariance Functions

#### Definition 4

#### Definition 5

#### Theorem 7

#### Proof

_{α}= (1 − α)π

_{1}+ απ

_{2}. Then, with obvious notation,

_{1}− μ

_{2})( μ

_{1}− μ

_{2})

^{T}is non-negative definite. Then, since φ is Loewner increasing and concave, φ(Γ(π

_{α})) ≥ φ ((1 − α)Γ(π

_{1}) + αΓ(π

_{2})) ≥ (1 − α)φ(Γ(π

_{1})) + αφ(Γ(π

_{2})), and by Theorem 2, −φ is a learning function.

^{T}, for some vector z. Take two distributions with equal covariance functions, but with means satisfying μ

_{1}− μ

_{2}= 2z. Then,

^{(}

^{i}

^{)}}, i = 1, . . . ,m, and the result follows by induction from the last result.

_{1}and π

_{2}, with covariance Γ

_{1}and Γ

_{2}, respectively, we have that for any Shannon-type learning function I

_{g}(θ

_{1}) ≤ I

_{g}(θ

_{2}) if and only if det(Γ

_{1}) ≥ det(Γ

_{2}). We should note that in many Bayesian set-ups, such as regression and Gaussian process prediction, we have a joint multivariate distribution between x and θ. Suppose that, with obvious notation, the joint covariance matrix is:

_{X}(φ(π(θ|X))), by Theorem 7. However, as the conditional covariance matrix does not depend on X, we have learning in the strong sense; −φ(π(θ)) ≤ −φ(π(θ|X)). Classifying learning functions for θ and Γ

_{θ,X}in the case where they are both unknown is not yet fully developed.

## 7. Approximate Bayesian Computation Designs

_{D}and θ. In terms of integration, this only requires a single double integral. The non-linear case requires the evaluation of an “internal” integral for E

_{θ|XD}U(X

_{D}, θ) and an external integral for E

_{XD}. It is important to note that Shannon-type functionals are special types of linear functionals where U(θ,X

_{D}) = g(π(θ|X

_{D})). The distance-based functionals are non-linear in that they require a repeated single integral.

_{D}) solely dependent of the outcome of the experiment: if it really does snow, then snow plows may need to be deployed. The overall (preposterior) expected value of the experiment might be:

_{D}) before evaluating φ. For simplicity, we use ABC rejection sampling (see Marjoram et al. [28]) to obtain an approximate sample from π(θ|X

_{D}) that allows us to estimate the functional φ(π(θ|X

_{D})). In many cases, it is hard to find an analytical solution for π(θ|X

_{D}), especially if f(x|θ) is intractable. These are the cases where ABC methods are most useful. Furthermore, ABC rejection sampling has the advantage that it is easily possible to re-compute φ̂(π(θ|X

_{D})) for different values of X

_{D}, which is an important feature, because we have to integrate over the marginal distribution of X

_{D}in order to obtain ψ(f) = E

_{XD}φ(π(θ|X

_{D})).

_{D})) with respect to the marginal distribution f

_{X}. We can achieve this using Monte Carlo integration:

_{D})) given x

_{D}is as follows.

- (1)
- Sample from π(θ) : {θ
_{1}, . . . , θ_{H}}. - (2)
- For each θ
_{i}, sample from f(x|θ_{i}) to obtain a sample: ${x}^{(i)}=({x}_{1}^{(i)},\dots ,{x}_{n}^{(i)})$. This gives a sample from the joint distribution: f_{X,θ}. - (3)
- For each θ
_{i}, compute a vector of summary statistics: T(x^{(}^{i}^{)}) = (T_{1}(x^{(}^{i}^{)}), . . . , T_{m}(x^{(}^{i}^{)})). - (4)
- Split T-space into disjoint neighborhoods .
- (5)
- Find the neighborhood for which T(x
_{D}) ∈ and collect the θ_{i}for which T(x^{(}^{i}^{)}) ∈ , forming an approximate posterior distribution π̃(θ|T), which if T is approximately sufficient, should be close to π(θ|x_{D}). If T is sufficient, we have that π̃(θ|T) → π(θ|x_{D}) as | | → 0. - (6)
- Approximate π(θ|x
_{D}) by π̃(θ|T). - (7)
- Evaluate φ(π(θ|x
_{D})) by integration (internal integration).

_{D}~ f

_{X}.

_{D}), if we are happy to use the naive approximation to the double integral:

#### 7.1. Selective Sampling

_{1}and z

_{2}. Here, the model is equivalent (in the limit as the slit widths become small) to replacing f(x|θ) by the discrete distribution:

_{1}, z

_{2}) = ∫ f(x|θ, z

_{1}, z

_{2})π(θ)dθ denotes the marginal distribution of x, the posterior distribution is given by:

_{2}≥ z

_{1}and z

_{i}∈ [−a, a], then:

- (1)
- For fixed z
_{1}and z_{2}, sample H numbers {θ^{(}^{j}^{)}, j = 1, . . . ,H} from the prior. - (2)
- For each θ
^{(}^{j}^{)}, repeat:- (a)
- sample z
^{(}^{k}^{)}~ π(z|θ^{(}^{j}^{)}) until #{z^{(}^{k}^{)}∈ {N_{ε}(z_{1}),N_{ε}(z_{2})}}= K_{z}, where N_{ε}(z) = [z − ε/2, z + ε/2]; - (b)
- drop all z
^{(}^{k}^{)}∉ {N_{ε}(z_{1}),N_{ε}(z_{2})}; - (c)
- sample x
^{(}^{j}^{)}from discrete distribution with probabilities $\text{Pr}({x}^{(j)}=i)=\frac{\#\{{z}^{(k)}\in {N}_{\varepsilon}({z}_{i})\}}{{K}_{z}}$, i = 1, 2.

- (3)
- For i = 1, 2, select all θ
^{(}^{j}^{)}for which x^{(}^{j}^{)}= i, compute kernel density estimate for these θ^{(}^{j}^{)}and obtain maximum → φ̂(π̂(θ|x = i, z_{1}, z_{2})). - (4)
- $\widehat{\psi}({z}_{1},{z}_{2})={\displaystyle \sum _{i=1}^{2}}\widehat{\phi}(\widehat{\pi}(\theta \mid x=i,{z}_{1},{z}_{2}))\frac{\#\{{x}^{(j)}=i\}}{H}$.

_{z}(K

_{z}= 50, 100, 200) in order to assess the effect of these parameters on the accuracy of the ABC estimates of the criterion ψ. The most notable effect was found for the ABC sample size H.

_{2}= −z

_{1}when a = 1.5. We set ε = 0.01, K

_{z}= 100. The ABC sample size H is set to H = 100 (left), H = 1, 000 (center), and H = 10, 000 (right). The criterion was evaluated at the eight points (z

_{1}= 0.1, 0.3, 0.5, 0.7, 0.9, 1.1, 1.3, 1.5). The theoretical criterion function ψ(z

_{1}) is plotted as a solid line.

#### 7.2. Spatial Sampling for Prediction

_{i}= X

_{i}(z

_{i}), i = 1, . . . , n to indicate sampling at sites (the design) D

_{n}= {z

_{1}, . . . , z

_{n}}. We would typically take the design space, Z, to be a compact region.

_{n}

_{+1}, namely x

_{n}

_{+1}(z

_{n}

_{+1}), given x

_{D}= x(D

_{n}) = (x

_{1}(z

_{1}), . . . , x

_{n}(z

_{n})). In the Gaussian case, the background parameter θ could be related to a fixed effect (drift) or the covariance function of the process, or both. In the analysis, x

_{n}

_{+1}is regarded as an additional parameter, and we need its (marginal) conditional distribution.

_{n}

_{+1}may be interpreted as the posterior distribution in (11). The optimality criterion ψ is found by integrating φ with respect to X

_{1}, . . . ,X

_{n}.

_{n}and then perform ABC at each test point z

_{n}

_{+1}. The learning functional φ(x

_{D}) is estimated by generating the sample $I={\{{x}_{D}^{(j)},{x}_{n+1}^{(j)}\}}_{j=1}^{H}={\{{x}_{1}^{(j)},{x}_{2}^{(j)},\dots ,{x}_{n}^{(j)},{x}_{n+1}^{(j)}\}}_{j=1}^{H}$ at the sites {z

_{1}, z

_{2}, . . . , z

_{n}, z

_{n}

_{+1}} and calculating:

_{n}) = E

_{XD}(φ(X

_{D})), we obtain a sample $O={\{{x}_{D}^{(i)}\}}_{i=1}^{G}$ from the marginal distribution of the random field at the design D

_{n}and perform Monte Carlo integration:

_{1}(z

_{1}), x

_{2}(z

_{2}), x

_{3}(z

_{3}), x

_{4}(z

_{4})) are assumed to be distributed according to a one-dimensional Gaussian random field with mean zero, a marginal variance of one and z

_{i}∈ [0, 1]. We want to select an optimal design D

_{3}= (z

_{1}, z

_{2}, z

_{3}), such that:

^{−}

^{θ|s}

^{−}

^{t|}. Two prior distributions for the parameter θ are considered. The first one is a point prior at θ = log(100), so that ρ(h) = ρ (h; log(100)) = 0.01

^{h}. This is the correlation function used by Müller et al. [32] in their study of empirical kriging optimal designs. The second prior distribution is an exponential prior for θ with scale parameter λ = 10 (i.e., θ ~ Exp(10)). The scale parameter λ was chosen, such that the average correlation functions of the point and exponential priors are similar. By that, we mean that the average of the mean correlation function for the exponential prior over all pairs of sites s and t, E

_{s,t}[E

_{θ}{ρ(|s−t|; θ)|θ ~ Exp(λ)}] = E

_{s,t}[1/(1+λ|s−t|)], matches the average of the fixed correlation function ρ(|s − t|; log(100)) = 0.01

^{|s}

^{−}

^{t|}over all pairs of sites s and t, E

_{s,t}[0.01

^{|s}

^{−}

^{t|}]. The sites are assumed to be uniformly distributed over the coordinate space.

_{s,t}[E

_{θ}{ρ(|s − t|; θ)|θ ~ Exp(10)}] = 0.3275.

^{h}. The dotted line and the two dashed lines represent the mean correlation function and the 0.025- and 0.975-quantile functions for ρ(h; θ) under the prior θ ~ Exp(10).

_{1}and z

_{3}(z

_{2}is fixed at z

_{2}= 0.5). We set G = 1, 000, H = 5 · 10

_{6}and ε = 0.01 for each design point. The sample ${\{{x}^{j}(z):z\in \mathcal{Z}\}}_{j=1}^{H}$ is simulated at all points z of the grid prior to the actual ABC algorithm. In order to accelerate the computations, it is then reused for all possible designs D

_{3}to estimate each $\widehat{\phi}({x}_{D}^{(i)})$, i = 1, . . . ,G, in (12). The sample size H = 5 · 10

_{6}was deemed to provide a sufficiently exhaustive sample from the four-dimensional normal vector (x

_{1}(z

_{1}), x

_{2}(z

_{2}), x

_{3}(z

_{3}), x

_{4}(z

_{4})) for any z

_{i}∈ Z, so that the distortive effect of using the same sample for the computations of all $\widehat{\phi}({x}_{D}^{(i)})$ is only of negligible concern for our purposes of ranking the designs.

_{3}), when the prior distribution of θ is the point prior at θ = log(100). It can be seen that the minimum of the criterion is attained at about (z

_{1}, z

_{3}) = (0.9, 0.1) or (z

_{1}, z

_{3}) = (0.1, 0.9), which is comparable to the the results obtained in Müller et al. [32] for empirical kriging optimal designs. Note that the diverging criterion values at the diagonal and at z

_{1}= 0.5 and z

_{3}= 0.5 are attributable to a specific feature of the ABC method used. At these designs, the actual dimension of the design is lower than three, so for a given ε, there are more elements in the neighborhood than for the other designs with three distinctive design points. Hence, a much larger fraction of the total sample, ${\{{x}_{n+1}^{(j)}\}}_{j=1}^{H}$

## 8. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

- Blackwell, D. Comparison of Experiments. Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 31 July–12 August 1950; University of California Press: Berkeley, CA, USA, 1951; pp. 93–102. [Google Scholar]
- Torgersen, E. Comparison of Statistical Experiments; Encyclopedia of Mathematics and its Applications 36; Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
- Rényi, A. On Measures of Entropy and Information. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
- Lindley, D.V. On a Measure of the Information Provided by an Experiment. Ann. Math. Stat
**1956**, 27, 986–1005. [Google Scholar] - Goel, P.K.; DeGroot, M.H. Comparison of Experiments and Information Measures. Ann. Math. Stat
**1979**, 7, 1066–1077. [Google Scholar] - Ginebra, J. On the measure of the information in a statistical experiment. Bayesian Anal
**2007**, 2, 167–211. [Google Scholar] - Chaloner, K.; Verdinelli, I. Bayesian Experimental Design: A Review. Stat. Sci
**1995**, 10, 273–304. [Google Scholar] - Sebastiani, P.; Wynn, H.P. Maximum entropy sampling and optimal Bayesian experimental design. J. R. Stat. Soc.: Ser. B (Stat. Methodol.)
**2000**, 62, 145–157. [Google Scholar] - Chater, N. The Probability Heuristics Model of Syllogistic Reasoning. Cogn. Psychol
**1999**, 38, 191–258. [Google Scholar] - Schoenberg, I.J. Metric Spaces Positive Definite Functions. Trans. Am. Math. Soc
**1938**, 44, 522–536. [Google Scholar] - Schilling, R.L.; Song, R.; Vondracek, Z. Bernstein Functions: Theory and Applications; De Gruyter Studies in Mathematics 37; De Gruyter: Berlin, Germany, 2012. [Google Scholar]
- Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys
**1988**, 52, 479–487. [Google Scholar] - DeGroot, M.H. Optimal Statistical Decisions, WCL edition; Wiley-Interscience: Hoboken, NJ, USA, 2004. [Google Scholar]
- Goldman, A.I.; Shaked, M. Results on inquiry and truth possession. Stat. Probab. Lett
**1991**, 12, 415–420. [Google Scholar] - Fallis, D.; Liddell, G. Further results on inquiry and truth possession. Stat. Probab. Lett
**2002**, 60, 169–182. [Google Scholar] - Torgerson, W.S. Theory and Methods of Scaling; John Wiley and Sons, Inc: New York, NY, USA, 1958. [Google Scholar]
- Gower, J.C. Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika
**1966**, 53, 325–338. [Google Scholar] - Gower, J.C. Euclidean distance geometry. Math. Sci
**1982**, 7, 1–14. [Google Scholar] - Itti, L.; Baldi, P. Bayesian surprise attracts human attention. Vis. Res
**2009**, 49, 1295–1306. [Google Scholar] - Haykin, S.; Chen, Z. The Cocktail Party Problem. Neural Comput
**2005**, 17, 1875–1902. [Google Scholar] - Berger, J. The case for objective Bayesian analysis. Bayesian Anal
**2006**, 1, 385–402. [Google Scholar] - Marshall, A.W.; Olkin, I.; Arnold, B.C. Inequalities: Theory of Majorization and Its Applications, 2nd ed; Springer Series in Statistics; Springer: Berlin, Germany, 2009. [Google Scholar]
- Hardy, G.H.; Littlewood, J.E.; Pólya, G. Inequalities, 2nd ed; Cambridge Mathematical Library; Cambridge University Press: Cambridge, UK, 1988. [Google Scholar]
- Müller, A.; Stoyan, D. Comparison Methods for Stochastic Models and Risks, 1st ed; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
- Ryff, J.V. Orbits of l
^{1}-functions under doubly stochastic transformations. Trans. Am. Math. Soc**1965**, 117, 92–100. [Google Scholar] - DeGroot, M.H.; Fienberg, S. Comparing probability forecasters: Basic binary concepts and multivariate extensions. In Bayesian Inference and Decision Techniques; Goel, P., Zellner, A., Eds.; North-Holland: Amsterdam, The Netherlands, 1986; pp. 247–264. [Google Scholar]
- Dawid, A.P.; Sebastiani, P. Coherent dispersion criteria for optimal experimental design. Ann. Stat
**1999**, 27, 65–81. [Google Scholar] - Marjoram, P.; Molitor, J.; Plagnol, V.; Tavaré, S. Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA
**2003**, 100, 15324–15328. [Google Scholar] - Hainy, M.; Müller, W.; Wynn, H. Approximate Bayesian Computation Design (ABCD), an Introduction. In mODa 10—Advances in Model-Oriented Design and Analysis; Ucinski, D., Atkinson, A.C., Patan, M., Eds.; Contributions to Statistics; Springer International Publishing: Heidelberg/Berlin, Germany, 2013; pp. 135–143. [Google Scholar]
- Drovandi, C.C.; Pettitt, A.N. Bayesian Experimental Design for Models with Intractable Likelihoods. Biom
**2013**, 69, 937–948. [Google Scholar] [Green Version] - Hainy, M.; Müller, W.G.; Wagner, H. Likelihood-free Simulation-based Optimal Design; Technical Report; Johannes Kepler University: Linz, Austria, 2013. [Google Scholar]
- Müller, W.G.; Pronzato, L.; Waldl, H. Beyond space-filling: An illustrative case. Procedia Environ. Sci
**2011**, 7, 14–19. [Google Scholar]

**Figure 2.**Estimated values of the criterion ψ̂(z

_{1}) (points) and theoretical criterion function ψ(z

_{1}) (solid line) for ε = 0.01, K

_{z}= 100, and H = 100 (

**a**), H = 1, 000 (

**b**), H = 10, 000 (

**c**).

**Figure 3.**Prior distributions of correlation function ρ(h; θ): correlation function ρ(h) = 0.01

^{h}under point prior θ = log(100) (solid line); mean correlation function (dotted line) and 0.025- and 0.975-quantile functions (dashed lines) for ρ(h; θ) under the prior θ ~ Exp(10).

**Figure 4.**Spatial prediction criterion map for the point prior at θ = log(100) (

**left**) and for the exponential prior θ ~ Exp(10) (

**right**).

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Hainy, M.; Müller, W.G.; P. Wynn, H.
Learning Functions and Approximate Bayesian Computation Design: ABCD. *Entropy* **2014**, *16*, 4353-4374.
https://doi.org/10.3390/e16084353

**AMA Style**

Hainy M, Müller WG, P. Wynn H.
Learning Functions and Approximate Bayesian Computation Design: ABCD. *Entropy*. 2014; 16(8):4353-4374.
https://doi.org/10.3390/e16084353

**Chicago/Turabian Style**

Hainy, Markus, Werner G. Müller, and Henry P. Wynn.
2014. "Learning Functions and Approximate Bayesian Computation Design: ABCD" *Entropy* 16, no. 8: 4353-4374.
https://doi.org/10.3390/e16084353