Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy

Dajles, Andres; Cavanaugh, Joseph

doi:10.3390/e24101483

Open AccessFeature PaperArticle

Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy

by

Andres Dajles

^*,†

and

Joseph Cavanaugh

^†

Department of Biostatistics, University of Iowa, 145 N. Riverside Drive, Iowa City, IA 52242, USA

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2022, 24(10), 1483; https://doi.org/10.3390/e24101483

Submission received: 27 September 2022 / Revised: 15 October 2022 / Accepted: 16 October 2022 / Published: 18 October 2022

(This article belongs to the Special Issue Information and Divergence Measures)

Download Review Reports Versions Notes

Abstract

:

When choosing between two candidate models, classical hypothesis testing presents two main limitations: first, the models being tested have to be nested, and second, one of the candidate models must subsume the structure of the true data-generating model. Discrepancy measures have been used as an alternative method to select models without the need to rely upon the aforementioned assumptions. In this paper, we utilize a bootstrap approximation of the Kullback–Leibler discrepancy (BD) to estimate the probability that the fitted null model is closer to the underlying generating model than the fitted alternative model. We propose correcting for the bias of the BD estimator either by adding a bootstrap-based correction or by adding the number of parameters in the candidate model. We exemplify the effect of these corrections on the estimator of the discrepancy probability and explore their behavior in different model comparison settings.

Keywords:

bootstrap discrepancy comparison probability (BDCP); discrepancy comparison probability (DCP); likelihood ratio test (LRT); model selection; p-value

1. Introduction

Hypothesis testing and p-values are routinely used in applied, empirically oriented research. However, practitioners of statistics often misinterpret p-values, particularly in settings where hypothesis tests are used for model comparisons. Riedle, Neath and Cavanaugh [1] attempt to address this issue by providing an alternate conceptualization of the p-value. The authors introduce and investigate the concept of the discrepancy comparison probability (DCP) and its bootstrapped estimator, called the bootstrap discrepancy comparison probability (BDCP). The authors establish a clear connection between the BDCP based on the Kullback–Leibler discrepancy (KLD) and the p-values derived from likelihood ratio tests. However, this connection only exists when using the bootstrap discrepancy (BD) that arises from the “plug-in” principle, which yields a biased approximation to the KLD. Similarly to complexity penalization of the Akaike Information Criterion (AIC), we establish that an intuitive bias correction to the BD is the addition of k, the number of functionally independent parameters in the candidate model. We also propose utilizing a bootstrap-based correction, which can be justified under less stringent assumptions. We analyze how well the bootstrap approach corrects the bias of the BDCP and the BD, and we show that, in most settings, its performance is comparable to simply adding k.

2. Methodological Development

2.1. Background

When faced with the task of choosing amongst competing models, statisticians often use discrepancy or divergence functions. One of the most flexible and ubiquitous divergence measures is the Kullback–Leibler information. To introduce this measure in the present context, consider a vector of independent observations

y = {(y_{1}, y_{2}, \dots, y_{n})}^{T}

such that y is generated from an unknown distribution

g (y)

. Suppose that a candidate model

f (y | θ)

is proposed as an approximation for

g (y)

, and that this model belongs to the parametric class of densities

F = [f (y | θ) : θ \in Θ],

where

Θ

is the parameter space for

θ

. The Kullback–Leibler information, given by

I_{K L} (g, θ) = E_{g} [log \frac{g (y)}{f (y | θ)}],

captures the separation between the proposed model

f (y | θ)

and the true data-generating model

g (y)

.

Although not a formal metric,

I_{K L} (g, θ)

is characterized by two desirable properties. First, by Jensen’s inequality,

I_{K L} (g, θ) \geq 0

with equality if and only if

g (y) = f (y | θ)

. Second, as the dissimilarity between

g (y)

and

f (y | θ)

increases,

I_{K L} (g, θ)

increases accordingly.

Note that we can write

\begin{matrix} 2 I_{K L} (g, θ) & = E_{g} [- 2 log (f (y | θ))] - E_{g} [- 2 log (g (y))] \\ = E_{g} [- 2 ℓ (θ | y))] - E_{g} [- 2 log (g (y))], \end{matrix}

where

log (f (y | θ)) = ℓ (θ | y)

. In the preceding relation, for any proposed candidate model, the quantity

E_{g} [- 2 log (g (y))]

is constant. Only the quantity

E_{g} [- 2 ℓ (θ | y)]

changes across different models, which means it is the only quantity needed to distinguish among various models. The expression

d (g, θ) = E_{g} [- 2 ℓ (θ | y))]

is known as the Kullback–Leibler discrepancy (KLD) and is often used as a substitute for

I_{K L} (g, θ)

.

In practice, the goal is to determine the propriety of fitted models of the form

f (y | \hat{θ})

, where

\hat{θ} = {argmax}_{θ \in Θ} ℓ (θ | y)

. The KL discrepancy for the fitted model is given by

d (g, \hat{θ}) = E_{g} [- 2 ℓ (θ | y)] |_{θ = \hat{θ}} .

2.2. The Discrepancy Comparison Probability and Bootstrap Discrepancy Comparison Probability

Suppose that we have two nested models that are formulated to characterize the sample y, and we designate one of the models the null, represented by

θ_{1}

, and the other model the alternative, represented by

θ_{2}

. The discrepancies under the fitted null and alternative models are given by

d (g, {\hat{θ}}_{1})

and

d (g, {\hat{θ}}_{2})

, respectively. We can use these discrepancies to define the Kullback–Leibler discrepancy comparison probability (KLDCP), which is given by

P = \Pr [d (g, {\hat{θ}}_{1}) < d (g, {\hat{θ}}_{2})] .

The KLDCP evaluates the probability that the fitted null model is closer to the true data-generating model than the fitted alternative. The values of

d (g, {\hat{θ}}_{1})

and

d (g, {\hat{θ}}_{2})

are calculated from the same sample. For example, a KLDCP of

0.8

means that the fitted null has a smaller discrepancy than the fitted alternative in 80% of the samples drawn from the same distribution and of the same size. The development and interpretation of the KLDCP is presented in depth by Riedle, Neath and Cavanaugh [1].

We can estimate the KLDCP using the bootstrap approximation of the joint distribution of

d (g, {\hat{θ}}_{1})

and

d (g, {\hat{θ}}_{2})

. The bootstrap joint distribution is based on the discrepancy estimators that arise from the “plug-in” principle, as described by Efron and Tibshirani [2], which replaces all the elements of the KLD by their bootstrap analogues. Specifically, we replace g by the empirical distribution

\hat{g}

; y by the bootstrap sample from

\hat{g}

, which we call

y^{*}

; and finally,

\hat{θ}

by the maximum likelihood estimate (MLE) derived under the bootstrap sample

y^{*}

, which we call

{\hat{θ}}^{*}

. With these replacements, the bootstrap version of the KLD is given by

\begin{matrix} d (\hat{g}, {\hat{θ}}^{*}) & = E_{\hat{g}} {[- 2 ℓ (θ | y)] |}_{θ = {\hat{θ}}^{*}} \\ = \sum_{i = 1}^{n} - 2 ℓ_{i} ({\hat{θ}}^{*} | y_{i}) (because each y_{i} is independent .) \\ = - 2 ℓ ({\hat{θ}}^{*} | y), \end{matrix}

where

ℓ_{i}

represents the contribution to the likelihood based on the ith response

y_{i}

.

Now, in order to build a bootstrap distribution, we must draw various bootstrap samples from y. Suppose that we draw

j = 1, 2, \dots, J

bootstrap samples, and for each of these samples, we calculate the MLE of

θ

, which we denote as

{\hat{θ}}^{*} (j)

. This allows us to obtain a set of J different bootstrap discrepancies; this set is defined as

\{d (\hat{g}, {\hat{θ}}^{*} (j)) : j = 1, \dots, J\},

and these variates can be used to construct the bootstrap analogue of the discrepancy distribution.

Finally, we can extend this procedure to the setting of the null and alternative models. For each bootstrap sample, we calculate

{\hat{θ}}_{2}^{*} (j)

and

{\hat{θ}}_{1}^{*} (j)

, which are the bootstrap sample MLEs of

θ_{2}

and

θ_{1}

, respectively. We then compute the discrepancies

d (\hat{g}, {\hat{θ}}_{2}^{*} (j))

and

d (\hat{g}, {\hat{θ}}_{1}^{*} (j))

for the null and alternative models, respectively. This collection of J pairs of null and alternative bootstrap discrepancies defines the set

\{(d (\hat{g}, {\hat{θ}}_{1}^{*} (j)), d (\hat{g}, {\hat{θ}}_{2}^{*} (j))) : j = 1, \dots, J\},

which characterizes the bootstrap analogue of the joint distribution of

d (\hat{g}, {\hat{θ}}_{1})

and

d (\hat{g}, {\hat{θ}}_{2}) .

The bootstrap distribution can be utilized to estimate the bootstrap analogue of the DCP, given by

P^{*} = \Pr^{*} [d (\hat{g}, {\hat{θ}}_{1}^{*}) < d (\hat{g}, {\hat{θ}}_{2}^{*})] .

By the law of large numbers, we can approximate

P^{*}

by calculating the proportion of times when

d (\hat{g}, {\hat{θ}}_{1}^{*} (j)) < d (\hat{g}, {\hat{θ}}_{2}^{*} (j))

in the J bootstrap samples that were drawn. Thus, if I is an indicator function, we can define an estimator of the DCP, which we call the bootstrap discrepancy comparison probability (BDCP), as follows:

BDCP = \frac{1}{J} \sum_{j = 1}^{J} I [d (\hat{g}, {\hat{θ}}_{1}^{*} (j)) < d (\hat{g}, {\hat{θ}}_{2}^{*} (j))] .

(1)

3. Bias Corrections for the BDCP

An important issue that arises in the bootstrap estimation of the KLD is the negative bias of the discrepancy estimators that materializes from the “plug-in” principle. The following lemma establishes and quantifies this bias for large-sample settings under an appropriately specified candidate model.

Lemma 1.

For a large sample size, assuming that the candidate model subsumes the true model, we have

E_{g} \{E_{*} [- 2 ℓ ({\hat{θ}}^{*} | y)]\} \approx E_{g} [d (g, \hat{θ})] - k,

where

E_{*}

is the expectation with respect to the bootstrap distribution, and k is the dimension of the model.

Proof.

For a maximum likelihood estimator

\hat{θ}

, it is well known that for a large sample size and under certain regularity conditions, we have

{(\hat{θ} - θ)}^{T} I (θ | y) (\hat{θ} - θ) \sim χ_{k}^{2},

(2)

provided that the model is adequately specified. In the preceding,

χ_{k}^{2}

denotes a centrally distributed chi-square random variable with k degrees-of-freedom.

Now, consider the second-order Taylor series expansion of

- 2 ℓ ({\hat{θ}}^{*} | y)

about

\hat{θ}

, which results in

- 2 ℓ ({\hat{θ}}^{*} | y) \approx - 2 ℓ (\hat{θ} | y) + {({\hat{θ}}^{*} - \hat{θ})}^{T} I (\hat{θ} | y) ({\hat{θ}}^{*} - \hat{θ}) .

(3)

By taking the expected value of both sides of (3) with respect to the bootstrap distribution of

{\hat{θ}}^{*}

, we obtain

\begin{matrix} E_{*} (- 2 ℓ ({\hat{θ}}^{*} | y)) & \approx - E_{*} (2 ℓ (\hat{θ} | y)) + E_{*} ({({\hat{θ}}^{*} - \hat{θ})}^{T} I (\hat{θ} | y) ({\hat{θ}}^{*} - \hat{θ})) \\ \approx - 2 ℓ (\hat{θ} | y) + k (by the approximation in (2)), \\ = AIC - k, \end{matrix}

where AIC denotes the Akaike information criterion.

Finally, it has been established that if the true model is contained in the candidate class at hand, and if the large sample properties of MLEs hold, then AIC serves as an asymptotically unbiased estimator of the KLD. Thus,

\begin{matrix} E_{g} (E_{*} (- 2 ℓ ({\hat{θ}}^{*} | y))) & \approx E_{g} (AIC) - k \\ \approx E_{g} (d (g, \hat{θ})) - k . \end{matrix}

□

The preceding expression can be re-written as

E_{g} (d (g, \hat{θ})) \approx E_{g} (E_{*} (- 2 ℓ ({\hat{θ}}^{*} | y))) + k,

which implies that the bias correction k must be added to the bootstrap discrepancy in the estimation of the KLD. The BD estimator corrected by the addition of k will be called BDk.

Now, focus again on Equation (3). By subtracting

(- 2 ℓ (\hat{θ} | y))

from both sides of the equation, we obtain

- 2 ℓ ({\hat{θ}}^{*} | y) - (- 2 ℓ (\hat{θ} | y)) \approx {({\hat{θ}}^{*} - \hat{θ})}^{T} I (\hat{θ} | y) ({\hat{θ}}^{*} - \hat{θ}) .

(4)

As mentioned previously, if the candidate model is adequately specified, then the distributional approximation in (2) holds true. However, if this model specification assumption is not met, then we can utilize the approximation in (4) to find a suitable bias correction via the bootstrap. The bootstrap has been used for bias corrections in similar problem contexts [3,4].

By applying the expected value with respect to the bootstrap distribution of

{\hat{θ}}^{*}

to both sides of (4), we obtain

E_{*} (- 2 ℓ ({\hat{θ}}^{*} | y)) - (- 2 ℓ (\hat{θ} | y)) \approx E_{*} ({({\hat{θ}}^{*} - \hat{θ})}^{T} I (\hat{θ} | y) ({\hat{θ}}^{*} - \hat{θ})) .

(5)

The goal is then to find an approximation of

E_{*} (- 2 ℓ ({\hat{θ}}^{*} | y)) - (- 2 ℓ (\hat{θ} | y))

. Note that by the law of large numbers, we have that when

J \to \infty

,

\frac{1}{J} \sum_{j = 1}^{J} - 2 ℓ ({\hat{θ}}^{*} (j) | y) \overset{}{\to} E_{*} (- 2 ℓ ({\hat{θ}}^{*} | y)) .

Thus, for

J \to \infty

, we can assert

\frac{1}{J} \sum_{j = 1}^{J} - 2 ℓ ({\hat{θ}}^{*} (j) | y) - (- 2 ℓ (\hat{θ} | y)) \overset{}{\to} E_{*} (- 2 ℓ ({\hat{θ}}^{*} | y)) - (- 2 ℓ (\hat{θ} | y)) .

The preceding result shows that

\frac{1}{J} \sum_{j = 1}^{J} - 2 ℓ ({\hat{θ}}^{*} (j) | y) - (- 2 ℓ (\hat{θ} | y))

serves as an asymptotically unbiased estimator of

E_{*} (- 2 ℓ ({\hat{θ}}^{*} | y)) - (- 2 ℓ (\hat{θ} | y))

. We therefore propose using

k_{b} = \frac{1}{J} \sum_{j = 1}^{J} - 2 ℓ ({\hat{θ}}^{*} (j) | y) - (- 2 ℓ (\hat{θ} | y))

as a bootstrap-based correction of the BD. A more in-depth derivation and exploration of the

k_{b}

correction can be found in Cavanaugh and Shumway [5].

Subsequently, the bootstrap approximation of the KLD with a bootstrap-based bias correction is expressed by

E_{*} (- 2 ℓ ({\hat{θ}}^{*} | y)) + k_{b}

, and is estimated by

BDb = \frac{1}{J} \sum_{j = 1}^{J} - 2 ℓ ({\hat{θ}}^{*} (j) | y) + k_{b} .

It follows that the bootstrap bias-corrected BDCP would be defined as

\begin{matrix} BDCPb = \frac{1}{J} \sum_{j = 1}^{J} I [d (\hat{g}, {\hat{θ}}_{1}^{*} (j)) + k_{1 b} < d (\hat{g}, {\hat{θ}}_{2}^{*} (j)) + k_{2 b}], \end{matrix}

(6)

where

k_{1 b}

and

k_{2 b}

correspond to the bootstrap-based corrections for the null and alternative models, respectively.

Similarly, the k bias-corrected BD is expressed as

BDk = \frac{1}{J} \sum_{j = 1}^{J} - 2 ℓ ({\hat{θ}}^{*} (j) | y) + k,

and the k bias-corrected BDCP is given by

\begin{matrix} BDCPk = \frac{1}{J} \sum_{j = 1}^{J} I [d (\hat{g}, {\hat{θ}}_{1}^{*} (j)) + k_{1} < d (\hat{g}, {\hat{θ}}_{2}^{*} (j)) + k_{2}], \end{matrix}

(7)

where

k_{1}

and

k_{2}

are the number of functionally independent parameters that define the null and alternative models, respectively.

4. Simulation Studies

The following simulation sets are designed to explore the bias when estimating both the DCP based on the Kullback–Leibler discrepancy (KLDCP) and the expected value of the KLD. We present different hypothesis testing scenarios, not all of which are conventional, under a linear data-generating model and for varying sample sizes. Each setting exhibits three different approaches to formulating the BD: adding the bootstrap-based correction (BDb), adding k (BDk), and leaving the estimator uncorrected.

4.1. Settings for Simulation Sets

For Sets 1 to 5, the true data-generating model is of the form

y_{i} = x_{i}^{T} β_{0} + ϵ_{i},

with

β_{0}^{T} = [\begin{matrix} β_{0, 1} & β_{0, 2} & \dots & β_{0, p} \end{matrix}],

x_{i}^{T} = [\begin{matrix} 1 & x_{i 2} & \dots & x_{i p} \end{matrix}],

and

{[\begin{matrix} x_{i 2} & \dots & x_{i p} \end{matrix}]}^{T} \sim N_{p - 1} (μ, Σ),

(8)

where the entries of

μ

are chosen from

{- 1, 1}

with equal probability, and

Σ = d i a g_{p - 1} (100)

. For Sets 1 to 4, we have

ϵ_{i} \sim N (0, σ_{0}^{2})

; for Set 5, we have that

ϵ_{i} \sim t_{d f = 5}

, where

t_{d f}

denotes the Student’s t distribution based on

d f

degrees of freedom; and for Set 6, we have that

ϵ_{i} \sim Z \cdot N (0, 1) + (1 - Z) \cdot N (0, 50)

, where

Z \sim B e r n o u l l i (π)

with

π = 0.85

.

In the setting at hand, the true data-generating model g has parameters

θ = {(β_{0}^{T}, σ_{0}^{2})}^{T}

. Hurvich and Tsai [6] showed that for the family of approximating models

y = X β + ϵ

, where X is the design matrix and

ϵ \sim N (0, σ^{2} I_{n})

, with maximum likelihood estimators given by

\hat{β} = {(X^{T} X)}^{- 1} X^{T} y

and

{\hat{σ}}^{2} = \frac{{(y - X \hat{β})}^{T} (y - X \hat{β})}{n},

the KLD measure

d (g, \hat{θ})

is given by

d (g, \hat{θ}) = n log (2 π {\hat{σ}}^{2}) + \frac{n σ_{0}^{2}}{{\hat{σ}}^{2}} + \frac{{(X β_{0} - X \hat{β})}^{T} (X β_{0} - X \hat{β})}{{\hat{σ}}^{2}} .

(9)

The expected value of the KLD for the null and the alternative models was approximated by averaging the KLD over 5000 samples generated from g. These 5000 KLD values, computed using (9), approximate the joint distribution of

d (g, {\hat{θ}}_{1})

and

d (g, {\hat{θ}}_{2})

; hence, the simulation-based estimator of the KLDCP is given by

\hat{P} = \frac{1}{5000} \sum_{i = 1}^{5000} I [d (g, {\hat{θ}}_{1} (i)) < d (g, {\hat{θ}}_{2} (i))] .

(10)

This KLDCP estimate is calculated 100 times in order to estimate the KLDCP distribution and its expected value.

Finally, for each of the 5000 samples, we calculate the BD and the BDb using 200 bootstrap samples. However, to attenuate the simulation variability incurred by the mixture distribution, the number of bootstrap samples in Set 6 was increased to 500. The results displayed in the tables are based on averages over the 5000 samples.

Set 1: Null hypothesis is correctly specified, and alternative hypothesis is overspecified.

Consider the true data-generating model given by

y_{i} = β_{0, 1} + β_{0, 2} x_{i 2} + β_{0, 3} x_{i 3} + ϵ_{i},

where

ϵ_{i} \sim N (0, 50)

,

β_{0, 1} = 1

,

β_{0, 2} = β_{0, 3} = 0.5

and

{[\begin{matrix} x_{i 2} & x_{i 3} \end{matrix}]}^{T}

is sampled as indicated in (8).

For the hypothesis testing setting in Set 1, the null and alternative models are defined as

\begin{matrix} H_{1} : y_{i} = β_{1} + β_{2} x_{2 i} + β_{3} x_{i 3}, \\ H_{2} : y_{i} = β_{1} + β_{2} x_{i 2} + β_{3} x_{i 3} + β_{4} x_{i 4} + β_{5} x_{i 5} + β_{6} x_{i 6} + β_{7} x_{i 7} . \end{matrix}

Note that the null model is adequately specified, while the alternative model contains the true model plus four additional explanatory variables. These extra explanatory variables are generated from the distribution indicated in (8).

Set 2: Null hypothesis is underspecified, and alternative hypothesis is correctly specified.

Consider the true data-generating model given by

y_{i} = β_{0, 1} + β_{0, 2} x_{i 2} + β_{0, 3} x_{i 3} + β_{0, 4} x_{i 4} + β_{0, 5} x_{i 5} + ϵ_{i},

where

ϵ_{i} \sim N (0, 45)

,

β_{0, 1} = 1, β_{0, 2} = 0.11, β_{0, 3} = 0.13, β_{0, 4} = 0.12, β_{0, 5} = - 0.11

, and

{[\begin{matrix} x_{i 2} & x_{i 3} & \dots & x_{i 5} \end{matrix}]}^{T}

is sampled as indicated in (8).

For the hypothesis testing setting in Set 2, the null and alternative models are

\begin{matrix} H_{1} : y_{i} = β_{1} + β_{2} x_{2 i} + β_{3} x_{i 3} + β_{4} x_{i 4}, \\ H_{2} : y_{i} = β_{1} + β_{2} x_{i 2} + β_{3} x_{i 3} + β_{4} x_{i 4} + β_{5} x_{i 5} . \end{matrix}

Here, the alternative model has the same structure as the data-generating model, but the null model is missing one of the explanatory variables in the true model, namely

x_{5} .

Set 3: Both null and alternative models are underspecified, but the null is closer to the data-generating model.

Consider the true data-generating model given by

y_{i} = β_{0, 1} + β_{0, 2} x_{i 2} + β_{0, 3} x_{i 3} + β_{0, 4} x_{i 4} + β_{0, 5} x_{i 5} + β_{0, 6} x_{i 6} + ϵ_{i},

where

ϵ_{i} \sim N (0, 50)

,

β_{0, 1} = 1, β_{0, 2} = β_{0, 3} = 0.5, β_{0, 4} = β_{0, 5} = - 0.5, β_{0.6} = 0.1

, and

{[\begin{matrix} x_{i 2} & x_{i 3} & \dots & x_{i 6} \end{matrix}]}^{T}

is sampled as indicated in (8).

For the hypothesis testing setting in Set 3, the null and alternative models are

\begin{matrix} H_{1} : y_{i} = β_{1} + β_{2} x_{2 i} + β_{3} x_{i 3}, \\ H_{2} : y_{i} = β_{1} + β_{4} x_{i 4} + β_{6} x_{i 6} . \end{matrix}

In this setting, both the null and alternative candidate models have the same number of explanatory variables, and they are both missing variable

x_{4}

. However, there is a slight difference in the effect sizes of the variables for these models. For the alternative, the effect sizes are −0.5 and 0.1 for

x_{4}

and

x_{6}

, respectively. On the other hand, the effect size for the null model is 0.5 for both

x_{2}

and

x_{3}

. When comparing the null and alternative models, the smaller effect size on

x_{6}

sets the alternative further away from the true model.

Set 4: Both null and alternative models are equally underspecified.

Consider the true data-generating model given by

y_{i} = β_{0, 1} + β_{0, 2} x_{i 2} + β_{0, 3} x_{i 3} + β_{0, 4} x_{i 4} + β_{0, 5} x_{i 5} + β_{0, 6} x_{i 6} + β_{0, 7} x_{i 7} + ϵ_{i},

with

ϵ_{i} \sim N (0, 50)

,

β_{0, 1} = 1, β_{0, 2} = β_{0, 3} = β_{0, 6} = β_{0, 7} = 0.5, β_{0, 4} = β_{0, 5} = - 0.5

, and

{[\begin{matrix} x_{i 1} & x_{i 2} & \dots & x_{i 7} \end{matrix}]}^{T}

is sampled as indicated in (8).

For the hypothesis testing setting in Set 4, the null and alternative models are

\begin{matrix} H_{1} : y_{i} = β_{1} + β_{2} x_{2 i} + β_{3} x_{i 3}, \\ H_{2} : y_{i} = β_{1} + β_{4} x_{i 4} + β_{5} x_{i 5} . \end{matrix}

Here, the null and alternative candidate models are equally underspecified because they have the same number of explanatory variables with the same effect sizes, and neither model captures the true data-generating model.

Set 5: Null model has correct mean specification and alternative model is overspecified, but both are misspecified with respect to the error distribution, which is a Student’s t distribution.

Consider the true data generating model given by

y_{i} = β_{0, 1} + ϵ_{i},

with

ϵ_{i} \sim t_{d f = 5}

and

β_{0, 1} = 1

. Therefore,

σ_{0}^{2} = \frac{5}{3}

.

For the hypothesis testing setting in Set 5, the null and alternative models are

\begin{matrix} H_{1} : y_{i} = β_{1}, \\ H_{2} : y_{i} = β_{1} + β_{2} x_{i 2}, \end{matrix}

where

x_{i 2} \sim N (1, 100)

. This setting is similar to the one displayed in Set 1, where the null is properly specified while the alternative is overspecified. However, the models in the setting at hand inadequately specify the distribution of the errors.

Set 6: Null model has correct mean specification, and the alternative model is overspecified, but both are misspecified with respect to the error distribution, which is a mixture of normals.

Consider the true data-generating model given by

y_{i} = β_{0, 1} + ϵ_{i},

with

ϵ_{i} \sim Z \cdot N (0, 1) + (1 - Z) \cdot N (0, 50)

, where

Z \sim B e r n o u l l i (π)

with

π = 0.85

. Therefore,

\begin{matrix} σ_{0}^{2} & = 0.85 (1) + 0.15 (50) \\ = 8.35 . \end{matrix}

For the hypothesis testing setting in Set 6, the null and alternative models are

\begin{matrix} H_{1} : y_{i} = β_{1}, \\ H_{2} : y_{i} = β_{1} + β_{2} x_{i 2}, \end{matrix}

where

x_{i 2} \sim N (1, 100)

. This setting is similar to the one featured in Set 5. However, the errors in the setting at hand are generated from a mixture of normal distributions.

4.2. KLDCP Estimates from Simulations

For the tables showing the KLDCP simulation results, the columns are labeled as follows.

(1): KLDCP corresponds to results based on the distribution of 100 replicates of KLDCP, where each KLDCP is calculated using (10). Note that the null and alternative KLD joint distribution is characterized based on discrepancy replicates obtained through (9).
(2): BDCPb corresponds to results based on the distribution of 5000 replicates of BDCPb. Each BDCPb is computed using (6) with 200 bootstrap samples for Sets 1–5 and 500 bootstrap samples for Set 6.
(3): BDCPk corresponds to results based on the distribution of 5000 replicates of BDCPk. Each BDCPk is computed using (7) with 200 bootstrap samples for Sets 1–5 and 500 bootstrap samples for Set 6.
(4): BDCP corresponds to results based on the distribution of 5000 replicates of the uncorrected BDCP. Each BDCP is computed using (1) with 200 bootstrap samples for Sets 1–5 and 500 bootstrap samples for Set 6.

4.3. Estimates of the Expected KLD from Simulations

For the tables showing the KLD results, the columns are labeled as follows.

(1): E(KLD) corresponds to the average of 5000 discrepancies calculated using (9).
(2): E(BD) corresponds to the average of 5000 replicates of BD, where each BD is calculated by

$\frac{1}{M} \sum_{m = 1}^{M} - 2 ℓ ({\hat{θ}}^{*} (m) | y) .$

We have that $M = 200$ for Sets 1–5 and $M = 500$ for Set 6.
(3): $Δ$ BDb corresponds to the difference between the estimate of E(BD), with each BD corrected by $k_{b}$ and the estimate of E(KLD) described in (1). In other words, if we let $j \in {1, 2 \dots, 5000}$ be the number of simulated data sets, ${\tilde{B D}}_{j}$ be the BD estimate for each data set j, and $k_{j b}$ be the $k_{b}$ correction for data set j, then

$Δ BDb = \frac{1}{5000} \sum_{j = 1}^{5000} [{\tilde{B D}}_{j} + k_{j b}] - E (K L D) .$
(4): $Δ$ BDk shows the same difference described in (3), but using k instead of $k_{b}$ , which results in

$Δ BDk = \frac{1}{5000} \sum_{j = 1}^{5000} [{\tilde{B D}}_{j} + k] - E (K L D) .$

4.4. Discussion of Simulation Results

As mentioned previously, in the conventional hypothesis testing scenario for comparing nested models, Riedle, Neath and Cavanaugh [1] established that the uncorrected BDCP approximates the p-value derived from the likelihood ratio test. Therefore, in the case where the null candidate model is correctly specified, both the uncorrected BDCP and the p-value have a

U n i f o r m (0, 1)

distribution. This behavior is displayed in Table 1, where for large sample sizes, the mean and median of the BDCP distribution are around 0.5. This is a problematic feature of the uncorrected BDCP and p-values because the measure does not reliably favor the null model in those settings where the null is true. However, we see that for large sample sizes, both the BDCPk and the BDCPb values are close to 1, which clearly favors the null model.

Table 2 shows the results from the setting where the alternative hypothesis is correctly specified, while the null is underspecified. Here, we would expect all the discrepancy probabilities to be close to 0, as seen in the case where the sample size is

N = 500

. However, for smaller sample sizes, i.e.,

N = 25

and

N = 50

, we observe larger values for the discrepancy probabilities. In fact, for

N = 25

, the BDCPb is

0.89

and, with a mean and median close to

0.5

, the uncorrected BDCP exhibits similar behavior to the case where the null is true. This phenomenon is expected within the framework of model selection, where additional explanatory variables are favorable if there is a sufficient sample size to adequately estimate their effects. If the sample size is too small to construct reliable estimates, then it is best to choose smaller models, even at the expense of model misspecification.

The results from Table 1, Table 3, Table 4, Table 5 and Table 6 show that when estimating the KLDCP with a small sample size (

N = 25

to

N = 100

), the BDb performs either better than or as well as the BDk. For large sample sizes, all simulation sets exhibit a similar performance for both corrections.

For discrepancy estimation, Table 7, Table 8, Table 9 and Table 10 show that across all sample sizes,

k_{b}

over-corrects for the bias of the discrepancy approximation, and the over correction is more prominent for small sample sizes. It is worth noting that this evident over-estimation from the BDb is accompanied by a superior bias reduction of the corresponding KLDCP estimator. For instance, Table 7 shows a significant over-estimation by BDb compared to BDk, especially in the small sample settings. However, the corresponding estimator of the KLDCP, displayed in Table 1, exhibits less bias for BDCPb than for BDCPk.

Finally, Table 11 and Table 12 show that, across all sample sizes, the correction by

k_{b}

markedly reduces the bias compared to the correction by k. This means that in the setting where the mean structure is correctly specified for the null and overspecified for the alternative, but both models are incorrectly specified with respect to the error distribution, the bootstrap-based correction evidently outperforms the simple correction of k.

In most cases, however, the bias reductions resulting from the

k_{b}

and the k corrections are comparable. Therefore, our simulation studies suggest that if the null and/or the alternative models are misspecified, then correcting by either

k_{b}

or k will generally yield comparable estimators of the expected KLDCP.

5. Application: Creatine Kinase Levels during Football Preseason

In this section, we apply the BDCP to a data set from a biomedical setting. The goal of this application is to understand the changes in creatine kinase (CK) levels observed on the blood samples of college football players during preseason training. In order to properly explain the variation of CK, we must select between competing models that use different demographic and clinical variables. We will analyze the models selected by the

k_{b}

corrected, the k corrected and the uncorrected BDCP, and we will compare the results to the selection of models via the more conventional p-value approach.

5.1. Overview of Application

During strenuous exercise, skeletal muscle cells break down and release a variety of intracellular contents. When in excess, a condition known as exertional rhabdomyolysis (ER) can occur, which may result in life-threatening complications such as renal failure, cardiac arrhythmia and compartment syndrome. Creatine kinase (CK) is one of the proteins released during muscle breakdown, and measuring its levels is the most sensitive test for assessing muscular damage that could lead to ER [7].

During the off-season workouts in January 2011, a group of 13 University of Iowa football players developed ER. This event led to a prospective study where 30 University of Iowa football athletes were followed during a 34-day preseason workout camp. Variables such as body mass index (BMI) and CK levels were obtained from blood samples that were drawn at the first, third, and seventh day of the camp. Other demographic and clinical variables such as age, number of semesters in the program and history of rhabdomyolysis were also collected.

The initial results of the study, published by Smoot et al. [8], show that the CK levels at later time points were significantly different than the levels at earlier times. However, most of the clinical and demographic variables were not significant in explaining the levels of CK. One of the underlying issues with this type of modeling analysis is that the significance of each variable can only be assessed by hypothesis tests with nested models. For example, suppose that we wish to determine the significance of BMI in the presence of semesters in the program. To obtain a p-value for BMI, we need to formulate a hypothesis test where the null model only contains semesters in the program, while the alternative model contains both BMI and semesters in the program.

Although this setting may be useful in some scenarios, it is too limiting. For instance, suppose that we wish to choose between two non-nested models where one contains BMI and the other contains semesters in the program. Although a conventional test based on linear regression models would not be able to answer this question, the BDCP approach could indeed determine the propriety of either model in this type of non-nested setting.

In the analysis of this data set, we let

C K 3

be the log of CK levels measured at the seventh day of the camp,

C K 1

be the log of CK levels measured at the first day of the camp, and

S e m e s t e r s

be the number of semesters at the program. Of note, the log transformation is routinely applied in studies involving CK levels in order to justify approximate normality, as the raw levels tend to have heavily right-skewed distributions.

Now, consider the following hypothesis testing settings.

Setting 1: Testing the propriety of the model containing $C K 1$ .

$\begin{matrix} H_{1} : C K 3 = β_{1}, \\ H_{2} : C K 3 = β_{1} + β_{2} C K 1 . \end{matrix}$
Setting 2: Testing the propriety of the model containing $C K 1$ and $S e m e s t e r s$ over the model containing only $C K 1$ .

$\begin{matrix} H_{1} : C K 3 = β_{1} + β_{2} C K 1, \\ H_{2} : C K 3 = β_{1} + β_{2} C K 1 + β_{3} S e m e s t e r s . \end{matrix}$
Setting 3: Head-to-head comparison of non-nested models.

$\begin{matrix} H_{1} : C K 3 = β_{1} + β_{2} C K 1 + β_{3} B M I, \\ H_{2} : C K 3 = β_{1} + β_{2} C K 1 + β_{3} S e m e s t e r s . \end{matrix}$

5.2. Results of Application

The results for the application are summarized in Table 13. Settings 1 and 2 illustrate the congruence between BDCP and p-values in the case of hypothesis testing based on nested models. Setting 1 assesses the propriety of a model that includes only the intercept against a model that includes both the intercept and the levels of

C K 1

. The p-value for

C K 1

in this setting is 0.001, which means that, using a level

α

of 0.05,

C K 1

is significant in explaining the variation in

C K 3

levels. Both the BDCPk and BDCPb are 0.075, which means that there is a

7.5 %

chance that the null model is preferred over multiple bootstrap samples, indicating that the model containing

C K 1

is superior.

Once we establish that

C K 1

is an important variable to include in our model, the next step is to determine if additional variables can improve our model fit. Setting 2 displays a hypothesis test where the null model only contains

C K 1

, while the alternative contains both

C K 1

and

S e m e s t e r s

. The p-value for

S e m e s t e r s

is 0.734, which means that

S e m e s t e r s

is not statistically significant, and a reasonable investigator would choose to exclude

S e m e s t e r s

from the final model. The corrected BDCP values arrive at the same conclusion. For instance, the BDCPb is 0.995, which indicates that the across multiple bootstrap samples, the null model is chosen

99.5 %

of the time; therefore, the BDCP encourages us to choose the model that excludes

S e m e s t e r s

.

The rationale for testing

S e m e s t e r s

is based on the idea that more senior athletes tend to rigorously maintain their workout habits during the off season, mostly because of experience and maturity. Therefore,

S e m e s t e r s

is a variable that may confound the effects of

C K 1

on the variation of

C K 3

. Additionally, medical literature has shown that BMI highly correlates with CK levels and the development of ER [9], which means that one should also test for the propriety of models that include

B M I

. Thus, one could ask if a model featuring

B M I

would be better than a model featuring

S e m e s t e r s

. This results in a hypothesis testing scenario where the null and alternative models are non-nested, as exhibited in Setting 3.

First, note that the p-values displayed in the table for Setting 3 do not answer the question at hand. These p-values are obtained from partial tests applied to the full model containing both variables. On the other hand, the BDCP gives us meaningful information about the performance of adding

B M I

versus adding

S e m e s t e r s

. The BDCPb tells us that there is a

78 %

probability that the model containing

B M I

is a better fit than the model containing

S e m e s t e r s

. If we use the BDCPk instead, the probability increases to

81.5 %

. In both cases, if we are debating weather to include

B M I

or

S e m e s t e r s

as an adjusting variable, the BDCP clearly favors the inclusion of

B M I

.

6. Conclusions

When deciding between two competing models, practitioners of statistics normally utilize traditional hypothesis testing methods that rely on the assumption that one of the candidate models is properly specified. This approach is problematic because it is unreasonable to assume that one of the proposed models is precisely true. In addition, these methods are only applicable for nested models. To avoid any underlying assumptions and model structure limitations, Riedle, Neath and Cavanaugh [1] propose the use of the bootstrap discrepancy probability (BDCP) to assess the propriety of the fit of two candidate models. However, the bootstrap discrepancy (BD) utilized in this work provides a biased estimator of the Kullback–Leibler discrepancy (KLD).

When hypothesis testing assumptions are met, the BDCP asymptotically approximates the likelihood ratio test p-value. Therefore, similarly to p-values, the distribution of the BDCP is uniform if the null hypothesis is true. Hence, in settings when the null is true, the BDCP would be of limited value in choosing the appropriate model.

In this paper, we proposed utilizing the

k_{b}

or the k corrected BDCP, namely BDCPb and BDCPk, respectively. The BDCPb employs the BDb, a bootstrap corrected estimator of the KLD, while the BDCPk uses the BDk, a BD corrected by adding the number of functionally independent parameters in the candidate model. We showed that for most settings, the BDb serves as an over-corrected estimator of the KLD, but the corresponding BDCPb is less biased than the BDCPk for the estimation of the KLDCP. However, in the case when there is distributional misspecification, we showed that the BDb has negligible bias for the estimation of expected value of the KLD.

Moreover, the estimation of the bootstrap correction

k_{b}

utilizes the same bootstrap samples that were used to calculate the BD; therefore, we argue that the computational requirements of estimating

k_{b}

are not too burdensome. However, if the sample size is moderately large compared to the number of parameters in the model, then we showed that using k to correct the bias generally results in comparable values of the KLDCP estimates.

Author Contributions

Conceptualization, A.D. and J.C.; Formal analysis, A.D. and J.C.; Methodology, A.D. and J.C.; Supervision, J.C.; Writing—original draft, A.D. and J.C.; Writing—review and editing, A.D. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The R code used in generating the data for the simulation study is available on request from the corresponding author. The data for the application are not publicly available since the dataset is confidential.

Conflicts of Interest

The authors declare no conflict of interest.

References

Riedle, B.; Neath, A.; Cavanaugh, J.E. Reconceptualizing the p-Value From a Likelihood Ratio Test: A Probabilistic Pairwise Comparison of Models Based on Kullback-Leibler Discrepancy Measures. J. Appl. Stat. 2020, 47, 13–15. [Google Scholar] [CrossRef] [PubMed]
Efron, B.; Tibshirani, R. An Introduction to the Bootstrap, 2nd ed.; Chapman Hall: New York, NY, USA, 1993; pp. 31–37. [Google Scholar]
Efron, B. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J. Am. Stat. Assoc. 1983, 78, 316–331. [Google Scholar] [CrossRef]
Efron, B. How Biased is the Apparent Error Rate of a Prediction Rule? J. Am. Stat. Assoc. 1986, 81, 461–470. [Google Scholar] [CrossRef]
Cavanaugh, J.E.; Shumway, R.H. A Bootstrap Variant of AIC for State-Space Model Selection. Stat. Sin. 1997, 7, 473–496. [Google Scholar]
Hurvich, C.M.; Tsai, C. Regression and Time Series Model Selection in Small Samples. Biometrika 1989, 76, 297–307. [Google Scholar] [CrossRef]
Torres, P.; Helmstetter, J.; Kaye, A.M.; Kaye, A.D. Rhabdomyolysis: Pathogenesis, Diagnosis, and Treatment. Ochsner J. Spring 2015, 15, 58–69. [Google Scholar]
Smoot, M.K.; Cavanaugh, J.E.; Amendola, A.; West, D.R.; Herwaldt, L.A. Creatine Kinase Levels During Preseason Camp in National Collegiate Athletic Association Division I Football Athletes. Clin. J. Sport Med. 2014, 5, 438–440. [Google Scholar] [CrossRef] [PubMed]
Vasquez, C.R.; DiSanto, T.; Reilly, J.P.; Forker, C.M.; Holena, D.N.; Wu, Q.; Lanken, P.N.; Christie, J.D.; Shashaty, M.G.S. Relationship of Body Mass Index, Serum Creatine Kinase, and Acute Kidney Injury After Severe Trauma. J. Trauma Acute Care Surg. 2020, 89, 179–185. [Google Scholar] [CrossRef] [PubMed]

Table 1. Distribution approximations for Set 1, where the null model is correctly specified, while the alternative model is overspecified.

Statistic	KLDCP	BDCPb	BDCPk	BDCP
N = 500
Mean	1.000	0.878	0.868	0.515
Median	1.000	1.000	1.000	0.515
SD	0.000	0.233	0.241	0.282
N = 100
Mean	1.000	0.918	0.864	0.564
Median	1.000	1.000	0.995	0.580
SD	0.000	0.186	0.225	0.256
N = 50
Mean	1.000	0.966	0.875	0.631
Median	1.000	1.000	0.980	0.650
SD	0.000	0.111	0.193	0.220
N = 25
Mean	1.000	0.999	0.886	0.739
Median	1.000	1.000	0.955	0.755
SD	0.000	0.012	0.144	0.156

Table 2. Distribution approximations for Set 2, where the null model is underspecified, while the alternative model is correctly specified.

Statistic	KLDCP	BDCPb	BDCPk	BDCP
N = 500
Mean	0.001	0.022	0.021	0.011
Median	0.001	0.000	0.000	0.000
SD	0.000	0.088	0.085	0.043
N = 100
Mean	0.156	0.470	0.428	0.264
Median	0.156	0.340	0.280	0.170
SD	0.005	0.390	0.378	0.257
N = 50
Mean	0.372	0.691	0.597	0.409
Median	0.372	0.905	0.630	0.360
SD	0.007	0.350	0.354	0.266
N = 25
Mean	0.617	0.890	0.698	0.536
Median	0.617	0.990	0.785	0.535
SD	0.006	0.213	0.280	0.222

Table 3. Distribution approximations for Set 3, where the null and alternative models are underspecified, but the null model is closer to the true data-generating model.

Statistic	KLDCP	BDCPb	BDCPk	BDCP
N = 500
Mean	1.000	1.000	1.000	1.000
Median	1.000	1.000	1.000	1.000
SD	0.000	0.013	0.013	0.013
N = 100
Mean	0.979	0.910	0.910	0.910
Median	0.979	1.000	1.000	1.000
SD	0.002	0.244	0.244	0.244
N = 50
Mean	0.916	0.807	0.808	0.808
Median	0.916	0.970	0.970	0.970
SD	0.004	0.311	0.309	0.309
N = 25
Mean	0.804	0.692	0.699	0.699
Median	0.805	0.845	0.840	0.840
SD	0.005	0.314	0.303	0.303

Table 4. Distribution approximations for Set 4, where the null and alternative models are equally underspecified.

Statistic	KLDCP	BDCPb	BDCPk	BDCP
N = 500
Mean	0.498	0.507	0.507	0.507
Median	0.498	0.570	0.580	0.580
SD	0.007	0.478	0.478	0.478
N = 100
Mean	0.500	0.510	0.509	0.509
Median	0.500	0.562	0.567	0.567
SD	0.007	0.442	0.442	0.442
N = 50
Mean	0.500	0.502	0.502	0.502
Median	0.500	0.505	0.515	0.515
SD	0.007	0.407	0.406	0.406
N = 25
Mean	0.501	0.501	0.501	0.501
Median	0.501	0.490	0.495	0.495
SD	0.007	0.353	0.345	0.345

Table 5. Distribution approximations for Set 5, where the null and alternative models are misspecified with respect to the error distribution. Here, the errors are generated from a Student’s t distribution.

Statistic	KLDCP	BDCPb	BDCPk	BDCP
N = 500
Mean	1.000	0.794	0.794	0.499
Median	1.000	1.000	1.000	0.500
SD	0.000	0.329	0.328	0.289
N = 100
Mean	1.000	0.807	0.794	0.507
Median	1.000	1.000	1.000	0.515
SD	0.000	0.318	0.323	0.284
N = 50
Mean	1.000	0.825	0.790	0.508
Median	1.000	1.000	0.995	0.505
SD	0.000	0.301	0.315	0.273
N = 25
Mean	1.000	0.862	0.790	0.525
Median	1.000	1.000	0.985	0.530
SD	0.000	0.270	0.306	0.261

Table 6. Distribution approximations for Set 6, where the null and alternative models are misspecified with respect to the error distribution. Here, the errors are generated from a mixture of normal distributions.

Statistic	KLDCP	BDCPb	BDCPk	BDCP
N = 500
Mean	1.000	0.783	0.786	0.487
Median	1.000	1.000	1.000	0.484
SD	0.000	0.338	0.335	0.289
N = 100
Mean	1.000	0.808	0.793	0.495
Median	1.000	1.000	0.998	0.496
SD	0.000	0.322	0.325	0.283
N = 50
Mean	1.000	0.851	0.793	0.502
Median	1.000	1.000	0.994	0.494
SD	0.000	0.286	0.311	0.269
N = 25
Mean	1.000	0.906	0.787	0.509
Median	1.000	1.000	0.986	0.490
SD	0.000	0.229	0.300	0.246

Table 7. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 1. Here, the null model is correctly specified, while the alternative model is overspecified.

Hypothesis	E(KLD)	E(BD)	$Δ$ BDb	$Δ$ BDk
N = 500
Null	3378.949	3375.407	0.488	0.411
Alternative	3383.138	3375.578	0.686	0.362
N = 100
Null	679.282	675.291	0.385	−0.030
Alternative	684.115	676.667	2.518	0.521
N = 50
Null	342.167	338.498	1.267	0.268
Alternative	348.245	342.348	7.476	2.065
N = 25
Null	174.334	171.169	3.657	0.910
Alternative	183.828	193.249	43.328	17.290

Table 8. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 2. Here, the null model is underspecified, while the alternative model is correctly specified.

Hypothesis	E(KLD)	E(BD)	$Δ$ BDb	$Δ$ BDk
N = 500
Null	3340.491	3335.733	0.410	0.290
Alternative	3328.467	3322.581	0.319	0.143
N = 100
Null	672.373	667.928	1.210	0.520
Alternative	671.137	665.628	1.493	0.454
N = 50
Null	339.515	334.726	1.891	0.226
Alternative	339.923	334.181	2.888	0.305
N = 25
Null	174.136	171.376	7.446	2.223
Alternative	176.073	174.320	13.270	4.106

Table 9. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 3. Here, the null and alternative models are underspecified, but the null model is closer to the true data-generating model.

Hypothesis	E(KLD)	E(BD)	$Δ$ BDb	$Δ$ BDk
N = 500
Null	3726.902	3726.159	3.401	3.332
Alternative	3832.770	3832.395	3.704	3.626
N = 100
Null	745.967	745.809	4.358	3.943
Alternative	766.212	766.813	4.947	4.528
N = 50
Null	373.419	373.704	5.309	4.325
Alternative	383.156	384.020	5.843	4.858
N = 25
Null	187.563	188.745	8.082	5.245
Alternative	191.924	194.082	8.878	6.088

Table 10. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 4. Here, the null and alternative models are equally underspecified.

Hypothesis	E(KLD)	E(BD)	$Δ$ BDb	$Δ$ BDk
N = 500
Null	3923.423	3923.908	5.022	4.948
Alternative	3923.580	3924.705	5.475	5.399
N = 100
Null	784.021	784.917	5.080	4.670
Alternative	784.042	785.026	5.241	4.823
N = 50
Null	391.751	393.155	6.335	5.343
Alternative	391.753	393.131	6.222	5.239
N = 25
Null	195.732	198.616	9.602	6.821
Alternative	195.862	198.690	9.598	6.804

Table 11. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 5. Here, the null and alternative models are misspecified with respect to the error distribution, and the errors are generated from a Student’s t distribution.

Hypothesis	E(KLD)	E(BD)	$Δ$ BDb	$Δ$ BDk
N = 500
Null	1678.652	1672.369	−2.224	−4.178
Alternative	1679.695	1672.387	−2.248	−4.231
N = 100
Null	338.728	334.154	−0.920	−2.471
Alternative	339.866	334.300	−0.728	−2.438
N = 50
Null	171.377	167.500	−0.231	−1.839
Alternative	172.640	167.847	0.283	−1.714
N = 25
Null	87.689	83.577	−0.434	−2.077
Alternative	89.311	84.495	0.869	−1.785

Table 12. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 6. Here, the null and alternative models are misspecified with respect to the error distribution, and the errors are generated from a mixture of normal distributions.

Hypothesis	E(KLD)	E(BD)	$Δ$ BDb	$Δ$ BDk
N = 500
Null	2488.932	2480.154	−0.389	6.554
Alternative	2490.012	2480.141	−0.310	6.659
N = 100
Null	508.122	497.000	−0.383	8.404
Alternative	509.426	497.237	−0.597	8.459
N = 50
Null	263.382	252.424	−2.852	8.590
Alternative	264.974	253.245	−3.930	8.361
N = 25
Null	144.895	131.870	−4.361	10.842
Alternative	147.551	134.298	−7.782	9.956

Table 13. From left to right: results for Setting 1, Setting 2, and Setting 3. BDCPk is the BDCP corrected by k, BDCPb is the BDCP corrected by

k_{b}

, and BDCP is the uncorrected BDCP. Results are based on 200 bootstraps samples.

Table 13. From left to right: results for Setting 1, Setting 2, and Setting 3. BDCPk is the BDCP corrected by k, BDCPb is the BDCP corrected by

k_{b}

, and BDCP is the uncorrected BDCP. Results are based on 200 bootstraps samples.

BDCP
BDCPk	0.075	BDCPk	0.990	BDCPk	0.815
BDCPb	0.075	BDCPb	0.995	BDCPb	0.780
BDCP	0.055	BDCP	0.495	BDCP	0.815
p-Value
CK1	0.001	CK1	0.001	CK1	0.001
		Semesters	0.734	BMI	0.176
				Semesters	0.936

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dajles, A.; Cavanaugh, J. Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy. Entropy 2022, 24, 1483. https://doi.org/10.3390/e24101483

AMA Style

Dajles A, Cavanaugh J. Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy. Entropy. 2022; 24(10):1483. https://doi.org/10.3390/e24101483

Chicago/Turabian Style

Dajles, Andres, and Joseph Cavanaugh. 2022. "Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy" Entropy 24, no. 10: 1483. https://doi.org/10.3390/e24101483

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy

Abstract

1. Introduction

2. Methodological Development

2.1. Background

2.2. The Discrepancy Comparison Probability and Bootstrap Discrepancy Comparison Probability

3. Bias Corrections for the BDCP

4. Simulation Studies

4.1. Settings for Simulation Sets

4.2. KLDCP Estimates from Simulations

4.3. Estimates of the Expected KLD from Simulations

4.4. Discussion of Simulation Results

5. Application: Creatine Kinase Levels during Football Preseason

5.1. Overview of Application

5.2. Results of Application

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI