Minimum Message Length Inference of the Exponential Distribution with Type I Censoring

Makalic, Enes; Schmidt, Daniel Francis

doi:10.3390/e23111439

Open AccessArticle

Minimum Message Length Inference of the Exponential Distribution with Type I Censoring

by

Enes Makalic

^1,*

and

Daniel Francis Schmidt

²

¹

Melbourne School of Population and Global Health, The University of Melbourne, Parkville, VIC 3010, Australia

²

Faculty of Information Technology, Monash University, Clayton, VIC 3168, Australia

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(11), 1439; https://doi.org/10.3390/e23111439

Submission received: 13 September 2021 / Revised: 13 October 2021 / Accepted: 19 October 2021 / Published: 30 October 2021

(This article belongs to the Collection Feature Papers in Information Theory)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Data with censoring is common in many areas of science and the associated statistical models are generally estimated with the method of maximum likelihood combined with a model selection criterion such as Akaike’s information criterion. This manuscript demonstrates how the information theoretic minimum message length principle can be used to estimate statistical models in the presence of type I random and fixed censoring data. The exponential distribution with fixed and random censoring is used as an example to demonstrate the process where we observe that the minimum message length estimate of mean survival time has some advantages over the standard maximum likelihood estimate.

Keywords:

minimum message length; exponential distribution; maximum likelihood; survival analysis; censoring

1. Introduction

In Type I random censoring we observe for each item i either the true survival time

T_{i} = t_{i}

(

t_{i} > 0

) or the censoring time

C_{i} = c_{i}

(

c_{i} > 0

), where capital letters are used to denote random variables. The data consists of joint realisations of the random variables

(Y_{i} = y_{i}, Δ_{i} = δ_{i})

(

i = 1, \dots, n

) where

\begin{matrix} Y_{i} & = min (T_{i}, C_{i}), \end{matrix}

(1)

\begin{matrix} Δ_{i} & = I (T_{i} \leq C_{i}) = \{\begin{matrix} 1, & if T_{i} \leq C_{i} (observed survival) \\ 0, & if T_{i} > C_{i} (observed censoring) . \end{matrix} \end{matrix}

(2)

The censoring time

C_{i}

may be fixed (i.e.,

C_{i} = c

for all

i = 1, \dots, n

) or a random variable that may depend on other factors (e.g., loss to follow-up). The likelihood function of n observed data points

D = {(y_{1}, δ_{1}), \dots, (y_{n}, δ_{n})}

is

p (D) = \prod_{i = 1}^{n} {(p_{T} (y_{i}) (1 - F_{C} (y_{i})))}^{δ_{i}} {(p_{C} (y_{i}) (1 - F_{T} (y_{i})))}^{1 - δ_{i}}

where

p_{T} (t | θ)

and

F_{T} (t | θ)

denote the probability density and the cumulative density function of the random variable T, respectively. Inference about the survival times

(t_{1}, \dots, t_{n})

is of key interest in many areas of science and is commonly done by maximizing the likelihood and dropping terms relevant to C only.

This manuscript examines inference of models in the presence of censored data under the minimum message length (MML) framework. MML (see Section 3) is Bayesian technique for model selection and parameter estimation that is based on data compression and key principles of information theory. MML is known to possess strong theoretical properties [1,2,3] and has previously been successfully applied to a wide range of statistical models [1]. Here, we demonstrate how MML can be used to infer models under fixed censoring as well as type I random censoring. We use the exponential distribution (see Section 2) as a simple example to demonstrate the key steps and compare the MML estimator to the well-known maximum likelihood estimator in this setting (see Section 2.1). Although MML analysis of the exponential distribution is not new (see, for example, [1,4]), the MML principle has not been applied to any kind of survival data with censoring to date.

The main contributions of this manuscript are to: (i) introduce the MML principle of inductive inference and demonstrate how the Wallace–Freeman MML approximation can used to infer exponential models with type I censored data; (ii) show that the MML estimate of the mean lifetime has some advantages over the usual maximum likelihood estimate for small samples and that it converges to the maximum likelihood estimate for large sample sizes, (iii) incorporate the proposed codelengths for censored exponential distributions into MML finite mixture models allowing for inference of all parameters as well as the number of mixture classes; and (iv) compare the MML principle to the closely related minimum description length principle.

2. Exponential Distribution

Consider the case of a randomly censored exponential parameter studied in [5] where the lifetime data and the censoring data are assumed to be exponentially distributed

T_{i} \sim Exp (β), C_{i} \sim Exp (α), i = 1, \dots, n,

(3)

and

α, β > 0

denote the mean censoring time and survival time, respectively. Under this model, the joint probability distribution of

(Y_{i} = y_{i}, Δ_{i} = δ_{i})

is

\begin{matrix} p (Y_{i} = y_{i}, Δ_{i} = 1) & = & p_{T} (y_{i}) (1 - F_{C} (y_{i})) \end{matrix}

(4)

\begin{matrix} p (Y_{i} = y_{i}, Δ_{i} = 0) & = & p_{C} (y_{i}) (1 - F_{T} (y_{i})) \end{matrix}

(5)

where

(Y_{i}, Δ_{i})

are defined in (1) and (2) respectively. In contrast to random censoring, in fixed censoring an item is observed for a period of time, say

c > 0

, and its actual survival time

t_{i}

is known if the item fails before time

t_{i} \leq c

; otherwise, we only know that the item survived past

t_{i} > c

. In the case of exponentially distributed survival times, the observed data (

Y = y_{i}, Δ_{i} = δ_{i})

follows

T_{i} \sim Exp (θ), C_{i} = c, i = 1, \dots, n,

(6)

where

c > 0

is a fixed constant (the follow up period) known a priori. Given n data points

D = {(y_{1}, δ_{1}), \dots, (y_{n}, δ_{n})}

, the aim is to estimate the unknown mean lifetime survival

β > 0

(random censoring) or

θ > 0

(fixed censoring).

2.1. Maximum Likelihood Estimation

The method of maximum likelihood is the most common approach used to obtain parameter estimates in parametric models. Under the censored exponential model, maximum likelihood proceeds by setting the parameter estimate

\hat{β} (D)

to the value that maximises the probability of the data. From (4) and (5), the joint probability of the data D is

\begin{matrix} p (D | α, β) & = {(\frac{1}{β})}^{k} {(\frac{1}{α})}^{n - k} exp (- (\frac{1}{β} + \frac{1}{α}) \sum_{i = 1}^{n} y_{i}), \end{matrix}

(7)

where

k = (\sum_{i} δ_{i})

is the number of observed uncensored survival times. Maximizing the likelihood function is equivalent to minimizing the negative log-likelihood function

- log p (D | α, β) = k log β + (n - k) log α + (\frac{1}{β} + \frac{1}{α}) \sum_{i = 1}^{n} y_{i} .

(8)

Maximum likelihood estimates of the mean survival and censoring times are

\hat{β} {(D)}_{ML} = \frac{1}{k} \sum_{i = 1}^{n} y_{i}, \hat{α} {(D)}_{ML} = \frac{1}{n - k} \sum_{i = 1}^{n} y_{i},

(9)

respectively. Provided the count of observed survival times

k \in (0, n)

, the maximum likelihood estimates

\hat{α} (D)

and

\hat{β} (D)

are finite; otherwise, if

k = 0

or

k = n

, one of the maximum likelihood estimates

\hat{α} (D)

or

\hat{β} (D)

is infinite. Kim [5] showed that maximum likelihood estimates have infinite mean and variance in this setting. However, the expected value of the maximum likelihood estimate

\hat{β} (D)

is finite if we condition on

k > 0

. Kim [5] further showed that, provided

k \in (0, n)

, the maximum likelihood estimates

\hat{α} (D)

and

\hat{β} (D)

are unbiased, strongly consistent (without any condition on k) and asymptotically normally distributed.

In the case of Type I censored data with a fixed censoring time

c > 0

, the negative log-likelihood function of the data is

- log p (D | θ; c) = k log (θ) + \frac{1}{θ} (\sum_{i = 1}^{n} y_{i} δ_{i}) + \frac{c (n - k)}{θ}

(10)

where

k = \sum_{i} δ_{i}

as before. Under fixed censoring, the maximum likelihood estimate of the mean survival time

\hat{θ} (D)

is (see, for example, [6])

\hat{θ} (D) = \frac{c (n - k) + \sum_{i = 1}^{n} δ_{i} y_{i}}{k} .

(11)

In case of no censoring (i.e.,

k = n

implying complete data), (11) reduces to

(\sum_{i} y_{i}) / n

, which is the usual maximum likelihood estimate for the exponential distribution with complete data. The sampling distribution of (11) is asymptotically normal with mean

θ

and variance

\frac{θ^{2}}{n (1 - exp (- c / θ))} = \frac{θ^{2}}{n F_{T} (c | θ)} .

(12)

Conditional upon

k > 0

, Mendenhall and Lehman [7] obtained the exact mean and variance of the maximum likelihood estimate (11)

E {\hat{θ}} = θ - c (\frac{q}{p} - n E {k^{- 1}} + 1), V {\hat{θ}} = {(n c)}^{2} V {k^{- 1}} + (θ^{2} - c^{2} q / p^{2}) E {k^{- 1}},

where

p = 1 - exp (- c / θ), q = 1 - p, E {k^{- a}} = \frac{1}{1 - q^{n}} \sum_{k = 1}^{n} \frac{1}{k^{a}} (\binom{n}{k}) p^{k} q^{n - k},

for

a = 1, 2, \dots

and

V {k^{- 1}} = E {k^{- 2}} - {(E {k^{- 1}})}^{2}

. However, the large sample normal approximation of the distribution of the maximum likelihood estimates is inaccurate and not representative of the behaviour of the estimate in the small to moderate sample size regime [7]. Balakrishnan and Davies [8] further show that the maximum likelihood estimate computed based on a censoring time

c^{'}

will always produce an estimate which is Pitman closer to the data generating model

θ

than the maximum likelihood estimate computed with a shorter censoring time

c < c^{'}

. In the next section, we introduce the MML principle of inductive inference (see Section 3) and demonstrate how MML can be used to infer exponential models with censoring (see Section 4).

3. Minimum Message Length

Introduced in the late 1960s by Wallace and Boulton [9], the minimum message length (MML) principle [1,9,10,11] is a framework for inductive inference based on ideas in information theory and data compression. Under the MML framework, the aim is to transmit a set of data (a message) from a hypothetical sender to a receiver over a noiseless transmission channel. The MML message is designed to consist of two parts:

the assertion: an encoding of the model structure and the associated model parameters $θ \in Θ \in R^{p}$ .
the detail: a description of the data D using the model $p (D | θ)$ that was specified in the assertion.

The length of the assertion measures the complexity of the model, with complex models requiring longer codelengths compared to simpler models, while the detail captures how well a model fits the data. The length of the two-part message,

I (D, θ)

, is the sum of the length of the assertion,

I (θ)

, and the length of detail,

I (D | θ)

; namely,

I (D, θ) = \underset{assertion}{\underset{︸}{I (θ)}} + \underset{detail}{\underset{︸}{I (D | θ)}} .

(13)

Within the MML framework we seek the model

\hat{θ} (D) = \underset{θ \in Θ}{arg min} \{I (D, θ)\}

(14)

that minimises the length of this message. Due to the two-part nature of the message, MML automatically balances the trade-off between model complexity and the goodness of fit of the model to the data. By measuring the quality of a model in (say) bits, MML is a yardstick that can be universally used to compare models with different parameters and structures.

There exist several approches to computing message lengths (13), with the strict MML procedure (SMML) [1,12] and the MML87 approximation [1,10] being the most widely known. In contrast to the SMML procedure whose construction is known to be NP hard [13], the MML87 approximation is computationally tractable and most widely used in practice. The MML87 codelength approximation to (13) is

I_{87} (D, θ) = \underset{assertion}{\underset{︸}{- log π (θ) + \frac{1}{2} log | J_{θ} (θ) | + \frac{p}{2} log κ_{p}}} + \underset{detail}{\underset{︸}{\frac{p}{2} - log p (D | θ)}}

(15)

where

π_{θ} (θ)

is the prior distribution for the parameters

θ

,

| J_{θ} (θ) |

is the determinant of the expected Fisher information matrix,

p (D | θ)

is the likelihood function of the model and

κ_{p}

is a quantization constant [14,15] that depends on the number of parameters p. Specifically, for small p we have

κ_{1} = \frac{1}{12}, κ_{2} = \frac{5}{36 \sqrt{3}}, κ_{3} = \frac{19}{192 \times 2^{1 / 3}},

(16)

while

κ_{p}

is well-approximated for large p by [1]:

\frac{p}{2} (log κ_{p} + 1) \approx - \frac{p}{2} log 2 π + \frac{1}{2} log p π - γ,

(17)

where

γ \approx 0.5772

is the Euler–Mascheroni constant. The MML87 codelength, evaluated at the minimum, is the shortest codelength of a two-part message that encodes both the model parameters

θ \in Θ

and the data D. The MML87 approximation is known to be invariant under smooth one-to-one reparameterizations of the likelihood function and is asymptotically equivalent to the well-known Bayesian information criterion (BIC) [16] as

n \to \infty

with

p > 0

fixed; that is,

I_{87} (D, θ) = - log p (D | θ) + \frac{p}{2} log n + O (1)

(18)

where the

O (1)

term depends on the prior distribution, the Fisher information and the number of parameters p. Unlike model selection criteria such as Akaike’s information criterion (AIC) and BIC, MML allows for both parameter estimation and model selection within the same unified framework. Furthermore, in models where the number of parameters grows with n or the sample size is relatively small, the difference between the MML87 codelength and BIC can be substantial. Examples include analysis of multiple short time series, where several measurements are collected over a period of time for a large number of study participants [17], learning finite mixture models [18] and discriminating between Poisson and geometric distributions based on observed data [19]. In the latter example, both the Poisson and geometric distribution have the same number of free parameters so that model selection with BIC is equivalent to choosing the model with the higher likelihood. In contrast, MML87 takes into account the complexity of each distribution [20] and not just the number of parameters, resulting in improved model selection performance for small sample sizes [19].

MML has been successful applied to a wide range of problems (e.g., decision trees [21], factor analysis [22], linear causal models [23], mixture modelling [18,24]) demonstrating excellent parameter estimation properties and model selection performance that is on par or better than commonly used techniques such as Akaike’s information criterion (AIC) [25] and the Bayesian information criterion (BIC). A brief tutorial overview of minimum message length can be found in [19].

4. Minimum Message Length Inference of Type I Censored Exponential Data

To encode and transmit censored data

D = {(y_{1}, δ_{1}), \dots, (y_{n}, δ_{n})}

between the hypothetical sender and receiver within the MML framework, we have two options:

Transmit the censoring indicators $(δ_{1}, \dots, δ_{n})$ first and then transmit the lifetime survival data $(y_{1}, \dots, y_{n})$ given the receiver now knows which of the n data points are censored (see Section 4.1);
Transmit the censoring indicators and the lifetime data simultaneously (see Section 4.2).

We shall now estimate the MML87 codelength (15) for both the joint and the conditional encoding schemes for the censored exponential distribution setting introduced in Section 2.

4.1. Conditional Encoding of the Data

Under the conditional encoding framework, the sender transmits the censoring indicators

δ_{i}

first, and then transmits the lifetime data

y_{i}

using the conditional distribution of the data given the observed censoring indicators. The total message length of the data

D = {(y_{1}, δ_{1}), \dots, (y_{n}, δ_{n})}

and the parameters

θ

with the conditional encoding is

I_{87} (D, θ) = I_{87} (ϕ, δ) + I_{87} (ψ, y | δ),

(19)

where

θ = {ϕ, ψ}

are the model parameters defined below,

I (ϕ, δ)

denotes the message length of the censoring indicators

δ = (δ_{1}, \dots, δ_{n})

, and

I (ψ, y | δ)

denotes the codelength of the survival data

y = (y_{1}, \dots, y_{n})

, given that the censoring indicators are known to the receiver. From (3), the probability of observing an uncensored datum, say

ϕ > 0

, is

ϕ = P (T_{i} \leq C_{i}) = \frac{α}{α + β}, (i = 1, \dots, n),

(20)

implying that the censoring indicators follow a Bernoulli distribution with probability

ϕ

; that is,

δ_{i} \sim Bernoulli (ϕ)

, or equivalently, k follows the binomial distribution

k \sim binomial (ϕ, n)

.

The MML87 codelength of the binomial distribution was previously derived in [1,18] and is included here for completeness. Briefly, to compute the MML87 codelength (15) we require the Fisher information

J_{ϕ} (ϕ)

and the prior distribution

π_{ϕ} (ϕ)

for the probability of observing an uncensored datum. The Fisher information is well-known

J_{ϕ} (ϕ) = \frac{n}{ϕ (1 - ϕ)} .

(21)

We assume the prior distribution for the censoring probability

ϕ

to be the beta distribution (

ϕ \sim beta (a, b)

) with probability density function

π_{ϕ} (ϕ | a, b) = \frac{ϕ^{a - 1} {(1 - ϕ)}^{b - 1}}{B (a, b)}

(22)

where

a, b > 0

are the shape and scale parameters respectively and

B (a, b)

is the usual beta function. Substituting (21) and (22) into the MML87 codelength (15) and noting that

κ_{1} = 1 / 12

, yields

I_{87} (ϕ, δ) = - (k + a - \frac{1}{2}) log ϕ - (n + b - \frac{1}{2} - k) log (1 - ϕ) + log B (a, b) + \frac{1}{2} (1 + log n - log 12)

(23)

where, as before,

k = (\sum_{i} δ_{i})

. The codelength (23) is minimised at the MML87 estimate

{\hat{ϕ}}_{87} (δ) = \frac{k + a - 1 / 2}{n + a + b - 1} .

(24)

Note that, in the special case of uniform prior distribution (

a = b = 1

), the MML87 estimate simplifies to

{\hat{ϕ}}_{87} (δ) = \frac{k + 1 / 2}{n + 1} .

(25)

The shortest MML87 codelength for the censoring indicators is therefore given by

I_{87} ({\hat{ϕ}}_{87}, δ)

. It remains to work out the conditional codelength of the surivival times given the censoring indicators,

I_{87} (ψ, y | δ)

.

We note that the conditional likelihood of the lifetime datum

y_{i}

is

p (y_{i} | α, β, δ = 0) = p (y_{i} | α, β, δ = 1) = (\frac{1}{β} + \frac{1}{α}) exp (- y_{i} (\frac{1}{β} + \frac{1}{α})),

(26)

which is the exponential distribution with mean

ψ = {(1 / β + 1 / α)}^{- 1}

; that is,

y_{i} | δ_{i} \sim Exp (ψ), i = 1, \dots, n .

(27)

The Fisher information of the exponential distribution is

J_{ψ} (ψ) = \frac{n}{ψ^{2}} .

(28)

In terms of the prior distribution for

ψ

, Schmidt and Makalic [4] consider the conjugate exponential distribution with a hyperparameter

ψ_{0}

that controls the prior mean. Here, we would like an objective prior distribution on the mean

ψ

that is free of hyperparameters and has heavy tails so that large values of

ψ

are not penalized too severely. Additionally, our choice of the prior distribution should ideally lead to an easy to compute analytic estimate of

ψ

. A reasonable option is the half-Cauchy distribution which has heavy tails however it leads to MML estimates that are roots of polynomial functions of

s = \sum_{i} (y_{i})

. Instead, we will use the Fréchet (inverse Weibull) distribution with probability density function

π_{ψ} (ψ) = ψ^{- 2} exp (- ψ^{- 1}), ψ > 0,

(29)

which is a type of generalized extreme value distribution and has Cauchy-like heavy tails. Substituting (29), (28) into (3), we obtain the MML87 codelength

I_{87} (ψ, y | δ) = (n + 1) log ψ + \frac{1}{ψ} (1 + \sum_{i = 1}^{n} y_{i}) + \frac{1}{2} (1 + log n - log 12) .

(30)

The MML87 estimate of the mean

ψ

is

{\hat{ψ}}_{87} (y) = \frac{s + 1}{n + 1}

(31)

where, as before,

s = \sum_{i} y_{i}

. The MML87 estimate corresponds to the usual maximum likelihood estimate

{\hat{ψ}}_{ML} (y) = s / n

with one additional data point that has a unit contribution to the mean. The expected mean squared error of the MML87 estimate is

E {{({\hat{ψ}}_{87} (y) - ψ)}^{2}} = \frac{ψ (ψ (n + 1) - 2) + 1}{{(n + 1)}^{2}}

(32)

which dominates the maximum likelihood estimate for

ψ > \frac{n}{n + \sqrt{n (2 n + 1)}}, n > 0 .

(33)

As the sample size n increases, we note that

lim_{n \to \infty} \{\frac{n}{n + \sqrt{n (2 n + 1)}}\} = \sqrt{2} - 1 \approx 0.414

(34)

implying the MML87 estimate dominates maximum likelihood for all

ψ > 0.414

in terms of expected mean squared error for large n. However, we note that, unlike the MML87 estimate with this choice of prior distribution, the maximum likelihood estimate is invariant to scaling of the data.

Substituting

I_{87} (\hat{ϕ}, δ)

and

I_{87} (\hat{ψ}, y | δ)

into (19) yields the total (conditional) codelength of the data. The MML87 estimates of the mean lifetime

{\hat{β}}_{87} (D)

and censoring time

{\hat{α}}_{87} (D)

can be recovered from

α \to \frac{ψ}{1 - ϕ}, β \to \frac{ψ}{ϕ},

(35)

for

ϕ \in (0, 1)

. Next, we examine how the same message can be encoded using joint encoding of lifetime data and censoring indicators.

4.2. Joint Encoding of the Data

Unlike in the conditional encoding, the sender now transmits the survival times and the indicator variables simultaneously. The negative log-likelihood function of the data

D = {(y_{1}, δ_{1}), \dots, (y_{n}, δ_{n})}

is given in (8). The Fisher information in this parameterization is

J (α, β) = \frac{n^{2}}{β α {(α + β)}^{2}} .

(36)

We would like to use prior distributions for

α

and

β

that are comparable to those in the conditional coding described in Section 4.1. Noting that

ϕ \sim beta (a, b)

,

ψ

has the standard Frechet distribution and

ϕ = \frac{α}{α + β}, ψ = {(\frac{1}{α} + \frac{1}{β})}^{- 1},

(37)

the Jacobian of the transformation from

(ϕ, ψ) \to (α, β)

is

\frac{α β}{{(α + β)}^{3}}

(38)

implying that a commensurate joint prior distribution for

α, β

is

π_{α, β} (α, β) = \frac{α^{a - 2} e^{- \frac{α + β}{α β}} β^{b - 2} {(α + β)}^{- a - b + 1}}{B (a, b)}

(39)

where

B (\cdot, \cdot)

is the beta function. In the special case where

ϕ

is given a uniform prior (i.e.,

a = b = 1

), we have

π_{α, β} (α, β) = \frac{e^{- \frac{α + β}{α β}}}{α β (α + β)}

(40)

Substituting (36) and (39) into (15), the MML87 codelength is

\begin{matrix} I_{87} (D, θ) & = k log β + (n - k) log α + (\frac{1}{α} + \frac{1}{β}) (\sum_{i = 1}^{n} y_{i}) - log π_{α, β} (α, β) \\ + log n - \frac{1}{2} log (α β {(α + β)}^{2}) + log κ_{2} + 1 \end{matrix}

(41)

where

θ = {α, β}

and the quantization constant

κ_{2} = 5 / (36 \sqrt{3})

. The MML87 estimates that minimize the codelength (41) are

\begin{matrix} {\hat{α}}_{87} (D) & = \frac{2 (s + 1) (a + b + n - 1)}{(n + 1) (2 b - 2 k + 2 n - 1)}, {\hat{β}}_{87} (D) = \frac{2 (s + 1) (a + b + n - 1)}{(n + 1) (2 a + 2 k - 1)} . \end{matrix}

(42)

If required, the corresponding estimates of

ϕ

and

ψ

can be obtained from (37).

4.3. Properties

First we show the the conditional (19) and joint (41) MML codelengths are equivalent up to a constant to be specified below. From Section 4.1 and Section 4.2, we note the joint density of

(Y, Δ)

can be expressed as a product of the binomial

p_{Δ} (δ | ϕ)

and exponential densities

p_{Y} (y | ψ)

p_{Y, Δ} (y, δ | α, β) = p_{Δ} (δ | ϕ) p_{Y} (y | ψ) = (ϕ^{δ} {(1 - ϕ)}^{1 - δ}) (exp (- y / ψ) / ψ)

where Y and

Δ

are independent random variables (see Kim [5] (p. 104)). Consequently, as MML87 is invariant under smooth one-to-one reparameterizations of the sampling model, the MML87 joint codelength (41) and the corresponding conditional codelength (19) are identical (except for the minor efficiency gain in the joint codelength discussed below). Specifically, the relationship between the joint codelength,

I_{87} (α, β, D)

and conditional codelength,

I_{87} (ψ, ϕ, D)

, can be expressed as

I_{87} (ψ, ϕ, D) = I_{87} (α, β, D) + log (\frac{3 \sqrt{3}}{5})

(43)

where the term

log (3 \sqrt{3} / 5) \approx 0.0385

arises due to the quantization constant being smaller in higher dimensions since its more efficient to encode multiple parameters simultaneously compared to encoding each parameter independently.

Furthermore, as the MML87 estimate of

ϕ

is

\hat{ϕ} \in (0, 1)

(see (24)), MML87 estimates of the mean survival times

(α, β)

are finite for all

k \in [0, n]

in contrast to the corresponding maximum likelihood estimates (9) which are finite for

k \in (0, n)

. As

n \to \infty

, it is well-known that the MML87 estimates are equivalent to the maximum likelihood estimates (see (18)) which implies that the MML87 estimates are similarly asymptotically normally distributed and strongly consistent.

The expected mean square error

E {{(\hat{β} - β^{*})}^{2}}

of the ML and MML87 estimates of

β^{*}

, conditional on

k > 0

, is expressible in terms of the generalized hypergeometric function for any

n > 0

. Figure 1 (top) depicts the expected mean squared error between the MML87 and ML estimate of

β^{*}

, expressed as a ratio of MML87 to ML with smaller values indicating preference for the MML87 estimate. The expected mean squared error of the MML87 estimate of

β^{*}

was generally lower than the corresponding maximum likelihood estimate (except when the true censoring proportion

ϕ^{*}

was small) with the biggest difference observed for small sample sizes, while the two estimates were practically indistinguishable for larger sample sizes

n \geq 100

.

We also compared the MML87 and the ML estimates in terms of the relative entropy or the Kullback–Leibler (KL) divergence, conditional on

k \in (0, n)

. The KL divergence between the true data generating model

(α_{1}, β_{1})

and the approximating model

(α_{2}, β_{2})

is

D_{KL} (α_{1}, β_{1} | | α_{2}, β_{2}) = \frac{α_{1} β_{1} (α_{2} + β_{2}) + α_{2} β_{2} (α_{1} log (\frac{β_{2}}{β_{1}}) + β_{1} log (\frac{α_{2}}{α_{1}}) - α_{1} - β_{1})}{α_{2} β_{2} (α_{1} + β_{1})},

which, as expected, is the sum of the KL divergences between two exponential and two binomial distributions. The KL divergence may be interpreted as the expected amount of extra information required to encode data from

(α_{1}, β_{1})

using the model

(α_{1}, β_{2})

. The expected KL divergence for both the ML and MML estimators is shown in the bottom of Figure 1, conditional on

k > 0

and

k < n

. It is clear that for

n = 5

the MML87 estimate dominates the maximum likelihood estimate in terms of the KL divergence for all

ϕ^{*} \in (0.05, 0.95)

. When the sample size is increased (

n = 10

), the MML87 estimate exhibits smaller KL divergence compared to the ML estimate for all

ϕ^{*}

except when

ϕ^{*} \to 0

or

ϕ^{*} \to 1

where the maximum likelihood estimate has smaller KL divergence.

5. Minimum Message Length Inference with Fixed Censoring

Consider now the fixed censoring scenario (6) introduced in Section 2 where the negative log-likelihood function of the data is given in (10). If we wish to encode the data D using the joint MML code (see Section 4.2), we require the negative log-likelihood, the Fisher information and a prior distribution for the mean survival time

θ > 0

. The negative log-likelihood is given in (10) while the Fisher information for Type I censored data with fixed censoring is:

J_{θ} (θ; c) = \frac{n (1 - exp (- c / θ))}{θ^{2}} = \frac{n F_{T} (c | θ)}{θ^{2}},

(44)

where

F_{T} (\cdot | θ)

is the cumulative distribution function of the survival data T (see Section 2). The reduction in information due to censoring is clearly a function of

θ

and the cumulative density function of T, with large c resulting in little information loss compared to small c. As expected, as c gets larger

lim_{c \to \infty} J_{θ} (θ; c) = \frac{n}{θ^{2}},

(45)

which is the usual Fisher information for the exponential distribution with no censoring. The prior distribution for

θ

is chosen to be the Fréchet ditribution with scale c and probability density function

π_{θ} (θ; c) = c^{- 1} {(\frac{θ}{c})}^{- 2} exp (- \frac{c}{θ}) .

(46)

Substituting (44) and (46) into (15), we obtain the complete MML87 codelength for the joint encoding

\begin{matrix} I_{87} (D, θ) & = (k + 1) log (θ) + \frac{1}{θ} (c ((n - k) + 1) + \sum_{i = 1}^{n} y_{i} δ_{i}) \\ + \frac{1}{2} log (\frac{1 - exp (- c / θ)}{c^{2}}) + \frac{1}{2} (1 + log n - log 12) . \end{matrix}

(47)

Due to the form of the Fisher information, the MML87 estimate of

θ

that minimizes this codelength is unavailable analytically and must be obtained via numerical optimisation. The maximum likelihood estimate (11) may be used as a starting point for the numerical search.

Consider now the conditional encoding (see Section 4.1) where the probability of observing an uncensored datum, say

ϕ > 0

, is

ϕ = P (T_{i} \leq c) = F_{T} (c | θ) = 1 - exp (- c / θ), (i = 1, \dots, n),

(48)

so that the number of uncensored data points k follows the binomial distribution

k \sim binomial (ϕ, n)

. This implies that the mean survival time can then written as

θ = - c / log (1 - ϕ), ϕ \in (0, 1) .

(49)

A naive conditional coding approach proceeds by encoding the censoring indicators following Section 4.1 with codelength (23). To encode

I (y | δ)

, one would use the conditional probabilities of the lifetime data which are

\begin{matrix} p_{Y | Δ} (Y_{i} = y_{i} | Δ_{i} = 1) & = \frac{p (T_{i} = y_{i})}{p (T_{i} \leq c)} if y_{i} \leq c, \end{matrix}

(50)

and

p_{Y | Δ} (Y_{i} = c | Δ_{i} = 0) = 1, p_{Y | Δ} (Y_{i} > c | Δ_{i} = 0) = 0 .

(51)

for all

i = 1 \dots, n

. The conditional likelihood of the

k = (\sum_{i} δ_{i})

data points is then

\begin{matrix} p (y | θ; δ = 1) & = \prod_{i : δ_{i} = 1} \frac{(1 / θ) exp (- y_{i} / θ)}{1 - exp (- c / θ)} = - \prod_{i : δ_{i} = 1} \frac{{(1 - ϕ)}^{y_{i} / c} log (1 - ϕ)}{c ϕ} . \end{matrix}

(52)

Once the receiver has the censoring data and an estimate of

ϕ

, they implicitly know

θ

from (49). The length of the message required to transmit the data

y

is

\begin{matrix} I_{87} (θ, y | δ) & = k log θ + \frac{1}{θ} \sum_{i : δ_{i} = 1} y_{i} + k log (1 - exp (- c / θ)) \\ = - \frac{1}{c} (\sum_{i : δ_{i} = 1} y_{i}) log (1 - ϕ) + k log c ϕ - k log (- log (1 - ϕ)) \end{matrix}

(53)

which is the negative log-likelihood of the data. However, this codelength is inefficient since the probability of censoring

ϕ (θ)

is not independent of the mean survival time

θ

. This implies that the precision to which

ϕ (θ)

is encoded must depend on the lifetime data

y

, which is not the case in the naive approach where the precision quantum for

ϕ (θ)

depends on the censoring data

δ

only. Consequently, joint MML coding should be used instead of the conditional encoding approach for the fixed censoring setup.

5.1. Example

We observe

n = 20

items with an exponential life distribution for

c = 150

h. Out of the 20 items

k = 15

items fail during the observation period and the sum of their lifetimes (in hours) is

s = \sum_{i} y_{i} δ_{i} = 835

[6]. The maximum likelihood estimate of the mean lifetime

θ

is

{\hat{θ}}_{ML} (D) = \frac{150 (5) + 835}{15} = 105.6 h,

(54)

with a negative log-likelihood of

84.9

at the minimum. The MML87 estimate is obtained by a numerical search and is

{\hat{θ}}_{87} (D) = 110.1

h with a codelength of

124.905

bits. The MML87 codelength at the maximum likelihood estimate is 124.924 bits suggesting that there is little difference between the two estimates in this example.

5.2. Properties

To evaluate the performance of the MML87 estimate, we computed the mean squared error risk and the expected Kullback–Leibler (KL) divergence for MML87 and the ML estimates under the data generating model

θ^{*} = 1

and sample sizes

n \in {5, 25}

. Since the ML estimate is undefined for

k = 0

, all the results discussed below are conditional on

k > 0

. The KL divergence from the ‘true’ model

θ_{1}

to the approximating model

θ_{2}

is

D_{KL} (θ_{1} | | θ_{2}) = (1 - \frac{θ_{1}}{θ_{2}} + log (\frac{θ_{1}}{θ_{2}})) (exp (- \frac{c}{θ_{1}}) - 1) .

(55)

The results are shown in Figure 2 where the x-axis of each plot is the censoring point c set to the p-th percentile of the data generating model Exp(

θ^{*} = 1

); for example

c = 0.69

corresponds to the

p = 0.50

-th percentile of Exp(

θ^{*} = 1

). It is clear that the MML87 estimate is a reasonable alternative to the maximum likelihood estimate under fixed censoring. For

p \geq 0.20

(i.e., the 20-th percentile of Exp(

θ^{*} = 1

)) the MML87 estimate dominates the ML estimate in terms of KL risk, while for

p < 0.20

the estimates are very similar for both sample sizes tested. In terms of the expected mean squared error, the MML87 estimate dominates the ML estimate for all

0.20 \leq p \leq 0.80

when

n = 5

; for

n = 25

, the estimates are indistinguishable for

p < 0.1

and

p > 0.69

and the MML87 estimate again dominates ML for all

0.1 < p < 0.5

.

6. Discussion

This manuscript has demonstrated how minimum message length can be used to infer data with censoring information. Specifically, we have derived MML87 codelengths for the exponential distribution with fixed censoring and random type I censoring. Although information theoretic universal models for the exponential distribution, including those corresponding to MML codes, are known [4], this is the first time MML has been applied to censored data.

The MML87 codelength for the exponential distribution with censoring provides a new means of parameter estimation as well as model selection. In terms of parameter estimation, the MML87 estimate of the mean lifetime

θ

under type I censoring described in this paper has some advantages over the usual maximum likelihood estimate for small sample sizes. First, the MML87 estimate is defined for all proportions of censoring unlike the maximum likelihood estimate which does not exist when all observations are censored; i.e.,

k = (\sum_{i} δ_{i}) = 0

. In addition, the MML87 estimate has on average lower mean squared error risk and lower KL divergence from the data generating model for a wide range of censoring proportions.

In the case of random censoring, the MML87 estimate is available in closed-form while for fixed censoring, it can only be obtained by numerical optimisation. Although the experiments in the manuscript utilised heavy-tailed prior distributions for the scale parameter as recommended in, for example, [26], the Bayesian nature of MML allows for information prior information to be incorporated directly into the estimation process. The effect of the prior distribution in the examples considered here is expected to be negligible for medium to large sample sizes.

Importantly, the proposed MML87 codelengths can also be used to discriminate between competing models (e.g., exponential vs lognormal) and offer some advantages over the well-known BIC model selection approach. BIC only considers the sample size and the number of parameters when measuring model complexity. In contrast, MML takes into account not just the number of parameters, but also the complexity of the distribution (i.e., the number of random data strings that are fitted well by the distribution). As the sample size

n \to \infty

, the MML87 codelength converges to the BIC and therefore inherits the favourable asymptotic properties of BIC, such as model selection consistency.

The codelengths derived in this manuscript are extendable to MML inference of other censored data types, such as the Weibull and the lognormal distribution, and can be incorporated into more complex models as shown in the next section.

6.1. Clustering Survival Data

To demonstrate the applicability of the codes derived in the manuscript, we implemented the MML87 codes into a Matlab software package for inference of finite mixture models. Our software, called Matlab Snob, features mixture models with categorical data (e.g., multinomial distribution), count data (e.g., geometric, Poisson and negative binomial distributions), continuous data (e.g., normal, Laplace, gamma and Weibull distributions) and survival data (type I fixed and random censored exponential distribution). As a demonstration of Matlab Snob, we used two publicly available survival data sets: (1) Rossi et al.’s criminal recidivism data [27], and (2) survival from malignant melanoma [28].

The crime data was recently analyzed in [29] using variational Bayes estimated finite mixture models. For clustering we used all

n = 432

observations and the following seven attributes: (1) financial aid (no, yes), (2) full-time work experience before incarceration (no, yes), (3) marital status at time of release (married, not married), (4) released on parole (no, yes), (5) number of convictions prior to current incarceration, (6) age in years at time of release and (7) week of first arrest after release (73.6% censored). This is an example of fixed censoring as all censored observations were censored at 52 weeks. We modelled the categorical attributes using a multinomial distribution, number of convictions was modelled with a negative binomial distribution, while a Gaussian distribution was used for age at time of release. For the week of arrest, we used the exponential distribution model with fixed type I censored data (see Section 5).

The melanoma data set consists of

n = 205

patients from Denmark who were diagnosed with malignant melanoma. Five attributes were used for clustering: (1) sex (male, female), (2) ulcer (present, absent), (3) age at diagnosis in years, (4) tumour thickness in mm, and (5) censored survival time in years (65.3% censored). Sex and ulcer were modelled via multinomial distributions, while age and tumour thickness were modelled with univariate Gaussian distributions. For the survival time, we used an exponential distribution with random type I censoring (see Section 4) and combined death due to melanoma and death due to other causes as the primary outcome of interest.

Clustering results for the Crime and the Melanoma data sets with Matlab Snob are shown in Table 1. First, since Matlab Snob learns finite mixture models using the MML87 codelength approximation, the same framework is used to estimate all model parameters as well as select the number of classes. For the Crime data, the model with three classes had the smallest codelength, while two classes were selected for the Melanoma data set. We observe that all the classes are relatively well differentiated in terms of average survival time. In the case of the Crime data set, class 1 had the shortest average time to arrest (

θ = 119

weeks) and consists of younger individuals (mean age 20.7 years, std. dev. 2.1 years) who are primarily unmarried and have no full-time work experience before incarceration. In contrast, class 3 comprised older individuals (mean age 36.3 years, std. dev. 4.9 years) 82% of which had full-time work experience, was estimated to have longest average time to arrest (

θ = 345.9

weeks). For the melanoma data set, class 1 was estimated to have the shortest average survival time (

β = 7.3

years) and consisted of individuals diagnosed at an older age (mean: 57.3 years, std. dev. 17.1 years) with larger tumours (mean: 5.4 mm, std. dev. 3.4 mm). In contrast, patients assigned to class 2 were diagnosed at a younger age, had smaller tumours and were estimated to have longer survival time on average (

β = 37.0

years).

The Matlab Snob clustering software is freely available for download from the Mathworks Filexchange website (ID: 72310) and will be extended to incorporate other survival distributions in the future (eg, Weibull and lognormal distribution). We note that the MML87 codelengths for type I censored exponentially distributed data derived in this paper can also be used in decision tree modelling [21,30,31]. For example, one could represent the leaves of the decision tree with a censored exponential distribution and use MML to infer an optimal decision tree for a data set.

6.2. Minimum Message Length and Minimum Description Length

Minimum message length is closely related to minimum description length (MDL), an inductive inference principle independently developed by Rissanen and colleagues [32,33,34,35]. Like MML, the MDL principle is rooted in information theory and, given a data set, seeks a model that would result in the shortest encoding of the data. A recent and popular version of the MDL principle is the normalized maximum likelihood (NML) code which says that the codelength for data

y

with respect to model class

M

parameterised by models

θ \in R^{p} \in M

is

- log p_{NML} (y | M) = - log p (y | \hat{θ} (y), M) + log \sum_{x} p (x | \hat{θ} (x), M)

(56)

where

\hat{θ}

is the maximum likelihood estimate of the p parameters and the sum in the second term is taken over the entire data space; we replace the sum with an integral in the case of continuous data. The first term in the NML codelength is the negative log-likelihood of the data evaluated at the maximum likelihood estimate, while the second term represents the parametric complexity of the model class and measures how well models

θ \in M

within the model class

M

approximate random data sequences. In particular, a high parametric complexity says that a large number of data sequences can be well-approximated by models within the class. In contrast, the parametric complexity of a simple model that can only well-approximate a few data sequences will tend to be small.

Rissanen [33] derives an asymptotic approximation for the NML codelength which is accurate for medium to large sample sizes:

- log p_{NML} (y | M) = - log p (y | \hat{θ} (y), M) + log \int_{Θ} \sqrt{| J_{1} (θ) |} + \frac{p}{2} log (\frac{n}{2 π}) + o (1)

(57)

where

J_{1} (\cdot)

is the per-sample Fisher information matrix. Mera et al. [36] derive a somewhat sharper approximation to the NML codelength using Riemannian geometry tools and apply their new approximation to principal component analysis. Rissanen further shows that, like the MML87 codelength, the NML codelength reduces to the well-known Bayesian information criterion (BIC) in the limit as the sample size

n \to \infty

. Unfortunately, in the case of the exponential distribution with or without type I censoring, the parametric complexity is infinite for both the exact NML codelength and the asymptotic approximation. To circumvent the problem of infinite parametric complexity, one may consider the restricted approximate normalized maximum likelihood (ANML), the two-part ANML or the objective Bayesian code, among others [37].

Although there exist many similarities in the approaches to inference between MML and MDL, there are some important differences which we summmarize below:

MDL relies on the maximum likelihood estimator and does not offer new means for parameter estimation;
MDL is decidedly non-Bayesian avoiding the use of any (subjective or objective) prior information;
MDL nominates the model class $M$ that would result in the shortest encoding of the data and does not does not infer a fully specified model;
while MML minimises the expected (average) codelength of the data with respect to the marginal data distribution, MDL minimizes the worst-case codelength relative to the ideal code.

In addition to the NML code, other MDL codes exist including the sequential NML code [38] and the conditional NML distribution [34], among others. Clearly, both MML and MDL approaches to inductive inference have merit, and if used correctly, will result in excellent model selection performance as shown in a wide range of applications. A more detailed discussion of MML and MDL similarities and differences can be found in [39] and [1] (pp. 413–415).

Author Contributions

Methodology, E.M. and D.F.S.; Software, E.M. and D.F.S.; Writing—original draft, E.M.; Writing—review & editing, D.F.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AIC	Akaike information criterion
BIC	Bayesian information criterion
KL	Kullback–Leibler
MDL	Minimum description length
MML	Minimum message length
ML	Maximum likelihood
NML	Normalized maximum likelihood

References

Wallace, C.S. Statistical and Inductive Inference by Minimum Message Length, 1st ed.; Information Science and Statistics; Springer: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
Wallace, C.S. False oracles and SMML estimators. In Proceedings of the International Conference on Information, Statistics and Induction in Science; World Scientific: Singapore, 1996; pp. 304–316. [Google Scholar]
Wallace, C.S.; Dowe, D.L. Minimum Message Length and Kolmogorov Complexity. Comput. J. 1999, 42, 270–283. [Google Scholar] [CrossRef] [Green Version]
Schmidt, D.F.; Makalic, E. Universal Models for the Exponential Distribution. IEEE Trans. Inf. Theory 2009, 55, 3087–3090. [Google Scholar] [CrossRef] [Green Version]
Kim, J.S. Asymptotic properties of the maximum likelihood estimator of a randomly censored exponential parameter. Commun. Stat. Theory Methods 1986, 15, 3637–3646. [Google Scholar] [CrossRef]
Bartholomew, D.J. The Sampling Distribution of an Estimate Arising in Life Testing. Technometrics 1963, 5, 3. [Google Scholar] [CrossRef]
Mendenhall, W.; Lehman, E.H. An Approximation to the Negative Moments of the Positive Binomial Useful in Life Testing. Technometrics 1960, 2, 227–242. [Google Scholar] [CrossRef]
Balakrishnan, N.; Davies, K.F. Pitman closeness results for Type-I censored data from exponential distribution. Stat. Probab. Lett. 2013, 83, 2693–2698. [Google Scholar] [CrossRef]
Wallace, C.S.; Boulton, D.M. An information measure for classification. Comput. J. 1968, 11, 185–194. [Google Scholar] [CrossRef] [Green Version]
Wallace, C.S.; Freeman, P.R. Estimation and inference by compact coding. J. R. Stat. Soc. (Ser. B) 1987, 49, 240–252. [Google Scholar] [CrossRef]
Wallace, C.S.; Dowe, D.L. Refinements of MDL and MML Coding. Comput. J. 1999, 42, 330–337. [Google Scholar] [CrossRef] [Green Version]
Wallace, C.; Boulton, D. An invariant Bayes method for point estimation. Classif. Soc. Bull. 1975, 3, 11–34. [Google Scholar]
Farr, G.E.; Wallace, C.S. The complexity of Strict Minimum Message Length inference. Comput. J. 2002, 45, 285–292. [Google Scholar] [CrossRef]
Conway, J.H.; Sloane, N.J.A. Sphere Packing, Lattices and Groups, 3rd ed.; Springer: Berlin/Heidelberg, Germany, 1998; p. 703. [Google Scholar]
Agrell, E.; Eriksson, T. Optimization of lattices for quantization. IEEE Trans. Inf. Theory 1998, 44, 1814–1828. [Google Scholar] [CrossRef]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Schmidt, D.F.; Makalic, E. Minimum message length analysis of multiple short time series. Stat. Probab. Lett. 2016, 110, 318–328. [Google Scholar] [CrossRef]
Wallace, C.S.; Dowe, D.L. MML clustering of multi-state, Poisson, von Mises circular and Gaussian distributions. Stat. Comput. 2000, 10, 73–83. [Google Scholar] [CrossRef]
Wong, C.K.; Makalic, E.; Schmidt, D.F. Minimum message length inference of the Poisson and geometric models using heavy-tailed prior distributions. J. Math. Psychol. 2018, 83, 1–11. [Google Scholar] [CrossRef] [Green Version]
Balasubramanian, V. MDL, Bayesian inference, and the geometry of the space of probability distributions. In Advances in Minimum Description Length: Theory and Applications; Grünwald, I.J.M., Pitt, M.A., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 81–99. [Google Scholar]
Wallace, C.S.; Patrick, J.D. Coding Decision Trees. Mach. Learn. 1993, 11, 7–22. [Google Scholar] [CrossRef] [Green Version]
Wallace, C.S.; Freeman, P.R. Single-Factor Analysis by Minimum Message Length Estimation. J. R. Stat. Soc. (Ser. B) 1992, 54, 195–209. [Google Scholar] [CrossRef]
Wallace, C.S.; Korb, K.B. Learning linear causal models by MML sampling. In Causal Models and Intelligent Data Management; Gammerman, A., Ed.; Springer: Berlin/Heidelberg, Germany, 1999; pp. 89–111. [Google Scholar]
Schmidt, D.F.; Makalic, E. Minimum Message Length Inference and Mixture Modelling of Inverse Gaussian Distributions. In AI 2012: Advances in Artificial Intelligence; Lecture Notes in Computer Science; Thielscher, M., Zhang, D., Eds.; Springer: Berlin/Heidelberg, Germany; Sydney, Australia, 2012; Volume 7691, pp. 672–682. [Google Scholar]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control. 1974, 19, 716–723. [Google Scholar] [CrossRef]
Polson, N.G.; Scott, J.G. On the Half-Cauchy Prior for a Global Scale Parameter. Bayesian Anal. 2012, 7, 887–902. [Google Scholar] [CrossRef]
Rossi, P.; Berk, R.A.; Lenihan, K.J. Money, Work, and Crime: Some Experimental Results; Academic Press: Cambridge, MA, USA, 1980. [Google Scholar]
Andersen, P.K.; Borgan, Ø.; Gill, R.D.; Keiding, N. Statistical Models Based on Counting Processes; Springer: Berlin, Germany, 2012. [Google Scholar] [CrossRef]
Kohjima, M.; Matsubayashi, T.; Toda, H. Variational Bayes for Mixture Models with Censored Data. In Machine Learning and Knowledge Discovery in Databases; Springer International Publishing: Berlin, Germany, 2019; pp. 605–620. [Google Scholar] [CrossRef]
Bou-Hamad, I.; Larocque, D.; Ben-Ameur, H. A review of survival trees. Stat. Surv. 2011, 5, 44–71. [Google Scholar] [CrossRef]
Dauda, K.A.; Pradhan, B.; Shankar, B.U.; Mitra, S. Decision tree for modeling survival data with competing risks. Biocybern. Biomed. Eng. 2019, 39, 697–708. [Google Scholar] [CrossRef]
Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
Rissanen, J. Fisher information and stochastic complexity. IEEE Trans. Inf. Theory 1996, 42, 40–47. [Google Scholar] [CrossRef]
Rissanen, J.; Roos, T. Conditional NML Universal Models. In Proceedings of the 2007 Information Theory and Applications Workshop (ITA-07), San Diego, CA, USA, 29 January–2 February 2007; IEEE Press: Piscataway, NJ, USA, 2007; pp. 337–341, (Invited Paper). [Google Scholar]
Rissanen, J. Optimal Estimation. Inf. Theory Newsl. 2009, 59, 1–20. [Google Scholar]
Mera, B.; Mateus, P.; Carvalho, A.M. On the minmax regret for statistical manifolds: The role of curvature. arXiv 2020, arXiv:2007.02904. [Google Scholar]
de Rooij, S.; Grünwald, P. An empirical study of minimum description length model selection with infinite parametric complexity. J. Math. Psychol. 2006, 50, 180–192. [Google Scholar] [CrossRef] [Green Version]
Roos, T.; Rissanen, J. On sequentially normalized maximum likelihood models. In Proceedings of the 1st Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-08), Tampere, Finland, 18–20 August 2008. (Invited Paper). [Google Scholar]
Baxter, R.A.; Oliver, J. MDL and MML: Similarities and Differences; Technical Report TR 207; Department of Computer Science, Monash University: Clayton, Australia, 1994. [Google Scholar]

Figure 1. Expected mean squared error of

β^{*}

and expected KL divergence between the MML87 and ML estimates. Ratio values less than 1 imply that the MML87 estimate has smaller mean squared error in estimating

β

.

Figure 1. Expected mean squared error of

β^{*}

and expected KL divergence between the MML87 and ML estimates. Ratio values less than 1 imply that the MML87 estimate has smaller mean squared error in estimating

β

.

Figure 2. Expected Kullback-Leibler (KL) divergence and squared error (SE) risk of the maximum likelihood and MML87 estimates for

n = 5

(top) and

n = 25

(bottom) data points generated from model

θ^{*} = 1

. The x-axis on all plots denotes the censoring point c and is set to a percentile of Exp(

θ^{*}

).

Figure 2. Expected Kullback-Leibler (KL) divergence and squared error (SE) risk of the maximum likelihood and MML87 estimates for

n = 5

(top) and

n = 25

(bottom) data points generated from model

θ^{*} = 1

. The x-axis on all plots denotes the censoring point c and is set to a percentile of Exp(

θ^{*}

).

Table 1. MML finite mixture models for Crime and Melanoma data. The attribute modelling censored survival time is seven in the Crime data set and five in the Melanoma data set.

Data	Class	Attributes
		1	2	3	4	5	6	7
Crime	1	(50%, 50%)	(73%, 27%)	( 3%, 97%)	(42%, 58%)	(r: 2.0, p: 0.4)	( $μ$ : 20.7, $σ$ : 2.1)	( $θ$ : 119.0)
	2	(55%, 45%)	(15%, 85%)	(23%, 77%)	(28%, 72%)	(r: 13.4, p: 0.8)	( $μ$ : 24.9, $σ$ : 3.3)	( $θ$ : 249.3)
	3	(40%, 60%)	(16%, 84%)	(18%, 82%)	(51%, 49%)	(r: 2.3, p: 0.5)	( $μ$ : 36.3, $σ$ : 4.9)	( $θ$ : 345.9)
Melanoma	1	(43%, 57%)	(17%, 83%)	( $μ$ : 57.3, $σ$ : 17.1)	( $μ$ : 5.4, $σ$ : 3.4)	( $α$ : 12.0, $β$ : 7.3)	–	–
Melanoma	2	(73%, 27%)	(80%, 20%)	( $μ$ : 49.5, $σ$ : 15.8)	( $μ$ : 1.4, $σ$ : 0.9)	( $α$ : 8.1, $β$ : 37.0)	–	–

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Makalic, E.; Schmidt, D.F. Minimum Message Length Inference of the Exponential Distribution with Type I Censoring. Entropy 2021, 23, 1439. https://doi.org/10.3390/e23111439

AMA Style

Makalic E, Schmidt DF. Minimum Message Length Inference of the Exponential Distribution with Type I Censoring. Entropy. 2021; 23(11):1439. https://doi.org/10.3390/e23111439

Chicago/Turabian Style

Makalic, Enes, and Daniel Francis Schmidt. 2021. "Minimum Message Length Inference of the Exponential Distribution with Type I Censoring" Entropy 23, no. 11: 1439. https://doi.org/10.3390/e23111439

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Minimum Message Length Inference of the Exponential Distribution with Type I Censoring

Abstract

1. Introduction

2. Exponential Distribution

2.1. Maximum Likelihood Estimation

3. Minimum Message Length

4. Minimum Message Length Inference of Type I Censored Exponential Data

4.1. Conditional Encoding of the Data

4.2. Joint Encoding of the Data

4.3. Properties

5. Minimum Message Length Inference with Fixed Censoring

5.1. Example

5.2. Properties

6. Discussion

6.1. Clustering Survival Data

6.2. Minimum Message Length and Minimum Description Length

Author Contributions

Funding

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI