Estimating Gaussian Copulas with Missing Data with and without Expert Knowledge

Kertel, Maximilian; Pauly, Markus

doi:10.3390/e24121849

Open AccessArticle

Estimating Gaussian Copulas with Missing Data with and without Expert Knowledge

by

Maximilian Kertel

^1,2,*

and

Markus Pauly

^2,3

¹

BMW Group, Battery Cell Competence Centre, 80788 Munich, Germany

²

Department of Statistics, TU Dortmund University, 44227 Dortmund, Germany

³

Research Center Trustworthy Data Science and Security, UA Ruhr, 44227 Dortmund, Germany

^*

Author to whom correspondence should be addressed.

Entropy 2022, 24(12), 1849; https://doi.org/10.3390/e24121849

Submission received: 27 October 2022 / Revised: 7 December 2022 / Accepted: 10 December 2022 / Published: 19 December 2022

(This article belongs to the Special Issue Improving Predictive Models with Expert Knowledge)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this work, we present a rigorous application of the Expectation Maximization algorithm to determine the marginal distributions and the dependence structure in a Gaussian copula model with missing data. We further show how to circumvent a priori assumptions on the marginals with semiparametric modeling. Further, we outline how expert knowledge on the marginals and the dependency structure can be included. A simulation study shows that the distribution learned through this algorithm is closer to the true distribution than that obtained with existing methods and that the incorporation of domain knowledge provides benefits.

Keywords:

missing at random; expert knowledge; expectation maximization; semiparametric estimation

1. Introduction

Even though the amount of data is increasing due to new technologies, big data are by no means good data. For example, missing values are ubiquitous in various fields, from the social sciences [1] to manufacturing [2]. For explanatory analysis or decision making, one is often interested in the joint distribution of a multivariate dataset, and its estimation is a central topic in statistics [3]. At the same time, there exists background knowledge in many domains that can help to compensate for the potential shortcomings of datasets. For instance, domain experts have an understanding of the causal relationships in the data generation process [4]. It is the scope of this paper to unify expert knowledge and datasets with missing data to derive approximations of the underlying joint distribution.

To estimate the multivariate distribution, we use copulas, where the dependence structure is assumed to belong to a parametric family, while the marginals are estimated nonparametrically. Genest et al. [5] showed that for complete datasets, a two-step approach consisting of the estimation of the marginals with an empirical cumulative distribution function (ecdf) and subsequent derivation of the dependence structure is consistent. This idea is even transferable to high dimensions [6].

In the case of missing values, the situation becomes more complex. Here, nonparametric methods do not scale well with the number of dimensions [7]. On the other hand, assuming that the distribution belongs to a parametric family, it can often be derived by using the EM algorithm [8]. However, this assumption is, in general, restrictive. Due to the encouraging results for complete datasets, there have been several works that have investigated the estimation of the joint distribution under a copula model. The authors of [9,10] even discussed the estimation in a missing-not-at-random (MNAR) setting. While MNAR is less restrictive than missing at random (MAR), it demands the explicit modeling of the missing mechanism [11]. On the contrary, the authors of [12,13] provided results in cases in which data were missing completely at random (MCAR). This strong assumption is rarely fulfilled in practice. Therefore, we assume an MAR mechanism in what follows [11].

Another interesting contribution [14] assumed external covariates, such that the probability of a missing value depended exclusively on them and not on the variables under investigation. They applied inverse probability weighting (IPW) and the two-step approach of [5]. While they proved a consistent result, it is unclear how this approach can be adapted to a setting without those covariates. IPW for general missing patterns is computationally demanding, and no software exists [15,16]. Thus, IPW is mostly applied with monotone missing patterns that appear, for example, in longitudinal studies [17]. The popular work of [18] proposed an EM algorithm in order to derive the joint distribution in a Gaussian copula model with data MAR [11]. However, their approach had weaknesses:

The presented algorithm was inexact. Among other things, the algorithm simplified by assuming that the marginals and the copula could be estimated separately (compare Equation (6) in [18] and Equation (11) in this paper).
If there was no a priori knowledge of the parametric family of all marginals, Ref. [18] proposed using the ecdf of the observed data points. Afterwards, they exclusively derived the parameters of the copula. This estimator of the marginals was biased [19,20], which is often overlooked in the copula literature, e.g., [21] (Section 4.3), [22] (Section 3), [23] (Section 3), or [24] (Section 3).
The description of the simulation study was incomplete and the results were not reproducible.

The aim of this paper is to close these gaps, and our contributions are the following:

We give a rigorous derivation of the EM algorithm under a Gaussian copula model. Similarly to [5], it consists of two separate steps, which estimate the marginals and the copula, respectively. However, these two steps alternate.
We show how prior knowledge about the marginals and the dependency structure can be utilized in order to achieve better results.
We propose a flexible parametrization of the marginals when a priori knowledge is absent. This allows us to learn the underlying marginal distributions; see Figure 1.
We provide a Python library that implements the proposed algorithm.

The structure of this paper is as follows. In Section 2, we review some background information about the Gaussian copula. We proceed by presenting the method (Section 3). In Section 4, we investigate its performance and the effect of domain knowledge in simulation studies. We conclude in Section 5. All technical aspects and proofs in this paper are given in Appendix A and Appendix B.

2. The Gaussian Copula Model

2.1. Notation and Assumptions

In the following, we consider a p-dimensional dataset

{x^{1}, \dots, x^{N}} \subset R^{p}

of size N, where

x^{1}, \dots, x^{N}

are i.i.d. samples from a p-dimensional random vector

X = (X_{1}, \dots, X_{p})

with a joint distribution function F and marginal distribution functions

F_{1}, \dots, F_{p}

. We denote the entries of

x^{ℓ}

by

x^{ℓ} = (x_{1}^{ℓ}, \dots, x_{p}^{ℓ}) \forall ℓ = 1, \dots, N

. The parameters of the marginals are represented by

θ = (θ_{1}, \dots, θ_{p})

, where

θ_{j}

is the parameter of

F_{j}

, so we write

F_{j}^{θ_{j}}

, where

θ_{j}

can be a vector itself.

For

ℓ \in {1, \dots, p}

, we define

obs (ℓ) \subset {1, \dots, p}

as the index set of the observed and

mis (ℓ) \subset {1, \dots, p}

as the index set of the missing columns of

x^{ℓ}

. Hence,

mis (ℓ) \cup obs (ℓ) = {1, \dots, p}

and

mis (ℓ) \cap obs (ℓ) = \emptyset

.

R = (R_{1}, \dots, R_{p}) \in {0, 1}^{p}

is a random vector for which

R_{i} = 0

if

X_{i}

is missing and

R_{i} = 1

if

X_{i}

can be observed. Further, we define

ϕ

to be the density function and

Φ

to be the distribution function of the one-dimensional standard normal distribution.

Φ_{μ, Σ}

stands for the distribution function of a p-variate normal distribution with covariance

Σ \in R^{p \times p}

and mean

μ \in R^{p}

. To simplify the notation, we define

Φ_{Σ} : = Φ_{0, Σ}

. For a matrix

A \in R^{p \times p}

, the entry of the i-th row and the j-th column is denoted by

A_{i j}

, while for index sets

S, T \subset {1, \dots, p}

,

A_{S, T}

is the submatrix of A with the row number in

S

and column number in

T

. For a (random) vector

x

(

X

),

x_{S}

(

X_{S}

) is the subvector containing entries with the index in

S

.

Throughout, we assume F to be strictly increasing and continuous in every component. Therefore,

F_{j}

is strictly increasing and continuous for all

j \in {1, \dots, p}

, and so is the existing inverse function

F_{j}^{- 1}

. For

S = {s_{1}, \dots, s_{k}} \subset {1, \dots, p}

, we define

F_{S} : R^{| S |} \to R^{| S |}

by

F_{S} (x_{s_{1}}, \dots, x_{s_{k}}) = (F_{s_{1}} (x_{s_{1}}), \dots, F_{s_{k}} (x_{s_{k}})) .

This work assumes that data are Missing at Random (MAR), as defined by [11], i.e.,

P_{X, R} (R = r | X_{r} = x_{- r}, X_{r} = x_{r}) = P_{X, R} (R = r | X_{r} = x_{r}),

(1)

where

X_{r} : = X_{{i : r_{i} = 1}}

are the observed and

X_{- r} : = X_{{i : r_{i} = 0}}

are the missing entries of

X

.

2.2. Properties

Sklar’s theorem [25] decomposes F into its marginals

F_{1}, \dots, F_{p}

and its dependency structure C with

F (x_{1}, \dots, x_{p}) = C (F_{1} (x_{1}), \dots, F_{p} (x_{p})) .

(2)

Here, C is a copula, which means it is a p-dimensional distribution function with support

{[0, 1]}^{p}

whose marginal distributions are uniform. In this paper, we focus on Gaussian copulas, where

C_{Σ} (u_{1}, \dots, u_{p}) = Φ_{Σ} (Φ^{- 1} (u_{1}), \dots, Φ^{- 1} (u_{p}))

(3)

and

Σ

is a covariance matrix with

Σ_{j j} = 1 \forall j \in {1, \dots, p}

. Beyond all multivariate normal distributions, there are distributions with non-normal marginals whose copula is Gaussian. Hence, the Gaussian copula model provides an extension of the normality assumption. Consider a random vector

X

whose copula is

C_{Σ}

. Under the transformation

Z : = Φ^{- 1} \circ F (X) : = (Φ^{- 1} \circ F_{1} (X_{1}), \dots, Φ^{- 1} \circ F_{p} (X_{p})),

it holds that

\begin{matrix} F_{Z} (z_{1} \dots, z_{p}) & = P (Z_{1} \leq z_{1}, \dots, Z_{p} \leq z_{p}) \\ = P (X_{1} \leq F_{1}^{- 1} (Φ (z_{1})), \dots, X_{p} \leq F_{p}^{- 1} (Φ (z_{p}))) \\ = Φ_{Σ} (Φ^{- 1} (F_{1} (F_{1}^{- 1} (Φ (z_{1})))), \dots, Φ^{- 1} (F_{p} (F_{p}^{- 1} (Φ (z_{p}))))) \\ = Φ_{Σ} (z_{1}, \dots, z_{p}) \end{matrix}

(4)

and hence,

Z

is normally distributed with mean 0 and covariance

Σ

. The two-step approaches given in [5,6] use this property and apply the following scheme:

Find consistent estimates $\hat{F_{1}}, \dots, \hat{F_{p}}$ for the marginal distributions $F_{1}, \dots, F_{p}$ .
Find $Σ$ by estimating the covariance of the random vector

$Z = (Φ^{- 1} (\hat{F_{1}} (X_{1})), \dots, Φ^{- 1} (\hat{F_{p}} (X_{p}))) .$

From now on, we assume that the marginals of

X

have existing density functions

f_{1}, \dots, f_{p}

. Then, by using Equation (4) and a change of variables, we can derive the joint density function

f_{F_{1}, \dots, F_{p}, Σ} (x_{1}, \dots, x_{p}) = f (x_{1}, \dots, x_{p}) = {| Σ |}^{- \frac{1}{2}} exp (- \frac{1}{2} z^{T} (Σ^{- 1} - I) z) \prod_{j = 1}^{p} f_{j} (x_{j}),

(5)

where

z : = (Φ^{- 1} (F_{1} (x_{1})), \dots, Φ^{- 1} (F_{p} (x_{p})))

. As for the multivariate normal distribution, we can identify the conditional independencies ([6]) from the inverse of the covariance matrix

K : = Σ^{- 1}

by using the property

K_{j k} = K_{k j} = 0 \Leftrightarrow X_{j} ⊥ X_{k} | \{X_{i} : i \in {1, \dots, p} ∖ {j, k}\} .

(6)

K is called the precision matrix. In order to slim down the notation, we define

Φ^{- 1} (F_{S} (x_{S})) : = (Φ^{- 1} (F_{s_{1}} (x_{s_{1}})), \dots, Φ^{- 1} (F_{s_{k}} (x_{s_{k}})))

and similarly

F_{S}^{- 1} (Φ (z_{S})) : = (F_{s_{1}}^{- 1} (Φ (z_{s_{1}})), \dots, F_{s_{1}}^{- 1} (Φ (z_{s_{k}}))) .

The former function transforms the data of a Gaussian copula distribution to be normally distributed. The latter mapping takes multivariate normally distributed data and returns data following a Gaussian copula distribution with marginals

F_{s_{1}}, \dots, F_{s_{k}}

. The conditional density functions have a closed form.

Proposition 1

(Conditional Distribution of Gaussian Copula). Let

S = {s_{1}, \dots, s_{k}}

and

T = {t_{1}, \dots, t_{k^{'}}}

be such that

T \dot{\cup} S = {1, \dots, p}

.

The conditional density of $X_{T} | X_{S} = x_{S}$ is given by

$f (x_{T} | X_{S} = x_{S}) = {| Σ^{'} |}^{- \frac{1}{2}} exp (- \frac{1}{2} {(z_{T} - μ)}^{T} Σ^{' - 1} (z_{T} - μ)) exp (\frac{1}{2} z_{T}^{T} z_{T}) \prod_{j \in T} f_{j} (x_{j}),$

where $μ = Σ_{T, S} Σ_{S, S}^{- 1} z_{S}$ , $Σ^{'} = Σ_{T, T} - Σ_{T, S} Σ_{S, S}^{- 1} Σ_{S, T}$ , $z_{T} = Φ^{- 1} (F_{T} (x_{T}))$ and $z_{S} = Φ^{- 1} (F_{S} (x_{S})) .$
$Φ^{- 1} (F_{T} (X_{T})) | X_{S} = x_{s}$ is normally distributed with mean μ and covariance $Σ^{'}$ .
The expectation of $h (X_{T})$ with respect to the density $f (x_{T} | X_{S} = x_{S})$ can be expressed by

$\int h (x_{T}) f (x_{T} | X_{S} = x_{S}) d x_{T} = \int h (F_{T}^{- 1} (Φ (z_{T}))) ϕ_{μ, Σ^{'}} (z_{T}) d z_{T} .$

Proposition 1 shows that the conditional distribution’s copula is Gaussian as well. More importantly, we can derive an algorithm for sampling from the conditional distribution.

Algorithm 1:Sampling from the conditional distribution of a Gaussian copula

Input:

x_{S}, Σ, F_{1}, \dots, F_{p}

Result: m samples of

X_{T} | X_{S} = x_{S}

Calculate

z_{S} : = Φ^{- 1} (F_{S} (x_{S}))

Calculate

μ

and

Σ^{'}

as in Proposition 1 using

z_{S}

and

Σ

Draw samples

{z^{1}, \dots, z^{m}}

from

N (μ, Σ^{'})

return {

F_{T}^{- 1} (Φ (z^{1})), \dots, F_{T}^{- 1} (Φ (z^{m}))}

The very last step follows with Proposition 1, as it holds for any measurable

A \subset R^{k^{'}}

:

\begin{matrix} P (X_{T} \in A | X_{S} = x_{S}) & = \int 1_{A} (x_{T}) f (x_{T} | X_{S} = x_{S}) d x_{T} = \int 1_{A} (F_{T}^{- 1} (Φ (z_{T}))) ϕ_{μ, Σ^{'}} (z_{T}) d z_{T} . \end{matrix}

3. The EM Algorithm in the Gaussian Copula Model

3.1. The EM Algorithm

Let

{y^{1}, \dots, y^{N}} \subset R^{p}

be a dataset following a distribution with parameter

ψ

and corresponding density function

g_{ψ} (\cdot)

, where observations are MAR. The EM algorithm [8] finds a local optimum of the log-likelihood function

\begin{matrix} \sum_{ℓ = 1}^{N} ln (g_{ψ} (y_{obs (ℓ)}^{ℓ})) & = \sum_{ℓ = 1}^{N} \int ln (g_{ψ} ((y_{obs (ℓ)}^{ℓ}, y_{mis (ℓ)}))) \\ g_{ψ} (y_{mis (ℓ)} | Y_{obs (ℓ)}^{ℓ} = y_{obs (ℓ)}^{ℓ}) d y_{mis (ℓ)} \\ = \sum_{ℓ = 1}^{N} E_{ψ} (ln (g_{ψ} ((y_{obs (ℓ)}, y_{mis (ℓ)}))) | Y_{obs}^{ℓ} = y_{obs (ℓ)}^{ℓ}) . \end{matrix}

After choosing a start value

ψ^{0}

, it does so by iterating the following two steps.

E-Step: Calculate

$\begin{matrix} λ (ψ | y^{1}, \dots, y^{N}, ψ^{t}) & : = \sum_{ℓ = 1}^{N} E_{ψ^{t}} (ln (g_{ψ} ((y_{obs (ℓ)}, y_{mis (ℓ)}))) | Y_{obs}^{ℓ} = y_{obs (ℓ)}^{ℓ}) \\ = \sum_{ℓ = 1}^{N} λ (ψ | y^{ℓ}, ψ^{t}) . \end{matrix}$

(7)
M-Step: Set

$ψ^{t + 1} = \underset{ψ}{argmax} λ (ψ | y^{1}, \dots, y^{N}, ψ^{t})$

(8)

and $t = t + 1$ .

For our purposes, there are two extensions of interest:

If there is no closed formula for the right-hand side of Equation (7), one can apply Monte Carlo integration [26] as an approximation. This is called the Monte Carlo EM algorithm.
If $ψ = (ψ_{1}, \dots, ψ_{v})$ and the joint maximization of (8) with respect to $ψ$ is not feasible, Ref. [27] proposed a sequential maximization. Thus, we optimize (8) with respect to $ψ_{i}$ while holding $ψ_{1} = ψ_{1}^{t + 1}, \dots, ψ_{i - 1} = ψ_{i - 1}^{t + 1}, ψ_{i + 1} = ψ_{i + 1}^{t}, \dots, ψ_{v} = ψ_{v}^{t}$ fixed before we continue with $ψ_{i + 1}$ . This is called the Expectation Conditional Maximization (ECM) algorithm.

3.2. Applying the ECM Algorithm on the Gaussian Copula Model

As we need a full parametrization of the Gaussian copula model for the EM algorithm, we assume parametric marginal distributions

F_{1}^{θ_{1}}, \dots, F_{p}^{θ_{p}}

with densities

f_{1}^{θ_{1}}, \dots, f_{p}^{θ_{p}}

. According to Equation (5), the joint density with respect to the parameters

θ = (θ_{1}, \dots, θ_{p})

and

Σ

has the form

f_{θ, Σ} (x_{1}, \dots, x_{p}) = {| Σ |}^{- \frac{1}{2}} exp (- \frac{1}{2} z_{θ}^{T} (Σ^{- 1} - I) z_{θ}) \prod_{j = 1}^{p} f_{j}^{θ_{j}} (x_{j}),

(9)

where

z_{θ} : = (Φ^{- 1} (F_{1}^{θ_{1}} (x_{1})), \dots, Φ^{- 1} (F_{p}^{θ_{p}} (x_{p})))

. Section 3.3 will describe how we can keep the flexibility for the marginals despite the parametrization. However, first, we outline the EM algorithm for general parametric marginal distributions.

3.2.1. E-Step

Set

K : = Σ^{- 1}

and

K^{t} : = Σ^{t^{- 1}}

. For simplicity, we pick one summand in Equation (7). By Equation (7) and (9), it holds with

ψ = (θ, Σ)

and

x^{ℓ}

taking the role of

y^{ℓ}

:

\begin{matrix} λ (θ, Σ | x^{ℓ}, θ^{t}, Σ^{t}) & = E_{θ^{t}, Σ^{t}} (ln (f_{θ, Σ} ((x_{obs (ℓ)}, x_{mis (ℓ)}))) | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) \\ = - \frac{1}{2} ln (| Σ |) \\ - \frac{1}{2} E_{Σ^{t}, θ^{t}} ({z_{θ}}^{T} (K - I) z_{θ} | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) \\ + \sum_{j = 1}^{p} E_{Σ^{t}, θ^{t}} (ln (f_{j}^{θ_{j}} (x_{j})) | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) . \end{matrix}

(10)

The first and last summand depend only on

Σ

and

θ

, respectively. Thus, of special interest is the second summand, for which we obtain the following with Proposition 1:

E_{Σ^{t}, θ^{t}} (z_{θ}^{T} (K - I) z_{θ} | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) = \int (z_{θ, θ^{t}}^{T} (K - I) z_{θ, θ^{t}}) ϕ_{μ, {Σ^{t}}^{'}} (q_{mis (ℓ)}) d q_{mis (ℓ)},

(11)

where

z_{θ, θ^{t}} : = (Φ^{- 1} (F_{1}^{θ_{1}} (F_{1}^{{θ_{1}^{t}}^{- 1}} (Φ (q_{1})))), \dots, Φ^{- 1} (F_{p}^{θ_{p}} (F_{p}^{{θ_{p}^{t}}^{- 1}} (Φ (q_{p}))))) .

Here,

μ = Σ_{mis (ℓ), obs (ℓ)} Σ_{obs (ℓ), obs (ℓ)}^{- 1} Φ^{- 1} (F_{obs (ℓ)}^{θ^{t}} (x_{obs (ℓ)}^{ℓ}))

and

{Σ^{t}}^{'} = Σ_{mis (ℓ), mis (ℓ)}^{t} - Σ_{mis (ℓ), obs (ℓ)}^{t} {(Σ_{obs (ℓ), obs (ℓ)}^{t})}^{- 1} Σ_{obs (ℓ), mis (ℓ)}^{t} .

At this point, the authors of [18] neglected that, in general,

F_{j}^{θ_{j}^{t}} \neq F_{j}^{θ_{j}}, j = 1, \dots, p

holds, and hence, (11) depends not only on

Σ

, but also on

θ

. This let us reconsider their approach, as we describe below.

3.2.2. M-Step

The joint optimization with respect to

θ

and

Σ

is difficult, as there is no closed form for Equation (10). We circumvent this problem by sequentially optimizing with respect to

Σ

and

θ

by applying the ECM algorithm. The maximization routine is the following.

Set $Σ^{t + 1} = {argmax}_{Σ} \sum_{l = 1}^{N} λ (θ^{t}, Σ | x^{ℓ}, θ^{t}, Σ^{t})$ .
Set $θ^{t + 1} = {argmax}_{θ} \sum_{l = 1}^{N} λ (θ, Σ^{t + 1} | x^{ℓ}, θ^{t}, Σ^{t})$ .

This is a two-step approach consisting of estimating the copula first and the marginals second. However, both steps are executed iteratively, which is typical for the EM algorithm.

Estimating $Σ$

As we are maximizing Equation (10) with respect to

Σ

with a fixed

θ = θ^{t}

, the last summand can be neglected. By a change-of-variables argument, we show the following in Theorem A1:

E_{Σ^{t}, θ^{t}} ({z_{θ^{t}}}^{T} (K - I) z_{θ^{t}} | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) = t r (Σ^{- 1} V_{ℓ}),

where

V_{ℓ}

depends on

Σ^{t}

and

z_{θ^{t}, obs (ℓ)} = Φ^{- 1} (F_{obs (ℓ)}^{θ^{t}} (x_{obs (ℓ)}^{ℓ}))

. Thus, considering all observations, we search for

\begin{matrix} Σ^{t + 1} & = \underset{Σ, Σ_{ℓ ℓ} = 1 \forall ℓ = 1, \dots, p}{argmax} \frac{1}{N} \sum_{l = 1}^{N} λ (θ^{t}, Σ | x^{ℓ}, θ^{t}, Σ^{t}) \\ = \underset{Σ, Σ_{ℓ ℓ} = 1 \forall ℓ = 1, \dots, p}{argmax} \frac{1}{N} \sum_{ℓ = 1}^{N} - \frac{1}{2} ln (| Σ |) - \frac{1}{2} t r (Σ^{- 1} V_{ℓ}) \\ = \underset{Σ, Σ_{ℓ ℓ} = 1 \forall ℓ = 1, \dots, p}{argmax} - \frac{1}{2} ln (| Σ |) - \frac{1}{2} t r (Σ^{- 1} \frac{1}{N} \sum_{ℓ = 1}^{N} V_{ℓ}), \end{matrix}

(12)

which only depends on the statistic

S : = \frac{1}{N} \sum_{ℓ = 1}^{N} V_{ℓ}

. Generally, this maximization can be formalized as a convex optimization problem that can be solved by a gradient descent. However, the properties of this estimator are not understood (for example, a scaling of S by

a \in R_{> 0}

leads to a different solution; see Appendix A.3). To overcome this issue, we instead approximate the solution with the correlation matrix

\underset{Σ, Σ_{ℓ ℓ} = 1 \forall ℓ = 1, \dots, p}{argmax} - \frac{1}{2} ln (| Σ |) - \frac{1}{2} t r (Σ^{- 1} S) \approx P S P,

where

P \in R^{p}

is the diagonal matrix with entries

P_{j j} = \frac{1}{\sqrt{S_{j j}}}, \forall j = 1, \dots, p

. This was also proposed in [28] (Section 2.2).

In cases in which there is expert knowledge on the dependency structure of the underlying distribution, one can adapt Equation (12) accordingly. We discuss this in more detail in Section 4.4.

Estimating $θ$

We now focus on finding

θ^{t + 1}

, which is the maximizer of

\begin{matrix} \sum_{ℓ = 1}^{N} λ (θ, Σ^{t + 1} | x^{ℓ}, θ^{t}, Σ^{t}) & = \sum_{ℓ = 1}^{N} E_{θ^{t}, Σ^{t}} (ln (f_{θ, Σ^{t + 1}} (x_{obs (ℓ)}, x_{mis (ℓ)})) | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) \\ = \sum_{ℓ = 1}^{N} \int ln (f_{θ, Σ^{t + 1}} (x_{obs (ℓ)}^{ℓ}, x_{mis (ℓ)})) \\ f_{θ^{t}, Σ^{t}} (x_{mis (ℓ)} | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) d x_{mis (ℓ)} \end{matrix}

with respect to

θ

. As there is, in general, no closed formula for the right-hand side, we use Monte Carlo integration. Again, we start by considering a single observation

x^{ℓ}

to simplify terms. Employing Algorithm 1, we receive M samples

x_{mis (ℓ), 1}^{ℓ}, \dots, x_{mis (ℓ), M}^{ℓ}

from the distribution of

X_{mis (ℓ)} | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}

given the parameters

θ^{t}

and

Σ^{t}

. We set

x_{obs (ℓ), m}^{ℓ} = x_{obs (ℓ)}^{ℓ} \forall m = 1, \dots, M

. Then, by Equation (9),

\begin{matrix} λ (θ, Σ^{t + 1} | x^{ℓ}, θ^{t}, Σ^{t}) \approx C + \frac{1}{M} \sum_{m = 1}^{M} & - \frac{1}{2} {(Φ^{- 1} (F_{1}^{θ_{1}} (x_{1, m}^{ℓ})), \dots, Φ^{- 1} (F_{p}^{θ_{p}} (x_{p, m}^{ℓ})))}^{T} \\ (K^{t + 1} - I) \\ (Φ^{- 1} (F_{1}^{θ_{1}} (x_{1, m}^{ℓ})), \dots, Φ^{- 1} (F_{p}^{θ_{p}} (x_{p, m}^{ℓ})) \\ + \sum_{j = 1}^{p} ln (f_{j}^{θ_{j}} (x_{j, m}^{ℓ})) . \end{matrix}

(13)

Hence, considering all observations, we set

\begin{matrix} θ^{t + 1} = \underset{θ}{argmax} \frac{1}{M} \sum_{ℓ = 1}^{N} \sum_{m = 1}^{M} & - \frac{1}{2} {(Φ^{- 1} (F_{1}^{θ_{1}} (x_{1, m}^{ℓ})), \dots, Φ^{- 1} (F_{p}^{θ_{p}} (x_{p, m}^{ℓ})))}^{T} \\ (K^{t + 1} - I) \\ (Φ^{- 1} (F_{1}^{θ_{1}} (x_{1, m}^{ℓ})), \dots, Φ^{- 1} (F_{p}^{θ_{p}} (x_{p, m}^{ℓ})) \\ + \sum_{j = 1}^{p} ln (f_{j}^{θ_{j}} (x_{j, m}^{ℓ})) . \end{matrix}

(14)

Note that we only use the Monte Carlo samples to update the parameters of the marginal distributions

θ

. We would also like to point out some interesting aspects about Equations (13) and (14):

The summand $\sum_{ℓ = 1}^{N} \sum_{m = 1}^{M} ln (f_{j}^{θ_{j}} (x_{j, m}^{ℓ}))$ describes how well the marginal distributions fit the (one-dimensional) data.
The estimations of the marginals are interdependent. Hence, in order to maximize with respect to $θ_{j}$ , we have to take into account all other components of $θ$ .
The first summand adjusts for the dependence structure in the data. If all observations at step $t + 1$ are assumed to be independent, then $K^{t + 1} = I$ , and this term is 0.
More generally, the derivative $\frac{\partial λ (θ, Σ^{t + 1} | x^{ℓ}, θ^{t}, Σ^{t})}{\partial θ_{j}}$ depends on $θ_{k}$ if and only if $K_{j k}^{t + 1} \neq 0$ . This means that if $Σ^{t + 1}$ implies the conditional independence of column j and k given all other columns (Equation (6)), the optimal $θ_{j}$ can be found without considering $θ_{k}$ . This, e.g., is the case if we set entries of the precision matrix to 0. Thus, the incorporation of prior knowledge reduces the complexity of the identification of the marginal distributions.

The intuition behind the derived EM algorithm is simple. Given a dataset with missing values, we estimate the dependency structure. With the identified dependency structure, we can derive likely locations of the missing values. Again, these locations help us to find a better dependency structure. This leads to the proposed cyclic approach. The framework of the EM algorithm guarantees the convergence of this procedure to a local maximum for

M \to \infty

in Equation (14).

3.3. Modelling with Semiparametric Marginals

In the case in which the missing mechanism is MAR, the estimation of the marginal distribution using only complete observations is biased. Even worse, any moment of the distribution can be distorted. Thus, one needs a priori knowledge in order to identify the parametric family of the marginals [19,20]. If their family is known, one can directly apply the algorithm of Section 3.2. If this is not the case, we propose the use of a mixture model parametrization of the form

F_{j}^{θ_{j}} (x_{j}) = \frac{1}{g} \sum_{k = 1}^{g} Φ (\frac{x_{j} - θ_{j k}}{σ_{j}}), θ_{j 1} \leq \dots \leq θ_{j g}, \forall j = 1, \dots, p,

(15)

where

σ_{j}

is a hyperparameter and the ordering of the

θ_{j k}

ensures the identifiability.

Using mixture models for density estimation is a well-known idea (e.g., [29,30,31]). As the authors of [31] noted, mixture models vary between being parametric and being non-parametric, where flexibility increases with g. It is reasonable to choose Gaussian mixture models, as their density functions are dense in the set of all density functions with respect to the

L^{1}

-norm [29] (Section 3.2). This flexibility and the provided parametrization make the mixture models a natural choice.

3.4. A Blueprint of the Algorithm

The complete algorithm is summarized in Algorithm 2. For the Monte Carlo EM algorithm, Ref. [26] proposed the stabilization of the parameters with a rather small number of samples M and to increase this number substantially in the latter steps of the algorithm. This seems to be reasonable for line 2 of Algorithm 2 as well.

If there is no a priori knowledge about the marginals, we propose that we follow Section 3.3. We choose the initial

θ^{0}

such that the cumulative distribution function of the mixture model fits the ecdf of the observed data points. For an empirical analysis of the role of g, see Section 4.3.3. For

σ_{1}, \dots, σ_{p}

, we use a rule of thumb inspired by [3] and set

σ_{j} = 1.06 \frac{\hat{σ_{j}}}{g^{1 / 5}},

where

\hat{σ_{j}}

is the standard deviation of the observed data points in the j-th component.

Algorithm 2:Blueprint for the EM algorithm for the Gaussian copula model

4. Simulation Study

We analyze the performance of the proposed estimator in two studies. First, we consider scenarios for two-dimensional datasets and check the potential of the algorithm. In the second part, we explore how expert knowledge can be incorporated and how this affects the behavior and performance. The proposed procedure, which is indexed with EM in the figures below, is compared with:

Standard COPula Estimator (SCOPE): The marginal distributions are estimated by the ecdf of the observed data points. This was proposed by [18] if the parametric family is unknown, and it is the state-of-the art approach. Thus, we apply an EM algorithm to determine the correlation structure on the mapped data points

$z_{j}^{ℓ} = Φ^{- 1} (\hat{F_{j}} (x_{j}^{ℓ})), ℓ = 1, \dots, N, j = 1, \dots, p,$

where $\hat{F_{j}}$ is the ecdf of the observed data points in column j. Its corresponding results are indexed with SCOPE in the figures and tables.
Known marginals: The distribution of the marginals is completely known. The idea is to eliminate the difficulty of finding them. Here, we apply the EM algorithm for the correlation structure on

$z_{j}^{ℓ} = Φ^{- 1} (F_{j} (x_{j}^{ℓ})), ℓ = 1, \dots, N, j = 1, \dots, p,$

where $F_{j}$ is the real marginal distribution function. Its corresponding results are indexed with a 0 in the figures and tables.
Markov chain–Monte Carlo (MCMC) approach [21]: The author proposed an MCMC scheme to estimate the copula in a Bayesian fashion. Therefore, Ref. [21] derived the distribution of the multivariate ranks. The marginals are treated as nuisance parameters. We employed the R package sbgcop, which is available on CRAN, as it provides not only a posterior distribution of the correlation matrix $Σ$ , but also imputations for missing values. In order to compare the approach with the likelihood-based methods, we set

${\hat{Σ}}_{M C M C} = \frac{1}{M} \sum_{m = 1}^{M} Σ^{m},$

where ${Σ^{m} : m = 1, \dots, M}$ are samples of the posterior distribution of the correlation matrix. For the marginals, we defined

${\hat{F}}_{j, M C M C} (x) = \frac{1}{M N} \sum_{ℓ = 1}^{N} \sum_{m = 1}^{M} 1_{{x_{j, m}^{ℓ} \leq x}},$

where $x_{j, m}^{ℓ}$ is the m-th of the total of M imputations for $x_{j}^{ℓ}$ and $x_{j, m}^{ℓ} = x_{j}^{ℓ} \forall m = 1, \dots, M$ if $x_{j}^{ℓ}$ can be observed. The samples were drawn from the posterior distribution. The corresponding results were indexed with the MCMC approach in the figures and tables.

Sklar’s theorem shows that the joint distribution can be decomposed into the marginals and the copula. Thus, we analyze them separately.

4.1. Adapting the EM Algorithm

In Section 4.3 and Section 4.4, we chose

g = 15

, for which we saw a sufficient flexibility. A sensitivity analysis of the procedure with respect to g can be found in Section 4.3.3. The initial

θ^{0}

was chosen by fitting the marginals to the existing observations, and

Σ^{0}

was the identity matrix. For the number of Monte Carlo samples M, we observed that with

M = 20

,

θ

stabilized after around 10 steps. Cautiously, we ran 20 steps before we increased M to 1000, for which we run another five steps. We stopped the algorithm when the condition

∥ Σ^{t + 1} - Σ^{t} ∥_{1} < 10^{- 5}

was fulfilled.

4.2. Data Generation

We considered a two-dimensional dataset (we would have liked to include the setup of the simulation study of [18]; however, neither could the missing mechanism be extracted from the paper nor did the authors provide it on request) with a priori unknown marginals

F_{1}

and

F_{2}

, whose copula was Gaussian with the correlation parameter

ρ \in [- 1, 1]

. The marginals were chosen to be

χ^{2}

with six and seven degrees of freedom. The data matrix

D \in R^{N \times 2}

kept N (complete) observations of the random vector. We enforced the following MAR mechanism:

Remove every entry in D with probability $0 \leq p_{M C A R} < 1$ . We denote the resulting data matrix (with missing entries) as $D^{M C A R} = {(D_{ℓ j}^{M C A R})}_{ℓ = 1, \dots, N, j = 1, 2}$ .
If $D_{ℓ 1}^{M C A R}$ and $D_{ℓ 2}^{M C A R}$ are observed, remove $D_{ℓ 2}^{M C A R}$ with probability

$\begin{matrix} P (R_{2} = 0 | X_{1} = D_{ℓ 1}, X_{2} = D_{ℓ 2}) & = P (R_{2} = 0 | X_{1} = D_{ℓ 1}) \\ = \frac{1}{1 + exp (- β_{0} - β_{1} Φ^{- 1} (F_{1} (D_{ℓ 1})))} \end{matrix}$

We call the resulting data matrix $D^{M A R}$ .

The missing patterns were non-monotone. Aside from

p_{M C A R}

, the parameters

β_{0}

and

β_{1}

controlled how many entries were absent in the final dataset. Assuming that

ρ > 0

,

β_{1} > 0

, and

| β_{0} |

was not too large, the ecdf of the observed values of

X_{2}

was shifted to the left compared to the true distribution function (changing the signs of

β_{1}

and/or

ρ

may change the direction of the shift, but the situation is analogous). This can be seen in Figure 1, where we chose

N = 200

,

ρ = 0.5

,

β = (β_{0}, β_{1}) = (0, 2)

. The marginal distribution of

X_{1}

could be estimated well by the ecdf of the observed data.

4.3. Results

This subsection explores how different specifications of the data-generating process presented in Section 4.2 influenced the estimation of the joint distribution. First, we investigate the influence of the share of missing values (controlled via

β

) and the dependency (controlled via

ρ

) by fixing the number of observations (denoted by N) to 100. Then, we vary N to study the behavior of the algorithms for larger sample sizes. Afterwards, we carry out a sensitivity analysis of the EM algorithm with respect to g, the number of mixtures. Finally, we study the computational demands of the algorithms.

4.3.1. The Effects of Dependency and Share of Missing Values

We investigate two different choices for the setup in Section 4.2 by setting the parameters to

ρ = 0.1

,

β = (- 1, 1)

and

ρ = 0.5

,

β = (0, 2)

. For both, we draw 1000 datasets with

N = 100

each and apply the estimators. To evaluate the methods, we look at two different aspects.

First, we compare the estimators for

ρ

with respect to bias and standard deviations. The results are depicted in the corresponding third columns of Table 1 and are summarized as boxplots in Figure A1 in Appendix B.3. We see that no method is clearly superior. While the EM algorithm has a stronger bias for

ρ = 0.5

than that of SCOPE, it also has a smaller standard deviation. The MCMC approach shows the largest bias. As even known marginals (

ρ_{0}

) do not lead to substantially better estimators compared to SCOPE (

ρ_{S C O P E}

) or the proposed (

ρ_{E M}

) approach, we deduce that (at least in this setting) the estimators for the marginals are almost negligible. MCMC performs notably worse.

Second, we investigate the Cramer–von Mises statistics

ω

between the estimated and the true marginal distribution (

ω^{1}

statistic for the first marginal,

ω^{2}

statistic for the second marginal). The results are shown in Table 1 (corresponding first two columns) and are summarized as boxplots in Figure A2 in Appendix B.3. While for

ρ = 0.1

, the proposed estimator behaves only slightly better than SCOPE, we see that the benefit becomes larger in the case of high correlation and more missing values, especially when estimating the second marginal. This is in line with the intuition that if the correlation is vanishing, the two random variables

X_{1}

and

X_{2}

become independent. Thus,

R_{2}

, the missing value indicator, and

X_{2}

become independent. (Note that there is a difference from the case in which

ρ \neq 0

, and hence, the missingness probability

R_{2}

isconditionally independent from

X_{2}

given

X_{1}

.) In that case, we can estimate the marginal of

X_{2}

using the ecdf of the observed data points. Hence, SCOPE’s estimates of the marginals should be good for small values of

ρ

. An illustration can be found in Figure 2. Again, the MCMC approach performs the worst.

4.3.2. Varying the Sample Size N

To investigate the behavior of the methods for larger sample sizes, we repeat the experiment from Section 4.2 with

ρ = 0.5, β = (0, 2)

for the sample sizes

N = 100, 200, 500, 1000

. The results are depicted in Table 2 and Figure A3, Figure A4 and Figure A5 in Appendix B.3. The bias of SCOPE and EM algorithm for

ρ

seem to vanish for large N, while the MCMC approach remains biased. Studying the estimation of the true marginals, the approximation of the second marginal via MCMC and SCOPE improves only slowly and is still poor for the largest sample sizes

N = 1000

. In contrast, the EM algorithm performs best in small sample sizes, and the mean (of

ω^{1}

and

ω^{2}

) and standard deviations (of all three values) move towards 0 for increasing N.

4.3.3. The Impacts of Varying the Number of Mixtures g

The proposed EM algorithm relies on the hyperparameter g, the number of mixtures in Equation (15). To analyze the behavior of the EM algorithm with respect to g, we additionally run the EM algorithm with

g = 5

and

g = 30

on the 1000 datasets of Section 4.2 for

ρ = 0.5

,

β = (0, 2)

, and

N = 100

. We did not adjust the number of steps in the EM algorithm to keep the results comparable. The results can be found in Table 3. We see that the choice of g does not have a large effect on the estimation of

ρ

. However, an increased g leads to better estimates for

X_{1}

. This is in line with the intuition that the ecdf of the first components is an unbiased estimate for the distribution function of

X_{1}

, and setting g to the number of samples corresponds to the kernel density estimator. On the other hand, the estimator for

X_{2}

benefits slightly from

g = 5

, as

ω_{E M}^{2}

has a lower mean and standard deviation compared to the choice

g = 30

. However, this effect is small and almost non-existent when we compare

g = 5

with

g = 15

. As the choice

g = 15

leads to better estimates of the first marginal compared to

g = 5

, we see this choice as a good compromise for our setting. For applications without prior knowledge, we recommend considering g as additional tuning parameter (via cross-validation).

4.3.4. Run Time

We analyze the computational demands of the different algorithms by comparing their run times in the study of Section 4.3.1 with

ρ = 0.5

and

β = (0, 2)

(the settings

ρ = 0.1

and

β = (- 1, 1)

lead to similar results and are omitted). The run times of all presented algorithms depend not only on the dataset, but also on the parameters (e.g., convergence criterion and

Σ^{0}

for SCOPE). Thus, we do not aim for an extensive study, but focus on the magnitudes. We compare the proposed EM algorithm with a varying number of mixtures (

g = 5, 15, 30

) with MCMC and SCOPE. The results are shown in Table 4. We see that the EM algorithm has the longest run time, which depends on the number of mixtures g. The MCMC approach and the proposed EM algorithm have a higher computational demand than SCOPE, as they are trying to model the interaction between the copula and the marginals. As mentioned in the onset, we could reduce the run time of the EM algorithm by going down to only 10 steps instead of 20.

4.4. Inclusion of Expert Knowledge

In the presence of prior knowledge on the dependency structure, the presented EM algorithm is highly flexible. While information on the marginals can be used to parametrize the copula model, expert knowledge on the dependency structure can be incorporated by adapting Equation (12). In the case of soft constraints on the covariance or precision matrix, one can replace Equation (12) with a penalized covariance estimation, where the penalty reflects the expert assessment [32,33]. Similarly, one can define a prior distribution on the covariance matrices and set

Σ^{t + 1}

as the mode of the posterior distribution (the MAP estimate) of

Σ

given the statistic S of Equation (12).

Another possibility could be that we are aware of conditional independencies in the data-generating process. This is, for example, the case when causal relationships are known [4]. To exemplify the latter, we consider a three-dimensional dataset

X

with the Gaussian copula

C_{Σ}

and marginals

X_{1}, X_{2}, X_{3}

, which are

χ^{2}

distributed with six, seven, and five degrees of freedom. The precision is set to

K = Σ^{- 1} = Δ^{1 / 2} (\begin{matrix} 1 & 0.5 & 0.5 \\ 0.5 & 1 & 0 \\ 0.5 & 0 & 1 \end{matrix}) Δ^{1 / 2},

where

Δ^{1 / 2}

is a diagonal matrix, which ensures that the diagonal elements of

Σ

are 1. We see that

X_{2}

and

X_{3}

are conditionally independent given

X_{1}

. The missing mechanism is similar to the one in Section 4.2. The missingness of

X_{3}

depends on

X_{1}

and

X_{2}

, while the probability of a missing

X_{1}

or

X_{2}

is independent of the others. The mechanism is, again, MAR. Details can be found in Appendix B.2. We compare the proposed method with prior knowledge on the zeros in the precision matrix (indexed by KP, EM in the figures) with the EM, SCOPE, and MCMC algorithms without background knowledge. We again sample 1000 datasets with 50 observations each from the real distribution. The background knowledge on the precision is used by restricting the non-zero elements in Equation (12). Therefore, we apply the procedure presented in [34] (Chapter 17.3.1) to find

Σ^{t + 1}

. The means and standard deviations of the estimates are presented in Table 5.

First, we evaluate the estimated dependency structures by calculating the Frobenius norm of the estimation error

Σ - \hat{Σ}

. The EM algorithm with background knowledge (KP, EM) performs best and is more stable than its competitors. Apart from MCMC, the other procedures behave similarly, which indicates again that the exact knowledge of the marginal distributions is not too relevant for identifying the dependency structure. MCMC performs the worst.

Second, we see that the proposed EM estimators return marginal distributions that are closer to the truth, while the estimate with background knowledge (KP, EM) performs the best. Thus, the background knowledge on the copula also transfers into better estimates for the marginal distribution—in particular, for

X_{3}

. This is due to Equation (14) and the comments thereafter. The zeros in the precision structure indicate which other marginals are relevant in order to identify the parameter of a marginal. In our case,

X_{2}

provides no additional information for

X_{3}

. This information is provided to the EM algorithm through the restriction of the precision matrix.

Finally, we compare the EM estimates of the joint distribution. The relative entropy or Kullback–Leibler divergence is a popular tool for estimating the difference between two distributions [35,36], where one of them is absolutely continuous with respect to the other. A lower number indicates a higher similarity. Due to the discrete structure of the marginals of SCOPE and MCMC, we cannot calculate their relative entropy with respect to the truth. However, we would like to analyze how the estimate of the proposed procedure improves if we include expert knowledge. The results are depicted in Table 6. Again, we observe that the incorporation of extra knowledge improves the estimates. This is in line with Table 5, as the estimation of all components in the joint distribution of Equation (3) is improved by the domain knowledge.

5. Discussion

In this paper, we investigated the estimation of the Gaussian copula and the marginals with an incomplete dataset, for which we derived a rigorous EM algorithm. The procedure iteratively searches for the marginal distributions and the copula. It is, hence, similar to known methods for complete datasets. We saw that if the data are missing at random, a consistent estimate of a marginal distribution depends on the copula and other marginals.

The EM algorithm relies on a complete parametrization of the marginals. The parametric family of the marginals is, in general, a priori unknown and cannot be identified through the observed data points. For this case, we presented a novel idea of employing mixture models. Although this is practically always a misspecification, our simulation study revealed that the combination of our EM algorithm and marginal mixture models delivers better estimates for the joint distribution than currently used procedures do. In principle, uncertainty quantification of the parameters derived by the proposed EM algorithm can be achieved by bootstrapping [37].

There are different possibilities for incorporating expert knowledge. Information on the parametric family of the marginals can be used for their parametrization. However, causal and structural understandings of the data-generating process can also be utilized [4,38,39]. For example, this can be achieved by restricting the correlation matrix or its inverse, the precision matrix. We presented how one can restrict the non-zero elements of the precision, which enforces conditional independencies. Our simulation study showed that this leads not only to an improved estimate for the dependency structure, but also to better estimates for the marginals. This translates into a lower relative entropy between the real distribution and the estimate. We also discussed how soft constraints on the dependency structure can be included.

We note that the focus of this paper is on estimating the joint distribution without precise specification of its subsequent use. Therefore, we did not discuss imputation methods (see, e.g., [40,41,42,43]). However, Gaussian copula models were employed as a device for multiple imputation (MI) with some success [22,24,44]. The resulting complete datasets can be used for inference. All approaches that we are aware of estimate the marginals by using the ecdf of the observed data points. The findings in Section 4 translate into better draws for the missing values.

Additionally, the joint distribution can be utilized for regressing a potentially multivariate

Y

on

Z

even if data are missing. By applying the EM algorithm on

X : = (Y, Z)

and by Proposition 1, one even obtains the whole conditional distribution of

Y

given

Z = z

.

We have shown how to incorporate a causal understanding of the data-generating process. However, in the potential outcome framework of [45], the derivation of a causal relationship can also be interpreted as a missing data problem in which the missing patterns are “misaligned” [46]. Our algorithm is applicable for this.

figuresection tablesection

Author Contributions

Conceptualization, M.K.; methodology, M.K.; software, M.K.; validation, M.P. and M.K.; formal analysis, M.K.; investigation, M.K.; resources, M.P. and M.K.; data curation, M.K.; writing—original draft preparation, M.K.; writing—review and editing, M.P. and M.K; visualization, M.K.; supervision, M.P.; project administration, M.P. All authors have read and agreed to the published version of the manuscript.

Funding

Maximilian Kertel’s work is funded by the BMW Group.

Data Availability Statement

The data generation procedures of the simulation studies and the proposed algorithm are availabe at https://github.com/mkrtl/misscop, accessed on 26 October 2022.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Technical Results

Appendix A.1. Proof of Conditional Distribution

Proof of Proposition 1.

We prove in the order of the proposition, which is a multivariate generalization of [47].

1.: We inspect the conditional density function:

$\begin{matrix} f (x_{T} | X_{S} = x_{S}) & = \frac{{| Σ |}^{- \frac{1}{2}} exp (- \frac{1}{2} z^{T} (Σ^{- 1} - I) z) \prod_{j = 1}^{p} f_{j} (x_{j})}{| Σ_{S, S} |^{- \frac{1}{2}} exp (- \frac{1}{2} z_{S}^{T} (Σ_{S, S}^{- 1} - I) z_{S}) \prod_{j \in S} f_{j} (x_{j})} \\ = \frac{{| Σ |}^{- \frac{1}{2}} exp (- \frac{1}{2} z^{T} Σ^{- 1} z) exp (\frac{1}{2} z^{T} z) \prod_{j = 1}^{p} f_{j} (x_{j})}{| Σ_{S, S} |^{- \frac{1}{2}} exp (- \frac{1}{2} z_{S}^{T} Σ_{S, S}^{- 1} z_{S}) exp (\frac{1}{2} z_{S}^{T} z_{S}) \prod_{j \in S} f_{j} (x_{j})} \\ = \frac{{| Σ |}^{- \frac{1}{2}} exp (- \frac{1}{2} z^{T} Σ^{- 1} z) exp (\frac{1}{2} z_{T}^{T} z_{T}) \prod_{j \in T} f_{j} (x_{j})}{| Σ_{S, S} |^{- \frac{1}{2}} exp (- \frac{1}{2} z_{S}^{T} Σ_{S, S}^{- 1} z_{S})} \end{matrix}$

Using well-known factorization lemmas and using the Schur complement (see, for example, [48] (Section 4.3.4)) applied on $Σ^{- 1}$ , we encounter

$f (x_{T} | X_{S} = x_{S}) = {| Σ^{'} |}^{- \frac{1}{2}} exp (- \frac{1}{2} {(z_{T} - μ)}^{T} Σ^{' - 1} (z_{T} - μ)) exp (\frac{1}{2} z_{T}^{T} z_{T}) \prod_{j \in T} f_{j} (x_{j}) .$

(A1)
2.: The distribution of

$Φ^{- 1} (F_{T} (X_{T})) | X_{S} = x_{S}$

follows with a change-of-variable argument. Using Equation (A1), we observe for any measurable set A that

$\begin{matrix} P ((Φ^{- 1} (F_{T} (X_{T})) | X_{S} = x_{s}) \in A) \\ = \int_{F^{- 1} (Φ (A))} {| Σ^{'} |}^{- \frac{1}{2}} exp (- \frac{1}{2} {(z_{T} - μ)}^{T} Σ^{' - 1} (z_{T} - μ)) exp (\frac{1}{2} z_{T}^{T} z_{T}) \prod_{j \in T} f_{j} (x_{j}) d x_{T} \\ = \int_{A} ϕ_{μ, Σ^{'}} (q_{T}) d q_{T}, \end{matrix}$

where, in the second equation, we used the transformation $q_{T} = Φ^{- 1} (F_{T} (x_{T}))$ and the fact that

$|D (ϕ^{- 1} (F_{T} (x_{T})))| = 2 π^{\frac{| T |}{2}} exp (\frac{1}{2} {(Φ^{- 1} (F_{T} (x_{T})))}^{T} (Φ^{- 1} (F_{T} (x_{T})))) \prod_{j \in T} f_{j} (x_{j}) .$
3.: This proof is analogous to the one above, and we finally obtain

$\begin{matrix} \int h (x_{T}) f (x_{T} | X_{S} = x_{S}) d x_{T} = \int h (F^{- 1} (Φ (z_{T}))) ϕ_{μ, Σ^{'}} (z_{T}) d z_{T} . \end{matrix}$

The result can be generalized to the case in which

S \cup T \neq {1, \dots, p}

. □

Appendix A.2. Closed-Form Solution of the E-Step for θ = θ t

Theorem A1.

We assume w.l.o.g. that

x^{ℓ} = (x_{obs (ℓ)}^{ℓ}, x_{mis (ℓ)}^{ℓ})

and set

z_{θ^{t}} : = (z_{obs (ℓ), θ^{t}}, z_{mis (ℓ)}) : = (Φ^{- 1} (F_{obs (ℓ)}^{θ^{t}} (x_{obs (ℓ)}^{ℓ})), z_{mis (ℓ)}) .

Then, it holds that

E_{Σ^{t}, θ^{t}} ({z_{θ^{t}}}^{T} Σ^{- 1} z_{θ^{t}} | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) = t r (Σ^{- 1} V),

where

V = (\begin{matrix} z_{obs (ℓ), θ^{t}} {z_{obs (ℓ), θ^{t}}}^{T} & z_{obs (ℓ), θ^{t}} μ^{T} \\ μ {z_{obs (ℓ), θ^{t}}}^{T} & Σ^{'} + μ μ^{T} \end{matrix}),

μ = Σ_{mis (ℓ), obs (ℓ)}^{t} {Σ_{obs (ℓ), obs (ℓ)}^{t}}^{- 1} z_{obs (ℓ), θ^{t}}

and

Σ^{'} = Σ_{mis (ℓ), mis (ℓ)}^{t} - Σ_{mis (ℓ), obs (ℓ)}^{t} {Σ_{obs (ℓ), obs (ℓ)}^{t}}^{- 1} Σ_{obs (ℓ), mis (ℓ)}^{t}

.

Proof.

We define

F_{θ^{t}} (x_{mis (ℓ)}) : = F_{θ^{t}} (x_{obs (ℓ)}^{ℓ}, x_{mis (ℓ)})

. Then,

\begin{matrix} E_{Σ^{t}, θ^{t}} (z_{θ^{t}}^{T} Σ^{- 1} z_{θ^{t}} | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) \\ = E_{Σ^{t}, θ^{t}} ({(Φ^{- 1} (F_{θ^{t}} (x_{mis (ℓ)})))}^{T} Σ^{- 1} (Φ^{- 1} (F_{θ^{t}} (x_{mis (ℓ)}))) | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) \\ = \int {(Φ^{- 1} (F_{θ^{t}} (x_{mis (ℓ)})))}^{T} Σ^{- 1} (Φ^{- 1} (F_{θ^{t}} (x_{mis (ℓ)}))) \\ f_{θ^{t}, Σ^{t}} (x_{mis (ℓ)} | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) d x_{mis (ℓ)} . \end{matrix}

We now apply Proposition 1 and encounter

\begin{matrix} \int {(Φ^{- 1} (F_{θ^{t}} (x_{mis (ℓ)})))}^{T} Σ^{- 1} Φ^{- 1} (F_{θ^{t}} (x_{mis (ℓ)})) \\ f_{θ^{t}, Σ^{t}} (x_{mis (ℓ)} | X_{obs (ℓ)} = x_{obs (ℓ)}^{ℓ}) d x_{mis (ℓ)} \\ = \int z_{θ^{t}}^{T} Σ^{- 1} z_{θ^{t}} ϕ_{Σ^{'}, μ} (z_{mis (ℓ)}) d z_{mis (ℓ)} \\ = \int t r (z_{θ^{t}} z_{θ^{t}}^{T} Σ^{- 1}) ϕ_{Σ^{'}, μ} (z_{mis (ℓ)}) d z_{mis (ℓ)} \\ = t r (Σ^{- 1} \int z_{θ^{t}} z_{θ^{t}}^{T} ϕ_{Σ^{'}, μ} (z_{mis (ℓ)}) d z_{mis (ℓ)}) . \end{matrix}

The last integral is understood element-wise. By the first and second moment of

Φ_{Σ^{'}, μ}

, it follows that

\begin{matrix} \int z_{θ^{t}} z_{θ^{t}}^{T} ϕ_{Σ^{'}, μ} (z_{mis (ℓ), θ^{t}}) d z_{mis (ℓ), θ^{t}} & = \int (z_{obs (ℓ), θ^{t}}, z_{mis (ℓ), θ^{t}}) {(z_{obs (ℓ), θ^{t}}, z_{mis (ℓ), θ^{t}})}^{T} \\ ϕ_{Σ^{'}, μ} (z_{mis (ℓ), θ^{t}}) d z_{mis (ℓ), θ^{t}} \\ = (\begin{matrix} z_{obs (ℓ), θ^{t}} z_{obs (ℓ), θ^{t}}^{T} & z_{obs (ℓ), θ^{t}} μ^{T} \\ μ z_{obs (ℓ), θ^{t}}^{T} & Σ^{'} + μ μ^{T} \end{matrix}) . \end{matrix}

□

Appendix A.3. Maximizer of argmax Σ,Σ jj =1∀j=1,…,p λ(θ t, Σ|θ t, Σ t )

We are interested in

\underset{Σ_{j j} = 1 \forall j = 1, \dots, p}{argmax} l (Σ) : = \underset{Σ_{j j} = 1 \forall j = 1, \dots, p}{argmax} - log (| Σ |) - t r (Σ^{- 1} S),

where

Σ, S \in R^{p \times p}

are positive definite matrices. Clearly,

Σ_{j j} = 1 \Leftrightarrow 1 = e_{j}^{T} Σ e_{j} = t r (e_{j}^{T} Σ e_{j}) = t r (e_{j} e_{j}^{T} Σ) .

Using the Lagrangian, we obtain the following function to optimize

L (Σ, λ) = - log (| Σ |) - t r (Σ^{- 1} S) + \sum_{j = 1}^{p} λ_{j} (t r (e_{j} e_{j}^{T} Σ) - 1) .

Applying the identities

\frac{\partial t r (A X)}{\partial X} = A

,

\frac{\partial t r (A X^{- 1})}{\partial X} = - X^{- 1} A X^{- 1}

, and

\frac{\partial log (| X |)}{\partial X} = X^{- 1}

, we obtain the derivative with respect to

Σ

:

\frac{\partial L}{\partial Σ} = - Σ^{- 1} + Σ^{- 1} S Σ^{- 1} - (\sum_{j = 1}^{p} λ_{j} (e_{j} e_{j}^{T})) \overset{!}{=} 0 .

This is equivalent to

- K + K S K = D_{λ},

where

D_{λ}

is the diagonal matrix with entries

λ = (λ_{1}, \dots, λ_{p})

and

K : = Σ^{- 1}

. We see that the scaling of S by

a \in R_{> 0}

leads, in general, to a different solution K, and hence, the estimator is not invariant under strictly monotone linear transformations of S.

We can also formulate the task as a convex optimization problem:

\underset{{(K^{- 1})}_{i i} = 1 \forall i = 1, \dots, p}{argmin} - log (| K |) + t r (K S) .

Appendix B. Details of the Simulation Studies

Appendix B.1. Drawing Samples from the Joint Distributions

Appendix B.1.1. Estimators of the Percentile Function

In the case of SCOPE, consider the marginal observed data points, which we assume to be ordered $y_{1} \leq \dots \leq y_{N}$ . We use the following linearly interpolated estimator for the percentile function:

$\hat{F^{- 1}} (u) = \{\begin{matrix} y_{1} & for u \leq \frac{1}{N + 1} \\ y_{N}, & for u > \frac{N}{N + 1} \\ \frac{u - \frac{i}{N + 1}}{\frac{i + 1}{N + 1} - \frac{i}{N + 1}} (y_{i + 1} - y_{i}) + y_{i}, & for u \in (\frac{i}{N + 1}, \frac{i + 1}{N + 1}] \end{matrix}\}$
To estimate the percentile function for the mixture models, we choose with equal probability (all Gaussians have equal weight) one component of the mixture and then draw a random number with its mean $θ_{j k}$ and standard deviation $σ_{j}$ , $j = 1, \dots, p, k = 1, \dots, g$ . In this manner, we generate $N^{'}$ samples $y_{1}^{'}, \dots, y_{N^{'}}^{'}$ . The estimator for the percentile function is then chosen to be analogous to the one above. A higher $N^{'}$ leads to a more exact result. We choose $N^{'}$ to be 10,000.

Appendix B.1.2. Sampling

Given an estimator

\hat{ρ}

and estimators for the percentile functions

\hat{F_{1}^{- 1}}, \hat{F_{2}^{- 1}}

, we obtain M samples from the learned joint distribution with

y_{ℓ} = (y_{ℓ 1}, y_{ℓ 2}) = (\hat{F_{1}^{- 1}} (u_{ℓ 1}), \hat{F_{2}^{- 1}} (u_{ℓ 2})) = (\hat{F_{1}^{- 1}} (Φ (z_{ℓ 1})), \hat{F_{2}^{- 1}} (Φ (z_{ℓ 2}))), ℓ = 1, \dots, M,

where

z_{ℓ} = (z_{ℓ 1}, z_{ℓ 2}), ℓ = 1, \dots, M

are draws from a bivariate normal distribution with mean 0 and covariance

(\begin{matrix} 1 & \hat{ρ} \\ \hat{ρ} & 1 \end{matrix})

. In the case of the gold standard, we set

\hat{F_{j}^{- 1}} = F_{j}^{- 1}, j = 1, 2

. We obtain samples of the real underlying distribution by using the correct percentile functions, as in the gold standard, and, additionally,

\hat{ρ} = ρ

. The procedure for three dimensions is analogous.

Appendix B.2. Missing Mechanism for Section 4.4

The missing mechanism is similar to the two-dimensional case. The marginals are chosen to be

χ^{2}

with six, seven, and five degrees of freedom. The data matrix

D \in R^{N \times 3}

keeps N (complete) observations of the random vector. We enforce the following missing data mechanism:

Again, we remove every entry in the data matrix D with probability $0 \leq p_{M C A R} < 1$ . The resulting data matrix (with missing entries) is denoted as

$D^{M C A R} = {(D_{ℓ j}^{M C A R})}_{ℓ = 1, \dots, N, j = 1, 2, 3} .$
If $D_{ℓ 1}^{M C A R}$ , $D_{ℓ 2}^{M C A R}$ , and $D_{ℓ 3}^{M C A R}$ are observed, we remove $D_{ℓ 3}^{M C A R}$ with probability

$P (R_{3} = 0 | X_{1} = D_{ℓ 1}, X_{2} = D_{ℓ 2}) = h (D_{ℓ 1}, D_{ℓ 2}; β),$

where

$h (D_{ℓ 1}, D_{ℓ 2}; β) = \frac{1}{1 + exp (- (β_{0} + β_{1} Φ^{- 1} (F_{1} (D_{ℓ 1})) + β_{2} Φ^{- 1} (F_{2} (D_{ℓ 2}))))}$

and $β = (β_{0}, β_{1}, β_{2})$ .

We call the resulting data matrix

D^{M A R}

. Its missing patterns are, again, non-monotone, and the data are MAR, but not MCAR. In Section 4.4, we set

β = (0, 2, 2)

.

Appendix B.3. Complementary Figures

Figure A1. Comparison of the algorithms with respect to the correlation

ρ

. Shown are the boxplots for the Standard Copula Estimator (SCOPE), the proposed EM algorithm (EM), the method based on known marginals (0), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2, where

ρ = 0.1, β = (- 1, 1)

are depicted in the left canvas and

ρ = 0.5, β = (0, 2)

are depicted in the right canvas. The true correlations are indicated by the dashed line.

Figure A1. Comparison of the algorithms with respect to the correlation

ρ

. Shown are the boxplots for the Standard Copula Estimator (SCOPE), the proposed EM algorithm (EM), the method based on known marginals (0), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2, where

ρ = 0.1, β = (- 1, 1)

are depicted in the left canvas and

ρ = 0.5, β = (0, 2)

are depicted in the right canvas. The true correlations are indicated by the dashed line.

Figure A2. Comparison of the algorithms with respect to the Cramer–von Mises distance between the estimated and the true first (

ω^{1}

) and true second marginal distributions (

ω^{2}

). Shown are the boxplots on a logarithmic scale for the proposed EM algorithm (EM), the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2, where

ρ = 0.1, β = (- 1, 1)

are depicted in the left canvas and

ρ = 0.5, β = (0, 2)

are depicted in the right canvas.

Figure A2. Comparison of the algorithms with respect to the Cramer–von Mises distance between the estimated and the true first (

ω^{1}

) and true second marginal distributions (

ω^{2}

). Shown are the boxplots on a logarithmic scale for the proposed EM algorithm (EM), the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2, where

ρ = 0.1, β = (- 1, 1)

are depicted in the left canvas and

ρ = 0.5, β = (0, 2)

are depicted in the right canvas.

Figure A3. Comparison of the algorithms with respect to the correlation (

ρ

). Shown are the mean (upper canvas) and standard deviation (lower canvas) of the Standard Copula Estimator (SCOPE), the proposed EM algorithm (EM), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

and for varying sample sizes

N = 100, 200, 500, 1000

, where the true

ρ

is

0.5

(dashed line in the upper canvas).

Figure A3. Comparison of the algorithms with respect to the correlation (

ρ

). Shown are the mean (upper canvas) and standard deviation (lower canvas) of the Standard Copula Estimator (SCOPE), the proposed EM algorithm (EM), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

and for varying sample sizes

N = 100, 200, 500, 1000

, where the true

ρ

is

0.5

(dashed line in the upper canvas).

Figure A4. Comparison of the algorithms with respect to the Cramer–von Mises statistic

ω^{1}

between the estimated and the true first marginal distribution. Shown are the mean (upper canvas) and standard deviation (lower canvas) of the Standard Copula Estimator (SCOPE), the proposed EM algorithm (EM), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

and for varying sample sizes of

N = 100, 200, 500, 1000

.

Figure A4. Comparison of the algorithms with respect to the Cramer–von Mises statistic

ω^{1}

between the estimated and the true first marginal distribution. Shown are the mean (upper canvas) and standard deviation (lower canvas) of the Standard Copula Estimator (SCOPE), the proposed EM algorithm (EM), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

and for varying sample sizes of

N = 100, 200, 500, 1000

.

Figure A5. Comparison of the algorithms with respect to the Cramer–von Mises statistic

ω^{2}

between the estimated and the true second marginal distribution. Shown are the mean (upper canvas) and standard deviation (lower canvas) of the Standard Copula Estimator (SCOPE), the proposed EM algorithm (EM), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

and for varying sample sizes of N = 100, 200, 500, 1000.

Figure A5. Comparison of the algorithms with respect to the Cramer–von Mises statistic

ω^{2}

between the estimated and the true second marginal distribution. Shown are the mean (upper canvas) and standard deviation (lower canvas) of the Standard Copula Estimator (SCOPE), the proposed EM algorithm (EM), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

and for varying sample sizes of N = 100, 200, 500, 1000.

References

Thurow, M.; Dumpert, F.; Ramosaj, B.; Pauly, M. Imputing missings in official statistics for general tasks–our vote for distributional accuracy. Stat. J. IAOS 2021, 37, 1379–1390. [Google Scholar] [CrossRef]
Liu, Y.; Dillon, T.; Yu, W.; Rahayu, W.; Mostafa, F. Missing value imputation for industrial IoT sensor data with large gaps. IEEE Internet Things J. 2020, 7, 6855–6867. [Google Scholar] [CrossRef]
Silverman, B. Density Estimation for Statistics and Data Analysis; Routledge: London, UK, 2018. [Google Scholar]
Kertel, M.; Harmeling, S.; Pauly, M. Learning causal graphs in manufacturing domains using structural equation models. arXiv 2022, arXiv:2210.14573. [Google Scholar] [CrossRef]
Genest, C.; Ghoudi, K.; Rivest, L.P. A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika 1995, 82, 543–552. [Google Scholar] [CrossRef]
Liu, H.; Han, F.; Yuan, M.; Lafferty, J.; Wasserman, L. High-dimensional semiparametric gaussian copula graphical models. Ann. Stat. 2012, 40, 2293–2326. [Google Scholar] [CrossRef]
Titterington, D.; Mill, G. Kernel-based density estimates from incomplete data. J. R. Stat. Soc. Ser. B Methodol. 1983, 45, 258–266. [Google Scholar] [CrossRef]
Dempster, A.; Laird, N.; Rubin, D. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Methodol. 1977, 39, 1–22. [Google Scholar]
Shen, C.; Weissfeld, L. A copula model for repeated measurements with non-ignorable non-monotone missing outcome. Stat. Med. 2006, 25, 2427–2440. [Google Scholar] [CrossRef]
Gomes, M.; Radice, R.; Camarena Brenes, J.; Marra, G. Copula selection models for non-Gaussian outcomes that are missing not at random. Stat. Med. 2019, 38, 480–496. [Google Scholar] [CrossRef]
Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
Cui, R.; Groot, P.; Heskes, T. Learning causal structure from mixed data with missing values using Gaussian copula models. Stat. Comput. 2019, 29, 311–333. [Google Scholar] [CrossRef]
Wang, H.; Fazayeli, F.; Chatterjee, S.; Banerjee, A. Gaussian copula precision estimation with missing values. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, Reykjavik, Iceland, 22–25 April 2014; PMLR: Reykjavik, Iceland, 2014; Volume 33, pp. 978–986. [Google Scholar]
Hamori, S.; Motegi, K.; Zhang, Z. Calibration estimation of semiparametric copula models with data missing at random. J. Multivar. Anal. 2019, 173, 85–109. [Google Scholar] [CrossRef]
Robins, J.M.; Gill, R.D. Non-response models for the analysis of non-monotone ignorable missing data. Stat. Med. 1997, 16, 39–56. [Google Scholar] [CrossRef]
Sun, B.; Tchetgen, E.J.T. On inverse probability weighting for nonmonotone missing at random data. J. Am. Stat. Assoc. 2018, 113, 369–379. [Google Scholar] [CrossRef] [PubMed]
Seaman, S.R.; White, I.R. Review of inverse probability weighting for dealing with missing data. Stat. Methods Med. Res. 2013, 22, 278–295. [Google Scholar] [CrossRef]
Ding, W.; Song, P. EM algorithm in gaussian copula with missing data. Comput. Stat. Data Anal. 2016, 101, 1–11. [Google Scholar] [CrossRef] [Green Version]
Efromovich, S. Adaptive nonparametric density estimation with missing observations. J. Stat. Plan. Inference 2013, 143, 637–650. [Google Scholar] [CrossRef]
Dubnicka, S.R. Kernel density estimation with missing data and auxiliary variables. Aust. N. Z. J. Stat. 2009, 51, 247–270. [Google Scholar] [CrossRef]
Hoff, P. Extending the rank likelihood for semiparametric copula estimation. Ann. Appl. Stat. 2007, 1, 265–283. [Google Scholar] [CrossRef]
Hollenbach, F.; Bojinov, I.; Minhas, S.; Metternich, N.; Ward, M.; Volfovsky, A. Multiple imputation using gaussian copulas. Sociol. Methods Res. 2021, 50, 1259–1283. [Google Scholar] [CrossRef] [Green Version]
Di Lascio, F.; Giannerini, S.; Reale, A. Exploring copulas for the imputation of complex dependent data. Stat. Methods Appl. 2015, 24, 159–175. [Google Scholar] [CrossRef]
Houari, R.; Bounceur, A.; Kechadi, T.; Tari, A.; Euler, R. A new method for estimation of missing data based on sampling methods for data mining. Adv. Intell. Syst. Comput. 2013, 225, 89–100. [Google Scholar] [CrossRef] [Green Version]
Sklar, A. Fonctions de repartition an dimensions et leurs marges. Publ. Inst. Statist. Univ. Paris 1959, 8, 229–231. [Google Scholar]
Wei, G.; Tanner, M. A monte carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J. Am. Stat. Assoc. 1990, 85, 699–704. [Google Scholar] [CrossRef]
Meng, X.L.; Rubin, D. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika 1993, 80, 267–278. [Google Scholar] [CrossRef]
Guo, J.; Levina, E.; Michailidis, G.; Zhu, J. Graphical models for ordinal data. J. Comput. Graph. Stat. 2015, 24, 183–204. [Google Scholar] [CrossRef] [Green Version]
McLachlan, G.; Lee, S.; Rathnayake, S. Finite mixture models. Annu. Rev. Stat. Its Appl. 2019, 6, 355–378. [Google Scholar] [CrossRef]
Hwang, J.; Lay, S.; Lippman, A. Nonparametric multivariate density estimation: A comparative study. IEEE Trans. Signal Process. 1994, 42, 2795–2810. [Google Scholar] [CrossRef] [Green Version]
Scott, D.; Sain, S. Multidimensional density estimation. Handb. Stat. 2005, 24, 229–261. [Google Scholar]
Zuo, Y.; Cui, Y.; Yu, G.; Li, R.; Ressom, H. Incorporating prior biological knowledge for network-based differential gene expression analysis using differentially weighted graphical LASSO. BMC Bioinform. 2017, 18, 99. [Google Scholar] [CrossRef] [Green Version]
Li, Y.; Jackson, S.A. Gene network reconstruction by integration of prior biological knowledge. G3 Genes Genomes Genet. 2015, 5, 1075–1079. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer Series in Statistics; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Joyce, J.M. Kullback-Leibler divergence. In International Encyclopedia of Statistical Science; Springer: Berlin/Heidelberg, Germany, 2011; pp. 720–722. [Google Scholar]
Contreras-Reyes, J.E.; Arellano-Valle, R.B. Kullback–Leibler divergence measure for multivariate skew-normal distributions. Entropy 2012, 14, 1606–1626. [Google Scholar] [CrossRef] [Green Version]
Honaker, J.; King, G.; Blackwell, M. Amelia II: A program for missing data. J. Stat. Softw. 2011, 45, 1–47. [Google Scholar] [CrossRef]
Holzinger, A.; Langs, G.; Denk, H.; Zatloukal, K.; Müller, H. Causability and explainability of artificial intelligence in medicine. WIREs Data Min. Knowl. Discov. 2019, 9, e1312. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Dinu, V.; Zhao, H.; Miller, P.L. Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis. J. Biomed. Inform. 2007, 40, 750–760. [Google Scholar] [CrossRef] [Green Version]
Rubin, D. Multiple imputation after 18+ years. J. Am. Stat. Assoc. 1996, 91, 473–489. [Google Scholar] [CrossRef]
Van Buuren, S. Flexible Imputation of Missing Data; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Ramosaj, B.; Pauly, M. Predicting missing values: A comparative study on non-parametric approaches for imputation. Comput. Stat. 2019, 34, 1741–1764. [Google Scholar] [CrossRef]
Ramosaj, B.; Amro, L.; Pauly, M. A cautionary tale on using imputation methods for inference in matched-pairs design. Bioinformatics 2020, 36, 3099–3106. [Google Scholar] [CrossRef]
Zhao, Y.; Udell, M. Missing value imputation for mixed data via gaussian copula. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual Event, CA, USA, 6–10 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 636–646. [Google Scholar] [CrossRef]
Rubin, D.B. Causal Inference Using Potential Outcomes: Design, Modeling, Decisions. J. Am. Stat. Assoc. 2005, 100, 322–331. [Google Scholar] [CrossRef]
Ding, P.; Li, F. Causal inference: A missing data perspective. Stat. Sci. 2017, 33. [Google Scholar] [CrossRef] [Green Version]
Käärik, E.; Käärik, M. Modeling dropouts by conditional distribution, a copula-based approach. J. Stat. Plan. Inference 2009, 139, 3830–3835. [Google Scholar] [CrossRef]
Murphy, K. Machine Learning: A Probabilistic Perspective; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]

Figure 1. Estimates of the proposed EM algorithm (

{\hat{F}}_{i}^{E M}

, orange line), the Standard Copula Estimator (

{\hat{F}}_{i}^{S C O P E}

, blue line, corresponds to ecdf), the Markov chain–Monte Carlo approach (

{\hat{F}}_{i}^{M C M C}

, purple line) for the marginals

X_{i}

,

i = 1, 2

, and the truth (

F_{i}

, green line) of a two-dimensional example dataset generated as described in Section 4.2 with

N = 200

,

ρ = 0.5

, and

β = (0, 2)

.

Figure 1. Estimates of the proposed EM algorithm (

{\hat{F}}_{i}^{E M}

, orange line), the Standard Copula Estimator (

{\hat{F}}_{i}^{S C O P E}

, blue line, corresponds to ecdf), the Markov chain–Monte Carlo approach (

{\hat{F}}_{i}^{M C M C}

, purple line) for the marginals

X_{i}

,

i = 1, 2

, and the truth (

F_{i}

, green line) of a two-dimensional example dataset generated as described in Section 4.2 with

N = 200

,

ρ = 0.5

, and

β = (0, 2)

.

Figure 2. Dependency graph for

X_{1}, X_{2}

, and

R_{2}

.

X_{2}

is independent of

R_{2}

if either

X_{1}

and

X_{2}

are independent (

ρ = 0

) or if

X_{1}

and

R_{2}

are independent (

β_{1} = 0

).

Figure 2. Dependency graph for

X_{1}, X_{2}

, and

R_{2}

.

X_{2}

is independent of

R_{2}

if either

X_{1}

and

X_{2}

are independent (

ρ = 0

) or if

X_{1}

and

R_{2}

are independent (

β_{1} = 0

).

Table 1. Comparison of the algorithms with respect to the Cramer–von Mises distance between the estimated and the true first (

ω^{1}

) and true second marginal distributions (

ω^{2}

), as well as the correlation (

ρ

). Shown are the mean and standard deviation of the proposed EM algorithm (EM), the method based on known marginals (0), the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.3.1.

Table 1. Comparison of the algorithms with respect to the Cramer–von Mises distance between the estimated and the true first (

ω^{1}

) and true second marginal distributions (

ω^{2}

), as well as the correlation (

ρ

). Shown are the mean and standard deviation of the proposed EM algorithm (EM), the method based on known marginals (0), the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.3.1.

		Mean			Standard Deviation
Setting	Method	$ω^{1}$	$ω^{2}$	$ρ$	$ω^{1}$	$ω^{2}$	$ρ$
$ρ = 0.1, β = (- 1, 1)$	EM	8.55	10.41	0.107	9.30	11.67	0.139
	0	-	-	0.109	-	-	0.144
	SCOPE	9.13	12.25	0.105	8.47	11.00	0.144
	MCMC	18.21	24.99	0.094	16.62	21.89	0.127
$ρ = 0.5, β = (0, 2)$	EM	8.03	16.48	0.455	8.68	19.47	0.139
	0	-	-	0.498	-	-	0.138
	SCOPE	9.06	45.25	0.486	8.25	36.11	0.143
	MCMC	17.90	59.34	0.393	16.13	57.15	0.131

Table 2. Comparison of the algorithms with respect to the Cramer–von Mises distance between the estimated and the true first (

ω^{1}

) and true second marginal distributions (

ω^{2}

), as well as the correlation (

ρ

). Shown are the mean and standard deviation of the proposed EM algorithm (EM), the method based on known marginals (0), the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

and varying sample sizes

N = 100, 200, 500, 1000

.

Table 2. Comparison of the algorithms with respect to the Cramer–von Mises distance between the estimated and the true first (

ω^{1}

) and true second marginal distributions (

ω^{2}

), as well as the correlation (

ρ

). Shown are the mean and standard deviation of the proposed EM algorithm (EM), the method based on known marginals (0), the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

and varying sample sizes

N = 100, 200, 500, 1000

.

		Mean			Standard Deviation
N	Method	$ω^{1}$	$ω^{2}$	$ρ$	$ω^{1}$	$ω^{2}$	$ρ$
$N = 100$	EM	8.03	16.48	0.455	8.68	19.47	0.139
	0	-	-	0.498	-	-	0.138
	SCOPE	9.06	45.25	0.486	8.25	36.11	0.143
	MCMC	17.90	59.34	0.393	16.13	57.15	0.131
$N = 200$	EM	4.91	8.53	0.469	5.46	8.88	0.098
	0	-	-	0.500	-	-	0.094
	SCOPE	4.76	37.38	0.493	4.18	25.35	0.096
	MCMC	9.27	42.91	0.370	8.01	36.23	0.089
$N = 500$	EM	3.01	3.83	0.480	2.92	3.59	0.063
	0	-	-	0.499	-	-	0.060
	SCOPE	2.05	31.92	0.497	1.85	14.95	0.060
	MCMC	4.01	31.41	0.0360	3.49	20.51	0.051
$N = 1000$	EM	2.25	2.74	0.486	1.92	2.40	0.047
	0	-	-	0.500	-	-	0.042
	SCOPE	1.08	30.60	0.499	0.93	11.13	0.043
	MCMC	1.99	28.13	0.365	1.84	14.49	0.037

Table 3. Comparison of the proposed EM algorithm with respect to the Cramer–von Mises distance between the estimated and the true first (

ω^{1}

) and true second marginal distributions (

ω^{2}

), as well as the correlation (

ρ

), for different numbers of mixtures g in Equation (15). Shown are the mean and standard deviation for

g = 5, 15, 30

and for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

.

Table 3. Comparison of the proposed EM algorithm with respect to the Cramer–von Mises distance between the estimated and the true first (

ω^{1}

) and true second marginal distributions (

ω^{2}

), as well as the correlation (

ρ

), for different numbers of mixtures g in Equation (15). Shown are the mean and standard deviation for

g = 5, 15, 30

and for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

.

	Mean			Standard Deviation
# Mixtures	$ω^{1}$	$ω^{2}$	$ρ$	$ω^{1}$	$ω^{2}$	$ρ$
$g = 5$	13.82	16.38	0.469	14.17	19.69	0.145
$g = 15$	8.03	16.48	0.455	8.68	19.47	0.139
$g = 30$	7.17	18.73	0.454	7.48	20.98	0.140

Table 4. Comparison of the algorithms with respect to the run time in seconds. Shown are the mean and standard deviation of the proposed EM algorithm (EM) with the number of mixtures g set to

5, 15, 30

, the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

.

Table 4. Comparison of the algorithms with respect to the run time in seconds. Shown are the mean and standard deviation of the proposed EM algorithm (EM) with the number of mixtures g set to

5, 15, 30

, the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.2 with

ρ = 0.5

and

β = (0, 2)

.

	Run Time in Seconds
Method	Mean	Standard Deviation
EM ( $g = 5$ )	21.78	3.27
EM ( $g = 15$ )	55.94	11.39
EM ( $g = 30$ )	161.57	38.00
SCOPE	0.45	0.11
MCMC	12.98	0.87

Table 5. Comparison of the algorithms with respect to the Cramer–von Mises distance between the estimated and the true first marginal distribution (

ω^{1}

), true second marginal distribution (

ω^{2}

), and true third marginal distribution (

ω^{3}

), as well as the correlation (

ρ

). Shown are the mean and standard deviation of the proposed EM algorithm (EM), the proposed EM algorithm with prior knowledge on the conditional independencies (KP, EM), the method based on known marginals (0), the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.4.

Table 5. Comparison of the algorithms with respect to the Cramer–von Mises distance between the estimated and the true first marginal distribution (

ω^{1}

), true second marginal distribution (

ω^{2}

), and true third marginal distribution (

ω^{3}

), as well as the correlation (

ρ

). Shown are the mean and standard deviation of the proposed EM algorithm (EM), the proposed EM algorithm with prior knowledge on the conditional independencies (KP, EM), the method based on known marginals (0), the Standard Copula Estimator (SCOPE), and the Markov chain–Monte Carlo approach (MCMC) for 1000 datasets generated as described in Section 4.4.

	Mean				Standard Deviation
Method	$ω^{1}$	$ω^{2}$	$ω^{3}$	$\| \| \hat{Σ} - Σ {\| \|}_{2}$	$ω^{1}$	$ω^{2}$	$ω^{3}$	$\| \| \hat{Σ} - Σ {\| \|}_{2}$
EM	12.12	13.38	21.15	0.229	13.89	14.25	22.44	0.113
KP, EM	12.04	13.28	19.66	0.182	13.93	14.37	20.88	0.111
0	-	-	-	0.227	-	-	-	0.108
SCOPE	17.57	17.55	26.69	0.232	16.75	15.55	24.84	0.113
MCMC	36.85	35.70	80.22	0.263	32.82	33.24	78.57	0.140

Table 6. Comparison of the algorithms with respect to the Kullback–Leibler divergence (

D_{K L}

) between the true joint distribution (F) and the estimates. Shown are the mean and standard deviation of the proposed EM algorithm (EM) and the proposed EM algorithm with prior knowledge on the conditional independencies (KP, EM) for 1000 datasets generated as described in Section 4.4.

Table 6. Comparison of the algorithms with respect to the Kullback–Leibler divergence (

D_{K L}

) between the true joint distribution (F) and the estimates. Shown are the mean and standard deviation of the proposed EM algorithm (EM) and the proposed EM algorithm with prior knowledge on the conditional independencies (KP, EM) for 1000 datasets generated as described in Section 4.4.

	Mean( $D_{KL} (F, \cdot)$ )	Standard Deviation( $D_{KL} (F, \cdot)$ )
EM	1.37	0.53
KP, EM	1.26	0.32

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kertel, M.; Pauly, M. Estimating Gaussian Copulas with Missing Data with and without Expert Knowledge. Entropy 2022, 24, 1849. https://doi.org/10.3390/e24121849

AMA Style

Kertel M, Pauly M. Estimating Gaussian Copulas with Missing Data with and without Expert Knowledge. Entropy. 2022; 24(12):1849. https://doi.org/10.3390/e24121849

Chicago/Turabian Style

Kertel, Maximilian, and Markus Pauly. 2022. "Estimating Gaussian Copulas with Missing Data with and without Expert Knowledge" Entropy 24, no. 12: 1849. https://doi.org/10.3390/e24121849

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Estimating Gaussian Copulas with Missing Data with and without Expert Knowledge

Abstract

1. Introduction

2. The Gaussian Copula Model

2.1. Notation and Assumptions

2.2. Properties

3. The EM Algorithm in the Gaussian Copula Model

3.1. The EM Algorithm

3.2. Applying the ECM Algorithm on the Gaussian Copula Model

3.2.1. E-Step

3.2.2. M-Step

Estimating Σ

Estimating θ

3.3. Modelling with Semiparametric Marginals

3.4. A Blueprint of the Algorithm

4. Simulation Study

4.1. Adapting the EM Algorithm

4.2. Data Generation

4.3. Results

4.3.1. The Effects of Dependency and Share of Missing Values

4.3.2. Varying the Sample Size N

4.3.3. The Impacts of Varying the Number of Mixtures g

4.3.4. Run Time

4.4. Inclusion of Expert Knowledge

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Technical Results

Appendix A.1. Proof of Conditional Distribution

Appendix A.2. Closed-Form Solution of the E-Step for θ = θ t

Appendix A.3. Maximizer of argmax Σ,Σ jj =1∀j=1,…,p λ(θ t, Σ|θ t, Σ t )

Appendix B. Details of the Simulation Studies

Appendix B.1. Drawing Samples from the Joint Distributions

Appendix B.1.1. Estimators of the Percentile Function

Appendix B.1.2. Sampling

Appendix B.2. Missing Mechanism for Section 4.4

Appendix B.3. Complementary Figures

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Estimating $Σ$

Estimating $θ$