Coordinate Descent for Variance-Component Models

Mathur, Anant; Moka, Sarat; Botev, Zdravko

doi:10.3390/a15100354

Open AccessArticle

Coordinate Descent for Variance-Component Models

by

Anant Mathur

^1,*,

Sarat Moka

²

and

Zdravko Botev

^1,*

¹

School of Mathematics and Statistics, University of New South Wales, Sydney, NSW 2052, Australia

²

School of Mathematical and Physical Sciences, Macquarie University, Sydney, NSW 2109, Australia

^*

Authors to whom correspondence should be addressed.

Algorithms 2022, 15(10), 354; https://doi.org/10.3390/a15100354

Submission received: 30 July 2022 / Revised: 23 September 2022 / Accepted: 23 September 2022 / Published: 28 September 2022

(This article belongs to the Special Issue Algorithms in Monte Carlo Methods)

Download

Browse Figure

Versions Notes

Abstract

:

Variance-component models are an indispensable tool for statisticians wanting to capture both random and fixed model effects. They have applications in a wide range of scientific disciplines. While maximum likelihood estimation (MLE) is the most popular method for estimating the variance-component model parameters, it is numerically challenging for large data sets. In this article, we consider the class of coordinate descent (CD) algorithms for computing the MLE. We show that a basic implementation of coordinate descent is numerically costly to implement and does not easily satisfy the standard theoretical conditions for convergence. We instead propose two parameter-expanded versions of CD, called PX-CD and PXI-CD. These novel algorithms not only converge faster than existing competitors (MM and EM algorithms) but are also more amenable to convergence analysis. PX-CD and PXI-CD are particularly well-suited for large data sets—namely, as the scale of the model increases, the performance gap between the parameter-expanded CD algorithms and the current competitor methods increases.

Keywords:

linear-mixed models; maximum likelihood estimation; numerical optimization

1. Introduction

Linear models that contain both fixed and random effects are referred to as variance components or linear-mixed models (LMMs). They arise in numerous applications, such as genetics [1], biology, economics, epidemiology and medicine. A broad coverage of existing methodologies and applications of these models can be found in the textbooks [2,3].

In the simplest variance component setup, we observe a response vector

y \in R^{n}

and a predictor matrix

X \in R^{n \times p}

and assume that

y

is an outcome of a normal random variable

Y \sim N (X β, Ω)

, where the covariance is of the form,

Ω = \sum_{i = 0}^{m} γ_{i} V_{i} \in R^{n \times n}, γ_{i} \geq 0 .

The matrices

V_{0}, \dots, V_{m}

are fixed positive semi-definite matrices, and

V_{0}

is non-singular. The unknown mean effects

β = (β_{1}, \dots, β_{p})

and variance component parameters

γ = (γ_{0}, \dots, γ_{m})

can be estimated by maximizing the log-likelihood function,

L (β, γ) = - \frac{1}{2} ln det Ω - \frac{1}{2} {(y - X β)}^{⊤} Ω^{- 1} (y - X β) .

(1)

If

Ω

is known, the maximum likelihood estimator (MLE) for

β

is given by

\hat{β} = {(X^{⊤} Ω^{- 1} X)}^{- 1} X^{⊤} Ω^{- 1} y .

To simplify the MLE estimate for

γ

, one can adopt the restricted MLE (REML) method [4] to remove the mean effect in the likelihood expression by projecting

y

onto the null space of

X

. Let

ν = n - p

and suppose we have the QR decomposition,

X = [\begin{matrix} Q_{[p]} & Q_{[ν]} \end{matrix}] [\begin{matrix} R \\ 0_{ν \times p} \end{matrix}],

where

R

is an

p \times p

upper triangular matrix,

0_{ν \times p}

is an

ν \times p

zero matrix,

Q_{[p]}

is an

n \times p

matrix,

Q_{[ν]}

is

n \times ν

, and

Q_{[p]}

and

Q_{[ν]}

both have orthogonal columns. If we take the Cholesky decomposition

L L^{⊤}

of the matrix

Q_{[ν]}^{⊤} V_{0} Q_{[ν]}

, then the transformation,

L^{- 1} Q_{[ν]}^{⊤} : R^{n} \mapsto R^{ν}

removes the mean from the response, and we obtain

y^{'} : = L^{- 1} Q_{[ν]}^{⊤} y \sim N (0, Ω^{'})

, where

Ω^{'} : = γ_{0} I + \sum_{i = 1}^{m} γ_{i} V_{i}^{'} \in R^{ν \times ν}

and

V_{i}^{'} : = L^{- 1} Q_{[ν]}^{⊤} V_{i} Q_{[ν]} L^{- ⊤}

. After this transformation, the restricted likelihood

- \frac{1}{2} ln det Ω^{'} - \frac{1}{2} {(y^{'})}^{⊤} {(Ω^{'})}^{- 1} y^{'}

will not depend on

β

.

Henceforth, and without loss of generality, we assume that such a transformation has been performed so that we can focus on minimizing an objective function of the form:

- 2 L (γ) = ln det Ω + y^{⊤} Ω^{- 1} y .

(2)

There exists extensive literature on optimization methods for the log-likelihood expression (2), including Newton’s Method [5], Fisher Scoring Method [6], the EM and MM Algorithms [7,8,9]. Newton’s method is known to scale poorly as n, or the number of variance components

m + 1

, increase due the cost of

O (m n^{3}) + O (m^{3})

flops required to invert a Hessian matrix at each update. Both the EM and MM algorithms have simple updating steps; however, numerical experience shows that they are slow to identify the active set

{i : {\hat{γ}}_{i} = 0}

, where

\hat{γ}

is the MLE.

One class of algorithms yet to be applied to this problem are coordinate-descent (CD) algorithms. These algorithms successively minimize the objective function along coordinate directions and can be effective when the optimization for each sub-problem can be made sufficiently simple. Furthermore, only few assumptions are needed to prove that accumulation points of the iterative sequence are stationary points of the objective function. CD algorithms have been used to solve optimization problems for many years, and their popularity has grown considerably in the past few decades because of their usefulness in data science and machine learning tasks. Further in-depth discussions of CD algorithms can be found in the articles [10,11,12].

In this paper, we show that a basic implementation of CD is costly for large-scale problems and does not easily succumb to standard convergence analysis. In contrast, our novel coordinate-descent algorithm called parameter-expanded coordinate descent (PX-CD) is computationally faster and more amenable to theoretical guarantees.

The PX-CD is computationally cheaper to run than the basic CD implementation because the first and second derivatives for each sub-problem can be evaluated efficiently with the conjugate gradient (CG) algorithm [13], whereas the basic CD implementation requires repeat Cholesky factorizations for each coordinate update, each with a complexity of

O (n^{3})

. Further to this, it is often the case that the

V_{i}

are low-rank, and we can take advantage of this by employing the well-known Woodbury matrix identity or QR transformation within PX-CD to reduce the computational cost of each univariate minimization.

In PX-CD, the extended parameters are treated as a block of coordinates, which is updated at each iteration by searching through a coordinate hyper-plane rather than single-coordinate directions. We provide an alternate version of PX-CD, which we call parameter expanded-immediate coordinate descent (PXI-CD), where the extended coordinate block is updated multiple times within each cycle of the original parameters. We observe numerically that, for large-scale models, the number of iterations needed to converge greatly offsets the additional computational cost for each coordinate cycle. As a result, the overall convergence time is better than that of the PX-CD.

From a theoretical point of view, we show that the accumulation points of the iterative sequence generated by both the PX-CD and PXI-CD are coordinate-wise minimum points of (2).

We remark that the improved efficiency of the PX-CD algorithm is similar to the well-known superior performance of the PX-DA (parameter-expanded data-augmentation) algorithm [14,15] in the Markov-chain Monte Carlo (MCMC) context—namely, the PX-DA algorithm is often much faster to converge than a basic data-augmentation Gibbs algorithm. This similarity is also the reason for using the same prefix “PX” in our nomenclature.

The remainder of the paper is structured as follows. In Section 2, we describe the basic implementation of CD and provide examples for which it performs unsatisfactorily. In Section 3, we introduce the PX-CD and PXI-CD and show that accumulation points of the iterations are coordinate-wise minima for the optimization. We then discuss their practical implementation and detail how to reduce the computational cost when the

V_{i}

are low-rank. We also extend the PX-CD algorithm for penalized estimation to perform variable selection. Then, in Section 4, we provide numerical results when

V_{i}

are computer simulated and when

V_{i}

are constructed from a real-world genetic data set. We have made our code for these simulations available on GitHub (https://github.com/anantmathur44/CD-for-variance-components) (accessed on 1 July 2022).

2. Basic Coordinate Descent

Recall that, after the REML procedure to remove the mean effect, we have

Y \sim N (0,}} Ω)

and

Ω = \sum_{i = 0}^{m} γ_{i} V_{i}

, where

V_{0} = I_{n}

. We thus seek to compute:

\begin{matrix} \hat{γ} & = \underset{γ}{arg min} G (Ω), γ_{i} \geq 0, i = 0, 1 \dots, m \end{matrix}

(3)

\begin{matrix} G (Ω) & : = y^{⊤} Ω^{- 1} y + ln det Ω . \end{matrix}

(4)

CD can be applied to solve this problem by successively minimizing G along coordinate directions. There is significant scope for variation in the way components are selected to be updated. In the most conventional way the algorithm cycles through the parameters in the order

γ_{0} \to γ_{1} \to \dots \to γ_{m}

and updates each in turn. This version, known as cyclic coordinate descent, is shown in Algorithm 1.

Algorithm 1 Cyclic CD for

G (Ω)

.

If we choose to minimize along one coordinate at a time, then the update of a parameter consists of a line search along the selected coordinate direction—that is, if the selected parameter is component k, then the new parameter value is updated as

γ_{k}^{(t + 1)} \leftarrow \underset{x \geq 0}{arg min} G (\sum_{i = 0}^{k - 1} γ_{i}^{(t + 1)} V_{i} + \sum_{i = k + 1}^{m} γ_{i}^{(t)} V_{i} + x V_{k}) .

2.1. Implementation

The non-trivial component of Algorithm 1 is to compute the line search,

\underset{x \geq 0}{arg min} G (Ω_{- k} + x V_{k}),

where

Ω_{- k} : = Ω - γ_{k} V_{k}

. When

Ω_{- k}

is invertible, this step can be simplified to a one-dimensional algebraic expression that can be numerically solved without repeated evaluations of the terms

y^{⊤} (Ω_{- k} + x V_{k}) y

and

ln det (Ω_{- k} + x V_{k})

. The simplification is achieved using the Generalized Eigenvalue Decomposition (GEV) [13] to decompose

(V_{k}, Ω_{- k}) \to (D_{k}, U_{k})

such that,

V_{k} U_{k} = Ω_{- k} U_{k} D_{k} and U_{k}^{⊤} Ω_{- k} U_{k} = I_{n},

where

D_{k} \in R^{n \times n}

is a diagonal matrix with non-negative entries and

U_{k} \in R^{n \times n}

is invertible. Using the above expressions we obtain the factorized expression,

Ω_{- k} + x V_{k} = U_{k}^{- ⊤} (I + x D) U_{k}^{- 1}

. Therefore, we can express the inverse and log-determinant terms in

G (Ω_{- k} + x V_{k})

as,

\begin{matrix} {(Ω_{- k} + x V_{k})}^{- 1} & = U_{k} {(I_{n} + x D_{k})}^{- 1} U_{k}^{⊤}, \end{matrix}

(5)

\begin{matrix} ln det (Ω_{- k} + x V_{k}) & = ln det (I_{n} + x D_{k}) - 2 ln det U_{k} . \end{matrix}

(6)

From (4), we have that,

G (Ω_{- k} + x V_{k}) = y^{⊤} U_{k} {(I_{n} + x D_{k})}^{- 1} U_{k}^{⊤} y + ln det (I_{n} + x D_{k}) + const .

Let

α_{k} = U_{k}^{⊤} y

and

d_{k} = diag (D_{k})

. Then, the function to be minimized at the k-th component is of the form,

g_{k} (x) : = G (Ω_{- k} + x V_{k}) = \sum_{j = 1}^{n} \frac{α_{k, j}^{2}}{d_{k, j} x + 1} + ln (1 + d_{k, j} x) + const .

(7)

where

d_{j} \geq 0

for all j. Unless

n = 1

, in general, there is no closed-form expression for the minimum. We thus resort to a numerical method, such as Newton’s method or the Golden-section search method. With the above simplification the majority of the cost is attributed to the GEV, which has a time complexity of

O (14 n^{3})

. Alternatively, one could employ iterative methods without prior simplification of

g_{k} (x)

via the GEV. In that case, however, evaluation of

g_{k} (x)

and its derivatives at each step requires one full Cholesky factorization, costing

O (n^{3})

. Either way, for problems where n is large, basic CD is too costly per one update.

2.2. Convergence

An interesting question is whether the sequence generated in Algorithm 1 converges to a local minimum of the objective function (3) (which is assumed to have a global minimum). To make a general point about the convergence theory, take

C

to be the set of limit points of a coordinate-descent sequence

{γ^{(t)}}

. It is well-known [16,17] that the following existence and uniqueness assumption:

g_{k} (x) = G (Ω_{- k} + x V_{k}) has a unique (global) minimizer for x \geq 0

(8)

is one of the simplest sufficient conditions that ensures the set

C

is not empty and contains only singletons. Assuming that the existence and uniqueness assumption holds for a given coordinate-descent algorithm and

γ^{*} \in C

then, by construction,

γ^{*}

is a global coordinate-wise minimum of (3)—that is, one cannot reduce the value of the objective function by moving along each of the coordinate directions (that is,

G (Ω_{- k} + x V_{k}) \geq G (Ω_{- k} + γ_{k}^{*} V_{k})

for each k and all

x \geq 0

).

Even if

γ^{*} \in C

is a coordinate-wise minimum, it may not be a local minimum of (3). It may well be a saddle point of (3). For example, the function

f (γ_{1}, γ_{2}) = γ_{1}^{2} + γ_{2}^{2} - 5 γ_{1} γ_{2}

does not have a local minimum at zero (

f (γ, γ) = - 3 γ^{2}

indicating that a minimum does not even exist); however, it has a coordinate-wise minimum at

(γ_{1}, γ_{2}) = (0, 0)

.

Thus, it is important to remember that the set of local minima of the optimization problem (3) is a subset of

C

because

C

may contain coordinate-wise minima, which are still saddle points of (3). One positive aspect is that the saddle points found by any coordinate-descent algorithm (under the existence and uniqueness assumption) are a subset of all the saddle points of (3) because they are constrained to be coordinate-wise minima. Stated another way, the set

C

consists of either local minimizers or saddle points that look like coordinate-wise minimizers.

Either way, there is simply no guarantee that any of our coordinate-descent procedures in this paper will converge to a strict local minimum of (3), see [16,18] for more in-depth discussions. Of course, this is an issue for any of the existing optimization algorithms for (3) (MM and EM algorithms simply being a special case of coordinate descent, and Newton’s method known for convergence only when initialized near a local minimum), and thus it should not be viewed as a particular disadvantage of our proposals.

Unfortunately, the existence and uniqueness assumption (8) cannot be used to deal with the convergence of the basic CD Algorithm 1. This is because

g_{k} (x)

can exhibit multiple local minima, as illustrated next.

Suppose

n = 2

, then from Equation (7), we have

g_{k} (x) = \frac{α_{k, 1}^{2}}{d_{k, 1} x + 1} + ln (1 + d_{k, 1} x) + \frac{α_{k, 2}^{2}}{d_{k, 2} x + 1} + ln (1 + d_{k, 2} x) + const .

In Figure 1, we observe two minimizers for

g_{k} (x)

when

α_{k} = (1.2, 3)

and

d_{k} = (10, 0.2)

, respectively. This implies that the existence and uniqueness assumption does not hold for the basic implementation of CD, and we cannot ensure that accumulation points of the sequence

{γ^{(t)}}

are coordinate-wise minima of G.

Example 1 (Sufficient Conditions for a Unique Minimum).

In this example, we show that strong conditions are needed for the existence and uniqueness assumption to hold, making the basic CD Algorithm 1 less than attractive. Suppose

δ : = \frac{1}{n} \sum_{j = 1}^{n} α_{k, j}^{2} > 1

and there is a constant

d > 0

such that

d_{k, j} = d

for all

j = 1, \dots, n

. In that case, we have

g_{k} (x) = \frac{n δ}{d x + 1} + n ln (1 + d x) + const .

Therefore, the first and second derivatives of

g_{k} (x)

are, respectively, given by

g_{k}^{'} (x) = \frac{n d}{(1 + d x)} (1 - \frac{δ}{(1 + d x)}) and g_{k}^{″} (x) = - \frac{n d^{2}}{{(1 + d x)}^{2}} (1 - \frac{2 δ}{(1 + d x)}) .

Since

d > 0

, it easy to see that the equation

g_{k}^{'} (x) = 0

has a unique (positive) solution, which is

x_{1}^{*} = \frac{δ - 1}{d}

. Similarly, the solution of

g_{k}^{″} (x) = 0

is

x_{2}^{*} = \frac{2 δ - 1}{d}

, which is greater than

x_{1}^{*}

.

Since

g_{k}^{″} (x) > 0

for every

x \in [0, x_{2}^{*})

,

g_{k} (x)

is (strictly) convex over

[0, x_{2}^{*})

. As a result, since

0 < x_{1}^{*} < x_{2}^{*}

,

g_{k} (x)

exhibits a global minimum at

x_{1}^{*}

.

3. Parameter-Expanded CD

Since the basic CD Algorithm 1 is both expensive per coordinate update and is not amenable to standard convergence analysis [16,17], we consider an alternative called the parameter-expanded CD or PX-CD. We argue that our novel coordinate-descent algorithm is both faster to converge and also amenable to simple convergence analysis because the existence and uniqueness assumption holds. This constitutes our main contribution.

In the PX-CD, we use the supporting hyper-plane (first-order Taylor approximation) to the concave matrix function

f (A) = ln det A

, where

A ≻ 0

. The supporting hyper-plane gives the bound [9]:

ln det A \leq ln det C + tr (C^{- 1} (A - C)),

where

C \in R^{n \times n}

is an arbitrary PSD matrix, and equality is achieved if and only if

C = A

. Replacing the log-determinant term in G with the above upper bound, we obtain the surrogate function,

H (Ω, C) : = y^{⊤} Ω^{- 1} y + \sum_{i = 0}^{m} γ_{i} tr (C^{- 1} V_{i}) + ln det (C) - n \geq G (Ω) .

(9)

The surrogate function H has

C

as an extra variable in our optimization, which we set to be of the form:

C = \sum_{i = 0}^{m} {\tilde{γ}}_{i} V_{i},

where

\tilde{γ} = ({\tilde{γ}}_{0}, {\tilde{γ}}_{1}, \dots, {\tilde{γ}}_{m})

are latent parameters. Similar to the MM algorithmic recipe [9], we then jointly minimize the surrogate function H with respect to both

γ

and

\tilde{γ}

using CD.

The most apparent way of selecting our coordinates is to cyclically update in the order:

γ_{0} \to γ_{1} \to \dots \to γ_{m} \to \tilde{γ},

where the last update is a block update of the entire block

\tilde{γ}

. In other words, the expanded parameters

\tilde{γ}

are treated as a block of coordinates that is updated in each cycle by searching through the coordinate hyper-plane rather than the single-coordinate directions. We refer to a full completion of updates in a single ordering as a “cycle” of updates. Suppose the initial guess for the parameters are

(γ^{(0)}, {\tilde{γ}}^{(0)})

, then, at the end of cycle t, we denote the updated parameters as

(γ^{(t)}, {\tilde{γ}}^{(t)})

. In Theorem 1, we state that under certain conditions, the sequence

{(γ^{(t)}, {\tilde{γ}}^{(t)})}_{t \geq 0}

generated by PX-CD has limit-points, which are coordinate-wise minima for G.

Let

Ω^{(t)} = \sum_{i = 0}^{m} γ_{i}^{(t)} V_{i}

be the updated covariance matrix after the

m + 1

original parameters have been updated in cycle t. Then, as the inequality in (9) achieves equality if and only if

C = Ω

the update for the expanded block parameters

\tilde{γ}

in cycle t is,

{\tilde{γ}}^{(t)} = \underset{\tilde{γ}}{arg min} H (Ω^{(t)}, \sum_{i = 0}^{m} {\tilde{γ}}_{i} V_{i}) = γ^{(t)} .

In practice, we simply store

C^{(t)} = Ω^{(t)}

at the end of each cycle.

Minimizing H with respect to the k-th component of the original parameter

γ

yields a function of the form:

h_{k} (x) : = H (Ω_{- k} + x V_{k}, C) = y^{⊤} {(Ω_{- k} + x V_{k})}^{- 1} y + x tr (C^{- 1} V_{k}) + const ., x \geq 0 .

(10)

One of the main advantages of the PX-CD procedure over the basic coordinate descent in Algorithm 1 is that the optimization along each coordinate has a unique minimum.

Lemma 1.

h_{k} (x)

has a unique minimizer for

x \geq 0

.

Proof.

We now show that on

[0, \infty)

, the function

h_{k} (x)

is either strictly convex or a linear function with a strictly positive gradient.

We first consider the case where

Ω_{- k}

is invertible. From [13], we have the GEV decomposition:

(V_{k}, Ω_{- k}) \to (D_{k}, U_{k})

where

D_{k} \in R^{n \times n}

is a diagonal matrix with non-negative entries and

U_{k} \in R^{n \times n}

is invertible. In a similar fashion to the simplified basic CD expression (7), let

α_{k} = U_{k}^{⊤} y

and

d_{k} = diag (D_{k})

. Then,

h_{k} (x)

can be simplified to,

h_{k} (x) = \sum_{j = 1}^{n} \frac{α_{k, j}^{2}}{d_{k, j} x + 1} + x tr (C^{- 1} V_{k}) + const .

(11)

We then obtain the first and second derivatives,

h_{k}^{'} (x) = - \sum_{j = 1}^{n} \frac{α_{k, j}^{2} d_{k, j}}{{(d_{k, j} x + 1)}^{2}} + tr (C^{- 1} V_{k}), h_{k}^{″} (x) = \sum_{j = 1}^{n} \frac{2 α_{k, j}^{2} d_{k, j}^{2}}{{(d_{k, j} x + 1)}^{3}} .

(12)

where

d_{j} \geq 0

. If there exists j such that

α_{k, j} d_{k, j} \neq 0

, then

h^{″} (x) > 0

for

x \geq 0

. Then, h is strictly convex and attains a unique global minimizer

x^{*} \in [0, \infty)

. Suppose that

α_{k, j} d_{k, j} = 0

for

j = 1, \dots, n

then

h_{k}^{'} (x) = tr (C^{- 1} V_{k})

. If we can show that

tr (C^{- 1} V_{k}) > 0

, then

h_{k} (x)

is strictly increasing on

[0, \infty)

, and

x^{*} = 0

is the unique global minimizer for

x \in [0, \infty)

.

We note that the matrix

C^{- 1}

is positive-definite since

C = \sum_{i = 0}^{m} {\tilde{γ}}_{i} V_{i}

is invertible and positive semi-definite. Therefore, the symmetric square root factorization,

C^{- 1} = C^{- 1 / 2} C^{- 1 / 2}

exists, and

tr (C^{- 1} V_{k}) = tr (C^{- 1 / 2} V_{k} C^{- 1 / 2}),

due to the invariant cyclic nature of the trace. The matrix

C^{- \frac{1}{2}} V_{k} C^{- \frac{1}{2}}

is positive semi-definite as

z^{⊤} C^{- 1 / 2} V_{k} C^{- 1 / 2} z = {∥ V_{k}^{1 / 2} C^{- 1 / 2} z ∥}_{2}^{2} \geq 0

for all

z \in R^{n}

. Since

C^{- \frac{1}{2}} V_{k} C^{- \frac{1}{2}}

is a non-zero matrix and positive-semi-definite,

tr (C^{- 1} V_{k}) > 0

and

h_{k} (x)

has a strictly positive slope, which implies that

x^{*} = 0

is the unique global minimizer for

x \in [0, \infty)

.

Consider the case

\dim (span {V_{1}, \dots, V_{m}}) = r < n

. Assuming

γ_{0} > 0

, then

Ω_{- k}

will be invertible except when

k = 0

. When

Ω_{- 0}

is singular, a simplified expression in the form of (11) may be difficult to find. Instead, we take the singular value decomposition (SVD) of the symmetric matrix

Ω_{- 0}

,

Ω_{- 0} = Q^{⊤} Λ Q, Λ = [\begin{matrix} diag (λ_{1}, \dots, λ_{r}) & 0 \\ 0 & 0 \end{matrix}],

where

λ_{1}, \dots, λ_{r} > 0

are the real-positive eigenvalues of

Ω_{- k}

, and the matrix

Q \in R^{n \times n}

is orthogonal. Then, we can express the inverse as

{(Ω_{- 0} + x I)}^{- 1} = Q {(Λ + x I_{n})}^{- 1} Q^{⊤} .

If we assume

y \notin span {V_{1}, \dots, V_{m}}

, then

α = Q^{⊤} y \neq 0

. Then,

h_{0} (x) = \sum_{j = 1}^{r} \frac{α_{j}^{2}}{λ_{j} + x} + \sum_{j = r + 1}^{n} α_{j}^{2} x^{- 1} + x tr (C^{- 1} V_{k}) + const .

(13)

and

h_{0}^{″} (x) = \sum_{j = 1}^{r} \frac{2 α_{j}^{2}}{{(λ_{j} + x)}^{3}} + \sum_{j = r + 1}^{n} \frac{2 α_{j}^{2}}{x^{3}} > 0,

when

x > 0

. Therefore,

h_{0} (x)

still attains a unique minimizer as the function is strongly convex for

x > 0

. □

The result of this lemma ensures the existence and uniqueness condition (8), and thus we can ensure that accumulation or limit points of the CD iteration are also coordinate-wise minimal points. The details of the optimization follow.

3.1. Univariate Minimization via Newton’s Method

Unlike the basic CD Algorithm 1, for which each coordinate update costs

O (n^{3})

, here we show that a coordinate update for the PX-CD algorithm costs only

O (j n^{2})

for some constant j where typically

j ≪ n

.

The function

h_{k} (x)

can be minimized via the second-order Newton’s method, which numerically finds the root of

h_{k}^{'} (x)

. The basic algorithm starts with an initial guess

x_{0}

of the root, and then

x_{n + 1} = x_{n} - h_{k}^{'} (x_{n}) {[h_{k}^{″} (x_{n})]}^{- 1},

(14)

are successive better approximations. The algorithm can be terminated once successive iterates are sufficiently close together. The first and second derivatives of h are given as

\begin{matrix} h_{k}^{'} (x) & = - y^{⊤} {(Ω_{- k} + x V_{k})}^{- 1} V_{k} {(Ω_{- k} + x V_{k})}^{- 1} y + tr (C^{- 1} V_{k}), \end{matrix}

(15)

\begin{matrix} h_{k}^{″} (x) & = 2 y^{⊤} {(Ω_{- k} + x V_{k})}^{- 1} V_{k} {(Ω_{- k} + x V_{k})}^{- 1} V_{k} {(Ω_{- k} + x V_{k})}^{- 1} y, \end{matrix}

(16)

where we used differentiation of a matrix inverse, which implies that

\frac{\partial {(Ω_{- k} + x V_{k})}^{- 1}}{\partial x} = - {(Ω_{- k} + x V_{k})}^{- 1} \frac{\partial (Ω_{- k} + x V_{k})}{\partial x} {(Ω_{- k} + x V_{k})}^{- 1} .

Similar to the basic CD implementation, computing the algebraic expression in (11) via GEV is expensive. Evaluating (15) and (16) by explicitly calculating

{(Ω_{- k} + x V_{k})}^{- 1}

is also expensive for large n and is of time complexity

O (n^{3})

. Instead, we utilize the conjugate gradient (CG) algorithm [13] to efficiently solve linear systems. At each iteration of Newton’s method, we approximately solve,

(Ω_{- k} + x V_{k}) b = y, (Ω_{- k} + x V_{k}) c = V_{k} b,

(17)

and store the solution in

b

and

c

, respectively, via CG algorithm. Generally,

∥ b - {(Ω_{- k} + x V_{k})}^{- 1} y ∥

and

∥ c - {(Ω_{- k} + x V_{k})}^{- 1} V_{k} b ∥

can be made small with

l ≪ n

iterations, where each iteration requires a matrix-vector-multiplication operation with a

n \times n

matrix. The CG algorithm has complexity

O (l n^{2})

and can be easily implemented with standard Linear Algebra packages. With the stored approximate solutions, we evaluate the first and second derivatives as

h_{k}^{'} (x) = - b^{⊤} V_{k} b + tr (C^{- 1} V_{k}), h_{k}^{″} (x) = 2 b^{⊤} V_{k} c .

(18)

Before initiating Newton’s method, we can check if k is in the active constraint set

{k : {\hat{γ}}_{k} = 0}

. Following from Lemma 1, if

h_{k}^{'} (0) \geq 0

, then

h_{k} (x)

is non-decreasing on

[0, \infty)

. Then,

h_{k} (0)

is the global minimum for

x \in [0, \infty)

, and we let

γ_{i}^{(t + 1)} = 0

if we are in cycle

t + 1

of PX-CD. If

h_{k}^{'} (0) < 0

, we initiate Newton’s method at the current value of the variance component,

x_{0} = γ_{k}^{(t)}

. If

\dim (span {V_{1}, \dots, V_{m}}) < n

, we require

γ_{0} > 0

so that

Ω

is invertible.

In this case,

k = 0

cannot be in the active constraint set and we immediately initiate Newton’s method at the starting point

x_{0} = γ_{0}^{(t)}

. In rare cases

h_{k}^{'} (x)

is sufficiently flat at

x_{n}

and (14) may significantly overstep the location of the minimizer and return an approximation

x_{n + 1} < 0

. In this case, we dampen the step size until

x_{n + 1} > 0

.

3.2. Updating Regime

We now consider an alternative to the cyclic ordering of updates. Suppose we update the block

\tilde{γ}

after every co-ordinate update—that is, the updating order of one complete cycle is

γ_{0} \to \tilde{γ} \to γ_{1} \to \tilde{γ} \to \dots \to γ_{m} \to \tilde{γ} .

This ordering regime satisfies the “essentially cyclic” condition whereby, in every stretch of

2 (m + 1)

updates, each component is updated at least once. We refer to CD with this ordering as parameter expanded-immediate coordinate descent (PXI-CD).

In practice, this ordering implies updating the matrix

C

after every update made to each

γ_{k}

. Since the expression for

h_{k}^{'} (x)

requires one to evaluate

tr (C^{- 1} V_{k})

and

C

is updated after every coordinate, we must re-compute

C^{- 1}

for each k. This implies that each cycle in PXI-CD will be more expensive than PX-CD. However, we observe that in situations where

V_{i}

is full rank and n is sufficiently large, the number of cycles needed to converge is significantly less than that required for PX-CD and basic CD.

This results in PXI-CD being the most time-efficient algorithm when the scale of the problem is large. In Section 3.3, we show that, when

V_{i}

is low-rank, re-computing

tr (C^{- 1} V_{k})

comes at no additional-cost through the use of the Woodbury matrix identity. However, in this particular scenario, where

V_{k}

are low-rank, the performance gain from PXI-CD is not as significant as when

V_{i}

are full rank, and both PXI-CD and PX-CD show similar performance. Algorithm 2 summarizes both PX-CD and PXI-CD methods to obtain

\hat{γ}

.

Algorithm 2 PX-CD and PXI-CD for

G (Ω)

As mentioned previously, the novel parameter-expanded coordinate-descent algorithms, PX-CD and PXI-CD, are both amenable to standard convergence analysis.

Theorem 1 (PX-CD and PXI-CD Limit Points).

For both PX-CD and PXI-CD in Algorithm 2, let

{γ^{(t)}, {\tilde{γ}}^{(t)}}_{t \geq 0}

be the coordinate-descent sequence. Then, either

G (\sum_{k} γ_{k}^{(t)} V_{k}) \to - \infty

, or every limit-point of

{γ^{(t)}}_{t \geq 0}

is a coordinate-wise minimum of (3). If we further assume that

y \notin span {V_{1}, \dots, V_{m}} < n

, then the sequence

{γ^{(t)}}_{t \geq 0}

is bounded and

G (\sum_{k} γ_{k}^{(t)} V_{k}) \to - \infty

is ruled out.

Proof.

Recall that

Ω = \sum_{i = 0}^{m} γ_{i} V_{i}

and

C = \sum_{i = 0}^{m} {\tilde{γ}}_{i} V_{i}

. Denote

(x_{1}, \dots, x_{m + 1}) : = γ

and

x_{m + 2} : = \tilde{γ} \in R^{m + 1}

as well as

x : = (x_{1}, \dots, x_{m + 1}, x_{m + 2}) \in R^{2 (m + 1)}

. We can rewrite the optimization problem (3) in the penalized form:

f (x) : = f_{0} (x) + \sum_{k = 1}^{m + 1} f_{k} (x_{k}) + f_{m + 2} (x_{m + 2}),

where

f_{0} (x) : = H (Ω, C)

and

f_{k} (x) : = \{\begin{matrix} 0, & x \geq 0 \\ \infty, & x < 0 \end{matrix}, k = 1, \dots, m + 1, f_{m + 2} (x) : = \{\begin{matrix} 0, & x \in [0, \infty] \\ \infty, & x \notin [0, \infty) \end{matrix} .

We can then apply the results in [18], which state that every limit point of

{γ^{(t)}}_{t \geq 0}

is a coordinate-wise minimum provided that:

Each function $x_{k} \mapsto f (x)$ , $k = 1, \dots, m + 1$ and $x_{m + 2} \mapsto f (x)$ has a unique minimum. For $k = 1, \dots, m + 1$ this has already been verified in Lemma 1. For $x_{m + 2} \mapsto f (x)$ , we simply recall that $H (Ω, C) \geq G (Ω)$ with equality achieved if and only if $C = Ω$ , or equivalently $x_{m + 2} = (x_{1}, \dots, x_{m + 1})$ .
Each $f_{k}, k = 1, \dots, m + 2$ is lower semi-continuous. This is clearly true for $k = 1, \dots, m + 1$ because, at the point of discontinuity, we have ${lim inf}_{x \to 0} f_{k} (x) \geq f_{k} (0) = 0$ . For $f_{m + 2}$ , we simply check that ${x : f_{m + 2} (x) \leq c}$ for a given $c \in R$ is a closed set.
The domain of $f_{0}$ is a Cartesian product and $f_{0}$ is continuous on its domain. Clearly, the domain is the $2 (m + 1)$ Cartesian product $D : = [0, \infty) \times \dots \times [0, \infty)$ and $f_{0}$ is continuous on its effective domain ${x \in D : f (x) < \infty}$ .
The updating rule is essentially cyclic—that is, there exists a constant $T \geq m + 2$ such that every block in $(x_{1}, \dots, x_{m + 1}, x_{m + 2})$ is updated at least once between the r-th iteration and the $(r + T - 1)$ -th iteration for all r. In our case, each block is updated at least once in one iteration of Algorithm 2 so that we satisfy the essentially cyclic condition. In the PXI-CD, we actually update $(m + 1)$ times the block $x_{m + 2}$ .

Thus, we can conclude by Proposition 5.1 in [18] that either

f_{0} (γ^{(t)}) \to - \infty

or the limit points of

{γ^{(t)}}_{t \geq 0}

are coordinate-wise minima of

H (Ω, C)

.

If we further assume that

y \notin span {V_{1}, \dots, V_{m}} < n

, then we can show that set

{γ \geq 0 : f_{0} (γ) \leq f_{0} (γ^{(0)})}

is compact, which ensures that the sequence

{γ^{(t)}}_{t \geq 0}

is bounded and rules out the possibility that

{lim}_{t} f_{0} (γ^{(t)}) = - \infty

. To see that this is the case, note that

G (Ω)

provides a lower bound to

H (Ω, C)

, and this is sufficient to show that

{γ \geq 0 : G (\sum_{k} γ_{k} V_{k}) \leq c}

is compact for any

c \in R

under the assumption that

V_{0} : = I

and

y \notin span {V_{1}, \dots, V_{m}} < n

. However, these are precisely the conditions of Lemma 3 in [9], which ensure that

{γ \geq 0 : L (γ) \geq c}

is compact for a likelihood

L (γ) : = - G (\sum_{k} γ_{k} V_{k})

.

Finally, note that since we update the entire block

\tilde{γ}

simultaneously (in both PX-CD and PXI-CD), this means that a coordinate-wise minimum of

H (Ω, C)

is also a coordinate-wise minimum for

G (Ω)

. □

We again emphasize that, as with all the competitor methods, the theorem does not guarantee convergence of the coordinate-descent sequence

{γ^{(t)}}_{t \geq 0}

or that the convergence will be to a local minimum. The only thing we can say for sure is that, when the sequence converges, then the limit will be a coordinate-wise minimum (which could be a saddle point in some special cases). Nevertheless, our numerical experience in Section 4 is that the sequence always converges to a coordinate-wise minimum and that the coordinate-wise minimum is in fact a (local) minimum.

3.3. Linear Mixed Model Implementation

We now show that, for Linear Mixed Models (LMM), we can reduce the computational complexity of each sub-problem to

O (j d^{2})

, for some constants

j ≪ n

and

d < n

. As shown in Section 3.1, solving each univariate sub-problem can be simplified to implementing Newton’s method, where at each Newton’s update, we solve two n-dimensional linear systems.

In settings where

rank (V i) < n

, for

i = 1, \dots, m

, we are able to reduce the dimensions of the linear system that is required to be solved. To see this, let us first specify the general variance-component model (also known as the general mixed ANOVA model) [19]. Suppose,

y = X β + Z_{1} b_{1} + \dots + Z_{m} b_{m} + ϵ,

(19)

where

X

is an

n \times p

matrix of known fixed numbers,

p \leq n

;

β

is an

p \times 1

vector of unknown constants;

Z_{i}

is a full-rank

n \times c_{i}

matrix of known fixed numbers,

c_{i} \leq n

;

b_{i}

is an

c_{i} \times 1

vector of independent variables from

N (0, γ_{i})

, which are unknown and

ϵ

is an

n \times 1

vector of independent errors from

N (0, γ_{0})

that are unknown. In this setup,

γ_{0}, \dots, γ_{m}

, are the variance component parameters that are to be estimated,

V_{i} = Z_{i} Z_{i}^{⊤}

and

Ω = γ_{0} I + Z Σ Z^{⊤}

, where

Z = [\begin{matrix} ∣ & ∣ & ∣ \\ Z_{1} & Z_{2} & \dots & Z_{m} \\ ∣ & ∣ & ∣ \end{matrix}], Σ = block diag (γ_{1} I_{c_{1}}, \dots, γ_{m} I_{c_{m}}) .

We now provide two methods that take advantage of

V_{i}

being low-rank. Let,

c : = \sum_{i = 1}^{m} rank (V_{i}) = \sum_{i = 1}^{m} c_{i} .

In the first method, we utilize a QR factorization that reduces the computational complexity when

c < n

, i.e., the column rank of

Z

is less than n. In the second method, we use the Woodbury matrix identity to reduce the complexity when

c_{i} < n

for

i = 1, \dots, m

, i.e., the column rank of each of the matrices

Z_{i}

is less than n.

3.3.1. QR Method

The following QR factorization can be viewed as a data pre-processing step that allows all PX-CD and PXI-CD computations to be c-dimensional instead of n-dimensional. The QR factorization only needs to be computed once initially with a cost of

O (c n^{2})

operations. Let the QR factorization of

Z \in R^{n \times c}

be

Z = \overset{Q}{\overset{︷}{[\begin{matrix} Q_{[c]} & Q_{[n - c]} \end{matrix}]}} [\begin{matrix} R \\ 0 \end{matrix}], R = [\begin{matrix} ∣ & ∣ & ∣ \\ R_{1} & R_{2} & \dots & R_{m} \\ ∣ & ∣ & ∣ \end{matrix}],

where

R

is an

c \times c

upper triangular matrix,

0

is an

(n - c) \times c

zero matrix,

Q_{[c]}

is an

n \times c

matrix,

Q_{[n - c]}

is

n \times (n - c)

, and

Q_{[c]}

and

Q_{[n - c]}

both have orthogonal columns. The matrix

R

is partitioned such that the number of columns in

R_{i}

is equal to the number of columns in

Z_{i}

. Let

\tilde{y} = {[{\tilde{y}}_{[c]}, {\tilde{y}}_{[n - c]}]}^{⊤} = Q^{⊤} y

, where

{\tilde{y}}_{[c]}

are the first c elements of

\tilde{y}

and

{\tilde{y}}_{[n - c]}

are the last

n - c

elements of

\tilde{y}

. Then,

\begin{matrix} H (Ω, C) = {\tilde{y}}_{[c]}^{⊤} {\tilde{Ω}}^{- 1} {\tilde{y}}_{[c]} + \sum_{i = 0}^{m} γ_{i} tr ({\tilde{C}}^{- 1} {\tilde{V}}_{i}) + ln det (C) - n + α, \end{matrix}

where we define:

{\tilde{V}}_{i} : = R_{i} R_{i}^{⊤}

,

{\tilde{V}}_{0} : = I_{c}

,

\tilde{Ω} : = \sum_{i = 0}^{m} γ_{i} {\tilde{V}}_{i}

,

\tilde{C} : = \sum_{i = 0}^{m} \tilde{γ_{i}} {\tilde{V}}_{i}

and

α : = γ_{0} {\tilde{γ_{0}}}^{- 1} (n - c) + γ_{0}^{- 1} {\tilde{y}}_{[n - c]}^{⊤} {\tilde{y}}_{[n - c]}

. The details of this derivation are provided in the Appendix A. To implement PX-CD or PXI-CD after this transformation we run Algorithm 2 with inputs

{\tilde{V}}_{0}, \dots, {\tilde{V}}_{m} \in R^{c \times c}

and

{\tilde{y}}_{[c]} \in R^{c \times 1}

. In this simplification of H, we have the additional term

α

, which is dependent on

γ_{0}

. Therefore, when we update the parameter

γ_{0}

and implement Newton’s method we must also add

\frac{\partial α}{\partial γ_{0}} = {\tilde{γ_{0}}}^{- 1} (n - c) - γ_{0}^{- 2} {\tilde{y}}_{[n - c]}^{⊤} {\tilde{y}}_{[n - c]} and \frac{\partial^{2} α}{\partial γ_{0}^{2}} = 2 γ_{0}^{- 3} {\tilde{y}}_{[n - c]}^{⊤} {\tilde{y}}_{[n - c]}

to the corresponding derivatives derived for Newton’s method in Section 3.1.

3.3.2. Woodbury Matrix Identity

Alternatively, if c is large (say

c > n

) but individually

c_{i} < n

, for

i = 1, \dots, m

, we can use the Woodbury identity to reduce each linear system to

c_{k}

dimensions (instead of n dimensions) when updating the component

γ_{k}

. Suppose we are in cycle

t + 1

of either PX-CD or PXI-CD and we wish to update the parameter

γ_{k}

, where

k \neq 0

, then we can simplify the optimization,

γ_{k}^{(t + 1)} = \underset{x \geq 0}{arg min} h_{k} (x), h_{k} (x) = H (Ω_{- k} + x V_{k}, C),

by viewing

Ω_{- k} + x V_{k} = Ω + (x - γ_{k}^{(t)}) Z_{k} Z_{k}^{⊤}

as a low-rank perturbation to the matrix

Ω

. The Woodbury identity gives the expression for the inverse,

{[Ω + (x - γ_{k}^{(t)}) Z_{k} Z_{k}^{⊤}]}^{- 1} = Ω^{- 1} - (x - γ_{k}^{(t)}) Ω^{- 1} Z_{k} {[I_{c_{k}} + (x - γ_{k}^{(t)}) Z_{k}^{⊤} Ω^{- 1} Z_{k}]}^{- 1} Z_{k}^{⊤} Ω^{- 1},

(20)

which contains the unperturbed inverse matrix

Ω^{- 1}

and the inverse of a smaller

c_{k} \times c_{k}

matrix. In this implementation of PXI-CD and PX-CD we re-compute and store the matrix

Ω^{- 1}

after each coordinate update. Let

w : = Z_{k}^{⊤} Ω^{- 1} y

, then the line search along the k-th component (

k \neq 0

) of the function H simplifies to

\begin{matrix} h_{k} (x) & = - (x - γ_{k}^{(t)}) w^{⊤} {[I_{c_{k}} + (x - γ_{k}^{(t)}) Z_{k}^{⊤} Ω^{- 1} Z_{k}]}^{- 1} w + x tr (C^{- 1} Z_{k} Z_{k}^{⊤}) + const . \end{matrix}

When implementing PXI-CD, there is no additional cost when using the Woodbury identity, as the update

C \leftarrow Ω

is made after every coordinate update and the trace term in the line search can be evaluated cheaply because

Ω^{- 1}

is known. We can now implement Newton’s method to find the minimum of

h_{k} (x)

. Let

B : = Z_{k}^{⊤} Ω^{- 1} Z_{k}, M : = I_{c_{k}} + (x - γ_{k}^{(t)}) B .

If we then solve the

c_{k}

-dimensional linear systems

M d = w

and

M f = B d

with CG and store the solution in the vectors

d

and

f

, respectively, we can evaluate the first and second derivatives of

h_{k} (x)

as

h_{k}^{'} (x) = - w^{⊤} d + (x - γ_{k}^{(t)}) d^{⊤} B d + tr (B), h_{k}^{″} (x) = 2 d^{⊤} B d + 2 (x - γ_{k}^{(t)}) d^{⊤} B f,

(21)

and implement the Newton steps (14). The derivation of these derivative expressions are provided in Appendix B. After each coordinate k is updated we evaluate and store the updated inverse covariance matrix,

Ω^{- 1} \leftarrow {[Ω + (γ_{k}^{(t + 1)} - γ_{k}^{(t)}) Z_{k} Z_{k}^{⊤}]}^{- 1},

using (20), where we invert a smaller

c_{k} \times c_{k}

matrix only. When

k = 0

, no reduction in complexity can be made as the perturbation to

Ω

is full-rank and we update

γ_{0}

as we did in Section 3.1. If

c = \sum_{i = 1}^{m} c_{i} < n

, then we can use an alternate form of the Woodbury identity,

Ω^{- 1} = γ_{0}^{- 1} [I - W {(γ_{0} I + W^{⊤} W)}^{- 1} W^{⊤}],

where

W : = Z Σ^{1 / 2}

to update

Ω^{- 1}

after

γ_{0}

has been updated. If

c > n

, we invert the full

n \times n

updated matrix covariance to obtain

Ω^{- 1}

using a Cholesky factorization at cost

O (n^{3})

. This

O (n^{3})

cost for updating

γ_{0}

is a disadvantage for this implementation if

c > n

; however, numerical simulations suggest that, when

c_{i} ≪ n

for

i = 1, \dots, m

, the Woodbury implementation is the fastest implementation.

3.4. Variable Selection

When the number of variance components is large, performing variable selection can enhance the model interpretation and provide more stable parameter estimation. To impose sparsity when estimating

γ

, a lasso or ridge penalty can be added to the negative log-likelihood [20]. The MM implementation [9] provides modifications to the MM algorithm such that both lasso and ridge penalized expressions can be minimized. We now show that, with PX-CD, we can minimize the penalized negative log-likelihood when using the

{∥ \cdot ∥}_{1}

penalty. Consider the penalized negative log-likelihood expression,

G (Ω) + λ \sum_{i = 0}^{m} γ_{i}, γ_{i} \geq 0, λ > 0 .

We then have the surrogate function,

J (Ω, C) : = H (Ω, C) + λ \sum_{i = 0}^{m} γ_{i} .

If we use PX-CD to minimize J, we need to repeatedly minimize the one-dimensional function along each co-ordinate,

j_{k} (x) : = h_{k} (x) + λ x + λ \sum_{i \neq k}^{m} γ_{i}

. Here, we implement Newton’s method as before with the only difference now being that the derivative is increased by the constant

λ

,

j_{k}^{'} (x) : = h_{k}^{'} (x) + λ .

It follows from Lemma 1 that

j (x)

is either strictly convex or linear with strict positive gradient for

x \in [0, \infty)

. We check if

h_{k}^{'} (0) + λ \geq 0

to determine if

j_{k} (0)

is the global minimizer for

x \in [0, \infty)

. If it is, we let

γ_{i}^{(t + 1)} = 0

if we are in cycle

t + 1

. If

j_{k}^{'} (0) < 0

, we initiate Newton’s method at the current value of the variance component

x_{0} = γ_{k}^{(t)}

. The larger the parameter

λ

is, the greater number of times this active constraint condition will be met, and therefore more variance components will be set to zero.

4. Numerical Results

In this section, we assess the efficiency of PX-CD and PXI-CD via simulation and compare them against the best current alternative method, the MM algorithm [9]. In [9] the MM algorithm is found to be superior to both the EM and Fisher Scoring Method in terms of performance. This superior performance was also described in [21].

In our experiments, we additionally include the Expectation–Maximization (EM) and Fisher-Scoring (FS) method in the small-scale problem only, where

γ_{0} = 0.1

. We exclude the EM and FS method for more difficult problems, as they are too slow and unsuitable. The MM, EM and FS are executed with the Julia implementation in [9]. We provide results in three settings. First, we simulate data from the model (Section 3.3), where

c_{i} < n

, i.e., when the matrices

V_{i}

are low-rank. Secondly, we simulate when

c_{i} = n

, i.e., the matrices

V_{i}

are full-rank, and finally, we simulate data from model (Section 3.3), where the matrices

V_{i}

are constructed from a real data-set containing genetic variants of mice.

4.1. Simulations

For the following simulations, we simulate data from the model (Section 3.3). Since the fixed effects

β

can always be eliminated from the model using the REML, we focus solely on the estimation of the variance component parameters. In other words, the value of

β

in our simulations is irrelevant. In each simulation, we generate the fixed matrices

V_{i}

as

V_{i} = \frac{\sum_{j = 1}^{r} Z_{i, j} Z_{i, j}^{⊤}}{{∥\sum_{j = 1}^{r} Z_{i, j} Z_{i, j}^{⊤}∥}_{F}},

(22)

where

Z_{i, j} \sim N (0, I_{n})

and

{∥ \cdot ∥}_{F}

is the Frobenius matrix norm. The rank of each

V_{i}

is equal to the parameter r, which we vary.

In each simulation, for

k \neq 0

, we draw the m true variance components as

γ_{k} = {(1 + ρ)}^{2}

where

ρ \sim N (0, 1)

. Then, we simulate the response from

y \sim N (0, \sum_{i = 0}^{m} γ_{i} V_{i})

and estimate the vector

\hat{γ}

. We vary the value of

γ_{0}

from the set

{0.1, 1, 10}

and keep

n = 1000

.

4.1.1. Low-Rank

We now present the results for where

V_{i}

are generated as stated above and

r < n

. As

V_{i}

are low-rank, we run the Woodbury implementation of PXI-CD and exclude PX-CD as it has the same computational cost as PXI-CD for each update and exhibits almost identical performance. First, PXI-CD is run until the relative change

[L (γ^{(t + 1)}) - L (γ^{(t)})] / [∣ L (γ^{(t)}) ∣ + 1]

is less than

10^{- 10}

and we store the final objective value as

L^{*}

. We then run the algorithms MM, EM and FS and terminate once

L (γ^{(t)}) > L^{*}

. We initialize all algorithms to start from the point

γ^{(0)} = 1

. Each simulation scenario is replicated 10 times. The mean running time is reported along with the standard error of the mean running time provided in parentheses.

The results of the low-rank simulations are given in Table 1, Table 2 and Table 3. The results indicate that, apart from the smallest scale problem when

r = m = 20

, our PXI-CD algorithm outperforms the MM algorithm and significantly so that the scale of the model (both m and r) increases.

4.1.2. Full-Rank

We now present the results when

rank (V_{i}) = n = 1000

. We implement the standard PX-CD, PXI-CD and MM algorithms, where either the Woodbury, or the QR method cannot be used. Initially, PX-CD is run until the relative change

[L (γ^{(t + 1)}) - L (γ^{(t)})] / [∣ L (γ^{(t)}) ∣ + 1]

is less than

10^{- 10}

, and we store the final objective value as

L^{*}

. The other algorithms terminate once

L (γ^{(t + 1)}) > L^{*}

. For the following simulations, we consider one iteration of a CD algorithm as a single cycle of updates. Each simulation scenario is replicated 10 times. The mean running time and mean iteration number is reported with the standard error of the mean running time and mean iteration provided in parentheses.

The results of the full-rank simulations are given in Table 4, Table 5 and Table 6. PX-CD and PXI-CD both significantly outperform the MM and the basic CD algorithms in these examples. We observe that as the number of components m increases the problem becomes increasingly difficult for the MM algorithm. An intuitive explanation for this performance gap is that the CD algorithms are able to identify the active constraint set

{k : {\hat{γ}}_{k} = 0}

in only a few cycles.

When

γ_{0} = 0.1

and

γ_{0} = 1

and m is large (

m = 50

,

m = 100

), PXI-CD is the fastest algorithm, even though it is computationally the most expensive per cycle. When

γ_{0} = 10

and

m = 100

, PXI-CD is the fastest algorithm. In fact, as the problem size grows, the number of iterations that PXI-CD requires to converge is less than that of the basic CD. This simulation indicates that PXI-CD is well-suited to problems with large

m, n

and when

V_{i}

are full-rank. The basic CD algorithm, while numerically the inferior compared to the PX-CD and PXI-CD algorithms still outperforms the MM algorithm in these simulations.

4.2. Genetic Data

We now present simulation results when

Z_{i}

are constructed from the https://openmendel.github.io/SnpArrays.jl/latest/#Example-data (accessed on 1 July 2022) mouse single nucleotide polymorphism (SNP) array data set available from the Open Mendel project [21]. The dataset consists of

Z

, an

n \times c

matrix consisting of c genetic variants for n individual mice. For this experiment, c = 10,200 and

n = 500

. We artificially generate m different genetic regions by partitioning the columns of

Z

into

Z_{i = 1, \dots, m}

gene matrices, where

Z_{i} \in R^{n \times r}

. Then, we can compose our fixed matrices

V_{1}, \dots, V_{m}

as,

V_{i} = \frac{Z_{i} Z_{i}^{⊤}}{{∥\sum_{j = 1}^{m} Z_{j} Z_{j}^{⊤}∥}_{F}} .

We simulate

γ

and

y

as we did in Section 4.1. In this case,

y

mimics a vector of quantitative trait measurements of the n mice. This data set is well-suited for testing our method when m is large (

m > n

). In these cases, we observe that, when initialized at the same point, the MM and PXI-CD method converge to different solutions, i.e., they may converge to different stationary points. Therefore, we run all algorithms until the relative change

[L (γ^{(t + 1)}) - L (γ^{(t + 1)})] / [∣ L (γ^{(t)}) ∣ + 1]

is less than

10^{- 10}

. Since

r < n

, we implement PXI-CD utilizing the Woodbury identity. Each simulation scenario is replicated 10 times. The mean running time and mean iteration number are reported with the standard error of the mean running time and mean iteration provided in parentheses.

The results of the genetic study simulation are provided in Table 7. We observe that PXI-CD outperforms the MM algorithm for all values of m and r for this data set in both the number of iterations and running time until convergence. When

m > n

, we observe that the MM and PXI-CD method converge to noticeably different objective values. We suspect that this is because when

m > n

the likelihood in (2) exhibits many more local minima. On average, PXI-CD converges to a more optimal stationary point when m is large and

m > n

.

5. Conclusions

The MLE solution for variance-component models requires the optimization of a non-convex log-likelihood function. In this paper, we showed that a basic implementation of the cyclic CD algorithm is computationally expensive to run and is not amenable to traditional convergence analysis.

To remedy this, we proposed a novel parameter-expanded CD (PX-CD) algorithm, which is both computationally faster and also subject to theoretical guarantees. PX-CD optimizes a higher-dimensional surrogate function that attains a coordinate-wise minimum with respect to each of the variance component parameters. The extra speed is derived from the fact that required quantities (such as first and second-order derivatives) are evaluated via the conjugate-gradient algorithm.

Additionally, we propose an alternative updating regime called PXI-CD, where the expanded block parameters are updated immediately after each coordinate update. This new updating regime requires more computation for each iteration as compared to PX-CD. However, numerically, we observed that, for large-scale models, where the number of variance components

m + 1

is large and

V_{i}

are full-rank, the number of iterations needed to converge greatly offsets the additional computational cost per coordinate update cycle.

Our numerical experiments suggest that PX-CD and PXI-CD outperform the best current alternative—the MM algorithm. When the number of variance components m is large, we observed that PXI-CD was significantly faster than the MM algorithm and tended to converge to more optimal stationary points.

A potential extension of this work is to apply parameter-expanded CD algorithms to the multivariate-response variance-component model. Instead of the univariate response, one considers the multivariate response model with a

n \times d

response matrix

Y

. In this setup,

E [Y] = X B

where

B

is a

p \times d

matrix. The

n d \times n d

covariance matrix is of the form

Ω = cov (vec (Y)) = \sum_{i = 0}^{m} Γ_{i} \otimes V_{i},

where

Γ_{i}

are unknown

d \times d

variance components and

V_{i}

are the known

n \times n

covariance matrices. The challenging aspect of this problem is that each optimization with respect to a parameter

Γ_{i}

is not univariate but is rather a search over positive semi-definite matrices—itself, a difficult optimization problem.

Author Contributions

Conceptualization, A.M., S.M. and Z.B.; methodology, A.M., S.M. and Z.B.; software, A.M.; formal analysis, A.M., S.M. and Z.B. data curation, A.M.; writing—original draft preparation, A.M.; writing—review and editing, A.M., S.M. and Z.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://openmendel.github.io/SnpArrays.jl/latest (accessed on 1 July 2022) mouse single nucleotide polymorphism (SNP) array data set.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CD	Coordinate descent
PX-CD	Parameter expanded coordinate descent
PXI-CD	Parameter expanded-immediate coordinate descent

Appendix A. QR Method

In this section, we provide further details on the QR factorization used in Section 3.3.1. Recall that we have the decomposition,

Z = \overset{Q}{\overset{︷}{[\begin{matrix} Q_{[c]} & Q_{[n - c]} \end{matrix}]}} [\begin{matrix} R \\ O \end{matrix}], R = [\begin{matrix} ∣ & ∣ & ∣ \\ R_{1} & R_{2} & \dots & R_{m} \\ ∣ & ∣ & ∣ \end{matrix}],

where

Q

is an orthogonal matrix and

R

is partitioned such that the number of columns in

R_{i}

is equal to number of columns in

Z_{i}

. Recall that

\tilde{y} = {[{\tilde{y}}_{[c]}, {\tilde{y}}_{[n - c]}]}^{⊤} = Q^{⊤} y

where

{\tilde{y}}_{[c]}

are the first c elements of

\tilde{y}

and

{\tilde{y}}_{[n - c]}

are the last

n - c

elements of

\tilde{y}

. Then,

\begin{matrix} Ω & = γ_{0} I + Z Σ Z^{⊤} \\ = Q (γ_{0} I + [\begin{matrix} R \\ 0 \end{matrix}] Σ [\begin{matrix} R^{⊤} & 0 \end{matrix}]) Q^{⊤} \\ = Q [\begin{matrix} R Σ R^{⊤} + γ_{0} I_{c} & 0 \\ 0 & γ_{0} I_{n - c} \end{matrix}] Q^{⊤} . \end{matrix}

Taking the inverse of this matrix yields

Ω^{- 1} = Q [\begin{matrix} {(R Σ R^{⊤} + γ_{0} I_{c})}^{- 1} & 0 \\ 0 & γ_{0}^{- 1} I_{n - c} \end{matrix}] Q^{⊤} .

and

y^{⊤} Ω^{- 1} y = {\tilde{y}}_{[c]}^{⊤} {\tilde{Ω}}^{- 1} {\tilde{y}}_{[c]} + γ_{0}^{- 1} {\tilde{y}}_{[n - c]}^{⊤} {\tilde{y}}_{[n - c]} .

(A1)

We now consider the simplifications for the trace terms in H. Let

\tilde{Σ} = block diag ({\tilde{γ}}_{1} I_{c_{1}}, \dots, {\tilde{γ}}_{m} I_{c_{m}}) .

Then,

C = {\tilde{γ}}_{0} I + Z \tilde{Σ} Z^{⊤}

and

C^{- 1} = Q [\begin{matrix} {(R \tilde{Σ} R^{⊤} + {\tilde{γ}}_{0} I_{c})}^{- 1} & 0 \\ 0 & {\tilde{γ}}_{0}^{- 1} I_{n - c} \end{matrix}] Q^{⊤} .

If we substitute this expression into the trace term when

k \neq 0

, we obtain

\begin{matrix} tr (C^{- 1} V_{k}) & = tr (Z_{k}^{⊤} C^{- 1} Z_{k}) \\ = tr (Z_{k}^{⊤} Q [\begin{matrix} {(R \tilde{Σ} R^{⊤} + {\tilde{γ}}_{0} I_{c})}^{- 1} & 0 \\ 0 & {\tilde{γ}}_{0}^{- 1} I_{n - c} \end{matrix}] Q^{⊤} Z_{k}) \\ = tr ([\begin{matrix} R_{k}^{⊤} & 0 \end{matrix}] [\begin{matrix} {(R \tilde{Σ} R^{⊤} + {\tilde{γ}}_{0} I_{c})}^{- 1} & 0 \\ 0 & {\tilde{γ}}_{0}^{- 1} I_{n - c} \end{matrix}] [\begin{matrix} R_{k} \\ 0 \end{matrix}]) \\ = tr ({(R \tilde{Σ} R^{⊤} + {\tilde{γ}}_{0} I_{c})}^{- 1} R_{k} R_{k}^{⊤}) . \end{matrix}

(A2)

When

k = 0

, we have

\begin{matrix} tr (C^{- 1} V_{0}) & = tr (Q [\begin{matrix} {(R \tilde{Σ} R^{⊤} + {\tilde{γ}}_{0} I_{c})}^{- 1} & 0 \\ 0 & {\tilde{γ}}_{0}^{- 1} I_{n - c} \end{matrix}] Q^{⊤}) \\ = tr ([\begin{matrix} {(R \tilde{Σ} R^{⊤} + {\tilde{γ}}_{0} I_{c})}^{- 1} & 0 \\ 0 & {\tilde{γ}}_{0}^{- 1} I_{n - c} \end{matrix}]) \\ = tr ({(R \tilde{Σ} R^{⊤} + {\tilde{γ}}_{0} I_{c})}^{- 1}) + {\tilde{γ}}_{0}^{- 1} (n - c) . \end{matrix}

(A3)

If we recall the definitions

{\tilde{V}}_{i} : = R_{i} R_{i}^{⊤}

,

{\tilde{V}}_{0} : = I_{c}

,

\tilde{Ω} : = \sum_{i = 0}^{m} γ_{i} {\tilde{V}}_{i}

and

\tilde{C} : = \sum_{i = 0}^{m} \tilde{γ_{i}} {\tilde{V}}_{i}

and combine Equations (A1)–(A3), we obtain

\begin{matrix} h (Ω, C) & = y^{⊤} Ω^{- 1} y + \sum_{i = 0}^{m} γ_{i} tr (C^{- 1} V_{i}) + ln det (C) - n \\ = {\tilde{y}}_{[c]}^{⊤} {\tilde{Ω}}^{- 1} {\tilde{y}}_{[c]} + \sum_{i = 0}^{m} γ_{i} tr ({\tilde{C}}^{- 1} {\tilde{V}}_{i}) + ln det (C) - n + γ_{0} {\tilde{γ_{0}}}^{- 1} (n - c) + γ_{0}^{- 1} {\tilde{y}}_{[n - c]}^{⊤} {\tilde{y}}_{[n - c]} . \end{matrix}

Appendix B. Woodbury Identity

Recall from Section 3.3.2 that we have the definitions

w : = Z_{k}^{⊤} Ω^{- 1} y

;

B : = Z_{k}^{⊤} Ω^{- 1} Z_{k}

and

M : = I_{c_{k}} + (x - γ_{k}^{(t)}) B

and the simplified univariate function,

h_{k} (x) = - (x - γ_{k}^{(t)}) w^{⊤} M^{- 1} w + x tr (B) + const .

We now derive the first and second derivatives of this function. The differentiation of an invertible symmetric matrix implies that

\frac{\partial M^{- 1}}{\partial x} = - M^{- 1} B M^{- 1} .

(A4)

Then, from the product rule of differentiation,

h_{k}^{'} (x) = (x - γ_{k}^{(t)}) w^{⊤} M^{- 1} B M^{- 1} w - w^{⊤} M^{- 1} w + tr (B) .

If we then approximately solve the linear system

M d = w

with CG, then

h_{k}^{'} (x) = - w^{⊤} d + (x - γ_{k}^{(t)}) d^{⊤} B d + tr (B) .

Using the matrix product rule of differentiation, we have that

\frac{\partial M^{- 1} B M^{- 1}}{\partial x} = 2 M^{- 1} B M^{- 1} B M^{- 1} .

(A5)

Then, using (A4) and (A5) to differentiate

h_{k}^{'} (x)

, we obtain

h_{k}^{″} (x) = 2 w^{⊤} M^{- 1} B M^{- 1} w + 2 (x - γ_{k}^{(t)}) w^{⊤} M^{- 1} B M^{- 1} B M^{- 1} w .

If we solve

M d = w

and

M j = B d

with CG then j will approximate the matrix-vector product

M^{- 1} B M^{- 1} w

and the second derivative can be evaluated as

h_{k}^{″} (x) = 2 d^{⊤} B d + 2 (x - γ_{k}^{(t)}) d^{⊤} B j .

References

Kang, H.M.; Sul, J.H.; Service, S.K.; Zaitlen, N.A.; Kong, S.y.; Freimer, N.B.; Sabatti, C.; Eskin, E. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010, 42, 348–354. [Google Scholar] [CrossRef] [PubMed]
Searle, S.; Casella, G.; McCulloch, C. Variance Components; Wiley Series in Probability and Statistics; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
Jiang, J.; Nguyen, T. Linear and Generalized Linear Mixed Models and Their Applications; Springer: New York, NY, USA, 2007; Volume 1. [Google Scholar]
Harville, D.A. Maximum likelihood approaches to variance component estimation and to related problems. J. Am. Stat. Assoc. 1977, 72, 320–338. [Google Scholar] [CrossRef]
Jennrich, R.I.; Sampson, P. Newton–Raphson and related algorithms for maximum likelihood variance component estimation. Technometrics 1976, 18, 11–17. [Google Scholar] [CrossRef]
Longford, N.T. A fast scoring algorithm for maximum likelihood estimation in unbalanced mixed models with nested random effects. Biometrika 1987, 74, 817–827. [Google Scholar] [CrossRef]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977, 39, 1–22. [Google Scholar]
Lindstrom, M.J.; Bates, D.M. Newton–Raphson and EM algorithms for linear mixed-effects models for repeated-measures data. J. Am. Stat. Assoc. 1988, 83, 1014–1022. [Google Scholar]
Zhou, H.; Hu, L.; Zhou, J.; Lange, K. MM algorithms for variance components models. J. Comput. Graph. Stat. 2019, 28, 350–361. [Google Scholar] [CrossRef] [PubMed]
Wright, S.J. Coordinate-descent algorithms. Math. Program. 2015, 151, 3–34. [Google Scholar] [CrossRef]
Nesterov, Y. Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 2012, 22, 341–362. [Google Scholar] [CrossRef]
Luo, Z.Q.; Tseng, P. On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl. 1992, 72, 7–35. [Google Scholar] [CrossRef]
Golub, G.H.; Van Loan, C.F. Matrix Computations; JHU Press: Baltimore, MD, USA, 2013. [Google Scholar]
Liu, J.S.; Wu, Y.N. Parameter expansion for data augmentation. J. Am. Stat. Assoc. 1999, 94, 1264–1274. [Google Scholar] [CrossRef]
Meng, X.L.; Van Dyk, D.A. Seeking efficient data augmentation schemes via conditional and marginal augmentation. Biometrika 1999, 86, 301–320. [Google Scholar] [CrossRef]
Bezdek, J.C.; Hathaway, R.J. Convergence of alternating optimization. Neural Parallel Sci. Comput. 2003, 11, 351–368. [Google Scholar]
Bezdek, J.; Hathaway, R. Some Notes on Alternating Optimization. In AFSS International Conference on Fuzzy Systems, Proceedings of the Advances in Soft Computing—AFSS 2002, Calcutta, India, 3–6 February 2002; Springer: Berlin/Heidelberg, Germany, 2002; Volume 2275, pp. 288–300. [Google Scholar] [CrossRef]
Tseng, P. Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 2001, 109, 475–494. [Google Scholar] [CrossRef]
Hartley, H.O.; Rao, J.N. Maximum-likelihood estimation for the mixed analysis of variance model. Biometrika 1967, 54, 93–108. [Google Scholar] [CrossRef] [PubMed]
Schelldorfer, J.; Bühlmann, P.; DE GEER, S.V. Estimation for high-dimensional linear mixed-effects models using l1-penalization. Scand. J. Stat. 2011, 38, 197–214. [Google Scholar] [CrossRef]
Zhou, H.; Sinsheimer, J.S.; Bates, D.M.; Chu, B.B.; German, C.A.; Ji, S.S.; Keys, K.L.; Kim, J.; Ko, S.; Mosher, G.D.; et al. OpenMendel: A cooperative programming project for statistical genetics. Hum. Genet. 2020, 139, 61–71. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Two local minima:

(0.11, 10.26)

and

(14.35, 8.66)

obtained for

g_{k} (x)

when

α_{k, 1} = 1.2

,

α_{k, 2} = 3

,

d_{k, 1} = 10

and

d_{k, 2} = 0.2

.

Figure 1. Two local minima:

(0.11, 10.26)

and

(14.35, 8.66)

obtained for

g_{k} (x)

when

α_{k, 1} = 1.2

,

α_{k, 2} = 3

,

d_{k, 1} = 10

and

d_{k, 2} = 0.2

.

Table 1. The running times (s) when

V_{i}

are low-rank,

n = 1000

, and

γ_{0} = 0.1

.

Table 1. The running times (s) when

V_{i}

are low-rank,

n = 1000

, and

γ_{0} = 0.1

.

		m
Method	$r$	20	50	75	100
PXI-CD	20	4.20 (0.07)	17.00 (1.18)	25.81 (1.57)	51.8 (5.09)
MM		1.89 (0.22)	42.54 (7.69)	220.06 (61.19)	845.2 (136.89)
EM		23.30 (0.52)	-	-	-
FS		91.80 (0.99)	-	-	-
PXI-CD	50	5.42 (0.16)	21.36 (0.85)	37.46 (2.72)	40.21 (2.97)
MM		95.63 (41.20)	203.24 (53.20)	584.76 (53.20)	1167.22 (114.38)
PXI-CD	100	8.96 (0.25)	30.17 (2.24)	28.26 (0.73)	45.65 (2.56)
MM		112.58 (21.48)	555.38 (82.01)	699.89 (90.55)	1327.54 (53.80)
PXI-CD	150	14.13 (0.63)	33.18 (4.67)	38.24 (1.54)	48.97 (2.08)
MM		122.44 (38.48)	628.24 (55.61)	929.67 (75.87)	1243.97 (84.30)

Table 2. The running times (s) when

V_{i}

are low-rank,

n = 1000

, and

γ_{0} = 1

.

Table 2. The running times (s) when

V_{i}

are low-rank,

n = 1000

, and

γ_{0} = 1

.

		m
Method	$r$	20	50	75	100
PXI-CD	20	3.10 (0.08)	9.52 (0.28)	16.30 (0.43)	26.05 (0.77)
MM		3.11 (0.84)	117.40 (24.42)	280.31 (51.43)	719.32 (158.51)
PXI-CD	50	4.08 (0.11)	14.35 (0.76)	29.94 (1.33)	53.10 (3.7)
MM		157.98 (34.53)	501.60 (90.15)	546.56 (91.48)	1133.83 (129.96)
PXI-CD	100	6.00 (0.36)	25.34 (0.95)	61.69 (3.38)	73.23 (10.6)
MM		103.08 (31.4)	544.06 (61.35)	743.79 (104.67)	1254.97 (87.32)
PXI-CD	150	9.18 (0.45)	42.30 (2.08)	66.13 (8.3)	66.67 (10.2)
MM		176.80 (31.64)	498.12 (63.39)	986.87 (61.46)	1110.90 (105.59)

Table 3. The running times (s) when

V_{i}

are low-rank,

n = 1000

, and

γ_{0} = 10

.

Table 3. The running times (s) when

V_{i}

are low-rank,

n = 1000

, and

γ_{0} = 10

.

		m
Method	$r$	20	50	75	100
PXI-CD	20	3.62 (0.10)	8.79 (0.20)	15.60 (0.41)	23.13 (0.91)
MM		10.17 (0.84)	318.53 (90.49)	648.64 (132.62)	839.24 (158.52)
PXI-CD	50	4.36 (0.09)	14.39 (0.4)	24.65 (0.97)	42.07 (1.13)
MM		184.03 (38.8)	473.58 (80.82)	648.33 (114.98)	1230.56 (128.79)
PXI-CD	100	6.53 (0.22)	26.07(1.51)	48.48 (2.48)	60.72 (5.33)
MM		124.68 (19.93)	511.66 (65.85)	880.53 (69.53)	1279.40 (81.93)
PXI-CD	150	10.15 (0.37)	35.34 (1.72)	66.94 (2.28)	103.89 (10.36)
MM		199.80 (33.17)	512.35 (49.97)	943.84 (84.24)	1244.32 (72.13)

Table 4. The convergenceresults when

V_{i}

are full-rank,

n = 1000

, and

γ_{0} = 0.1

.

Table 4. The convergenceresults when

V_{i}

are full-rank,

n = 1000

, and

γ_{0} = 0.1

.

Method	m	Iterations	Time (s)	Objective
PX-CD	25	89.00 (2.90)	37.47 (1.71)	−1383.44
PXI-CD		146.90 (21.17)	97.49 (13.58)	−1383.44
CD		109.40 (17.52)	286.96 (19.83)	−1383.44
MM		3957.90 (300.77)	281.53 (21.03)	−1383.44
PX-CD	50	182.40 (11.34)	147.21 (10.64)	−1831.40
PXI-CD		73.10 (2.97)	104.42 (4.64)	−1831.40
CD		103.40 (6.18)	852.41 (70.22)	−1831.40
MM		10,240.40 (2140.25)	1557.79 (343.47)	−1831.40
PX-CD	100	279.10 (15.09)	376.18 (24.14)	−2143.93
PXI-CD		80.70 (2.97)	211.19 (8.58)	−2143.93
CD		164.00 (8.84)	2060.69 (155.37)	−2143.93
MM		12,171.90 (1526.30)	3482.30 (465.51)	−2143.93

Table 5. The convergence results when

V_{i}

are full-rank,

n = 1000

, and

γ_{0} = 1

.

Table 5. The convergence results when

V_{i}

are full-rank,

n = 1000

, and

γ_{0} = 1

.

Method	m	Iterations	Time (s)	Objective
PX-CD	25	82.60 (4.05)	32.00 (1.93)	−1707.53
PXI-CD		172.70 (7.08)	112.31 (5.15)	−1707.53
CD		116.90 (6.31)	279.61 (18.03)	−1707.53
MM		4313.90 (347.85)	303.13(24.3)	−1707.53
PX-CD	50	192.50 (8.89)	155.03 (8.11)	−1957.26
PXI-CD		103.00 (18.85)	147.79 (26.11)	−1957.26
CD		110.20 (4.53)	940.96 (58.0)	−1957.26
MM		15,860.80 (2872.82)	2488.70 (484.24)	−1957.26
PX-CD	100	313.60 (13.88)	423.78 (20.48)	−2203.61
PXI-CD		86.80 (3.06)	226.10 (8.51)	−2203.61
CD		185.60 (8.33)	2422.93 (143.76)	−2203.61
MM		12,820.60 (2102.69)	3792.45 (725.63)	−2203.61

Table 6. The convergence results when

V_{i}

are full-rank,

n = 1000

, and

γ_{0} = 10

.

Table 6. The convergence results when

V_{i}

are full-rank,

n = 1000

, and

γ_{0} = 10

.

Method	m	Iterations	Time (s)	Objective
PX-CD	25	75.20 (2.88)	25.74 (1.46)	−2603.51
PXI-CD		152.00 (6.98)	95.59 (5.12)	−2603.51
CD		38.50 (1.60)	172.03 (11.70)	−2603.51
MM		3616.90 (513.57)	254.87 (36.27)	−2603.51
PX-CD	50	143.70 (6.93)	108.55 (5.89)	−2668.99
PXI-CD		177.50 (28.72)	249.73 (39.38)	−2668.99
CD		79.80 (4.59)	668.65 (40.71)	−2668.99
MM		11,306.30 (1824.79)	1697.89 (275.36)	−2668.99
PX-CD	100	304.00 (12.39)	412.54 (19.01)	−2731.09
PXI-CD		87.40 (2.54)	230.54 (7.27)	−2731.09
CD		181.80 (7.63)	2697.20 (150.64)	−2731.09
MM		11,796.10 (1732.0)	3358.14 (521.92)	−2731.09

Table 7. The running times (s)—mouse data.

Method	m	r	Iterations	Time (s)	Objective
PXI-CD	100	102	95.1 (9.9)	23.1 (2.66)	−43.6
MM			1580.2 (194.76)	108.7 (15.37)	−43.6
PXI-CD	200	51	180.0 (13.41)	58.8 (4.4)	−82.5
MM			2797.8 (401.1)	466.0 (88.95)	−82.6
PXI-CD	500	20	397.8 (53.73)	258.1 (34.56)	−100.3
MM			2700.8 (366.57)	1055.4 (179.97)	−106.6
PXI-CD	1000	10	434.7 (35.7)	507.5 (39.91)	−82.1
MM			3004.8 (343.61)	2329.4 (296.37)	−91.2

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mathur, A.; Moka, S.; Botev, Z. Coordinate Descent for Variance-Component Models. Algorithms 2022, 15, 354. https://doi.org/10.3390/a15100354

AMA Style

Mathur A, Moka S, Botev Z. Coordinate Descent for Variance-Component Models. Algorithms. 2022; 15(10):354. https://doi.org/10.3390/a15100354

Chicago/Turabian Style

Mathur, Anant, Sarat Moka, and Zdravko Botev. 2022. "Coordinate Descent for Variance-Component Models" Algorithms 15, no. 10: 354. https://doi.org/10.3390/a15100354

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Coordinate Descent for Variance-Component Models

Abstract

1. Introduction

2. Basic Coordinate Descent

2.1. Implementation

2.2. Convergence

3. Parameter-Expanded CD

3.1. Univariate Minimization via Newton’s Method

3.2. Updating Regime

3.3. Linear Mixed Model Implementation

3.3.1. QR Method

3.3.2. Woodbury Matrix Identity

3.4. Variable Selection

4. Numerical Results

4.1. Simulations

4.1.1. Low-Rank

4.1.2. Full-Rank

4.2. Genetic Data

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. QR Method

Appendix B. Woodbury Identity

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI