2.1. Log-Linear Graphical Models
Consider a vector of random variables
indexed by the set
such that each
takes values in the finite set
with
levels. Then the resulting counts can be presented in a
p-dimensional contingency table corresponding to
where
I is the set of cells
and
. The number of observations for cell
i is denoted
and the probability of an object being observed in cell
i is denoted
. If
, the set of
D-marginal cells is
. For
, we assume the cell counts follow a multinomial distribution and the cell probabilities are modelled by a hierarchical log-linear model. For simplicity, in this paper, we assume all random variables are binary.
The conditional independencies between the random variables
can be read off an undirected graph
with vertex set
V and edge set
, that is, if
is independent of
given
, whenever
is not an edge in
E. A graph is
complete if every pair of vertices has an edge. The discrete graphical model for
X is said to be
decomposable or Markov with respect to
G if it corresponds to a
chordal or
triangulated undirected graph, meaning every cycle of length greater or equal to 4 has a chord. Furthermore, a collection of random variables
with associated graph
G are said to be Markov relative to
G if for any triple of disjoint sets
,
where
and
S is a complete subset.
For a graph G and any of its decompositions , we call the subsets A and B cliques, and the subset S a separator. The advantage of using a decomposable model is that the probability distribution of its variables can be written as a product of factors over the cliques and the separators of the corresponding decomposable graph. This allows for many convenient computational properties. If a graph is non-decomposable, it has been shown that its minimal triangulation can be used as a reasonable proxy. The minimal triangulation of a non-decomposable graph is the graph made up of the least number of fill-in edges which results in a decomposable graph. Since the minimal triangulation of a non-decomposable graph is by definition decomposable, all of the computational advantages will apply.
For example, the graph in
Figure 1a is the smallest non-decomposable graph and it has three possible triangulations.
Figure 1d is the complete graph on four vertices and it is a triangulation of
Figure 1a, however, it is not minimal.
Figure 1b,c are minimal since removing the edge
from
Figure 1b, or the edge
from
Figure 1c will result in a non-decomposable graph. Therefore, if we want to consider the non-decomposable graph
Figure 1a as a candidate model, then we would calculate the required computations corresponding to either the minimal triangulation
Figure 1b or
Figure 1d and use the result to compare graph
Figure 1a to other competing models.
The implementation of a genetic algorithm requires a measure of fitness. We use an expression proportionate to the posterior probability
which requires an appropriate prior distribution. We choose the Dirichlet distribution on the log-linear parameters because it is the Diaconis-Ylvisaker (DY) conjugate prior (Diaconis and Ylvisaker [
13]). The Dirichlet distribution is parametrized by fictive counts (or pseudocounts), denoted
for
, which sum up to
. In Dawid and Lauritzen [
14], they develop the hyper Dirichlet conjugate prior which exhibits the same Markov properties when corresponding to a decomposable model. It can be shown that a posterior probability which is Markov with respect to a decomposable graph
G is proportionate to a normalizing constant, denoted
, and can be written as the product of gamma functions indexed over the cliques and separators of
G, that is,
where
n is the vector of true cell counts,
s is the vector of fictive cell counts, and
is the multiplicity of separator
S. If a model corresponds to a non-decomposable graph, then we compute
for its minimal triangulation.
We denote the number of free parameters in a decomposable model by
k and it can be expressed as
Since we assume that all variables are binary, in our simulations we use
and
. However, the algorithm can be implemented for variables with more than 2 levels using Equation (
2).
2.2. Graphical Local Genetic Algorithm
Genetic algorithms (GAs) belong to the class of evolutionary algorithms and are used to solve optimization and search problems. They mimic the evolutionary process of natural selection, popularized by Charles Darwin. In the original algorithm, developed by Holland [
8], each candidate solution corresponds to an individual in the population which is assumed to have one chromosome. A chromosome is represented by a string of 0’s and 1’s and each element of the string is called an allele. In the first generation, the population is randomly initiated and the fitness of each candidate in the population is measured. Then two parents are selected from the population, often the most ‘fit’ are selected, and they produce
offspring by a crossover operation. The simplest crossover operation is the one-point crossover where the chromosome of each parent it randomly cut into two segments and switched with one of the segments of the other parent to create offspring. Finally, the offspring are subject to random mutations in one of more of the alleles in their chromosome. Depending on the fitness of the offspring, they may replace existing members of the population or they may simply be added to the population. The algorithm iterates until a predetermined stopping criterion is met. There are many variations of each step in the algorithm. For more details on these variations, see Givens and Hoeting [
15].
Since we are interested in graphical models, we will use adjacency matrices instead of strings of 0’s and 1’s. An undirected graph
with
can be represented by a
matrix
, where
if
and
, and
otherwise. For example, consider a graph
with vertices
and cliques
, as seen in
Figure 2a. In
Figure 2b, we have its adjacency matrix with 1’s where the element of the matrix corresponds to an edge in the graph, and 0’s where the element corresponds to no edge in the graph and 0’s on the diagonal.
All of our computations are in R, where it is more practical to use the
function instead of
, so we use the logarithm of the normalizing constant (
1) to measure the fitness of each model. We use the
igraph package in R to obtain the cliques and separators from the adjacency matrix of a decomposable graph or of a minimal triangulation of a graph, and compute the sum of the log of gamma functions indexed according to the graph’s factorization. In the case that a candidate model is non-decomposable and we compute the log of the normalizing constant for its minimal triangulation, we do not consider the minimal triangulation to be an updated version of the candidate model. We use the minimal triangulation only for its convenient computational properties. It has been shown that when comparing overfitting models, the log posterior probability will favour the model with less superfluous edges. However, when comparing underfitting models, it will favour the candidate model with most true edges. This cause the genetic algorithm to be prone to keeping false edges if it means having more true edges. Therefore, we add a penalty term to our measure of fitness to prevent the algorithm from selecting a model with too many unnecessary edges. We use the penalty
, where
k is the number of free parameters in the model (
2).
Since the genetic algorithm does not have the same restrictions as some other model selection methods, it is sensitive to its initial conditions and can consider any possible candidate model. Seeing as our goal is to implement the GLGA in the high-dimensional setting, we must direct the algorithm towards the more suitable models. To do so, we use the Bayes factor to compare the presence of each edge versus no edges. The Bayes factor is the ratio of posterior probabilities for two models and it is commonly used in model selection to compare candidate models. When comparing two models
and
with equal probability, the Bayes factor is a ratio of the normalizing constants (
1), that is,
To initialize the population, we compute the edgewise Bayes factor for each possible edge, that is, we compare a candidate model with the single edge in question to the model with no edges. We use ranges of the values of the edgewise Bayes factors to determine the probability of including that edge or not in the otherwise randomly generated initial candidate models, where these ranges depend on the sample size. One of the convenient properties of the Bayes factor is that, for two decomposable models, the cliques or separators common to both models will cancel out and hence not need to be computed. Thus, even if we have a high-dimensional dataset, we only need to compute the Bayes factor for the given edge. For example, even if , to compute the Bayes factor for the edge , we simply need to compare the existence of the edge versus no edge . Therefore, is it not computationally intensive to compute the Bayes factor for each edge.
In order to perform the crossover step and the mutation step, we use the upper triangular adjacency matrix. For the crossover step, we randomly select a cut-point and we interchange the rows above and below the cut-point between the two parent matrices. We do this three times using three different cut-points to create six offspring at each crossover step. In
Figure 3a, we have the upper triangular matrix of the adjacency matrix which represents the graph
from
Figure 1b. To continue our example, consider a graph
with upper triangular adjacency matrix seen in
Figure 3b. Say we randomly choose to cut
and
between row 1 and row 2, as seen in
Figure 3a,b. Then to complete the crossover, we take row 1 from
and rows 2–4 from
to form one offspring (
Figure 3c), and we take row 1 from
and rows 2–4 from
to form a second offspring (
Figure 3d). In each crossover step, we do three unique cuts to obtain six offspring. In the mutation step, since the genetic algorithm tends to pick up extra edges when there are missing true edges, we simply take the upper triangular adjacency matrix for each offspring and with a small probability we change 0 to 1. We do not allow for random mutations from 0 to 1.
Once we have the six offspring obtained from what we refer to as the global crossover, post-mutation step, if certain conditions are met we perform a local search. We implement the crossover-hill-climbing step from Lozano et al. [
11], but applied to graphical models. We refer to crossover operations preform during the crossover-hill-climbing step as local crossover. Let
the current best parent,
the best current offspring,
, and
. The pseudo code for the crossover-hill-climbing operator can be found in
Table 1.
Since the local search can add unnecessary computational expense, Lozano et al. [
11] only carry out the crossover-hill-climbing step with probability 1 if the best new offspring is better than the worst member of the population, or with probability 0.0625 otherwise. We opt to carry out the local search only if the best new offspring is better than the worst member of the population. Lozano et al. [
11] implement specific mating, global crossover, mutation and replacement strategies, which keeps the population diverse; however, we choose more targeted strategies. Only our local crossover operation follows their approach. Our global crossover step and our mutation step are similar to the generic genetic algorithm, with the probability of mutation 0.0005.
In the low-dimensional case, we initiate
models then for each of the
global iterations, we add
new randomly generated models to add diversity to the population. Thus, the regular GLGA considers a total of 120 models, not including the offspring. We only do at most
local iteration because since we control the initialization of the matrices with the edgewise Bayes factor, the local searches converge quickly. We select the matrix with the highest edgewise values up until a cut-off, determined by the sample size, as one parent which mates in every iteration with a second parent which is the model with the highest fitness score out of the current population. We only keep the offspring from the global crossover step and the offspring from the local crossover step if they have higher fitness scores than the member of the population with the current highest fitness score. The pseudo code for the graphical local genetic algorithm can be found in
Table 2. Note that in our simulation results we use
,
,
,
, and
.
2.3. Initiating the Matrices
The advantage of the graphical local genetic algorithm is its flexibility; however, it is sensitive to its initialization. Thus, we use the edgewise Bayes factor (
3) and the sample size to guide the search. Once we compute the Bayes factor for each edge, we use the distribution of these edgewise values as a guideline to choose two cut-off points. We choose two cut-off points: the first to indicate that edges with a corresponding value greater or equal to this cut-off will be initialized with a high probability, and the second to indicate that edges with a corresponding value lesser or equal to this cut-off will be initialized with a low probability. The value of these cut-off points will depend on the sample size.
We use the first and third quartile as guidelines for the cut-off values. Edges corresponding to Bayes factor values greater than the third quartile are included with probability 0.9, and edges corresponding to Bayes factor values between the first and the third quartile are included with probability 0.4. To be conservative, for sample sizes over 1000, we round up the third quartile value and we round down the first quartile value to the nearest number in the set {0, 5, 10, 100, 500, 1000, 5000, 10,000}. Therefore, there are few edges considered with high probability and there are few edges that are not considered. In general, for sample sizes over 100,000 we take the cut-off points
, between 5000 and 100,000 we take
, and under 1000 we take
. For example, if we have a sample size of 6000, then we compute (
3) for each edges. The edges with corresponding edgewise Bayes factor values greater or equal to 100 will be initialized with probability 0.9, the edges with values between 10 and 100 will be initialized with probability 0.4, and the edges with values less than 10 will not be considered.
These are general guidelines for fitting a predictive model. If it is desired to obtain a sparse model for interpretation, then the cut-off points can be made more conservative and the edge probabilities can be decreased. A histogram showing the distribution of the edgewise Bayes factor values can help make the decision.
2.4. High-Dimensional Setting
The regular local graphical genetic algorithm we just described works well for up to 20 variables. For larger number of variables, we randomly select overlapping subsets of eight variables and perform the algorithm as usual. We store the resulting top submodels as an array of adjacency matrices, then we select the union of all the edges to reconstruct the full model. We choose subsets of eight because the algorithm works well with the number of variables for a variety of edge densities. The number of subsets is chosen depending on the number of variables; however, it is preferable to fit too many submodels than too few. In the high-dimensional case, we do not need to initiate as many matrices and we do not need as many global iterations because the subsets will be relatively sparse. We initiate models then for each of the global iterations, we add newly generated models. Thus, the high-dimensional GLGA considers a total of 25 models, not including the offspring.
Since we are computing the log of the normalizing constant for subsets of the graph, it can occur that false edges are retained in the model. To combat this we, for sample sizes of 5000 or more, follow the general guidelines for cut-off points in
Section 2.3, and for sample sizes under 5000, we take the upper cut-off as 10 and the lower cutoff as the second lowest edgewise Bayes factor value. The edges corresponding to high Bayes factor values are included with probability 0.9; however, the edges with corresponding values between the two cut-offs are included with probability 0.1. We lower this second probability since each subset is less dense than the full graph. Furthermore, to reduce superfluous edges, after the final graph is constructed if there are three cycles, such that two of the edges have correspond to high edgewise Bayes factor values and the third edges corresponds to a lower value, we delete the third edge with probability 0.8. This reduces the number of false edges that have accumulated over the course of the algorithm. As stated earlier, it is preferable to fit too many submodels than too few. This adjustment to remove extra edges means that we do not need to worry about searching too many subsets. In
Section 3.1, we use 600 subsets of eight variables to fit two models with
variables with two different densities. We use the same set up to fit both models and obtain favourable results, which demonstrates that this adjustment corrects for taking
too many subsets. If we want the resulting model to be sparse for the interpretation purposes, then we will adjust the initial settings instead of taking less subsets.
2.5. Scalability of Algorithm
Most of the computations are performed on arrays of matrices and can be done quickly. The bottleneck of the algorithm is computing the fitness of each model. Both the log of the normalizing constant and the penalty slow down the computation because we must iterate over all the cliques and separators. Since the fitness of each model can be computed separately, we can use parallel computing to decrease the computing time.
In R, the packages foreach and doParallel allow us to use parallel execution on seven cores of the computer, which significantly reduces the computing time. For example, to compute the log of the normalizing constant for 120 models with eight vertices, the regular ‘for’ loop takes 4.2213 s and the ‘foreach’ loop takes 1.3114 s. For 120 models with 20 vertices, the regular ‘for’ loop takes 1.3405 min and the ‘foreach’ loop takes 31.5809 s. In the low-dimensional setting, meaning up to 20 variables, the time it takes to run the algorithm is still manageable with taking 36.1979 s.
We also use parallel computing to compute the edgewise Bayes factor when generating the initial model population. For , there are 4950 edges we need to consider. It takes the regular ‘for’ loop 2.8219 min to compute the Bayes factor indicating the presence or absence of each edge, and the ‘foreach’ loops takes 1.4356 min.