Next Article in Journal
Convolutional Neural Network Outperforms Graph Neural Network on the Spatially Variant Graph Data
Next Article in Special Issue
Estimation of Mediation Effect on Zero-Inflated Microbiome Mediators
Previous Article in Journal
A Novel Model Validation Method Based on Area Metric Disagreement between Accelerated Storage Distributions and Natural Storage Data
Previous Article in Special Issue
Modified BIC Criterion for Model Selection in Linear Mixed Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Graphical Local Genetic Algorithm for High-Dimensional Log-Linear Models

Department of Mathematics and Statistics, York University, Toronto, ON M3J 1P3, Canada
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(11), 2514; https://doi.org/10.3390/math11112514
Submission received: 6 April 2023 / Revised: 18 May 2023 / Accepted: 24 May 2023 / Published: 30 May 2023

Abstract

:
Graphical log-linear models are effective for representing complex structures that emerge from high-dimensional data. It is challenging to fit an appropriate model in the high-dimensional setting and many existing methods rely on a convenient class of models, called decomposable models, which lend well to a stepwise approach. However, these methods restrict the pool of candidate models from which they can search, and these methods are difficult to scale. It can be shown that a non-decomposable model can be approximated by the decomposable model which is its minimal triangulation, thus extending the convenient computational properties of decomposable models to any model. In this paper, we propose a local genetic algorithm with a crossover-hill-climbing operator, adapted for log-linear graphical models. We show that the graphical local genetic algorithm can be used successfully to fit non-decomposable models for both a low number of variables and a high number of variables. We use the posterior probability as a measure of fitness and parallel computing to decrease the computation time.

1. Introduction

The most commonly used graphical log-linear models are hierarchical models which are determined by their two-way interactions, meaning that for every higher-order term in the model, the model also contains the corresponding lower-order terms. A log-linear model obtained from a p-dimensional contingency table can be represented by an undirected graph G = ( V , E ) with vertex set V = { 1 , 2 , , p } and edge set E V × V . If the graphical log-linear model corresponds to a chordal (decomposable) graph, then it is called a decomposable model; otherwise, it is called nonchordal (non-decomposable). Frequently, graphical model selection consists of forward or backward elimination procedures on decomposable graphs due to the decomposable chain rule, that is, one can construct either an increasing or decreasing sequence of decomposable graphs differing by one edge (Lauritzen [1]). However, for p variables there are 2 p possible models, thus these methods become computationally intensive for high-dimensional models.
In Gauraha [2] and Gauraha and Parui [3], the authors present a forward selection method for low-dimensional graphs, using the mutual conditional independence between vertices to reduce the search space and in turn reduces the computational complexity. Another popular method is the graphical lasso. The original approach was for Gaussian graphical models, but it has since been extended to log-linear models with many variations. For example, in Allen and Liu [4] they propose the Poisson graphical lasso, and in Dahinden et al. [5], they offer a variation of the group lasso where they learn subsets of the graph then reconstruct the original graph.
In a high-dimensional setting, Petijean et al. [6] present their approach, called Chordalysis. It is a forward selection method where they use data mining techniques to store and reuse the computed marginal likelihood ratios. They demonstrate that their method is efficient and effective for up to 150 variables. However, the efficiency of their algorithm relies on the decomposable property of the candidate graphs and the sensitivity of the algorithm decreases rapidly with sample size. Dobra and Mohammadi [7] implement a Birth–Death Markov Chain Monte Carlo (BDMCMC) algorithm using a marginal posterior probability based on the marginal pseudo-likelihood with a Dirichlet prior to define the birth and death probabilities. To speed up their algorithm, they compute all of the possible edges using parallel computing.
Model selection for discrete variables can be particularly challenging, because the typical optimization methods borrowed from calculus are not applicable. A popular approach for binary variables is to use a genetic algorithm, first introduced by Holland [8], which imitates Darwinian natural selection. The genetic algorithm is an iterative process where the binary elements represent chromosomes. During each iteration, or generation, two candidate parents are selected from the population using a measure of fitness. Then their chromosomes are combined using a crossover operation to produce offspring which are subject to random chromosomal mutation. Finally, the new offspring are introduced into the population. Since its inception, many variations for each step of the genetic algorithm have been introduced.
Poli and Roverato [9], and Blauth and Pigeot [10] both proposed genetic algorithms for graphical models for a low number of variables—ten variables and six variables, respectively. Poli and Roverato [9] used the Akaike’s Information Criterion as their measure of fitness and they used the elitism variation, meaning the top 5% of candidates of the current population are kept into the next generation. Their main contribution was how they exploited the hierarchical properties of the candidate models in the crossover step. The parents exchanged randomly selected subsets of their corresponding graphs and thus reducing the required computations. Blauth and Pigeot [10] used the Bayesian Information Criterion as their measure of fitness and tournament selection which considers the ranks of the candidate chromosomes. Other variations of the genetic algorithm include local searches. Lozano et al. [11] give a real-coded local genetic algorithm and García-Martínez [12] give a binary-coded local genetic algorithm. The key in both these contributions is to balance the diversity of the global search while fine-tuning the local search. In the local search, they propose what they call the crossover-hill-climbing operator, meaning the most fit offspring replaces the worst parent and reproduces with the best parent for a predetermined number of iterations. In Lozano et al. [11], the authors use negative assortment mating, meaning that the candidate parents that are selected are the most different from each other. Conversely, in García-Martínez [12], they use positive assortment mating.
In the following, we present a local genetic algorithm which implements the crossover-hill-climbing operator for log-linear graphical models, using a normalizing constant proportionate to posterior probability as a measure of fitness. Since the genetic algorithm has no stepwise component, we are not constrained to the class of decomposable models. It can be shown that a model corresponding to a non-decomposable graph can be approximated by its minimal triangulation, which is by definition a decomposable graph. This allows us to benefit from convenient properties of decomposable graphs when computing the posterior probability, while also being able to consider a wider variety of candidate models. In order to focus the search, we use what we called the edgewise Bayes factor to initiate the candidate models. In the low-dimensional setting, we perform our algorithm on the entire graph and in the high-dimensional setting, we find appropriate candidate subsets of the graph, then reconstruct the full graph from a predetermined number of subsets. In Section 2, we give an overview of log-linear graphical models and define the posterior probability for a decomposable graph. Then, we describe the graphical local genetic algorithm and how we use adjacency matrices to perform each step of the algorithm. In Section 3, we give our experiment results. We perform simulations for the number of variables p { 8 , 20 , 100 } in Section 3.1 and we apply our algorithm to a real dataset in Section 3.2. In Section 4, we discuss the conclusions drawn from our results. Additional simulation results are included in Appendix A.

2. Materials and Methods

In this section, we first describe the log-linear graphical model, and give the necessary background from graph theory to illustrate how we compute an expression proportionate to the posterior probability corresponding to both decomposable and non-decomposable graphs. For additional details on graph theory and graphical models, see Lauritzen [1]. Then we explain the global and local components of the Graphical Local Genetic Algorithm (GLGA), and how we implement the algorithm in a high-dimensional setting. Finally, we discuss the complexity of the algorithm and the computing technique we used to speed up certain steps.

2.1. Log-Linear Graphical Models

Consider a vector of random variables X = ( X v , v V ) indexed by the set V = { 1 , 2 , , p } such that each X v takes values in the finite set I v with | I v | levels. Then the resulting counts can be presented in a p-dimensional contingency table corresponding to
I = × v V I v ,
where I is the set of cells i = ( i v , v V ) and i v I v . The number of observations for cell i is denoted n ( i ) and the probability of an object being observed in cell i is denoted p ( i ) . If D V , the set of D-marginal cells is i D = ( i v , v D ) . For N = i I n ( i ) , we assume the cell counts follow a multinomial distribution and the cell probabilities are modelled by a hierarchical log-linear model. For simplicity, in this paper, we assume all random variables are binary.
The conditional independencies between the random variables X v can be read off an undirected graph G = ( V , E ) with vertex set V and edge set E V × V , that is, if X a is independent of X b given X V { a , b } , whenever ( a , b ) is not an edge in E. A graph is complete if every pair of vertices has an edge. The discrete graphical model for X is said to be decomposable or Markov with respect to G if it corresponds to a chordal or triangulated undirected graph, meaning every cycle of length greater or equal to 4 has a chord. Furthermore, a collection of random variables ( X v ) v V with associated graph G are said to be Markov relative to G if for any triple of disjoint sets ( A , B , S ) ,
X A X B | X S ,
where V = A B S and S is a complete subset.
For a graph G and any of its decompositions ( A , B , S ) , we call the subsets A and B cliques, and the subset S a separator. The advantage of using a decomposable model is that the probability distribution of its variables can be written as a product of factors over the cliques C C and the separators S S of the corresponding decomposable graph. This allows for many convenient computational properties. If a graph is non-decomposable, it has been shown that its minimal triangulation can be used as a reasonable proxy. The minimal triangulation of a non-decomposable graph is the graph made up of the least number of fill-in edges which results in a decomposable graph. Since the minimal triangulation of a non-decomposable graph is by definition decomposable, all of the computational advantages will apply.
For example, the graph in Figure 1a is the smallest non-decomposable graph and it has three possible triangulations. Figure 1d is the complete graph on four vertices and it is a triangulation of Figure 1a, however, it is not minimal. Figure 1b,c are minimal since removing the edge ( b , c ) from Figure 1b, or the edge ( a , d ) from Figure 1c will result in a non-decomposable graph. Therefore, if we want to consider the non-decomposable graph Figure 1a as a candidate model, then we would calculate the required computations corresponding to either the minimal triangulation Figure 1b or Figure 1d and use the result to compare graph Figure 1a to other competing models.
The implementation of a genetic algorithm requires a measure of fitness. We use an expression proportionate to the posterior probability f ( G | x ) which requires an appropriate prior distribution. We choose the Dirichlet distribution on the log-linear parameters because it is the Diaconis-Ylvisaker (DY) conjugate prior (Diaconis and Ylvisaker [13]). The Dirichlet distribution is parametrized by fictive counts (or pseudocounts), denoted s ( i ) for i I , which sum up to α . In Dawid and Lauritzen [14], they develop the hyper Dirichlet conjugate prior which exhibits the same Markov properties when corresponding to a decomposable model. It can be shown that a posterior probability which is Markov with respect to a decomposable graph G is proportionate to a normalizing constant, denoted I G ( n + s , N + α ) , and can be written as the product of gamma functions indexed over the cliques and separators of G, that is,
f ( G | x ) I G ( n + s , N + α ) = C C i C I C Γ ( n C ( i C ) + s C ( i C ) ) Γ ( N + α ) S S i S I S Γ ( n S ( i S ) + s S ( i S ) ) ν ( S ) ,
where n is the vector of true cell counts, s is the vector of fictive cell counts, and ν ( S ) is the multiplicity of separator S. If a model corresponds to a non-decomposable graph, then we compute I G ( n + s , N + α ) for its minimal triangulation.
We denote the number of free parameters in a decomposable model by k and it can be expressed as
k = 1 + C C | I C | S S ν ( S ) · | I S | .
Since we assume that all variables are binary, in our simulations we use | I C | = 2 | C | and | I S | = 2 | S | . However, the algorithm can be implemented for variables with more than 2 levels using Equation (2).

2.2. Graphical Local Genetic Algorithm

Genetic algorithms (GAs) belong to the class of evolutionary algorithms and are used to solve optimization and search problems. They mimic the evolutionary process of natural selection, popularized by Charles Darwin. In the original algorithm, developed by Holland [8], each candidate solution corresponds to an individual in the population which is assumed to have one chromosome. A chromosome is represented by a string of 0’s and 1’s and each element of the string is called an allele. In the first generation, the population is randomly initiated and the fitness of each candidate in the population is measured. Then two parents are selected from the population, often the most ‘fit’ are selected, and they produce offspring by a crossover operation. The simplest crossover operation is the one-point crossover where the chromosome of each parent it randomly cut into two segments and switched with one of the segments of the other parent to create offspring. Finally, the offspring are subject to random mutations in one of more of the alleles in their chromosome. Depending on the fitness of the offspring, they may replace existing members of the population or they may simply be added to the population. The algorithm iterates until a predetermined stopping criterion is met. There are many variations of each step in the algorithm. For more details on these variations, see Givens and Hoeting [15].
Since we are interested in graphical models, we will use adjacency matrices instead of strings of 0’s and 1’s. An undirected graph G = ( V , E ) with | V | = p can be represented by a p × p matrix A = ( a i j ) , where a i j = 1 if ( i , j ) E and i j , and a i j = 0 otherwise. For example, consider a graph G 1 with vertices V = { a , b , c , d } and cliques G 1 = { a b c , b c d } , as seen in Figure 2a. In Figure 2b, we have its adjacency matrix with 1’s where the element of the matrix corresponds to an edge in the graph, and 0’s where the element corresponds to no edge in the graph and 0’s on the diagonal.
All of our computations are in R, where it is more practical to use the log Γ ( · ) function instead of Γ ( · ) , so we use the logarithm of the normalizing constant (1) to measure the fitness of each model. We use the igraph package in R to obtain the cliques and separators from the adjacency matrix of a decomposable graph or of a minimal triangulation of a graph, and compute the sum of the log of gamma functions indexed according to the graph’s factorization. In the case that a candidate model is non-decomposable and we compute the log of the normalizing constant for its minimal triangulation, we do not consider the minimal triangulation to be an updated version of the candidate model. We use the minimal triangulation only for its convenient computational properties. It has been shown that when comparing overfitting models, the log posterior probability will favour the model with less superfluous edges. However, when comparing underfitting models, it will favour the candidate model with most true edges. This cause the genetic algorithm to be prone to keeping false edges if it means having more true edges. Therefore, we add a penalty term to our measure of fitness to prevent the algorithm from selecting a model with too many unnecessary edges. We use the penalty k log ( N + α ) , where k is the number of free parameters in the model (2).
Since the genetic algorithm does not have the same restrictions as some other model selection methods, it is sensitive to its initial conditions and can consider any possible candidate model. Seeing as our goal is to implement the GLGA in the high-dimensional setting, we must direct the algorithm towards the more suitable models. To do so, we use the Bayes factor to compare the presence of each edge versus no edges. The Bayes factor is the ratio of posterior probabilities for two models and it is commonly used in model selection to compare candidate models. When comparing two models G a and G b with equal probability, the Bayes factor is a ratio of the normalizing constants (1), that is,
B F G a , G b = I G a ( t + s , N + α ) I G b ( t + s , N + α ) .
To initialize the population, we compute the edgewise Bayes factor for each possible edge, that is, we compare a candidate model with the single edge in question to the model with no edges. We use ranges of the values of the edgewise Bayes factors to determine the probability of including that edge or not in the otherwise randomly generated initial candidate models, where these ranges depend on the sample size. One of the convenient properties of the Bayes factor is that, for two decomposable models, the cliques or separators common to both models will cancel out and hence not need to be computed. Thus, even if we have a high-dimensional dataset, we only need to compute the Bayes factor for the given edge. For example, even if p = 100 , to compute the Bayes factor for the edge { a b } , we simply need to compare the existence of the edge { a b } versus no edge { a , b } . Therefore, is it not computationally intensive to compute the Bayes factor for each edge.
In order to perform the crossover step and the mutation step, we use the upper triangular adjacency matrix. For the crossover step, we randomly select a cut-point and we interchange the rows above and below the cut-point between the two parent matrices. We do this three times using three different cut-points to create six offspring at each crossover step. In Figure 3a, we have the upper triangular matrix of the adjacency matrix which represents the graph G 1 from Figure 1b. To continue our example, consider a graph G 2 = { a c , a d , c d } with upper triangular adjacency matrix seen in Figure 3b. Say we randomly choose to cut G 1 and G 2 between row 1 and row 2, as seen in Figure 3a,b. Then to complete the crossover, we take row 1 from G 1 and rows 2–4 from G 2 to form one offspring (Figure 3c), and we take row 1 from G 2 and rows 2–4 from G 1 to form a second offspring (Figure 3d). In each crossover step, we do three unique cuts to obtain six offspring. In the mutation step, since the genetic algorithm tends to pick up extra edges when there are missing true edges, we simply take the upper triangular adjacency matrix for each offspring and with a small probability we change 0 to 1. We do not allow for random mutations from 0 to 1.
Once we have the six offspring obtained from what we refer to as the global crossover, post-mutation step, if certain conditions are met we perform a local search. We implement the crossover-hill-climbing step from Lozano et al. [11], but applied to graphical models. We refer to crossover operations preform during the crossover-hill-climbing step as local crossover. Let p 1 = the current best parent, p 2 = the best current offspring, n t = 3 , and n o f f = 6 . The pseudo code for the crossover-hill-climbing operator can be found in Table 1.
Since the local search can add unnecessary computational expense, Lozano et al. [11] only carry out the crossover-hill-climbing step with probability 1 if the best new offspring is better than the worst member of the population, or with probability 0.0625 otherwise. We opt to carry out the local search only if the best new offspring is better than the worst member of the population. Lozano et al. [11] implement specific mating, global crossover, mutation and replacement strategies, which keeps the population diverse; however, we choose more targeted strategies. Only our local crossover operation follows their approach. Our global crossover step and our mutation step are similar to the generic genetic algorithm, with the probability of mutation 0.0005.
In the low-dimensional case, we initiate m 1 = 20 models then for each of the n i t = 20 global iterations, we add m 2 = 5 new randomly generated models to add diversity to the population. Thus, the regular GLGA considers a total of 120 models, not including the offspring. We only do at most n t = 3 local iteration because since we control the initialization of the matrices with the edgewise Bayes factor, the local searches converge quickly. We select the matrix with the highest edgewise values up until a cut-off, determined by the sample size, as one parent which mates in every iteration with a second parent which is the model with the highest fitness score out of the current population. We only keep the offspring from the global crossover step and the offspring from the local crossover step if they have higher fitness scores than the member of the population with the current highest fitness score. The pseudo code for the graphical local genetic algorithm can be found in Table 2. Note that in our simulation results we use m 1 = 20 , m 2 = 5 , n i t = 20 , n t = 3 , and x = 3 .

2.3. Initiating the Matrices

The advantage of the graphical local genetic algorithm is its flexibility; however, it is sensitive to its initialization. Thus, we use the edgewise Bayes factor (3) and the sample size to guide the search. Once we compute the Bayes factor for each edge, we use the distribution of these edgewise values as a guideline to choose two cut-off points. We choose two cut-off points: the first to indicate that edges with a corresponding value greater or equal to this cut-off will be initialized with a high probability, and the second to indicate that edges with a corresponding value lesser or equal to this cut-off will be initialized with a low probability. The value of these cut-off points will depend on the sample size.
We use the first and third quartile as guidelines for the cut-off values. Edges corresponding to Bayes factor values greater than the third quartile are included with probability 0.9, and edges corresponding to Bayes factor values between the first and the third quartile are included with probability 0.4. To be conservative, for sample sizes over 1000, we round up the third quartile value and we round down the first quartile value to the nearest number in the set {0, 5, 10, 100, 500, 1000, 5000, 10,000}. Therefore, there are few edges considered with high probability and there are few edges that are not considered. In general, for sample sizes over 100,000 we take the cut-off points { 100 , 1000 } , between 5000 and 100,000 we take { 10 , 100 } , and under 1000 we take { 7 , 10 } . For example, if we have a sample size of 6000, then we compute (3) for each edges. The edges with corresponding edgewise Bayes factor values greater or equal to 100 will be initialized with probability 0.9, the edges with values between 10 and 100 will be initialized with probability 0.4, and the edges with values less than 10 will not be considered.
These are general guidelines for fitting a predictive model. If it is desired to obtain a sparse model for interpretation, then the cut-off points can be made more conservative and the edge probabilities can be decreased. A histogram showing the distribution of the edgewise Bayes factor values can help make the decision.

2.4. High-Dimensional Setting

The regular local graphical genetic algorithm we just described works well for up to 20 variables. For larger number of variables, we randomly select overlapping subsets of eight variables and perform the algorithm as usual. We store the resulting top submodels as an array of adjacency matrices, then we select the union of all the edges to reconstruct the full model. We choose subsets of eight because the algorithm works well with the number of variables for a variety of edge densities. The number of subsets is chosen depending on the number of variables; however, it is preferable to fit too many submodels than too few. In the high-dimensional case, we do not need to initiate as many matrices and we do not need as many global iterations because the subsets will be relatively sparse. We initiate m 1 = 10 models then for each of the n i t = 3 global iterations, we add m 2 = 5 newly generated models. Thus, the high-dimensional GLGA considers a total of 25 models, not including the offspring.
Since we are computing the log of the normalizing constant for subsets of the graph, it can occur that false edges are retained in the model. To combat this we, for sample sizes of 5000 or more, follow the general guidelines for cut-off points in Section 2.3, and for sample sizes under 5000, we take the upper cut-off as 10 and the lower cutoff as the second lowest edgewise Bayes factor value. The edges corresponding to high Bayes factor values are included with probability 0.9; however, the edges with corresponding values between the two cut-offs are included with probability 0.1. We lower this second probability since each subset is less dense than the full graph. Furthermore, to reduce superfluous edges, after the final graph is constructed if there are three cycles, such that two of the edges have correspond to high edgewise Bayes factor values and the third edges corresponds to a lower value, we delete the third edge with probability 0.8. This reduces the number of false edges that have accumulated over the course of the algorithm. As stated earlier, it is preferable to fit too many submodels than too few. This adjustment to remove extra edges means that we do not need to worry about searching too many subsets. In Section 3.1, we use 600 subsets of eight variables to fit two models with p = 100 variables with two different densities. We use the same set up to fit both models and obtain favourable results, which demonstrates that this adjustment corrects for taking too many subsets. If we want the resulting model to be sparse for the interpretation purposes, then we will adjust the initial settings instead of taking less subsets.

2.5. Scalability of Algorithm

Most of the computations are performed on arrays of matrices and can be done quickly. The bottleneck of the algorithm is computing the fitness of each model. Both the log of the normalizing constant and the penalty slow down the computation because we must iterate over all the cliques and separators. Since the fitness of each model can be computed separately, we can use parallel computing to decrease the computing time.
In R, the packages foreach and doParallel allow us to use parallel execution on seven cores of the computer, which significantly reduces the computing time. For example, to compute the log of the normalizing constant for 120 models with eight vertices, the regular ‘for’ loop takes 4.2213 s and the ‘foreach’ loop takes 1.3114 s. For 120 models with 20 vertices, the regular ‘for’ loop takes 1.3405 min and the ‘foreach’ loop takes 31.5809 s. In the low-dimensional setting, meaning up to 20 variables, the time it takes to run the algorithm is still manageable with p = 20 taking 36.1979 s.
We also use parallel computing to compute the edgewise Bayes factor when generating the initial model population. For p = 100 , there are 4950 edges we need to consider. It takes the regular ‘for’ loop 2.8219 min to compute the Bayes factor indicating the presence or absence of each edge, and the ‘foreach’ loops takes 1.4356 min.

3. Results

In this section, we give the results of experiments with simulated datasets for p { 8 , 100 } for various sample sizes from 100 to 500,000. There are additional simulation results found in Appendix A for p { 6 , 12 , 20 , 50 } . Furthermore, we implement the GLGA on a real world dataset with p = 32 .

3.1. Simulated Data Sets

We test the GLGA using simulated data from known graphs for various p and various sample sizes. We compare the GLGA to the Chordalysis approach by Petijean et al. [6]. In Petijean et al. [6], they demonstrate the advantages of the Chordalysis approach; however, the method can only provide a decomposable and it looses its strength for smaller sample sizes. To evaluate the performance of each algorithm, we use the
sensitivity = TP TP + FN , and specificity = TN TN + FP ,
where TP, TN, FP and FN are the number of true positives, true negative, false positives and false negative, respectively. The sensitivity is the proportion of true edges correctly identified, and the specificity is the proportion of absent edges correctly identified. Each score is between 0 and 1, and a higher score implies better accuracy. In general, we aim to have both scores be 0.70 or above.
For each graph, we give the average sensitivity and specificity scores, and standard deviations (Sd.) of 20 runs and compare the results to the Chordalysis algorithm. Chordalysis returns the same graph every run, thus we do not have any standard deviation to report. For each sample size, the first row is the GLGA results, and the second row is the Chordalysis results. All of the graphs we simulate are non-decomposable graphs.
In Table 3, Table 4 and Table 5, we have the results for p = 8 , p = 100 with an edge density of 0.02, and p = 100 with an edge density of 0.05, respectively. We begin with the results for for the graph with p = 8 in Figure 4, since this is the size of the subsets we take in the high-dimensional setting. We see that the GLGA has sensitivity and specificity scores over 0.7 for sample size 5000 or more. For the smaller sample sizes, it does not give strong results; however, it is able to find the same or more edges than using Chordalysis.
In Figure 5, we generated a random graph with 100 vertices and with 0.02 probability of including an edge. We see that this graph is relatively sparseand there are many vertices that are not connected by any edge. We use 600 subsets of p = 8 to obtain the fitted model. Table 5 shows that the GLGA performed well for sample sizes 1000 or more. We note that the Chordalysis algorithm had an error when running on a dataset with sample size 500,000 due to lack of memory, this is why its sensitivity score for the simulation is so low. Again, the GLGA loses accuracy for lower sample size; however, it can still find more edges that Chordalysis.
Figure 6 shows a randomly generated graph with 100 vertices and with 0.05 probability of including an edge. Even though this is a low probability, since a graph with 100 vertices has 4950 possible edges, there is a noticeable difference in density between this graph and the one in Figure 5. We also use 600 subsets of p = 8 to fit this model and we see that we have favourable results for sample sizes 5000 and more. We see that Chordalysis has difficultly finding edges for this denser graph. The Chordalysis algorithm tends to be more conservative with adding edges, thus it has low sensitivity in this case for any sample size.
In general, the GLGA performs well for sample size 5000 or more, and it outperforms Chordalysis for small sample sizes. Moreover, since Chrordalysis must return a decomposable model, only the GLGA is able to select the true model. See Appendix A for further examples.

3.2. Application on a Real Data Set

In this section, we apply the GLGA to a real-world dataset. We apply our algorithm to the Movies Dataset collected by TMDB and GroupLens (https://grouplens.org/datasets/movielens/latest/ (accessed on 5 April 2023)). The original dataset contains over 280,000 movie titles with reviews from over 50,000 individual viewers. The ratings are on a scale from 0 to 5 with intervals of 0.5. We select 32 movies which were reviewed by the same 353 individuals and if they rated a movie 4 or more then we encoded that observation as ‘1’ to mean they like the movie, and if they rated a movie 3.5 or less we encoded that observation as ‘0’ to mean they do not like the movie.
Since the same size a relatively small for this number of variables and the purpose of the model is for interpretation, we use conservative cut-off points and initial edge probabilities when generating the initial populations of submatrices. Edges with a Bayes factor value of 30 or over are initialized with probability 0.8, and edges with values of 15 or lower are initialized with probability 0.05. We used a histogram of the edgewise Bayes factors to decide these cut-offs. Moreover, since in our simulations we used 50 subsets for 20 variables, here we used 60 subsets. Figure 7 shows the graph representing selected model with the numeric labels given in the original dataset. A legend for the titles, genre and year of the movies is provide in Appendix B.
The graph in Figure 7 show connections between movies which are liked by the same viewers. First, we notice that Bridge to Terabithia (1265) is the only movie not connected any other movie in the graph. This is because it is the only family movie we considered; therefore, we do not expect it to have been viewed by the same demographic as the other movies. The movie with the most connections is Batman Returns (364). This movie is apart of the well-known Batman franchise, so it was likely viewed by different demographics. It is considered an action movie and it is connected to movies with related genres drama, thriller, and horror; as well as, action. Batman Returns was directed by Tim Burton who is known for his quirky gothic fantasy and horror style. We notice that Batman Returns is connected to Silent Hill (588) and Motha vs Godvilla (1682) which are both categorized as horror. It is also connected to Big Fish (587) which is a fantasy-drama, also directed by Tim Burton.
This type of model can be used to guide movie recommendations to users based on the movies they have previously viewed. The user would be recommended the movies connected to a movie that have viewed and if they like the original movie, they would select a recommended movie and if they did not like the original movie, they know not to select the new recommendations. For first time viewers, the suggestion system should start with a movie with many connections because those are likely to have been enjoyed by a diverse audience, for example, Batman Returns.

4. Conclusions

Graphical log-linear models are effective for modelling complex interactions between discrete variables; however, model selection for high-dimensional data is a difficult task. In this chapter, we introduce the Graphical Local Genetic Algorithm, which is an extension of the graphical genetic algorithm to the high-dimensional setting with the crossover-hill-climbing operator from Lozano et al. [11].
First, we successfully apply the GLGA to graphs of up to 20 variables, then we modify the GLGA by implementing the algorithm for subsets of eight variables and reconstructing the final model using the resulting subgraphs. We are able to fit datasets with up to 100 variables using the GLGA. Previously, the graphical genetic algorithm had only been implemented for graphs with a low number of variables. Many competing model selection methods are stepwise methods which rely on the properties of decomposable graphs. Our simulation results show that the GLGA is flexible in that it can fit non-decomposable models with varying densities by taking advantage of the convenient properties of minimal triangulations. Moreover, we use the GLGA to analyse a real-world dataset containing movie reviews for 32 movies from 353 individuals. The resulting model exhibits valuable connections between movies, which can be used for a movie suggestion system.

Author Contributions

Writing—original draft, L.R.; Writing—review & editing, X.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by NSERC grants held by Gao.

Data Availability Statement

The Movies Dataset was collected by TMDB and GroupLens (https://grouplens.org/datasets/movielens/latest/) accessed on 5 April 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
BDMCMC Birth-Death Markov Chain Monte Carlo
DY Diaconis-Ylvisaker
GA Genetic Algorithm
GLGA Graphical Local Genetic Algorithm

Appendix A

In Appendix A, we give the rest of our results using simulated datasets using randomly generated non-decomposable graphs. In the low-dimensional case, we have p { 6 , 12 , 20 } and in the high-dimensional case, we have p { 20 , 50 } . Note that in both cases we use the same graph with 20 variables; however, we used different datasets, so the results using Chordalysis are different in each table. We use 50 subsets of eight variables when using the high-dimensional method for p = 20 , and we use 300 subsets of 8 for p = 50 . The first row shows the results using GLGA and the second row shows data using Chordalysis.
Figure A1. True non-decomposable graph with p = 6 .
Figure A1. True non-decomposable graph with p = 6 .
Mathematics 11 02514 g0a1
Table A1. Results from simulated dataset with p = 6 . The first row gives the results using GLGA and the second row those using Chordalysis.
Table A1. Results from simulated dataset with p = 6 . The first row gives the results using GLGA and the second row those using Chordalysis.
Sample SizeSensitivityStandard DeviationSpecificityStandard Deviation
N = 100 0.630.07071.000.0000
0.20-1.00-
N = 500 0.500.00001.000.0000
0.50-1.00-
N = 1000 0.600.00001.000.0000
0.60-1.00-
N = 5000 0.780.01670.960.0179
0.60-1.00-
N = 10,0000.800.00000.800.0000
0.80-0.80-
N = 50,0000.800.00001.000.0000
0.90-0.60-
N = 100,0000.960.01100.920.0219
1.00-0.60-
N = 500,0001.000.00000.920.0219
1.00-0.20-
Figure A2. True non-decomposable graph with p = 12 .
Figure A2. True non-decomposable graph with p = 12 .
Mathematics 11 02514 g0a2
Table A2. Results from simulated dataset with p = 12 . The first row gives the results using GLGA and the second row is using Chordalysis.
Table A2. Results from simulated dataset with p = 12 . The first row gives the results using GLGA and the second row is using Chordalysis.
Sample SizeSensitivityStandard DeviationSpecificityStandard Deviation
N = 100 0.570.02630.850.0000
0.26-0.98-
N = 500 0.770.01410.910.0049
0.53-1.00-
N = 1000 0.720.02880.960.0000
0.58-0.98-
N = 5000 0.780.00470.960.0000
0.68-0.98-
N = 10,0000.670.00470.980.0000
0.63-0.98-
N = 50,0000.890.00000.980.0000
0.68-0.98-
N = 100,0000.750.01150.970.0055
0.79-0.94-
N = 500,0000.910.00470.790.0019
0.89-0.85-
Figure A3. True non-decomposable graph with p = 20 .
Figure A3. True non-decomposable graph with p = 20 .
Mathematics 11 02514 g0a3
Table A3. Results from simulated dataset with p = 20 . The first row gives the results using GLGA and the second row is using Chordalysis.
Table A3. Results from simulated dataset with p = 20 . The first row gives the results using GLGA and the second row is using Chordalysis.
Sample SizeSensitivityStandard DeviationSpecificityStandard Deviation
N = 100 0.650.07160.750.0670
0.24-1.00-
N = 500 0.610.01610.970.0000
0.41-1.00-
N = 1000 0.750.01610.960.0035
0.56-1.00-
N = 5000 0.820.01100.900.0021
0.59-1.00-
N = 10,0000.700.00260.960.0006
0.53-1.00-
N = 50,0000.820.00260.940.0006
0.76-0.98-
N = 100,0000.730.00260.960.0007
0.71-0.98-
N = 500,0000.910.00000.930.0006
0.74-0.92-
Table A4. Results from simulated dataset with p = 20 . The first row gives the results using GLGA with 50 subsets of eight variables and the second row is using Chordalysis.
Table A4. Results from simulated dataset with p = 20 . The first row gives the results using GLGA with 50 subsets of eight variables and the second row is using Chordalysis.
Sample SizeSensitivityStandard DeviationSpecificityStandard Deviation
N = 100 0.840.05720.680.0262
0.09-1.00-
N = 500 0.800.01470.700.037
0.29-1.00-
N = 1000 0.890.04490.770.0482
0.56-1.00-
N = 5000 0.790.04170.830.0154
0.53-0.99-
N = 10,0000.710.03220.940.0197
0.56-1.00-
N = 50,0000.850.00000.910.0059
0.65-0.99-
N = 100,0000.750.05730.950.0187
0.68-0.97-
N = 500,0000.780.03320.910.0199
0.79-0.94-
Figure A4. True non-decomposable graph with p = 50 .
Figure A4. True non-decomposable graph with p = 50 .
Mathematics 11 02514 g0a4
Table A5. Results from simulated dataset with p = 50 . The first row gives the results using GLGA with 300 subsets of eight variables and the second row is using Chordalysis.
Table A5. Results from simulated dataset with p = 50 . The first row gives the results using GLGA with 300 subsets of eight variables and the second row is using Chordalysis.
Sample SizeSensitivityStandard DeviationSpecificityStandard Deviation
N = 100 0.610.01530.630.0035
0.053-1.00-
N = 500 0.600.02200.690.0031
0.31-1.00-
N = 1000 0.760.02290.830.0032
0.37-1.00-
N = 5000 0.730.02360.750.0019
0.42-0.99-
N = 10,0000.610.02890.930.0050
0.48-0.99-
N = 50,0000.810.01650.910.0071
0.47-0.98-
N = 100,0000.620.01160.980.0036
0.52-0.98-
N = 500,0000.770.01370.920.0104
0.56-0.94-

Appendix B

Table A6. Legend for the labels of the movies in Figure 7.
Table A6. Legend for the labels of the movies in Figure 7.
LabelMovie TitleGenreYear
111ScarfaceCrime/Drama1983
153Lost in TranslationRomance/Drama2003
165Back to the Future Part IISci-fi/Comedy1989
231SyrianaDrama/Political Thriller2005
293A River Runs Through ItDrama1992
296Terminator 3: Rise of the MachinesAction/Sci-fi2003
318The Million Dollar hotelDrama/Mystery2000
364Batman ReturnsAction/Adventure1992
377Nightmare on Elm StreetHorror/Mystery1984
380Rain ManDrama1988
480Monsoon WeddingComedy/Drama/Romance2001
500Resevoir DogsAction/Adventure1992
586Wag the DogComedy/Political Cinema1997
587Big FishDrama/Fantasy2003
588Silent HillSupernatural/Horror2006
590The HoursDrama/Romance2002
593SolarisSci-fi/Drama/Mystery1972
595To Kill a MockingbirdDrama/Mystery1962
597TitanicRomance/Drama1997
608Men in Black IISci-fi/Action2002
648Beauty and the BeastFantasy/Romance1946
780The Passion of Joan of ArcDrama/Silent1928
858Sleepless in SeattleRomance/Comedy1993
1073Arlington RoadThriller/Crime1999
1089Point BreakAction/Crime1991
1213The Talented Mr. RipleyThriller/Drama1999
1265Bridge to TerabithiaFamily/Fantasy2007
1682Mothra vs. GodzillaSci-fi/Horror1964
1721All the Way BoysAction/Comedy1972
2959License to WedRomance/Comedy2007
49935 Card StudWestern/Drama1968
4995Boogie NightsComedy/Drama1997

References

  1. Lauritzen, S.L. Graphical Models; Oxford University Press: Oxford, UK, 1996. [Google Scholar]
  2. Gauraha, N. Model Selection for Graphical Log-Linear Models: A Forward Model Selection Algorithm based on Mutual Conditional Independence. arXiv 2016, arXiv:1603.03719. [Google Scholar]
  3. Gauraha, N.; Parui, S.K. Mutual Conditional Independence and its Applications to Model Selection in Markov Networks. Ann. Math. Artif. Intell. 2020, 88, 951–972. [Google Scholar] [CrossRef]
  4. Allen, G.I.; Liu, Z. A Log-Linear Graphical Model for Inferring Genetic Networks from High-Throughput Sequencing Data. In Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, Philadelphia, PA, USA, 4–7 October 2012; pp. 1–6. [Google Scholar]
  5. Dahinden, C.; Kalisch, M.; Bühlmann, P. Decomposition and Model Selection for Large Contingency Tables. Biom. J. 2010, 25, 233–252. [Google Scholar] [CrossRef] [PubMed]
  6. Petijean, F.; Webb, G.I.; Nicholson, A.E. Scaling Log-Linear Analysis to High-Dimensional Data. In Proceedings of the IEEE 13th International Conference on Data Mining, Dallas, TX, USA, 7–10 December 2013; pp. 597–606. [Google Scholar]
  7. Dobra, A.; Mohammadi, A. Loglinear Model Selection and Human Mobility. Ann. Appl. Stat. 2018, 12, 815–845. [Google Scholar] [CrossRef]
  8. Holland, J.H. Adaptation in Natural and Artificial Systems; University of Michigan Press: Ann Arbor, MI, USA, 1975. [Google Scholar]
  9. Poli, I.; Roverato, A. A Genetic Algorithm for Model Selection. J. Ital. Stat. Soc. 1998, 7, 197–208. [Google Scholar] [CrossRef]
  10. Blauth, A.; Pigeot, I. Using Genetic Algorithms for Model Selection in Graphical Models; Collaborative Research Center 386, Discussion Paper 278; LMU: München, Germany, 2002. [Google Scholar]
  11. Lozano, M.; Herrera, F.; Krasnogor, N.; Molina, D. Real-Coded Memetic Algorithms with Crossover Hill-Climbing. Evol. Comput. 2004, 12, 273–302. [Google Scholar] [CrossRef] [PubMed]
  12. García-Martínez, C.; Lozano, M.; Molina, D. A Local Genetic Algorithm for Binary-Coded Problems. In Parallel Problem Solving from Nature; Runarsson, T.P., Beyer, H.-G., Burke, E., Merelo-Guervós, J.J., Whitley, L.D., Yao, X., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 192–201. [Google Scholar]
  13. Diaconis, P.; Ylvisaker, D. Conjugate Priors for Exponential Families. Ann. Stat. 1979, 7, 269–281. [Google Scholar] [CrossRef]
  14. Dawid, A.P.; Lauritzen, S.L. Hyper Markov Laws in the Statistical Analysis of Decomposable Graphical Models. Ann. Stat. 1993, 21, 1272–1317. [Google Scholar] [CrossRef]
  15. Givens, G.H.; Hoeting, J.A. Genetic Algorithms. In Computational Statistics; Giudici, P., Givens, G.H., Mallick, B.K., Eds.; John Wiley & Sons Inc. Publication: Hoboken, NJ, USA, 2013; pp. 75–84. [Google Scholar]
Figure 1. The smallest non-decomposable graph and its triangulations.
Figure 1. The smallest non-decomposable graph and its triangulations.
Mathematics 11 02514 g001
Figure 2. A visual representation of the graph G 1 and the corresponding adjacency matrix.
Figure 2. A visual representation of the graph G 1 and the corresponding adjacency matrix.
Mathematics 11 02514 g002
Figure 3. Example of creating two offspring with one cut-point in the crossover step.
Figure 3. Example of creating two offspring with one cut-point in the crossover step.
Mathematics 11 02514 g003
Figure 4. True nondecomposable graph with p = 8 .
Figure 4. True nondecomposable graph with p = 8 .
Mathematics 11 02514 g004
Figure 5. True nondecomposable graph with p = 100 .
Figure 5. True nondecomposable graph with p = 100 .
Mathematics 11 02514 g005
Figure 6. True non-decomposable graph with p = 100 .
Figure 6. True non-decomposable graph with p = 100 .
Mathematics 11 02514 g006
Figure 7. Graphical representation of model selected for the Movie Dataset.
Figure 7. Graphical representation of model selected for the Movie Dataset.
Mathematics 11 02514 g007
Table 1. Pseudocode for crossover-hill-climbing step.
Table 1. Pseudocode for crossover-hill-climbing step.
Crossover-hill-climbing ( p 1 , p 2 , n o f f , n t )
  • Select parents p 1 and p 2 .
  • Repeat n t times.
    (a)
    Generate n o f f offspring by performing local crossover on p 1 and p 2 .
    (b)
    Evaluate the fitness of each n o f f offspring.
    (c)
    Find the offspring with the highest fitness value, o b e s t .
    (d)
    If o b e s t is better than the parent with the lowest fitness score, either p 1 or p 2 , then replace that parent with o b e s t .
    (e)
    If p 1 = p 2 , then exit iteration.
  • Return parent with highest fitness score, either p 1 or p 2 .
Table 2. Pseudocode for graphical local genetic algorithm.
Table 2. Pseudocode for graphical local genetic algorithm.
Graphical Local Genetic Algorithm
  • Initialize population of m 1 matrices.
  • Repeat n i t times.
    (a)
    Evaluate fitness of current population and select best two matrices from the current population as parent models.
    (b)
    Perform one-point global crossover x times and mutation to produce 2 x offspring.
    (c)
    Evaluate fitness of the offspring and find best offspring, o b e s t .
    (d)
    If the top offspring is better than the worst individual in the general population, then
    • Find the best parent, c b e s t , and perform Crossover-hill-climbing ( c b e s t , o b e s t , 2 x , n t ).
    • Once the termination condition is met, if final best parent is better than the best individual in the general population, then replace the worst individual with result from Crossover-hill-climbing step.
    (e)
    Generate m 2 new random matrices to diversify and add to the general population, continue to next iteration.
Table 3. Results from simulated dataset with p = 8 . The first row gives the results using GLGA and the second row is using Chordalysis.
Table 3. Results from simulated dataset with p = 8 . The first row gives the results using GLGA and the second row is using Chordalysis.
Sample SizeSensitivityStandard DeviationSpecificityStandard Deviation
N = 100 0.500.00000.860.0000
0.07-1.00-
N = 500 0.650.00001.000.0000
0.50-1.00-
N = 1000 0.500.00001.000.0000
0.50-1.00-
N = 5000 0.710.00000.930.0175
0.64-1.00-
N = 10,0000.770.00640.940.0156
0.79-0.86-
N = 50,0000.830.00780.810.0239
1.00-0.57-
N = 100,0000.830.02170.870.0120
0.86-0.71-
N = 500,0000.970.00780.810.0128
1.00-0.50-
Table 4. Results from simulated dataset with p = 100 , and edge probability 0.02. The first row gives the results using GLGA with 600 subsets of eight variables and the second row is using Chordalysis.
Table 4. Results from simulated dataset with p = 100 , and edge probability 0.02. The first row gives the results using GLGA with 600 subsets of eight variables and the second row is using Chordalysis.
Sample SizeSensitivityStandard DeviationSpecificityStandard Deviation
N = 100 0.410.03060.800.0024
0.04-1.00-
N = 500 0.660.02220.920.0040
0.36-1.00-
N = 1000 0.780.02430.900.0008
0.60-1.00-
N = 5000 0.820.02310.980.0007
0.72-1.00-
N = 10,0000.780.05230.980.0069
0.77-1.00-
N = 50,0000.880.02060.980.0020
0.84-0.99-
N = 100,0000.800.02000.990.0007
0.84-1.00-
N = 500,0000.880.02300.990.0007
0.03-0.98-
Table 5. Results from simulated dataset with p = 100 , and edge probability 0.05. The first row gives the results using GLGA with 600 subsets of eight variables and the second row is using Chordalysis.
Table 5. Results from simulated dataset with p = 100 , and edge probability 0.05. The first row gives the results using GLGA with 600 subsets of eight variables and the second row is using Chordalysis.
Sample SizeSensitivityStandard DeviationSpecificityStandard Deviation
N = 100 0.330.00500.880.0056
0.06-0.99-
N = 500 0.530.02910.940.0026
0.28-1.00-
N = 1000 0.610.01160.940.0011
0.38-1.00-
N = 5000 0.710.04700.950.0077
0.44-0.99-
N = 10,0000.790.02310.940.0010
0.43-1.00-
N = 50,0000.820.01260.940.0035
0.47-0.99-
N = 100,0000.720.01240.990.0010
0.46-0.99-
N = 500,0000.750.01630.970.0024
0.46-0.99-
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Roach, L.; Gao, X. Graphical Local Genetic Algorithm for High-Dimensional Log-Linear Models. Mathematics 2023, 11, 2514. https://doi.org/10.3390/math11112514

AMA Style

Roach L, Gao X. Graphical Local Genetic Algorithm for High-Dimensional Log-Linear Models. Mathematics. 2023; 11(11):2514. https://doi.org/10.3390/math11112514

Chicago/Turabian Style

Roach, Lyndsay, and Xin Gao. 2023. "Graphical Local Genetic Algorithm for High-Dimensional Log-Linear Models" Mathematics 11, no. 11: 2514. https://doi.org/10.3390/math11112514

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop