# COVID-19 Data Analysis with a Multi-Objective Evolutionary Algorithm for Causal Association Rule Mining

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background and Basic Concepts

#### 2.1. Association Rule Mining

#### 2.1.1. Absolute Risk (AR)

#### 2.1.2. Probability of Sufficiency (PS)

#### 2.1.3. Population Attributable Fraction (PAF)

#### 2.2. Discrete Multi-Objective Optimization Problems

## 3. Related Previous Works

## 4. Proposal

#### 4.1. Data Preparation

- Uniform support: Each selector had approximately the same support, which was 20%.
- Reduced impact of outliers: Quantile-based discretization accumulates outliers in two ranges, assigning very low values to the first quintile and very high values to the last quintile.
- Pareto principle, or the 80-20 rule: This empirical principle states that 80% of the incidence of a factor is attributable to 20% of the observations [21].

#### 4.2. Group Modeling

#### 4.3. Query Definition by Optimization Problem

- Support greater than zero: The association rule must be true for at least one record. In the formal definition of optimization problems, this is stated as $supp(A\to C)>0$.
- Absolute positive effect: An association rule with an absolute positive effect and with a value greater than zero indicates that observing the antecedent increases the probability of observing the consequent, thus rejecting rules in which the antecedent inhibits the consequent. Formally, in optimization problems, this is stated as $AR(A\to C)>0$.
- Statistical significance: The odds ratio must be statistically significant; that is, the lower bound of its 95% confidence interval must be greater than or equal to one. Formally, in problems of optimization, this is stated as $C{I}_{inf}^{OR}(A\to C)\ge 1$.

**DMOP-1**. Classic association rule mining aims to obtain rules with the highest possible support, confidence, and lift. However, simultaneously optimizing these measures is impossible because a sustained increase or decrease in one does not guarantee behavior in the same direction in the other two. The formal definition of the optimization problem for this query is described by Equation (10):

**DMOP-2**. From a logical perspective, a biconditional expression $(A\leftrightarrow C)$ can be interpreted as “A if and only C” or “A is a necessary and sufficient condition for C”. Its truth value is equivalent to the expression $(A\to C)\wedge (C\to A)$. The sufficiency condition falls to the association rule $A\to C$, interpreted as “A is a sufficient condition for C”. The sufficiency condition is considered to be satisfied if the causal effect of $A\to C$ is large enough. On the other hand, to satisfy the necessary condition of the biconditional expression, the causal effect of $C\to A$ must be considered as well (see Equation (11)):

**DMOP-3**. In this problem, we seek to find the rules that maximize susceptibility, a measure that quantifies the capacity of the antecedent to produce the consequent, and the population attributable fraction, a measure that indicates the proportion of observations of the consequences that were caused by the antecedent (see Equation (12)):

#### 4.4. Evolution Proposal and Heuristically Guided Mining

- The non-dominated sorting of solutions based on the Pareto dominance concept assigns a ranking to the non-dominated members of the population.
- A crowding distance strategy for assessing the density of individuals surrounding a particular solution allows for preserving a better population diversity.

#### Rule Evolution

- The stop criterion for the evolution process is triggered after 100 generations without improvement in the fitness value of the fittest rule.
- Recombination: This generates new individuals (new rules) from a pair of previously known rules ${A}_{1}\to {C}_{1}$ and ${A}_{2}\to {C}_{2}$, referred to as the ancestor rules. Four new individuals are created using the following recombination modalities:
- Interchange: The antecedent and consequent from the ancestor rules are interchanged. Two new individuals are created: ${A}_{1}\to {C}_{2}$ and ${A}_{2}\to {C}_{1}$.
- Set operations: The union (∪), intersection (∩), and symmetric difference (∆) operators are applied to the sets of selectors in the antecedent and consequent of the ancestor rules. For each one of those set operators, the antecedent of new rules results from applying the operator on sets ${A}_{1}$ and ${A}_{2}$, and the consequent results from applying the same operator on sets ${C}_{1}$ and ${C}_{2}$. Selectors with repeated attributes are pruned, as well as all cases that result in an empty antecedent or consequent. At most, three new individuals are created with this recombination process.

- Mutation: Each new rule generated by any recombination method is subjected to either an extension or a contraction transformation to introduce variability into the population. The extension randomly adds a new selector not previously present in the rule, while the contraction randomly prunes a selector from the rule.
- Elitism: The non-dominated sorting and crowding distance methods used by NSGA-II [22] are adopted to select the fittest rule and preserve the diversity of the population.

## 5. Experiments and Results

## 6. Conclusions

- Design and testing of a new evolutionary algorithm for association rule mining with enough flexibility to integrate domain knowledge in order to solve single-objective and multi-objective association rule mining problems;
- The inclusion of a causal model to restate the semantics of the search process by providing a measure of the actionability of mined rules;
- The inclusion of a set of proposed crossover and mutation operators into the mining process.

- Extending the evolution process with logical expressions;
- Incorporating a target group discovery algorithm;
- Considering the opposite optimization criteria to generate interesting rules;
- Including the proposed algorithm in other case studies.

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Sohrabi, C.; Alsafi, Z.; O’Neill, N.; Khan, M.; Kerwan, A.; Al-Jabir, A.; Iosifidis, C.; Agha, R. World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19). Int. J. Surg.
**2020**, 76, 71–76. [Google Scholar] [CrossRef] [PubMed] - López, L.; Rodó, X. The end of social confinement and COVID-19 re-emergence risk. Nat. Hum. Behav.
**2020**, 4, 746–755. [Google Scholar] [CrossRef] [PubMed] - Telikani, A.; Gandomi, A.H.; Shahbahrami, A. A survey of evolutionary computation for association rule mining. Inf. Sci.
**2020**, 524, 318–352. [Google Scholar] [CrossRef] - De Salud, S. COVID-19 Pandemic Data Set from Mexico. 2022. Available online: https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico (accessed on 7 December 2022).
- Fürnkranz, J.; Gamberger, D.; Lavrač, N. Foundations of Rule Learning; Cognitive Technologies; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar] [CrossRef]
- Pearl, J.; Mackenzie, D. The Book of Why: The New Science of Cause and Effect; Basic Books: New York, NY, USA, 2018. [Google Scholar]
- Pearl, J. Causality: Models, Reasoning, and Inference; Cambridge University Press: Cambridge, UK; New York, NY, USA, 2000. [Google Scholar]
- Hernán, M.A.; Robins, J.M. Causal Inference: What If; CRC Press: Boca Raton, FL, USA, 2020; p. 311. [Google Scholar]
- Mansournia, M.A.; Altman, D.G. Population attributable fraction. BMJ
**2018**, 360, k757. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Miettinen, O.S. Proportion of disease caused or prevented by a given exposure, trait or intervention. Am. J. Epidemiol.
**1974**, 99, 325–332. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Cortés-Martínez, K.V.; Estrada-Esquivel, H.; Martínez-Rebollar, A.; Hernández-Pérez, Y.; Ortiz-Hernández, J. The State of the Art of Data Mining Algorithms for Predicting the COVID-19 Pandemic. Axioms
**2022**, 11, 242. [Google Scholar] [CrossRef] - Flora, J.; Khan, W.; Jin, J.; Jin, D.; Hussain, A.; Dajani, K.; Khan, B. Usefulness of Vaccine Adverse Event Reporting System for Machine-Learning Based Vaccine Research: A Case Study for COVID-19 Vaccines. Int. J. Mol. Sci.
**2022**, 23, 8235. [Google Scholar] [CrossRef] [PubMed] - Shan, Z.; Miao, W. COVID-19 patient diagnosis and treatment data mining algorithm based on association rules. Expert Syst.
**2021**, e12814. [Google Scholar] [CrossRef] [PubMed] - Wasiq, K.; Abir, H.; Ahmed, K.S.; Mohammed, A.J.; Raheel, N.; Panos, L. Analysing the impact of global demographic characteristics over the COVID-19 spread using class rule mining and pattern matching. R. Soc.
**2021**, 8, 201823. [Google Scholar] - Tandan, M.; Acharya, Y.; Pokharel, S.; Timilsina, M. Discovering symptom patterns of COVID-19 patients using association rule mining. Comput. Biol. Med.
**2021**, 131, 104249. [Google Scholar] [CrossRef] [PubMed] - Wakabi-Waiswa, P.P.; Baryamureeba, V. Extraction of interesting association rules using genetic algorithms. Adv. Syst. Model. ICT Appl.
**2007**. Available online: https://www.researchgate.net/publication/255610299_Extraction_of_interesting_association_rules_using_genetic_algorithms (accessed on 1 November 2022). - Anand, R.; Vaid, A.; Singh, P.K. Association rule mining using multi-objective evolutionary algorithms: Strengths and challenges. In Proceedings of the 2009 World Congress on Nature and Biologically Inspired Computing (NaBIC), Coimbatore, India, 9–11 December 2009; pp. 385–390. [Google Scholar] [CrossRef]
- Martín, D.; Rosete, A.; Alcalá-Fdez, J.; Herrera, F. A multi-objective evolutionary algorithm for mining quantitative association rules. In Proceedings of the 2011 11th International Conference on Intelligent Systems Design and Applications, Cordoba, Spain, 22–24 November 2011; pp. 1397–1402. [Google Scholar]
- Luna, J.M.; Cano, A.; Ventura, S. Genetic Programming for Mining Association Rules in Relational Database Environments. In Handbook of Genetic Programming Applications; Springer International Publishing: Cham, Switzerland, 2015; pp. 431–450. [Google Scholar] [CrossRef]
- Elhilbawi, H.; Eldawlatly, S.; Mahdi, H. The Importance of Discretization Methods in Machine Learning Applications: A Case Study of Predicting ICU Mortality. In Advanced Machine Learning Technologies and Applications; Chang, K.C., Hassanien, A.E., Mincong, T., Eds.; Springer International Publishing: Cham, Switzerland, 2021; Volume 1339, pp. 214–224. [Google Scholar] [CrossRef]
- Tanabe, K. Pareto’s 80/20 rule and the Gaussian distribution. Phys. A Stat. Mech. Its Appl.
**2018**, 510, 635–640. [Google Scholar] [CrossRef] - Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput.
**2002**, 6, 182–197. [Google Scholar] [CrossRef] [Green Version] - Shanu, V.; Millie, P.; Vaclav, S. A Comprehensive Review on NSGA-II for Multi-Objective Combinatorial Optimization Problems. IEEE Access
**2021**, 9, 57757–57791. [Google Scholar] [CrossRef]

**Figure 2.**Obtained Pareto fronts of DMOP-2 (absolute risk and reciprocal absolute risk) and DMOP-3 (probability of sufficiency and population attributable fraction) for all waves in Scenario A. The antecedent group is age and gender, and the consequent group is comorbidities.

**Figure 3.**Obtained Pareto fronts of DMOP-2 (absolute risk and reciprocal absolute risk) and DMOP-3 (probability of sufficiency and population attributable fraction) for all waves in Scenario B. The antecedent group is comorbidities, and the consequent group is medical care.

**Figure 4.**Obtained Pareto fronts of DMOP-2 (absolute risk and reciprocal absolute risk) and DMOP-3 (probability of sufficiency and population attributable fraction) for all waves in Scenario C. The antecedent group is location, and the consequent group is comorbidities.

ID | Attributes |
---|---|

Comorbidities | Asthma, cardiovascular disease, COVID-19, diabetes, chronic obstructive pulmonary disease (COPD), hypertension, immunosuppression, pneumonia, obesity, chronic kidney disease |

Age and gender | Age, Gender |

Location | Location of hospital, sector |

Medical care and outcome | Intubation, in intensive care unit (ICU), deceased, hospitalized |

Attribute Group | No. of Attributes | No. of Selectors | Possible Combinations |
---|---|---|---|

Comorbidities | 13 | 43 | 134,217,727 |

Clinical care | 3 | 66 | 3266 |

Medical care | 4 | 12 | 224 |

Age and gender | 2 | 7 | 17 |

Scenario | Antecedent Group | Consequent Group | Search Space Size |
---|---|---|---|

A | Age and gender | Comorbidities | 2,281,701,359 |

B | Comorbidities | Medical care | 30,064,770,848 |

C | Location | Comorbidities | 438,355,096382 |

Wave | Initial Date | End Date | Records |
---|---|---|---|

1 | 2020-02-16 | 2020-09-26 | 1,955,291 |

2 | 2020-09-27 | 2021-04-17 | 4,604,490 |

3 | 2021-06-06 | 2021-10-23 | 4,220,735 |

4 | 2021-12-19 | 2022-03-05 | 3,027,248 |

Classical Measures | Scenarios | ||

$\mathit{A}$ | $\mathit{B}$ | $\mathit{C}$ | |

Support | 0.322 (0) | 0.588 (0) | 0.365 (0.021) |

Confidence | 0.655 (0) | 0.993 (0) | 0.880 (0.119) |

Lift | 7.696 (0) | 33.967 (0) | 41.11 (31.003) |

Causal Measures | Scenarios | ||

$\mathit{A}$ | $\mathit{B}$ | $\mathit{C}$ | |

Probability of Sufficiency | 0.272 (0) | 0.869 (0) | 0.823 (0.094) |

Attributable Fraction | 0.739 (0) | 0.891 (0) | 0.245 (0.019) |

Absolute Risk | 0.293 (0) | 0.941 (0) | 0.951 (0.019) |

Reciprocal Absolute Risk | 0.913 (0) | 0.935 (0) | 0.377 (0.031) |

Scenario | DMOP | W1 | W2 | W3 | W4 | A |
---|---|---|---|---|---|---|

A | 1 | 28 (0) | 28 (0) | 29 (0) | 26 (0) | 27 (0) |

2 | 9 (0) | 28 (0) | 14 (0) | 15 (0) | 9 (0) | |

3 | 13 (0) | 15 (0) | 13 (0) | 14 (0) | 11 (0) | |

B | 1 | 38 (0) | 40 (0) | 39 (0) | 40 (0) | 39 (0) |

2 | 16 (0) | 21 (0) | 24 (0) | 28 (0) | 20 (0) | |

3 | 23 (0) | 22 (0) | 23 (0) | 28 (2.34) | 21 (0) | |

C | 1 | 16.4 (1.14) | 16.8 (1.789 | 18.2 (1.09) | 10.2 (0.83) | 16 (0.70) |

2 | 9.6 (1.51) | 13.4 (1.14) | 10.8 (1.64) | 3 (0.70) | 8.8 (1.64) | |

3 | 10.6 (1.14) | 12.6 (2.40) | 11.4 (0.54) | 8.2 (1.30) | 11.2 (1.09) |

**Table 7.**The best association rules were obtained in scenarios A (age and gender → comorbidities) and B (comorbidities and medical care). The evolutionary algorithm found these rules in the last population generated.

AGE ⌃53.0 ->DIABETES ⌃HYPERTENSION ⌃PNEUMONIA | |||||

Measure | W1 | W2 | W3 | W4 | T |

Support | 0.0178 | 0.0095 | 0.0041 | 0.0027 | 0.0071 |

Confidence | 0.075 | 0.042 | 0.026 | 0.016 | 0.037 |

Lift | 3.3361 | 3.7296 | 5.1588 | 4.9639 | 4.2691 |

Absolute Risk | 0.0686 | 0.04 | 0.0252 | 0.0155 | 0.0354 |

Reciprocal Absolute Risk | 0.5711 | 0.6184 | 0.6439 | 0.6776 | 0.6304 |

Prob. Sufficiency | 0.069 | 0.0401 | 0.0252 | 0.0155 | 0.0355 |

Attributable Fraction | 0.7337 | 0.7878 | 0.7574 | 0.8139 | 0.7725 |

AGE >53.0 ->DIABETES ⌃HYPERTENSION | |||||

Measure | W1 | W2 | W3 | W4 | T |

Support | 0.045 | 0.0345 | 0.0218 | 0.0206 | 0.0288 |

Confidence | 0.188 | 0.154 | 0.141 | 0.121 | 0.15 |

Lift | 2.9746 | 3.2525 | 4.3939 | 3.9651 | 3.6786 |

Absolute Risk | 0.1642 | 0.1373 | 0.1291 | 0.1092 | 0.1354 |

Reciprocal Absolute Risk | 0.5038 | 0.5296 | 0.5402 | 0.5212 | 0.5338 |

Prob. Sufficiency | 0.1683 | 0.1396 | 0.1307 | 0.1106 | 0.1375 |

Attributable Fraction | 0.6201 | 0.6501 | 0.618 | 0.609 | 0.633 |

AGE >53 ->HYPERTENSION ⌃PNEUMONIA | |||||

Measure | W1 | W2 | W3 | W4 | T |

Support | 0.0324 | 0.0173 | 0.0072 | 0.0048 | 0.0129 |

Confidence | 0.136 | 0.077 | 0.047 | 0.028 | 0.067 |

Lift | 3.2169 | 3.6087 | 4.9252 | 4.7907 | 4.1133 |

Absolute Risk | 0.1229 | 0.0721 | 0.044 | 0.027 | 0.0631 |

Reciprocal Absolute Risk | 0.5532 | 0.5971 | 0.6104 | 0.6497 | 0.605 |

Prob. Sufficiency | 0.1245 | 0.0724 | 0.0441 | 0.027 | 0.0633 |

Attributable Fraction | 0.6962 | 0.7529 | 0.7147 | 0.7786 | 0.7357 |

AGE >53.0 ->HYPERTENSION | |||||

Measure | W1 | W2 | W3 | W4 | T |

Support | 0.094 | 0.0766 | 0.0467 | 0.0465 | 0.0624 |

Confidence | 0.393 | 0.342 | 0.303 | 0.273 | 0.327 |

Lift | 2.5102 | 2.7453 | 3.6076 | 3.225 | 3.0653 |

Absolute Risk | 0.3108 | 0.2804 | 0.2589 | 0.2271 | 0.2722 |

Reciprocal Absolute Risk | 0.4279 | 0.4466 | 0.4385 | 0.4142 | 0.4419 |

Prob. Sufficiency | 0.3387 | 0.2988 | 0.2708 | 0.238 | 0.2879 |

Attributable Fraction | 0.4743 | 0.5037 | 0.4748 | 0.457 | 0.488 |

HYPERTENSION ⌃PNEUMONIA ->HOSPITALIZATION | |||||

Measure | W1 | W2 | W3 | W4 | T |

Support | 0.0378 | 0.0196 | 0.0087 | 0.0052 | 0.0148 |

Confidence | 0.897 | 0.912 | 0.92 | 0.873 | 0.904 |

Lift | 5.3062 | 10.4193 | 16.4679 | 23.7918 | 11.7312 |

Absolute Risk | 0.7602 | 0.8426 | 0.8721 | 0.8414 | 0.8411 |

Reciprocal Absolute Risk | 0.2186 | 0.2213 | 0.1553 | 0.1396 | 0.1905 |

Prob. Sufficiency | 0.8809 | 0.9055 | 0.9156 | 0.869 | 0.8979 |

Attributable Fraction | 0.1896 | 0.2063 | 0.1481 | 0.1353 | 0.1788 |

Measure | W1 | W2 | W3 | W4 | T |

Support | 0.0337 | 0.0163 | 0.0077 | 0.0042 | 0.0127 |

Confidence | 0.905 | 0.919 | 0.925 | 0.881 | 0.911 |

Lift | 5.3528 | 10.4933 | 16.5565 | 23.9939 | 11.8201 |

Absolute Risk | 0.7645 | 0.846 | 0.876 | 0.8479 | 0.846 |

Reciprocal Absolute Risk | 0.1952 | 0.1848 | 0.1371 | 0.1148 | 0.1639 |

Prob. Sufficiency | 0.8896 | 0.9122 | 0.9207 | 0.8765 | 0.905 |

Attributable Fraction | 0.1685 | 0.1717 | 0.1306 | 0.1111 | 0.1534 |

PNEUMONIA ->HOSPITALIZATION | |||||

Measure | W1 | W2 | W3 | W4 | T |

Support | 0.1016 | 0.0497 | 0.0257 | 0.0127 | 0.0395 |

Confidence | 0.83 | 0.836 | 0.813 | 0.668 | 0.815 |

Lift | 4.9114 | 9.5507 | 14.5584 | 18.1962 | 10.5679 |

Absolute Risk | 0.7536 | 0.7958 | 0.7819 | 0.6433 | 0.7751 |

Reciprocal Absolute Risk | 0.576 | 0.5571 | 0.4547 | 0.3386 | 0.5021 |

Prob. Sufficiency | 0.8163 | 0.8292 | 0.807 | 0.6594 | 0.8071 |

Attributable Fraction | 0.5453 | 0.5405 | 0.4433 | 0.3324 | 0.487 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Sinisterra-Sierra, S.; Godoy-Calderón, S.; Pescador-Rojas, M.
COVID-19 Data Analysis with a Multi-Objective Evolutionary Algorithm for Causal Association Rule Mining. *Math. Comput. Appl.* **2023**, *28*, 12.
https://doi.org/10.3390/mca28010012

**AMA Style**

Sinisterra-Sierra S, Godoy-Calderón S, Pescador-Rojas M.
COVID-19 Data Analysis with a Multi-Objective Evolutionary Algorithm for Causal Association Rule Mining. *Mathematical and Computational Applications*. 2023; 28(1):12.
https://doi.org/10.3390/mca28010012

**Chicago/Turabian Style**

Sinisterra-Sierra, Santiago, Salvador Godoy-Calderón, and Miriam Pescador-Rojas.
2023. "COVID-19 Data Analysis with a Multi-Objective Evolutionary Algorithm for Causal Association Rule Mining" *Mathematical and Computational Applications* 28, no. 1: 12.
https://doi.org/10.3390/mca28010012