Predictive Modeling for the Diagnosis of Gestational Diabetes Mellitus Using Epidemiological Data in the United Arab Emirates

Ali, Nasloon; Khan, Wasif; Ahmad, Amir; Masud, Mohammad Mehedy; Adam, Hiba; Ahmed, Luai A.

doi:10.3390/info13100485

Open AccessArticle

Predictive Modeling for the Diagnosis of Gestational Diabetes Mellitus Using Epidemiological Data in the United Arab Emirates

by

Nasloon Ali

^1,*

,

Wasif Khan

^2,3

,

Amir Ahmad

²,

Mohammad Mehedy Masud

^2,3,

Hiba Adam

¹

and

Luai A. Ahmed

^1,4

¹

Institute of Public Health, College of Medicine and Health Sciences, United Arab Emirates University, Al Ain 15551, United Arab Emirates

²

College of Information Technology, United Arab Emirates University, Al Ain 15551, United Arab Emirates

³

Big Data Analytics Center, United Arab Emirates University, Al Ain 15551, United Arab Emirates

⁴

Zayed Centre for Health Sciences, United Arab Emirates University, Al Ain 15551, United Arab Emirates

^*

Author to whom correspondence should be addressed.

Information 2022, 13(10), 485; https://doi.org/10.3390/info13100485

Submission received: 6 September 2022 / Revised: 29 September 2022 / Accepted: 3 October 2022 / Published: 10 October 2022

(This article belongs to the Topic Advances in Data Analytics with Applications to Health Care)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Gestational diabetes mellitus (GDM) is a common condition with repercussions for both the mother and her child. Machine learning (ML) modeling techniques were proposed to predict the risk of several medical outcomes. A systematic evaluation of the predictive capacity of maternal factors resulting in GDM in the UAE is warranted. Data on a total of 3858 women who gave birth and had information on their GDM status in a birth cohort were used to fit the GDM risk prediction model. Information used for the predictive modeling were from self-reported epidemiological data collected at early gestation. Three different ML models, random forest (RF), gradient boosting model (GBM), and extreme gradient boosting (XGBoost), were used to predict GDM. Furthermore, to provide local interpretation of each feature in GDM diagnosis, features were studied using Shapley additive explanations (SHAP). Results obtained using ML models show that XGBoost, which achieved an AUC of 0.77, performed better compared to RF and GBM. Individual feature importance using SHAP value and the XGBoost model show that previous GDM diagnosis, maternal age, body mass index, and gravidity play a vital role in GDM diagnosis. ML models using self-reported epidemiological data are useful and feasible in prediction models for GDM diagnosis amongst pregnant women. Such data should be periodically collected at early pregnancy for health professionals to intervene at earlier stages to prevent adverse outcomes in pregnancy and delivery. The XGBoost algorithm was the optimal model for identifying the features that predict GDM diagnosis.

Keywords:

gestational diabetes mellitus; machine learning; prediction modeling; United Arab Emirates

1. Introduction

Gestational diabetes mellitus (GDM) is a common medical condition during pregnancy and is characterized as any degree of glucose intolerance with onset or first recognition during pregnancy [1]. This definition is applicable regardless of whether insulin or diet modifications are used for treating GDM or whether the condition continues after pregnancy [2].

GDM increases the risk of maternal trauma, preeclampsia and eclampsia, premature rupture of membranes, preterm delivery, and delivery by caesarean section [3,4,5]. In the newborns, there is increased risk of macrosomia, shoulder dystocia, neonatal intensive care unit admission, and perinatal death [3,6,7,8]. Moreover, mothers with GDM have increased risk of type 2 diabetes and cardiovascular diseases later in life [9,10], while their children have an increased risk for obesity, impaired glucose tolerance, metabolic syndromes, and cardiovascular risk profiles during adolescence and early adulthood [11,12,13]. The current evidence indicates that early detection and management of GDM improves outcomes for both mothers and their children [14].

Previous research has shown that prediction modeling is very successful in relating factors to future advent of GDM diagnosis [15]. One model developed by applying a machine learning (ML) algorithm to data extracted from health records for the first trimester to predict risk GDM at 24–28 weeks of gestation achieved an AUC of 0.86 and accuracy of 62.2%. This algorithm also included maternal factors such as age, parity, BMI, education, and other hematological and biochemical test results [16]. It is uncertain whether these models are applicable to local population in the UAE as such a study has not been done here or in the region. Furthermore, there is lack of evidence whether the use of the relatively feasible and easy to collect self-reported epidemiological data in these models would be predictive in the local community. Therefore, the purpose of the present study was to develop a simple model incorporating maternal self-reported data and triage results to predict the risk of GDM amongst pregnant women in the UAE.

2. Materials and Methods

This analysis is based on the pregnant women from the Emirati population who participated in a prospective cohort study in Al Ain, Abu Dhabi, UAE. Upon recruitment, women completed a baseline questionnaire and were followed up during pregnancy via medical records in the hospitals. The overall study has been described in detail elsewhere [17]. The study was approved by the United Arab Emirates University Human Research Ethics Committee (ERH-2017-5512), the Al Ain Hospital Research Ethics Committee (AAHEC-03-17-058) and the Tawam Hospital Research Ethics Committee (IRR–494). Informed written consent was obtained from the participant prior to the data collection.

Data for the current analysis were extracted from the questionnaire administered during the first point of contact with the participants recruited between May 2017 and February 2021. The questionnaire contains questions on the demographics, psychosocial factors, previous pregnancies, and behaviors during the participant’s current pregnancy.

GDM was diagnosed using the mandatory testing and diagnosis standards used in all healthcare facilities in the emirate of Abu Dhabi. Specifically, between weeks 24–28 of gestation, pregnant women are required to complete a standardized oral glucose tolerance test (OGTT, fasting and 2 h post-glucose load) for GDM. Diagnosis of GDM was confirmed if fasting plasma glucose ≥ 5.1 mmol/L, one-hour plasma glucose ≥ 10.0 mmol/L, or two-hour plasma glucose ≥ 8.5 mmol/L [18]. Women diagnosed with GDM were categorized into the GDM group and all other women were included in the comparison group. Women with previous type 1 or 2 diabetes mellitus were excluded in this analysis.

Features selected included maternal age, number of previous pregnancies (gravidity), previous GDM diagnosis, planned pregnancy status, infertility treatment, consanguinity, education, employment, and physical activity during and before the current pregnancy. The focus of these features was for data that were collected in the questionnaire. This was to investigate the feasibility of prediction via self-reported data and hence features collected from medical records such as most anthropometry and biomarker data were not included. From the medical records, only information about the women’s GDM status and their body mass index (BMI) were used for prediction. Features used in this study have been previously shown to predict GDM diagnosis in pregnancies.

Descriptive statistics were performed to show and compare the distribution of characteristics of the study population by GDM status. Continuous variables are presented as means and standard deviations, while categorical variables are presented as counts and percentages. Student’s t-tests were used to determine differences between group means for continuous variables (e.g., maternal age) and Pearson chi-square tests were used for categorical variables (e.g., maternal education). Statistical analyses were performed using Stata 15.1 (Stata Corp, College Station, TX, USA). A p-value less than or equal to 0.05 defined statistical significance.

The proposed methodology can be explained using Algorithm 1.

Algorithm 1 GDM diagnosis using ML model.

Input:

X

is the total dataset of Patients P.

Output: diagnosis of GDM for a Patient

p_{i} \in P

such that

i = 0

is normal and

i = 1

is a GDM.

$X_{i m p u t e d}$ $\leftarrow X$ . Predict the missing values in X using MissForest.
$X_{t r a i n}, X_{t e s t}$ $\leftarrow X_{I m p u t e d}$ Divided $X_{I m p u t e d}$ into $X_{t r a i n}$ and $X_{t e s t}$ .
$X_{t r a i n}$ is used to train the ML model $f (\cdot)$ .
Use trained model $f (\cdot)$ to predict patient $p_{i}$ in $X_{t e s t}$ .
Return $p_{i}$ .

To build a GDM prediction model, the performance of three ML-based models, random forest (RF), gradient boosting model (GBM) and extreme gradient boosting model (XGBoost), were evaluated in this analysis. The RF is a ML model that consists of multiple decision tress (DT), where each tree has its own prediction. The prediction of each DT is then combined using averaging or majority vote to obtain overall output prediction. If there are multiple DTs where each tree has their own prediction for each outcome, according to RF algorithm the final prediction with the outcome being GDM will be:

R F_{o u t} = m a x_{v o t e s} {C o u n t (Y e s_{G D M}), C o u n t (N o_{G D M})}

(1)

where

Y e s_{G D M}

and

N o_{G D M}

are the predictions of the DTs with the presence and absence of GDM, respectively.

GBM is an ensemble ML classifier based on the idea of boosting, that is, if the weak learners can be modified to strong learners in an iterative way [19,20]. In GBM, gradients in the loss function are used to minimize the loss for weak learners (DTs). However, the GBM suffers from overfitting if the iterative process is not properly [20]. XGBoost [19] is a scalable ML model based on a gradient boosting framework used to build a low-depth DT iteratively to minimize a loss function [21,22] The training process add DTs iteratively to predict the errors from previous DTs before all the DTs are ensembled [23]. To express the XGBoost model containing n number of DTs are represented as:

y_{p r e d} = \sum_{D T = 1}^{n} f_{n} (y_{a c t u a l})

(2)

where y_actual is the input sample and y_pred is the predicted value by the model. In XGBoost, training is performed in an additive manner with the aim to optimize the objective function. The objective function O for m samples at the t-th iteration is represented as:

O^{t} = \sum_{1}^{m} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{i})

(3)

In Equation (3), l(.) is the loss function and Ω(f_i) is the regulation function which can be represented as:

Ω (f_{i}) = γ . T + \frac{1}{2} λ \sum_{i = 1}^{T} ω_{i}^{2}

(4)

The Shapley additive explanations (SHAP) method [24] is based on cooperative game theory [25] in which a group of team members cooperate to an outcome of a game and obtain a certain gain. Some players may contribute more than the others team members resulting in their payoff being more than others. SHAP values provide the solution to distribute the gain based on each player’s contribution. Consider a game, G, consisting of N players. Let C be the coalition of players and v(C) be the cost obtained from the coalition. Then, for each individual player i from the cost function v, the SHAP value Ø can be obtained using:

Ø_{i} (v) = \frac{\sum_{π ϵ C_{N}} (v (C (π, i)) - (v (\frac{π, i}{i})))}{N!}

(5)

where π represents the set of permutations and C(π,i) is the set of players in the coalition. The higher the Øi(v), the larger is the payoff of the individual player. Similarly, in the GDM model, each player is analogous to each feature. If the SHAP values is higher for a specific feature, it means that the specific feature is contributing more to the diagnosis of GDM.

The missing data were imputed using missForest [26,27]. The dataset with the features was divided randomly into 70–30 training–testing sets and each experiment was repeated five times. SHAP values were calculated using Python implementation of TreeExplainer. To evaluate the performance of each ML model for the propped GDM diagnosis, we first obtained ROC using ML models followed by each feature’s importance using SHAP value. Then, we described the positive and the negative impact towards the GDM diagnosis for each feature using SHAP values (see Section 3). The feature contribution impact plot was plotted, with a positive impact represented by a red bar and a negative impact represented by a blue color bar. The impact of global feature importance using ML model corresponding to each sample was obtained using a summary graph. Each sample in a summary graph is represented in a dot that lies on the x-axis which is determined by a SHAP value shows the contribution of that feature on the GDM diagnosis. When multiple sample lies on the same points it creates a density. Finally, dependence plots are obtained for each feature. In each dependence plot, the X-axis represents the feature while its interaction with other dependent feature is represented on the Y-axis (right side) and the SHAP values are represented on the left side of the Y-axis (see Section 3). All the experiments were conducted using Python 3.8 on Inter(R) Core i9-9900 CPU@ 3.10 GHz 8 GB RAM.

3. Results

The baseline characteristics of 3858 women who had information on their OGTT are presented in Table 1. Women who were diagnosed with GDM were older (32.8 vs. 29.9, p < 0.001), more parous (3.4 vs. 2.7, p < 0.001), more likely to be employed (37.5% vs. 31.3%, p = 0.006), and more likely to have self-reported previous GDM (56.0% vs. 15.6%, p < 0.001).

The ROC plot represented in Figure 1 shows that the XGBoost model achieved the best auROC of 0.770 compared to 0.764 and 0.624 achieved by RF and GBM, respectively. Therefore, XGBoost-based GDM models were used for further analysis using the SHAP values.

The impact of each feature towards GDM diagnosis is represented in Figure 2. The figure shows that based on higher SHAP values, previous GDM diagnosis is the most important factor for GDM diagnosis in the current pregnancy compared to any other feature. Maternal age and BMI are the next most important factors in GDM diagnosis. Features such as self-reported physical activity before and after pregnancy, employment, infertility treatment, and planned pregnancy had very little impact on GDM diagnosis. Again, previous GDM diagnosis was identified as the most important feature, and maternal age and BMI were the second most predictive features contributing positively towards GDM diagnosis using the SHAP values. On the other hand, self-reported physical activity before and after pregnancy and higher education have a negative impact on the proposed GDM model.

A summary pot in Figure 3 shows that history of previous GDM leads to higher SHAP values; thus, it has the most significant influence on GDM diagnosis. Similarly, higher values for maternal age and BMI results in higher SHAP values. Higher education and performing physical activity before and after pregnancy decreases the risk of developing GDM diagnosis. Employment is also shown to have a negatively impact on GDM diagnosis. Other features such as infertility treatment, planned pregnancy, and consanguinity only have a slight positive impact on GDM diagnosis.

Figure 4 represents the dependence plot for the features using the SHAP values for each sample. The plot shows that previous GDM diagnosis has a clear interaction with gravidity, as shown in Figure 4a for women who have been previously diagnosed with GDM and those who have a gravidity more than four. In this situation, the SHAP values increase and thus the women are at a higher risk of developing GDM. For age (Figure 4b), the SHAP value increases when the age of the women is older than 35 years. Regardless of previous GDM diagnosis, increased age is a risk of developing GDM. Nevertheless, we also found that pregnant women whose age is less than 30 years are negatively impacted in the GDM model. BMI (Figure 4c) shows an interaction with age, that is, a pregnant woman whose BMI is more than 30 kg/m² (considered obese) has an increased risk of GDM, especially in older women. Younger pregnant women with a BMI less than 25 kg/m² (considered of acceptable weight) are negatively correlated with GDM diagnosis. The plot also shows that gravidity increases the risk of developing GDM. Women with lower education (Figure 4e) are at a higher risk of developing GDM, especially if they are older women. Primiparity has an interaction with education, as shown in Figure 4f, which shows that women who attend higher education and who are not primiparous are at a lower risk of developing GDM. Physical activity before pregnancy (Figure 4g) shows an interaction with previous GDM. Women who do not perform physical activity, on the other hand, have >0 SHAP values, indicating a risk of GDM diagnosis. In Figure 4i, it is shown that unplanned pregnancy in older women is a risk for developing GDM diagnosis. Employment and education have an interaction, which show that women who are unemployed with lower education may be at a higher risk of GDM diagnosis (Figure 4j). The dependence plot for infertility treatment in Figure 4k shows that younger women who had infertility treatment are at a higher risk of GDM diagnosis.

4. Discussion

In this analysis from self-reported epidemiological data collected during early pregnancy, three ML prediction models were developed: XGBoost, RF, and GBM. Compared to other models, the XGBoost was highly predictive (auROC = 0.77). Experiments performed for the diagnosis of GDM show that the XGBoost algorithm performed well in comparison to to GBM and RF. The models showed that previous GDM diagnosis, maternal age, body mass index, and gravidity play a vital role in future GDM diagnosis.

The high predictivity of the XGBoost model in this analysis is consistent with the findings of a recent study [28] which found that the XGBoost model had a higher AUC than the logistic model (0.742 vs. 0.663, p = 0.001). XGBoost is an ensemble of multiple decision trees. XGBoost is an optimized gradient tree boosting system which also controls overfitting [29]. XGBoost can create diverse and accurate DTs that can be the reason of better performance [28,30]. Moreover, it handles the non-linear relationship in the data, and it is robust to outliers in the data. However, the black box nature of DT ensemble algorithms [31] remains a challenge to provide the local interpretations of each feature which leads to GDM diagnosis. Therefore, to provide the local interpretation of each feature we used SHAP with XGBoost algorithm for GDM diagnosis. Individual feature importance using SHAP value shows that previous GDM diagnosis is the most important factor for GDM diagnosis, followed by the age of the pregnant woman. The incorporation of SHAP value with XGBoost model enabled these local interpretations of each feature which contributes towards GDM diagnosis. XGBoost and RF which are main parts of the proposed GDM model can easily be generalized for similar populations. The GBM model is prone to noise, requires expensive parameter tuning, and may suffer from overfitting; therefore, the performance of the GBM was poor compared to RF and XGBoost. The problem of overfitting can be solved by using the optimized objective function obtained from booted trees in the XGBoost model.

The features included in this analysis are easily collected from pregnant women at antenatal care visits. As such, prediction of future GDM diagnosis using self-reported epidemiological data at early pregnancy is extremely feasible. This is crucial for many reasons. The easy prediction of women at risk of GDM is an important step to allow better antenatal care and interventions during pregnancy and even before conception [32]. Nutritional management should remain the focus, as it has shown the best prognosis for better neonatal and maternal outcomes in women with GDM [33]. The early recognition of such women using simple predictors makes management inherently manageable for both the women and the caregivers. Furthermore, practical interventions can be set up to ensure women with GDM are being proactive in their approach to dealing with the diagnosis. Many applications exist in the areas of diabetes-related dietary and physical activity management, and a customized application can be explored as per the factors highly associated with the diagnosis in each population.

A systematic review [11], reported that the risk for recurrent GDM in subsequent pregnancy was as high as 30–84% in women with prior GDM, and the variations in the GDM recurrence rate were dependent on the presence of other risk factors. The combination of high BMI with high abdominal circumference and elevated fasting glucose was associated with a 13-fold increased risk of GDM as compared to women who did not have this combination of symptoms [34]. According to Liu et al. [35], BMI and maternal age were two of the most used features for GDM prediction. Our findings showed that regardless of previous GDM history, the risk of GDM increases after the age of 35, with women under the age of 30 having a lower risk. Advanced maternal age is an independent risk factor for GDM [36]. Physical activity during pregnancy has also been shown to predict GDM in other populations. In the same meta-analysis mentioned above, physical activity both during and prior to pregnancy was associated with lower odds of GDM in pregnancy. The identification of women with sedentary behavior or poor activity via simple data such as those of this study allows for early intervention and better follow up. The predictive models also showed interactions between socio-economic factors such as education and employment in decreasing the probability of future GDM diagnosis. The clear interaction between older and uneducated women and their association to GDM diagnosis shows a clearly defined population who should be targeted for intervention. This is the same for women with previous infertility treatments who are inactive or sedentary. Such interactive models allow for defined populations to be targeted, ensuring that higher risk populations do not eventually get diagnosed with GDM so long as appropriate interventions are provided.

One of the strengths of this study is the focus of self-reported epidemiological data in predicting GDM diagnosis among pregnant women. Other studies show prediction models work well with data that are more invasive to collect, such as those from complete blood counts and anthropometry [37]. The focus of this study, however, was to give importance to features that might be easily collected from pregnant women at early gestation. This allows for close monitoring as early as possible. Women can also be told of their risk of being diagnosed and given appropriate interventions, as mentioned earlier. There are some limitations to this study. Firstly, self-reported data suffers from their own set of bias. From the ML perspective, ML models usually require enough amount of training data for better prediction. Although there is a reasonable amount of data for the GDM model, only a limited number of risk factors were used in this analysis. Other authors have used large number of features and a sufficient amount of training data. For instance, Artzi et al., Qiu et al., and Wu et al. used 2355, 50, and 50 features, respectively [16,38,39]. Another major issue is the class imbalance, as GDM cases were fewer compared to those who were not diagnosed, which makes the algorithm biased. Therefore, to balance all the classes, data balancing algorithms such SMOTE and GANS will be used in the future. Furthermore, the prediction of the proposed XGBoost can further be improved by incorporating more robust optimization techniques such as Bayesian particle swarm optimization.

5. Conclusions

ML models using self-reported epidemiological data are useful and feasible in prediction models for GDM diagnosis amongst pregnant women. Such data should be periodically collected at early pregnancy for health professionals to intervene at earlier stages to prevent adverse outcomes in pregnancy and delivery. The XGBoost algorithm was the optimal model for identifying the features that predict GDM diagnosis. SHAP values using XGBoost further identify the interactions of some variables in determining GDM diagnosis.

Author Contributions

Conceptualization, L.A.A., N.A., W.K., A.A. and M.M.M.; methodology, N.A. and W.K.; software, N.A. and W.K.; validation, L.A.A., A.A. and M.M.M.; formal analysis, N.A. and W.K.; investigation, N.A. and W.K.; resources, N.A., W.K, A.A. and L.A.A.; data curation, N.A.; writing—original draft preparation, N.A. and W.K.; writing—review and editing, all authors; visualization, W.K.; supervision, L.A.A. and A.A.; project administration, L.A.A. and A.A.; funding acquisition, L.A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grant from the Zayed Center for Health Sciences, United Arab Emirates University (31R239).

Informed Consent Statement

Informed written consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study can be made available on request from the Mutaba’ah study. Approval from research ethics committee may be required.

Acknowledgments

We thank the women who took part in the overall study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Buchanan, T.A.; Xiang, A.; Kjos, S.L.; Watanabe, R. What is gestational diabetes? Diabetes Care 2007, 30 (Suppl. S2), S105–S111. [Google Scholar] [CrossRef] [PubMed] [Green Version]
McIntyre, H.D.; Colagiuri, S.; Roglic, G.; Hod, M. Diagnosis of GDM: A suggested consensus. Best Pract. Res. Clin. Obstet. Gynaecol. 2015, 29, 194–205. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Keller, J.D.; Lopez-Zeno, J.A.; Dooley, S.L.; Socol, M.L. Shoulder dystocia and birth trauma in gestational diabetes: A five-year experience. Am. J. Obstet. Gynecol. 1991, 165, 928–930. [Google Scholar] [CrossRef]
Catalano, P.M.; McIntyre, H.D.; Cruickshank, J.K.; McCance, D.R.; Dyer, A.R.; Metzger, B.E.; Lowe, L.P.; Trimble, E.R.; Coustan, D.R.; Hadden, D.R.; et al. The hyperglycemia and adverse pregnancy outcome study: Associations of GDM and obesity with pregnancy outcomes. Diabetes Care 2012, 35, 780–786. [Google Scholar] [CrossRef] [Green Version]
Lao, T.; Ho, L. Does maternal glucose intolerance affect the length of gestation in singleton pregnancies? J. Soc. Gynecol. Investig. 2003, 10, 366–371. [Google Scholar] [CrossRef]
He, X.-J.; Qin, F.-Y.; Hu, C.-L.; Zhu, M.; Tian, C.-Q.; Li, L. Is gestational diabetes mellitus an independent risk factor for macrosomia: A meta-analysis? Arch. Gynecol. Obstet. 2015, 291, 729–735. [Google Scholar] [CrossRef]
Gasim, T. Gestational diabetes mellitus: Maternal and perinatal outcomes in 220 Saudi women. Oman Med. J. 2012, 27, 140. [Google Scholar] [CrossRef]
Billionnet, C.; Mitanchez, D.; Weill, A.; Nizard, J.; Alla, F.; Hartemann, A.; Jacqueminet, S. Gestational diabetes and adverse perinatal outcomes from 716,152 births in France in 2012. Diabetologia 2017, 60, 636–644. [Google Scholar] [CrossRef] [Green Version]
Bellamy, L.; Casas, J.-P.; Hingorani, A.D.; Williams, D. Type 2 diabetes mellitus after gestational diabetes: A systematic review and meta-analysis. Lancet 2009, 373, 1773–1779. [Google Scholar] [CrossRef]
Kessous, R.; Shoham-Vardi, I.; Pariente, G.; Sherf, M.; Sheiner, E. An association between gestational diabetes mellitus and long-term maternal cardiovascular morbidity. Heart 2013, 99, 1118–1121. [Google Scholar] [CrossRef]
Kim, S.Y.; England, J.L.; Sharma, J.A.; Njoroge, T. Gestational diabetes mellitus and risk of childhood overweight and obesity in offspring: A systematic review. Exp. Diabetes Res. 2011, 2011, 541308. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Vohr, B.R.; Boney, C.M. Gestational diabetes: The forerunner for the development of maternal and childhood obesity and metabolic syndrome? J. Matern.-Fetal Neonatal Med. 2008, 21, 149–157. [Google Scholar] [CrossRef] [PubMed]
Lee, H.; Jang, H.C.; Park, H.K.; Cho, N.H. Early manifestation of cardiovascular disease risk factors in offspring of mothers with previous history of gestational diabetes mellitus. Diabetes Res. Clin. Pract. 2007, 78, 238–245. [Google Scholar] [CrossRef] [PubMed]
Buckley, B.S.; Harreiter, J.; Damm, P.; Corcoy, R.; Chico, A.; Simmons, D.; Vellinga, A.; Dunne, F. Gestational diabetes mellitus in Europe: Prevalence, current screening practice and barriers to screening. A review. Diabet. Med. 2012, 29, 844–854. [Google Scholar] [CrossRef]
Smirnakis, K.V.; Plati, A.; Wolf, M.; Thadhani, R.; Ecker, J.L. Predicting gestational diabetes: Choosing the optimal early serum marker. Am. J. Obstet. Gynecol. 2007, 196, 410.e1–410.e7. [Google Scholar] [CrossRef]
Qiu, H.; Yu, H.Y.; Wang, L.Y.; Yao, Q.; Wu, S.N.; Yin, C.; Fu, B.; Zhu, X.J.; Zhang, Y.L.; Xing, Y.; et al. Electronic health record driven prediction for gestational diabetes mellitus in early pregnancy. Sci. Rep. 2017, 7, 16417. [Google Scholar] [CrossRef] [Green Version]
Al Haddad, A.; Ali, N.; Elbarazi, I.; Elabadlah, H.; Al-Maskari, F.; Narchi, H.; Brabon, C.; Ghazal-Aswad, S.; AlShalabi, F.M.; Zampelas, A.; et al. Mutaba’ah—Mother and Child Health Study: Protocol for a prospective cohort study investigating the maternal and early life determinants of infant, child, adolescent and maternal health in the United Arab Emirates. BMJ Open 2019, 9, e030937. [Google Scholar] [CrossRef] [Green Version]
Department of Health. HAAD Standard for Routine Antenatal Screening and Care; HAAD/ANSC/SD; Department of Health: Abu Dhabi, United Arab Emirates, 2011; pp. 1–8.
Chen, T.; Carlos, G. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
Wang, F.; Ross, C.L. Machine learning travel mode choices: Comparing the performance of an extreme gradient boosting model with a multinomial logit model. Transp. Res. Rec. 2018, 2672, 35–45. [Google Scholar] [CrossRef] [Green Version]
Zhang, D.; Qian, L.; Mao, B.; Huang, C.; Huang, B.; Si, Y. A data-driven design for fault detection of wind turbines using random forests and XGboost. Ieee Access. 2018, 6, 21020–21031. [Google Scholar] [CrossRef]
Lundberg, S.M.; Erion, G.; Chen, H.; DeGrave, A.; Prutkin, J.M.; Nair, B.; Katz, R.; Himmelfarb, J.; Bansal, N.; Lee, S.I. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2020, 2, 56–67. [Google Scholar] [CrossRef] [PubMed]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4768–4777. [Google Scholar]
Sundararajan, M.; Najmi, A. The many Shapley values for model explanation. In Proceedings of the International Conference on Machine Learning, Virtual Event, 13–18 July 2020; pp. 9269–9278. [Google Scholar]
Stekhoven, D.J.; Bühlmann, B. MissForest—Non-parametric missing value imputation for mixed-type data. Bioinformatics 2012, 28, 112–118. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sun, J.; Hong, M.; Xie, S.; Wu, L.; Wang, Y.; Mei, W.; Zhang, J. The interactive effect of pre-pregnancy overweight and obesity and hypertensive disorders of pregnancy on the weight status in infancy. Sci. Rep. 2019, 9, 15960. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Liu, J.; Wu, J.; Liu, S.; Li, M.; Hu, K.; Li, K. Predicting mortality of patients with acute kidney injury in the ICU using XGBoost model. PLoS ONE 2021, 16, e0246306. [Google Scholar] [CrossRef]
Shi, X.; Wong, Y.D.; Li, M.Z.F.; Palanisamy, C.; Chai, C. A feature learning approach based on XGBoost for driving assessment and risk prediction. Accid. Anal. Prev. 2019, 129, 170–179. [Google Scholar] [CrossRef]
Sahin, E.K. Assessing the predictive capability of ensemble tree methods for landslide susceptibility mapping using XGBoost, gradient boosting machine, and random forest. SN Appl. Sci. 2020, 2, 1–17. [Google Scholar] [CrossRef]
Shrikumar, A.; Greenside, P.; Shcherbina, A.; Kundaje, A. Not just a black box: Learning important features through propagating activation differences. arXiv 2016. Available online: https://arxiv.org/abs/1605.01713 (accessed on 5 September 2022).
Qiu, J.; Liu, Y.; Zhu, W.; Zhang, C. Comparison of effectiveness of routine antenatal care with a midwife-managed clinic service in prevention of gestational diabetes mellitus in early pregnancy at a hospital in China. Med. Sci. Monit. Int. Med. J. Exp. Clin. Res. 2020, 26, e925991-1–e925991-8. [Google Scholar] [CrossRef]
Lamminpää, R.; Vehviläinen-Julkunen, K.; Schwab, U. A systematic review of dietary interventions for gestational weight gain and gestational diabetes in overweight and obese pregnant women. Eur. J. Nutr. 2018, 57, 1721–1736. [Google Scholar] [CrossRef] [Green Version]
Popova, P.; Kravchuk, E.; Gerasimov, A.; Shelepova, E.; Tsoi, U.; Grineva, E. The new combination of risk factors determining a high risk of gestational diabetes mellitus. In Proceedings of the 15th International & 14th European Congress of Endocrinology, Florence, Italy, 5–9 May 2012. [Google Scholar]
Liu, R.; Zhan, Y.; Liu, X.; Zhang, Y.; Gui, L.; Qu, Y.; Nan, H.; Jiang, Y. Stacking Ensemble Method for Gestational Diabetes Mellitus Prediction in Chinese Pregnant Women: A Prospective Cohort Study. J. Healthc. Eng. 2022, 1, 1–14. [Google Scholar] [CrossRef] [Green Version]
Marozio, L.; Picardo, E.; Filippini, C.; Mainolfi, E.; Berchialla, P.; Cavallo, F.; Tancredi, A.; Benedetto, C. Maternal age over 40 years and pregnancy outcome: A hospital-based survey. J. Matern.-Fetal Neonatal Med. 2019, 32, 1602–1608. [Google Scholar] [CrossRef] [PubMed]
Sweeting, A.N.; Wong, J.; Appelblom, H.; Ross, G.P.; Kouru, H.; Williams, P.F.; Sairanen, M.; Hyett, J.A. A novel early pregnancy risk prediction model for gestational diabetes mellitus. Fetal Diagn. Ther. 2019, 45, 76–84. [Google Scholar] [CrossRef] [PubMed]
Artzi, N.S.; Shilo, S.; Hadar, E.; Rossman, H.; Barbash-Hazan, S.; Ben-Haroush, A.; Balicer, R.D.; Feldman, B.; Wiznitzer, A.; Segal, E. Prediction of gestational diabetes based on nationwide electronic health records. Nat. Med. 2020, 26, 71–76. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Ma, S.; Wang, Y.; Chen, F.; Zhu, F.; Sun, W.; Shen, W.; Zhang, J.; Chen, H. A risk prediction model of gestational diabetes mellitus before 16 gestational weeks in Chinese pregnant women. Diabetes Res. Clin. Pract. 2021, 179, 109001. [Google Scholar] [CrossRef]

Figure 1. ROC curve for GDM prediction using three prediction models. XGBoost shows best performance.

Figure 2. Feature contributions for the GDM model using SHAP values. Red shows that the features that positively contributed to GDM diagnosis while blue represents negatively contributed features.

Figure 3. Summary plot to represent the impact of each feature on model performance.

Figure 4. SHAP dependence plots for features versus its SHAP value.

Table 1. Baseline characteristics of 3858 women by gestational diabetes mellitus (GDM) status in Al Ain, UAE.

Characteristic	All (n = 3858)	No GDM (n = 2977)	GDM (n = 881)	p-Value
Age ^a*	31.1 ± 6.08	30.2 ± 5.90	33.8 ± 5.84	<0.001
BMI at pregnancy ^a*	28.6 ± 5.82	28.0 ± 5.68	30.7 ± 5.79	<0.001
Number of pregnancies ^a*	2.91 ± 2.42	2.71 ± 2.34	3.61 ± 2.58	<0.001
Primiparity **				<0.001
Yes	718 (19.8%)	609 (21.7%)	109 (13.4)
No	2911 (80.2%)	2200 (78.3%)	711 (86.2)
Previous GDM diagnosis **				<0.001
Yes	875 (24.8%)	452 (16.6%)	423 (52.6%)
No	2657 (75.2%)	2276 (83.4%)	381 (47.4%)
Planned pregnancy **				0.684
Yes	1894 (53.1%)	1467 (53.3%)	427 (52.5%)
No	1674 (46.9%)	1287 (46.7%)	387 (47.5%)
Infertility treatment **				<0.001
Yes	320 (9.1%)	219 (8.0%)	101 (12.6%)
No	3212 (90.9%)	2509 (92.0%)	703 (87.4%)
Consanguinity **				0.727
Yes	1001 (84.2%)	794 (84.4%)	207 (83.5)
No	188 (15.8%)	147 (15.6%)	41 (16.5)
Education **
High school and below	1798 (50.8%)	1420 (52.0%)	378 (46.8%)	0.011
Above High school	1742 (49.2%)	1313 (48.0%)	429 (53.2%)
Employed **				0.004
Not employed	2426 (68.4%)	1906 (69.7%)	520 (64.3%)
Employed	119 (31.6%)	830 (30.3%)	289 (35.7%)
Physical activity prior to current pregnancy **				0.582
Yes	1437 (44.3%)	1098 (44.1%)	339 (45.2)
No	1805 (55.7%)	1394 (55.9%)	411 (54.8)
Physical activity during current pregnancy **				0.026
Yes	1548 (46.8%)	1335 (52.2%)	323 (43.2)
No	1759 (53.2%)	1225 (47.8%)	424 (56.8)

^a continuous values are expressed as mean (standard deviation). *, Student’s t-test; **, Pearson’s chi-square test.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ali, N.; Khan, W.; Ahmad, A.; Masud, M.M.; Adam, H.; Ahmed, L.A. Predictive Modeling for the Diagnosis of Gestational Diabetes Mellitus Using Epidemiological Data in the United Arab Emirates. Information 2022, 13, 485. https://doi.org/10.3390/info13100485

AMA Style

Ali N, Khan W, Ahmad A, Masud MM, Adam H, Ahmed LA. Predictive Modeling for the Diagnosis of Gestational Diabetes Mellitus Using Epidemiological Data in the United Arab Emirates. Information. 2022; 13(10):485. https://doi.org/10.3390/info13100485

Chicago/Turabian Style

Ali, Nasloon, Wasif Khan, Amir Ahmad, Mohammad Mehedy Masud, Hiba Adam, and Luai A. Ahmed. 2022. "Predictive Modeling for the Diagnosis of Gestational Diabetes Mellitus Using Epidemiological Data in the United Arab Emirates" Information 13, no. 10: 485. https://doi.org/10.3390/info13100485

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predictive Modeling for the Diagnosis of Gestational Diabetes Mellitus Using Epidemiological Data in the United Arab Emirates

Abstract

1. Introduction

2. Materials and Methods

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI