# Nonparametric Generation of Synthetic Data Using Copulas

^{*}

## Abstract

**:**

## 1. Introduction

**SD**) is data generated through mathematical models that preserve the statistical properties of real data (

**RD**), such as the marginal and joint distributions of the data variables [1,2]. In recent years, research on synthetic data generation processes has become more relevant since important applications have been demonstrated, such as the possibility of anonymizing information, with a special interest in health [3,4,5], balancing classes in the training of machine learning (

**ML**) models [6,7,8,9], increasing the amount of data to improve the generalizability of deep learning models [10,11,12], among others.

**SD**makes it possible to solve the problem of class imbalance in the

**ML**algorithms used in classification, thanks to the fact that the

**RD**are oversampled, which allows for obtaining new individuals from minority classes. One of the most widely used techniques for this has been the Synthetic Minority Oversampling Technique (

**SMOTE**). Originally introduced by Chawla et al. [13],

**SMOTE**finds the nearest neighbors for a random sample of the class of interest and then randomly selects one of those neighbors and generates a sample that belongs to the line segment joining the random sample with its neighbor. The main disadvantage of this method is that it uses local information for data synthesis instead of considering the complete distribution of minority classes [6]. Applications of this method on the performance of

**ML**models can be found in [8,9].

**SD**is Generative Adversarial Networks (

**GAN**), initially introduced by Goodfellow et al. [14].

**GAN**consists of coupling two neural network architectures; one of them receives the name of Generator and the other the name of Discriminator. The first one has the function of generating

**SD**from the

**RD**and the second one of classifying if the data generated are real or synthetic. The ultimate goal is that the synthetic samples have such good quality that they are indistinguishable for the Discriminator [10]. Although it has been a short time since they were introduced in 2014,

**GAN**has been remarkably improved, to the point that even for a human being, it can be difficult to distinguish between real and synthetic images generated by the method. Their main drawback is that they are challenging to train [15]. Numerous studies present applications of

**GAN**; for instance, in Porcu et al. [11], they are used to improve the generalization capacity of facial recognition

**ML**models. However, in Andreini et al. [12], they are used to generate retinal images. In Poudevigne-Durance et al. [16], a modification is presented that allows the

**GAN**to better manage records with missing data, among others.

**SD**is the use of copulas; these are multivariate distribution functions that can explain the dependency relationships among the variables of a data set [17]. Recent work on generating

**SD**from copulas is found in Patki et al. [18], where Gaussian copulas are used to generate

**SD**in the context of a relational database. On the other hand, Sun et al. [18], employ vine copulas to produce

**SD**that can be used to fit

**ML**models. In turn, Nejad et al. [19] use Archimedean copulas to generate a synthetic population and thus carry out an emergency planning study. Despite previous references offer effective methods for generating

**SD**, they share a common limitation in that they are based on parametric versions of copula theory. Explicitly, these methods assume a specific functional shape for the copula of the data under study. Although this assumption is prevalent in the literature [20,21,22], it can be problematic since it imposes certain restrictions on the copula that may not hold true in practice. Furthermore, it can be challenging to verify the validity of such an assumption, and therefore a wrong selection of the parametric copula could lead to distorted results.

**SD**generation can be found in Reiter [23], who uses classification and regression trees for synthesis, Ping et al. [24], who use Bayesian neural networks, Rankin et al. [3], who apply both previous methods to preserve the privacy of data on the health status of patients, and Yale et al. [4], who fit a multivariate Gaussian distribution by maximum likelihood to generate new data, among others.

**SD**generation methods have proven their practical utility in a wide range of scenarios. For instance, Wang et al. [25] used synthetic data to address the challenge of insufficient data for training machine learning models in crowd analysis. In the work of Boikov et al. [26], synthetic data were used in the automated recognition of defective parts in steel production, which allowed the training of two deep learning models: one for classification and one for segmentation. Shamsolmoali et al. [27] used synthetic data to improve the generalizability of a road segmentation model. Farajzadeh-Zanjani et al. [28] used synthetic data to address the class imbalance problem and improve the training of attack detection models in electrical networks. These examples demonstrate the utility of synthetic data to address various challenges in machine learning and artificial intelligence.

**SD**follows the same distribution as the

**RD**. Examples of the above can be found in Yale et al. [4], who verify if the confidence intervals of each simulated and real variable overlap and check if the PCA projections of

**RD**and

**SD**are similar. Hernandez et al. [29] use the Student’s t-test to verify the equality of the means of the variables and the Kolmogorov–Smirnov test, which checks whether the marginal distributions of each variable are equal. Gonzalez-Abril et al. [30] use Pearson’s chi-square test to verify categorical variables and the Kolmogorov–Smirnov test for continuous variables, among others.

**SD**follows the exact behavior of the

**RD**. However, to our knowledge, there are no articles where it is verified, using a multivariate homogeneity test, if the

**SD**has the same distribution as the

**RD**. For the authors of this work, it represents a great opportunity since a multivariate homogeneity test is the most appropriate way to verify that the marginal distributions, the joint distributions, and the dependency relationships (linear and nonlinear) of the

**SD**are the same as those of the

**RD**. This is because comparing each variable marginally is not enough to verify that the

**SD**distribution is equal to the

**RD**distribution since, although the marginal distributions may be equal, this does not imply that the joint distributions are. An example of the latter can be found in the work of Matejka and Fitzmaurice [33], who generate different bivariate datasets, all with the same arithmetic means, standard deviations, and Pearson correlation coefficients but with different dependency structures [34]. Figure 1 presents some of the samples generated by these authors.

**SD**through copulas, which are functions that explain the dependency structure of the

**RD**[35]. The proposed method uses the global information of the data distribution, and not only local information such as

**SMOTE**, which allows for fitting the dependency relationships that exist in a data set better. One of the advantages of the proposed method is that, since it is nonparametric, empirical copulas are used, which avoids having to make assumptions about the parametric distribution that the

**RD**follows. In addition, this method is easier to use, interpret and implement than

**GAN,**since the algorithm introduced here carries out simple calculations while achieving excellent results.

**DD-plot**) is used to evaluate that the

**SD**comes from the same generating process as the

**RD**. This is an outstanding contribution since, in the scientific literature, many articles focus on comparing marginal distributions of

**SD**with

**RD**and on examining linear correlation coefficients, among other techniques. However, these measurements are not enough since they do not take into account the equality of the multivariate joint distributions. Our study fills this gap by conducting a comprehensive evaluation of the quality of the

**SD**.

**SD**and demonstrates that the data generated by the method respects the dependency structures of

**RD**, without the need to know the functional shapes of such structures. The article is organized as follows: Section 2 presents the formal development of the method, describes the simulations carried out to exemplify the method, explains the

**DD-plot**and describes a sensitivity analysis performed on the algorithm. Section 3 presents the results of the simulations and analyzes the goodness of the method. Finally, Section 4 presents the conclusions.

## 2. Methodology

#### 2.1. Mathematical Framework

**Proposition**

**1.**

**Proof.**

**Proposition**

**2.**

**Proof.**

**Example**

**1.**

**Example**

**2.**

Algorithm 1: Synthetic Data Generation Algorithm |

**Synthetic Data Generation Algorithm**is $\mathcal{O}\left(p\phantom{\rule{0.277778em}{0ex}}n\phantom{\rule{0.277778em}{0ex}}log\right(n)\phantom{\rule{0.277778em}{0ex}}+\phantom{\rule{0.277778em}{0ex}}N\phantom{\rule{0.277778em}{0ex}}p\phantom{\rule{0.277778em}{0ex}}log(t\left)\right)$ and since t is bounded by n, the expression can be simplified to $\mathcal{O}\left(max\right(N,\phantom{\rule{0.277778em}{0ex}}n\left)\phantom{\rule{0.277778em}{0ex}}p\phantom{\rule{0.277778em}{0ex}}log\right(n\left)\right)$.

#### 2.2. Experiments

**SD**from realizations of

**RD**following a distribution $\mathcal{F}$. The generated data and the real data are compared using scatter plots since they are presented only as an illustration. To generate the data ${x}^{\left(1\right)},\dots ,{x}^{\left(n\right)}\in \mathbb{R}{}^{p},\phantom{\rule{4.pt}{0ex}}\mathrm{p}\phantom{\rule{4.pt}{0ex}}=\phantom{\rule{4.pt}{0ex}}2,\phantom{\rule{4.pt}{0ex}}3$, are sampled from $\mathcal{F}$. The patterns with which these real data are obtained are identified as (

**a**) cubic function, (

**b**) two-dimensions spiral, (

**c**) Batman logo, (

**d**) petals and (

**e**) three-dimensions spiral. These cases were chosen since, in general, they present patterns with complex geometries, which allows us to demonstrate that, even in complex cases, the data generation method works correctly.

**RD**are compared with the synthetic ones using scatter plots, probability density plots and with the multivariate homogeneity test explained in Section 2.3. This case is presented to show that the method maintains good results even in high-dimensional data sets.

#### 2.3. Homogeneity Test

**RD**.

**DD-plot**, which corresponds to a plot of the combined sample depth under the two corresponding empirical distributions. If the distributions are identical, i.e., come from the same population, then the depth plot is a segment of a straight line joining the points $(0,0)$ and $(1,1)$ in ${\mathbb{R}}^{2}$. If there is any deviation of said graph from the straight line, then it is a sign that the distributions are not identical.

**DD-plot**must be used; for example, if F and G are a set of observations $\left\{{X}_{1},{X}_{2},\dots ,{X}_{n}\right\}(\equiv \mathbf{X})$, $\left\{{Y}_{1},{Y}_{2},\dots ,{Y}_{m}\right\}(\equiv \mathbf{Y})$, respectively, the

**DD-plot**is defined as follows:

#### 2.4. Sensitivity Analysis

**SD**. In this section, we explain the simulation analysis carried out to study the effect of t on the quality of the synthetic data.

**SD**was visually compared using scatterplots.

## 3. Results and Analysis

**a**) cubic function, (

**b**) two-dimensions spiral, (

**c**) Batman logo, (

**d**) petals, and (

**e**) three-dimensions spiral. Blue dots represent real data, while red dots correspond to synthetically generated data. As can be seen, the plots of the

**SD**are very similar to the plots of the real observations. No matter what geometry needs to be replicated, the data augmentation method can learn the underlying dependency relationships and probability distributions. This first group of experiments allows us to demonstrate graphically and simply the goodness of the synthesis method.

**SD**generation method respects the marginal distributions of each variable. Moreover, when analyzing the scatter plots, it is observed that the bivariate dependency relationships are also respected by the proposed method since the geometric shape and the scale are identical to those of the

**RD**. In practice, these facts indicate that synthetic data are similar to real data in univariate and bivariate parameters. Univariate parameters, such as mean, median, standard deviation, interquartile range, kurtosis, and skewness, are identical in

**SD**and

**RD**. Bivariate parameters, such as covariance, Pearson’s correlation, and Spearman’s correlation, are identical for both data sets. This indicates that the

**SD**accurately captures the relationships and patterns present in the

**RD**.

**DD-plot**explained in Section 2.3, where ${F}_{n}$ corresponds to the original data and ${G}_{m}$ to the generated samples. It should be noted that the depth values form a straight line that, fitting a simple linear regression model, yields a confidence interval for ${R}^{2}$ of $[0.9922,0.9999]$, a confidence interval for intercept ${\beta}_{0}$ of $[-0.0957,0.0618]$ and a confidence interval for the slope ${\beta}_{1}$ of $[0.9069,1.1087]$. Therefore, the homogeneity test confirms that the generated samples come from the same multivariate distribution as the original data, since the line they form is a segment of the straight line that joins the points $(0,0)$ and $(1,1)$ in ${\mathbb{R}}^{2}$ because ${R}^{2}$ is very close to one, and it cannot be rejected that ${\beta}_{0}$ is zero and that ${\beta}_{1}$ is one with a confidence level of 95%.

**SD**, which more closely reflects the dependency structure of the raw data. Conversely, using a low number of bins fails to capture this structure accurately. However, a very high number of partitions can produce synthetic data with less variability.

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

DD-plot | Data-depth plot |

GAN | Generative adversarial networks |

ML | Machine learning |

RD | Real data |

SD | Synthetic data |

SMOTE | Synthetic minority oversampling technique |

## References

- Liang, Y.; Nobakht, B.; Lindsay, G. The application of synthetic data generation and data-driven modelling in the development of a fraud detection system for fuel bunkering. Meas. Sens.
**2021**, 18, 100225. [Google Scholar] [CrossRef] - Dilmegani, C. What is Synthetic Data? What Are Its Use Cases & Benefits? 2023. Available online: https://research.aimultiple.com/synthetic-data/ (accessed on 1 January 2023).
- Rankin, D.; Black, M.; Bond, R.; Wallace, J.; Mulvenna, M.; Epelde, G. Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing. JMIR Med. Inform.
**2020**, 8, e18910. [Google Scholar] [CrossRef] - Yale, A.; Dash, S.; Dutta, R.; Guyon, I.; Pavao, A.; Bennett, K.P. Generation and evaluation of privacy preserving synthetic health data. Neurocomputing
**2020**, 416, 244–255. [Google Scholar] [CrossRef] - Yoon, J.; Drumright, L.N.; van der Schaar, M. Anonymization Through Data Synthesis Using Generative Adversarial Networks (ADS-GAN). IEEE J. Biomed. Health Inform.
**2020**, 24, 2378–2388. [Google Scholar] [CrossRef] [PubMed] - Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl.
**2018**, 91, 464–471. [Google Scholar] [CrossRef] - Ahmed, J.; Green II, R.C. Predicting severely imbalanced data disk drive failures with machine learning models. Mach. Learn. Appl.
**2022**, 9, 100361. [Google Scholar] [CrossRef] - Moreno-Barea, F.J.; Franco, L.; Elizondo, D.; Grootveld, M. Application of data augmentation techniques towards metabolomics. Comput. Biol. Med.
**2022**, 148, 105916. [Google Scholar] [CrossRef] [PubMed] - Temraz, M.; Keane, M.T. Solving the class imbalance problem using a counterfactual method for data augmentation. Mach. Learn. Appl.
**2022**, 9, 100375. [Google Scholar] [CrossRef] - Lashgari, E.; Liang, D.; Maoz, U. Data augmentation for deep-learning-based electroencephalography. J. Neurosci. Methods
**2020**, 346, 108885. [Google Scholar] [CrossRef] - Porcu, S.; Floris, A.; Atzori, L. Evaluation of Data Augmentation Techniques for Facial Expression Recognition Systems. Electronics
**2020**, 9, 1892. [Google Scholar] [CrossRef] - Andreini, P.; Ciano, G.; Bonechi, S.; Graziani, C.; Lachi, V.; Mecocci, A.; Sodi, A.; Scarselli, F.; Bianchini, M. A Two-Stage GAN for High-Resolution Retinal Image Generation and Segmentation. Electronics
**2021**, 11, 60. [Google Scholar] [CrossRef] - Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res.
**2002**, 16, 321–357. [Google Scholar] [CrossRef] - Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Netsworks: An overview. IEEE Signal Process. Mag.
**2018**, 35, 53–65. [Google Scholar] - Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. Commun. ACM
**2020**, 63, 139–144. [Google Scholar] [CrossRef] - Poudevigne-Durance, T.; Jones, O.D.; Qin, Y. MaWGAN: A Generative Adversarial Network to Create Synthetic Data from Datasets with Missing Data. Electronics
**2022**, 11, 837. [Google Scholar] [CrossRef] - Sklar, A. Fonctions de Répartition à n Dimensions et Leurs Marges. Publ. L’Institut Stat. L’UniversitÉ Paris
**1959**, 8, 229–231. [Google Scholar] - Patki, N.; Wedge, R.; Veeramachaneni, K. The Synthetic Data Vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17–19 October 2016; pp. 399–410. [Google Scholar] [CrossRef]
- Nejad, M.M.; Erdogan, S.; Cirillo, C. A statistical approach to small area synthetic population generation as a basis for carless evacuation planning. J. Transp. Geogr.
**2021**, 90, 102902. [Google Scholar] [CrossRef] - Li, Z.; Zhao, Y.; Fu, J. SynC: A Copula based Framework for Generating Synthetic Data from Aggregated Sources. In Proceedings of the 2020 International Conference on Data Mining Workshops (ICDMW), Sorrento, Italy, 17–20 November 2020; Volume 2020, pp. 571–578. [Google Scholar] [CrossRef]
- Benali, F.; Bodénès, D.; Labroche, N.; de Runz, C. MTCopula: Synthetic Complex Data Generation Using Copul. In Proceedings of the 23rd International Workshop on Design, Optimization, Languages and Analytical Processing of Big Data (DOLAP), Nicosia, Cyprus, 23 March 2021; Volume 2840, pp. 51–60. [Google Scholar]
- Endres, M.; Mannarapotta Venugopal, A.; Tran, T.S. Synthetic Data Generation: A Comparative Study. In Proceedings of the International Database Engineered Applications Symposium, Budapest Hungary, 22–24 August 2022; ACM: New York, NY, USA, 2022; pp. 94–102. [Google Scholar] [CrossRef]
- Reiter, J.P. Using CART to generate partially synthetic, public use microdata. J. Off. Stat.
**2005**, 21, 441–462. [Google Scholar] - Ping, H.; Stoyanovich, J.; Howe, B. DataSynthesizer: Privacy-Preserving Synthetic Datasets. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, 27–29 June 2017; ACM: New York, NY, USA, 2017; pp. 1–5. [Google Scholar] [CrossRef]
- Wang, Q.; Gao, J.; Lin, W.; Yuan, Y. Pixel-Wise Crowd Understanding via Synthetic Data. Int. J. Comput. Vis.
**2021**, 129, 225–245. [Google Scholar] [CrossRef] - Boikov, A.; Payor, V.; Savelev, R.; Kolesnikov, A. Synthetic Data Generation for Steel Defect Detection and Classification Using Deep Learning. Symmetry
**2021**, 13, 1176. [Google Scholar] [CrossRef] - Shamsolmoali, P.; Zareapoor, M.; Zhou, H.; Wang, R.; Yang, J. Road Segmentation for Remote Sensing Images Using Adversarial Spatial Pyramid Networks. IEEE Trans. Geosci. Remote Sens.
**2021**, 59, 4673–4688. [Google Scholar] [CrossRef] - Farajzadeh-Zanjani, M.; Hallaji, E.; Razavi-Far, R.; Saif, M.; Parvania, M. Adversarial Semi-Supervised Learning for Diagnosing Faults and Attacks in Power Grids. IEEE Trans. Smart Grid
**2021**, 12, 3468–3478. [Google Scholar] [CrossRef] - Hernandez, M.; Epelde, G.; Beristain, A.; Álvarez, R.; Molina, C.; Larrea, X.; Alberdi, A.; Timoleon, M.; Bamidis, P.; Konstantinidis, E. Incorporation of Synthetic Data Generation Techniques within a Controlled Data Processing Workflow in the Health and Wellbeing Domain. Electronics
**2022**, 11, 812. [Google Scholar] [CrossRef] - Gonzalez-Abril, L.; Angulo, C.; Ortega, J.A.; Lopez-Guerra, J.L. Statistical Validation of Synthetic Data for Lung Cancer Patients Generated by Using Generative Adversarial Networks. Electronics
**2022**, 11, 3277. [Google Scholar] [CrossRef] - Dankar, F.K.; Ibrahim, M.K.; Ismail, L. A Multi-Dimensional Evaluation of Synthetic Data Generators. IEEE Access
**2022**, 10, 11147–11158. [Google Scholar] [CrossRef] - Hernadez, M.; Epelde, G.; Alberdi, A.; Cilla, R.; Rankin, D. Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions. Methods Inf. Med.
**2023**. [Google Scholar] [CrossRef] - Matejka, J.; Fitzmaurice, G. Same Stats, Different Graphs. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017; ACM: New York, NY, USA, 2017; Volume 2017, pp. 1290–1294. [Google Scholar] [CrossRef]
- Matejka, J.; Fitzmaurice, G. Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, 6–11 May 2017. [Google Scholar]
- Nelsen, R.B. An Introduction to Copulas; Springer Series in Statistics; Springer: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
- Liu, R.Y.; Parelius, J.M.; Singh, K. Multivariate analysis by data depth: Descriptive statistics, graphics and inference, (with discussion and a rejoinder by Liu and Singh). Ann. Stat.
**1999**, 27, 783–858. [Google Scholar] [CrossRef] - Wasserman, L. All of Nonparametric Statistics; Springer Texts in Statistics; Springer New York: New York, NY, USA, 2006. [Google Scholar] [CrossRef]
- Cortez, P.; Cerdeira, A.; Almeida, F.; Matos, T.; Reis, J. Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst.
**2009**, 47, 547–553. [Google Scholar] [CrossRef][Green Version] - Hollander, M.; Wolfe, D.A.; Chicken, E. Density Estimation. In Nonparametric Statistical Methods; John Wiley & Sons: New York, NY, USA, 2015; pp. 609–628. [Google Scholar] [CrossRef]
- Silverman, B. Density Estimation for Statistics and Data Analysis; Routledge: New York, NY, USA, 2017; Available online: https://www.taylorfrancis.com/books/mono/10.1201/9781315140919/density-estimation-statistics-data-analysis-bernard-silverman (accessed on 1 January 2023).

**Table 1.**Population parameters of a standard normal distribution and their point estimates considering the raw data and synthetic data generated from those raw data.

Standard Normal | ${\mathit{P}}_{25}$ | Mean | Std | Median | ${\mathit{P}}_{75}$ |
---|---|---|---|---|---|

Theoretical Value | −0.6745 | 0 | 1 | 0 | 0.6745 |

Raw Data | −0.6504 | −0.0023 | 0.9998 | −0.0186 | 0.6368 |

Synthetic Data | −0.6425 | −0.0028 | 1.0054 | −0.0119 | 0.6502 |

Standard Normal | ${\mathit{P}}_{25}$ | Mean | Std | Median | ${\mathit{P}}_{75}$ |
---|---|---|---|---|---|

Mean Value | −0.6522 | −0.0027 | 1.0026 | −0.0135 | 0.6389 |

Confidence Interval | $[-0.6814\phantom{\rule{0.277778em}{0ex}}-0.6207]$ | $[-0.0121\phantom{\rule{0.277778em}{0ex}}0.0062]$ | $\left[0.9935\phantom{\rule{0.277778em}{0ex}}1.0126\right]$ | $[-0.0485\phantom{\rule{0.277778em}{0ex}}0.0220]$ | $\left[0.6038\phantom{\rule{0.277778em}{0ex}}0.6706\right]$ |

**Table 3.**Population parameters of Exponential distribution and their point estimates considering the raw data and synthetic data generated from those raw data.

Exponential | ${\mathit{P}}_{25}$ | Mean | Std | Median | ${\mathit{P}}_{75}$ |
---|---|---|---|---|---|

Theoretical Value | 1.4384 | 5 | 5 | 3.4657 | 6.9315 |

Raw Data | 1.9618 | 5.1615 | 4.6496 | 3.8424 | 7.1929 |

Synthetic Data | 1.7346 | 5.2074 | 4.5679 | 4.0704 | 7.5514 |

Exponential | ${\mathit{P}}_{25}$ | Mean | Std | Median | ${\mathit{P}}_{75}$ |
---|---|---|---|---|---|

Mean Value | 1.8482 | 5.2006 | 4.5643 | 3.9670 | 7.4234 |

Confidence Interval | $\left[1.7104\phantom{\rule{0.277778em}{0ex}}1.9759\right]$ | $\left[5.1474\phantom{\rule{0.277778em}{0ex}}5.2500\right]$ | $\left[4.5147\phantom{\rule{0.277778em}{0ex}}4.6143\right]$ | $\left[3.8131\phantom{\rule{0.277778em}{0ex}}4.1276\right]$ | $\left[7.1650\phantom{\rule{0.277778em}{0ex}}7.6706\right]$ |

**Table 5.**Summary statistics for selected plots of Figure 7.

Variable | Mean | Std | ${\mathit{P}}_{25}$ | ${\mathit{P}}_{50}$ | ${\mathit{P}}_{75}$ | |
---|---|---|---|---|---|---|

bins = 5 | X | 54.9 | 17.9 | 40.9 | 53.9 | 66.8 |

Y | 46.3 | 26.9 | 24.3 | 44.7 | 65.8 | |

bins = 100 | X | 54.9 | 16.6 | 45.5 | 52.9 | 64.5 |

Y | 46.6 | 28.0 | 23.5 | 41.5 | 71.3 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Restrepo, J.P.; Rivera, J.C.; Laniado, H.; Osorio, P.; Becerra, O.A. Nonparametric Generation of Synthetic Data Using Copulas. *Electronics* **2023**, *12*, 1601.
https://doi.org/10.3390/electronics12071601

**AMA Style**

Restrepo JP, Rivera JC, Laniado H, Osorio P, Becerra OA. Nonparametric Generation of Synthetic Data Using Copulas. *Electronics*. 2023; 12(7):1601.
https://doi.org/10.3390/electronics12071601

**Chicago/Turabian Style**

Restrepo, Juan P., Juan Carlos Rivera, Henry Laniado, Pablo Osorio, and Omar A. Becerra. 2023. "Nonparametric Generation of Synthetic Data Using Copulas" *Electronics* 12, no. 7: 1601.
https://doi.org/10.3390/electronics12071601