# Variational Autoencoders for Data Augmentation in Clinical Studies

^{1}

^{2}

^{*}

## Abstract

**:**

## Featured Application

**Variational autoencoders, which are a type of neural network, are introduced in this study as a means to virtually increase the sample size of clinical studies and reduce costs, time, dropouts, and ethical concerns. The efficiency of variational autoencoders in data augmentation is proven through simulations of several scenarios.**

## Abstract

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Strategy of the Analysis

- (a)
- Perform the clinical study using a limited number of volunteers;
- (b)
- Using the results from “a”, apply in the next step a VAE in order to create virtual subjects and increase the statistical power.

- Create N virtual subjects (e.g., N = 100) using Monte Carlo simulations. This is considered the “original” dataset.
- Set the average endpoint value equal to 100 units and conduct sampling assuming log-normal distribution [4]. Several levels of variability (e.g., 10%, 20%, 40%, etc.) are used for the random creation of virtual subjects.
- Assume two treatments: Test (T) and Reference (R), as in the case of bioequivalence studies. Several levels of the T/R ratios are explored.
- Use these virtual subjects to form a clinical trial; for the purposes of this work, a parallel clinical design was used. Half of the subjects are considered to receive one treatment (e.g., T) and the other half received the other (e.g., R).
- Draw a random sample from the original dataset (steps “i” and “ii”) to create the sub-sample.
- Apply VAE to the sub-sample created in the previous step (i.e., “v”). This leads to the creation of the “generated sample of subjects”.
- Record the success or failure of the study separately for the “original dataset”, “sub-sample”, and “generated dataset”.
- Repeat steps “i”–“viii” many times (e.g., 500) to obtain robust estimates for the percentage of acceptance (i.e., % success) of each of the three datasets.
- Compare the performances obtained in step “ix”.

#### 2.2. Neural Networks—Autoencoders

#### 2.3. Variational Autoencoders

#### 2.4. Tuning of Hyperparameters

^{X})

#### 2.5. Monte Carlo Simulations

_{R}(i.e., the average endpoint value) and a standard deviation of σ

_{R}. Then, a random subsampling procedure was performed on the original R group, whereby a proportion (gradually decreasing from 90% to 10% with a step of 10%) of the data, termed “subsample size”, was selected from the distribution. Subsequently, the subsample was utilized to train the VAE model, followed by sampling from the inferred latent distribution and generating a total of 100 virtual subjects for the R group. Similarly, the aforementioned procedure was also repeated for the test (i.e., T) group of subjects. In the case of the T group, random generation was based on a mean endpoint value of μ

_{T}and standard deviations (σ

_{T}). The aforementioned procedure is schematically shown in Figure 2.

## 3. Results

## 4. Discussion

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Sakpal, T.V. Sample size estimation in clinical trial. Perspect. Clin. Res.
**2010**, 1, 67–69. [Google Scholar] - Wang, X.; Ji, X. Sample Size Estimation in Clinical Research: From Randomized Controlled Trials to Observational Studies. Chest
**2020**, 158, S12–S20. [Google Scholar] [CrossRef] [PubMed] - Malone, H.E.; Nicholl, H.; Coyne, I. Fundamentals of estimating sample size. Nurse Res.
**2016**, 23, 21–25. [Google Scholar] [CrossRef] [PubMed] - Karalis, V. Modeling and Simulation in Bioequivalence. In Modeling in Biopharmaceutics, Pharmacokinetics and Pharmacodynamics. Homogeneous and Heterogeneous Approaches, 2nd ed.; Iliadis, A., Macheras, P., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 227–255. [Google Scholar]
- European Medicines Agency; Committee for Medicinal Products for Human Use (CHMP). Guideline on the Investigation of Bioequivalence; CPMP/EWP/QWP/1401/98 Rev. 1/Corr**; Committee for Medicinal Products for Human Use (CHMP): London, UK, 20 January 2010. Available online: https://www.ema.europa.eu/en/documents/scientific-guideline/guideline-investigation-bioequivalence-rev1_en.pdf (accessed on 29 May 2023).
- Food and Drug Administration (FDA). Guidance for Industry. Bioavailability and Bioequivalence Studies Submitted in NDAs or INDs—General Considerations. Draft Guidance. U.S. Department of Health and Human Services Food and Drug Administration. Center for Drug Evaluation and Research (CDER). December 2013. Available online: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/bioavailability-and-bioequivalence-studies-submitted-ndas-or-inds-general-considerations (accessed on 29 May 2023).
- Askin, S.; Burkhalter, D.; Calado, G.; El Dakrouni, S. Artificial Intelligence Applied to clinical trials: Opportunities and challenges. Health Technol.
**2023**, 13, 203–213. [Google Scholar] [CrossRef] [PubMed] - Harrer, S.; Shah, P.; Antony, B.; Hu, J. Artificial Intelligence for Clinical Trial Design. Trends Pharmacol. Sci.
**2019**, 40, 577–591. [Google Scholar] [CrossRef] [Green Version] - Delso, G.; Cirillo, D.; Kaggie, J.D.; Valencia, A.; Metser, U.; Veit-Haibach, P. How to Design AI-Driven Clinical Trials in Nuclear Medicine. Semin. Nucl. Med.
**2021**, 51, 112–119. [Google Scholar] [CrossRef] - The Alan Turing Institute. Statistical Machine Learning for Randomised Clinical Trials (MRC CTU). Available online: https://www.turing.ac.uk/research/research-projects/statistical-machine-learning-randomised-clinical-trials-mrc-ctu (accessed on 29 May 2023).
- Chollet, F. Deep Learning with Python, 2nd ed.; Manning; Simon and Schuster: New York, NY, USA, 2021. [Google Scholar]
- Atienza, R. Advanced Deep Learning with Keras: Apply Deep Learning Techniques, Autoencoders, GANs, Variational Autoencoders, Deep Reinforcement Learning, Policy Gradients, and More; Packt Publishing: Birmingham, UK, 2018. [Google Scholar]
- Kingma, D.; Welling, M. An Introduction to Variational Autoencoders (Foundations and Trends(r) in Machine Learning); Now Publishers Inc.: Hanover, MA, USA, 2019. [Google Scholar]
- Henderson, H. Artificial Intelligence: Mirrors for the Mind (Milestones in Discovery and Invention), 1st ed.; Chelsea House Pub: Singapore, 2007; 176p. [Google Scholar]
- Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 4th ed.; Pearson: London, UK, 2021; 1136p. [Google Scholar]
- Yang, Y.; Ye, Z.; Su, Y.; Zhao, Q.; Li, X.; Ouyang, D. Deep learning for in vitro prediction of pharmaceutical formulations. Acta Pharm. Sin. B
**2019**, 9, 177–185. [Google Scholar] [CrossRef] - Goceri, E. Medical image data augmentation: Techniques, comparisons and interpretations. Artif. Intell. Rev.
**2023**, 1–45. [Google Scholar] [CrossRef] - Kebaili, A.; Lapuyade-Lahorgue, J.; Ruan, S. Deep Learning Approaches for Data Augmentation in Medical Imaging: A Review. J. Imaging
**2023**, 9, 81. [Google Scholar] [CrossRef] - Pleouras, D.; Sakellarios, A.; Rigas, G.; Karanasiou, G.S.; Tsompou, P.; Karanasiou, G.; Kigka, V.; Kyriakidis, S.; Pezoulas, V.; Gois, G.; et al. A Novel Approach to Generate a Virtual Population of Human Coronary Arteries for In Silico Clinical Trials of Stent Design. IEEE Open J. Eng. Med. Biol.
**2021**, 20, 201–209. [Google Scholar] [CrossRef] - Khan, A.R.; Khan, S.; Harouni, M.; Abbasi, R.; Iqbal, S.; Mehmood, Z. Brain tumor segmentation using K-means clustering and deep learning with synthetic data augmentation for classification. Microsc. Res. Tech.
**2021**, 84, 1389–1399. [Google Scholar] [CrossRef] - Maqsood, S.; Damaševičius, R.; Maskeliūnas, R. Hemorrhage Detection Based on 3D CNN Deep Learning Framework and Feature Fusion for Evaluating Retinal Abnormality in Diabetic Patients. Sensors
**2021**, 21, 3865. [Google Scholar] [CrossRef] - Chen, X.; Lian, C.; Wang, L.; Deng, H.; Kuang, T.; Fung, S.H.; Gateno, J.; Shen, D.; Xia, J.J.; Yap, P.T. Diverse data augmentation for learning image segmentation with cross-modality annotations. Med. Image Anal.
**2021**, 71, 102060. [Google Scholar] [CrossRef] [PubMed] - Barile, B.; Marzullo, A.; Stamile, C.; Durand-Dubief, F.; Sappey-Marinier, D. Data augmentation using generative adversarial neural networks on brain structural connectivity in multiple sclerosis. Comput. Methods Programs Biomed.
**2021**, 206, 106113. [Google Scholar] [CrossRef] [PubMed] - Athalye, C.; Arnaout, R. Domain-guided data augmentation for deep learning on medical imaging. PLoS ONE
**2023**, 18, e0282532. [Google Scholar] [CrossRef] [PubMed] - Pesteie, M.; Abolmaesumi, P.; Rohling, R.N. Adaptive Augmentation of Medical Data Using Independently Conditional Variational Auto-Encoders. IEEE Trans. Med. Imaging
**2019**, 38, 2807–2820. [Google Scholar] [CrossRef] - Chadebec, C.; Thibeau-Sutre, E.; Burgos, N.; Allassonniere, S. Data Augmentation in High Dimensional Low Sample Size Setting Using a Geometry-Based Variational Autoencoder. IEEE Trans. Pattern Anal. Mach. Intell.
**2023**, 45, 2879–2896. [Google Scholar] [CrossRef] - Lopez-Martin, M.; Carro, B.; Sanchez-Esguevillas, A.; Lloret, J. Conditional Variational Autoencoder for Prediction and Feature Recovery Applied to Intrusion Detection in IoT. Sensors
**2017**, 17, 1967. [Google Scholar] [CrossRef] [Green Version] - Yao, R.; Bekhor, S. A Variational Autoencoder Approach for Choice Set Generation and Implicit Perception of Alternatives in Choice Modeling. Transp. Res. Part B Methodol.
**2022**, 158, 273–294. [Google Scholar] [CrossRef] - Mak, H.W.L.; Han, R.; Yin, H.H.F. Application of Variational AutoEncoder (VAE) Model and Image Processing Approaches in Game Design. Sensors
**2023**, 23, 3457. [Google Scholar] [CrossRef] - Staffini, A.; Svensson, T.; Chung, U.; Svensson, A.K. A Disentangled VAE-BiLSTM Model for Heart Rate Anomaly Detection. Bioengineering
**2023**, 10, 683. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Visual representation of a variational autoencoder. The process of encoding involves compressing data from their original space to a latent space, while the decoding process involves decompressing the data. The methodology involves the utilization of neural networks as both an encoder and a decoder, with the aim of acquiring an optimal encoding–decoding scheme through an iterative optimization process. Variational autoencoders aim to establish mapping between the input data and a probability distribution across the latent space.

**Figure 2.**Schematic representation of the analysis strategy in this study. Initially, two randomly generated datasets were generated for the test (T) and reference (R) groups. Then followed subsampling to draw parts of the original population. Finally, the variational autoencoder was applied to the subsampled data in order to produce the generated datasets. The aim of the generated datasets was to exhibit the same properties as the original data. In this study, comparisons were made among the three datasets (original vs. subsampled vs. generated), as well as between the T and R groups of all datasets.

**Figure 3.**Distribution of the generated data for both R and T groups using the “softplus” (

**a**) and linear (

**b**) activation functions for the output layer.

**Figure 4.**Distribution of the generated data for both the R and T groups using variational autoencoders with 100 (

**a**), 500 (

**b**), 1000 (

**c**), 5000 (

**d**), and 10,000 (

**e**) epochs.

**Figure 5.**Distribution of the generated data for both the R and T groups using variational autoencoders with 2 (

**a**), 3 (

**b**), and 4 (

**c**) hidden layers for the encoder and the decoder.

**Figure 6.**Probability of accepting equivalence between the original and the generated datasets for three levels of variability (CV): (

**a**) 10%, (

**b**) 20%, and (

**c**) 40%. The results are shown separately for the test and reference groups, as well as the two types of activation functions (“softplus” and linear) used for the hidden layers.

**Figure 7.**Probability of accepting equivalence between the test and reference groups for the original (

**a**), subsampled (

**b**), and generated (

**c**) datasets Three levels of variability (coefficient of variation, CV) were used: 10%, 20%, and 40%. In all cases, the “softplus” activation was used for the hidden layers, while both the test and reference groups were assumed to exhibit identical average performances.

**Figure 8.**Probability of accepting equivalence between the test and reference groups for several ratios (1, 1.10, 1.25, 1.50) of the average test (T)/reference (R) performance. The comparisons were made separately for the original (

**a**), subsampled (

**b**), and generated datasets by the variational autoencoder (

**c**). In all cases, the “softplus” activation function was used for the hidden layers and two levels of variability (coefficient of variation, CV) were used: 10% and 20%.

**Table 1.**Hyperparameter tuning during the development of the variational autoencoders. In all cases, the latent space dimension was equal to 1.

Number of Epochs | Activation Function | Weights of Loss Function | Number of Hidden Layers | Number of Neurons in Hidden Layers (from Left to Right) | ||||
---|---|---|---|---|---|---|---|---|

Hidden Layers | Output Layer | KL Part | Reconstruction Part | Encoder | Decoder | Encoder | Decoder | |

100 | softplus | softplus | 1 | 1 | 2 | 2 | 32-16 | 16-32 |

500 | linear | linear | 3 | 3 | 64-32-16 | 16-32-64 | ||

1000 | 4 | 4 | 128-64-32-16 | 16-32-64-128 | ||||

5000 | ||||||||

10,000 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Papadopoulos, D.; Karalis, V.D.
Variational Autoencoders for Data Augmentation in Clinical Studies. *Appl. Sci.* **2023**, *13*, 8793.
https://doi.org/10.3390/app13158793

**AMA Style**

Papadopoulos D, Karalis VD.
Variational Autoencoders for Data Augmentation in Clinical Studies. *Applied Sciences*. 2023; 13(15):8793.
https://doi.org/10.3390/app13158793

**Chicago/Turabian Style**

Papadopoulos, Dimitris, and Vangelis D. Karalis.
2023. "Variational Autoencoders for Data Augmentation in Clinical Studies" *Applied Sciences* 13, no. 15: 8793.
https://doi.org/10.3390/app13158793