1. Introduction
In big data analysis, the major task is based on text and image data [
1,
2]. Most machine learning researches with generative modeling have been focused on image data. However, there is not much research on generative models for continuous data analysis. In general, the preprocessed text data from documents are numerical data. So, we need to study the generative models for continuous data analysis. In this paper, we focus on developing the generative model for text data analysis. In the text data analysis, we have to preprocess the text documents using text mining techniques [
3]. The text documents are transformed into a document–keyword matrix as structured data for statistical analysis [
4]. The matrix consists of documents and keywords for its column and row, and the matrix element represents the frequency value of the keywords occurring in the document. Most of the elements are zero values because the number of keywords is much larger than the number of documents [
3,
4]. This is the zero-inflated sparsity problem that arises in text data analysis [
4]. This causes deterioration in the performance of text data analysis [
4]. So, we have to overcome the zero-inflated problem for improving the performance of statistical models.
Most of the previous research on zero-inflated data analysis was focused on the development of a new analytical method for zero-inflated data itself; but they have analytical limitations. This is because the data still have the zero-inflated problem. Unlike the previous studies, we do not develop new analysis methods to analyze the zero-inflated data itself. The goal of our research is to solve the zero-inflated problem by using the generative model to learn the zero-inflated data, generate synthetic data that describes the original data well, and analyze it. In this paper, we use the generative adversarial network (GAN) as a generative model to generate synthetic data from original data. So far, the GAN has been mainly used as a generative model for image data, but we use the GAN to study how to generate synthetic data for numerical data containing many zeros.
In the next section, we illustrate the zero-inflated data analysis. The proposed method is shown in
Section 3. Also, our experimental results are in
Section 4. In this section, we also discuss some theoretical and practical issues related to our proposed method and experimental results. In
Section 5, we discuss our research findings and contribution to related works. The conclusions of our research and possible future works are described in the
Section 6.
2. Zero-Inflated Data Analysis
Until now, various studies to settle the zero-inflated problem have been performed in diverse fields such as statistics and machine learning [
5,
6,
7,
8]. Park and Jun (2023) used the compound Poisson model to solve the zero-inflated problem in patent data analysis [
4]. They separated the given data into zero and non-zero parts and applied the compound Poisson and gamma distributions. They showed a better performance than the traditional models such as the generalized linear model and zero-inflated Poisson model. Lu et al. (2014) and Neelon and Chung (2017) considered Bayesian inference to analyze the zero-inflated count data [
5,
6]. Using the data augmentation by Markov chain Monte Carlo (MCMC), Lu et al. (2014) applied the Bayesian architecture of prior, likelihood, and posterior to the zero-inflated Poisson model [
5]. Neelon and Chung (2017) developed a Bayesian latent factor zero-inflated Poisson model for the analysis of zero-inflated count data [
6]. They also used the MCMC data augmentation for Bayesian inference. In statistics, many methods related to zero inflation were provided [
7,
9,
10,
11]. Hilbe (2014) introduced zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) to deal with the sparsity of zero inflation [
9,
10]. In machine learning, Sidumo et al. (2023) used random forest, support vector machine, K-nearest neighbor, and the artificial neural network for zero-inflated data analysis [
7].
Zero-inflated data refers to data in which too many zero values are included in the given data. In general, the zero excess problem is a major cause of deteriorating model performance. In the meantime, many methods have been studied to solve this problem [
9,
10,
11]. A basic approach to zero-inflated data analysis is to separate the given data into the zero part and the non-zero part as follows [
9,
11]:
This model is defined by two parts, P(
X = 0) and P(
X > 0). P(
X < 0) is not considered because all
X values are greater than or equal to zero. In Equation (1), P(x) is a probability density function and
is the proportion of zeros in the given data or preprocessed data. Using this equation, we can reduce the influence of zero in the entire data by reducing P(x) when
increases. According to the type of probability distribution applied to P(x), we can build various zero-inflated models. For example, a model with Poisson distribution is ZIP, and a model with negative binomial distribution is ZINB. In the ZINB regression, we assume two approaches to data generation [
8]. The zero values are followed by Bernoulli distribution with a success probability
, and the non-negative values are dependent on negative binomial distribution with the success probability
and mean
. Success means that the zero value is observed. The zero values can be generated by both approaches. So, we estimate the probability of zero by combining the probabilities of two approaches. First, the
X is zero with probability
. Also, the
X is x with probability
and is distributed as negative binomial with
and
r, where
r is over dispersion parameter. Therefore, the distribution of ZINB for zero value is defined as Equation (2) [
8,
10].
The parameters of this distribution are
(mean) and
r (over dispersion parameter). Equation (2) is the zero part of Equation (1) using ZINB. Next, we show the non-zero part of Equation (1) using the ZINB model in Equation (3) [
8].
We can predict the count values for the count component using this equation. Using Equations (2) and (3), we can analyze the zero-inflated data. In addition, like the process of ZINB, we perform the zero-inflated data analysis using ZIP.
3. Proposed Method
In this paper, we propose a new method to solve the sparsity problem in text data analysis. The sparsity problem occurs because the proportion of zeros included in the preprocessed text document data is too large. This problem causes performance degradation in the results of statistical data analysis. Using the synthetic data generation from the original data, we try to overcome the zero-inflated problem in the preprocessed text data. We call the preprocessed data original data. There are many generative models to generate the synthetic data [
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22]. Among them, we use the GAN, because this model has a discriminator as well as a generator for the sparse data with zero inflation.
Figure 1 shows the entire procedure for the proposed method.
In
Figure 1, the text documents are based on various data sources such as the Internet, social network services, sensor data, relational databases, and patents. The text data are preprocessed to form structured data such as the document–keyword matrix using text mining techniques. The preprocessing is carried out using the following process.
(Step 1) Collecting patent documents.
(Step 2) Patent text data import and export.
(2-1) Corpus: collection of patent documents.
(2-2) Transformation: stemming, stop-word removal, etc.
(Step 3) Create document–keyword matrix.
Figure 2 shows the document–keyword matrix preprocessed from the patent documents related to the extended reality technology.
The matrix consists of documents and keywords for its row and column, respectively. Its element represents the frequency of keywords occurring in the document. In general, a significant amount of data have a frequency value of zero. This is because the number of keywords is much larger than the number of documents in the document–keyword matrix. Therefore, the matrix has a very sparse data structure with zero inflation. The statistical analysis results of this matrix data are not good because of the zero-inflation problem. So, we propose a method generating synthetic data from the original data (document–keyword matrix) using a generative model based on the GAN. The GAN is composed of two networks, generator and discriminator. The generator carries out the generation of synthetic data and creates a new document–keyword matrix from the probability distribution estimated by original data, the preprocessed document–keyword matrix. The discriminator also determines how similar the synthetic data generated from the generator are to the original data. Based on the cost function, the generator and discriminator continue competing until a predetermined quality of synthetic data is reached. The data generated from the GAN model, finally built through repeated learning and updating, resolves the sparsity problem of zero excess. Finally, statistical analysis is performed using this data.
Figure 3 represents the synthetic data generated from the original data by the GAN.
In
Figure 3, we can see that the zero-inflated problem in the document–keyword matrix has been solved. Next, we describe the proposed method step by step in detail. First, we show our generative model based on the GAN for synthetic data generation from the original document–keyword matrix. The document–keyword matrix is represented in the following data.
In Equation (4),
p and
n are the numbers of keywords and documents, respectively. Also,
is the frequency value of the
jth keyword occurring in the
ith document. Each
X is a column vector. For learning the generative model, we select a keyword and use it as the target variable in the GAN model. Also, the target variable is used for the response variable in our statistical model. So, the data of Equation (4) is changed to the following data architecture.
are input variables and
is the output variable in our model. In general, we determine
as the keyword with the largest frequency value. Therefore, the data of Equation (5) are from the preprocessed document–keyword matrix. This is used as the original data in the GAN.
Figure 4 shows our GAN model for synthetic data generation.
From the noise, the generator makes the synthetic data, which is the new document–keyword matrix. The generator builds a model
that can generate the new document–keyword matrix. The synthetic data are compared with the original data, which are the preprocessed document–keyword matrix in the discriminator. The discriminator predicts
Y using given
X in the data of Equation (5) and estimates
. The generator and discriminator play the minmax game against each other. The discriminator wants to maximize the reward
as follows [
23]:
In Equation (6),
D and
G are discriminator and generator, respectively. The
and
are the distributions of the discriminator and generator, and
is the probability distribution of the original data.
is the probability distribution of noise such as the standard normal distribution. Contrary to the discriminator, the generator tries to minimize the
. The goal of this game is to reach a state where the distribution of the original data and the distribution of the synthetic data are the same. At this time, the probability distribution of the discriminator
is 0.5 for all
x. That is, the discriminator cannot distinguish between the original data and the synthetic data, so that the quality of the synthetic data is almost the same as that of the original data. Finally, the synthetic data reduces the zero-inflated problem in the document–keyword matrix preprocessed from the original data. Using this synthetic data, we perform statistical analysis as the following model [
24,
25]:
This equation is the defined linear regression model in our experiments using practical and simulation data sets. In Equation (7),
Y and
X are response and explanatory variables and
are model parameters.
is the error term distributed to
. In the experiments in this paper, we use the statistical model of Equation (4) to evaluate the performance between the original data and the synthetic data. We use five measures, prediction sum of squares (PRESS), R-squared, log-likelihood, Akaike information criterion (AIC) and Bayesian information criterion (BIC) for performance evaluation between the original and synthetic data in statistical analysis [
9,
10,
11,
24,
25]. The PRESS statistic has been used for model validation based on quality of prediction. The statistic is defined as follows [
25]:
In Equation (8),
is one observation not used in model building, and
represents the predicted value of
. That is, for
n data, (
n − 1) are used to estimate the model parameters, and the remaining one is used to evaluate the performance of the prediction. Therefore, the smaller the value of the PRESS statistic, the better the performance of the model. The second measure of model evaluation is the R-squared, defined as Equation (9) [
25,
26].
where
and
are real and predicted values, respectively. Also,
is the mean of all
y values. R-squared, which has a value between 0 and 1, indicates the explanatory power of the constructed model, and the larger this value, the higher the explanatory power of the model. We also consider the log-likelihood function to infer the parameters of the statistical model. For the parameter vector
, the likelihood is defined as Equation (10) [
26].
The likelihood is a joint density function of
Y given
X for
. In general, for the convenience of mathematical calculations, we use log-likelihood as the logarithm of likelihood in Equation (11) [
26].
We can judge that the larger the log-likelihood is, the better the performance of the model is. Next, we use the AIC to compare the model performance between the original and synthetic data. The AIC is represented by Equation (12) [
24,
26].
In Equation (12), p is the number of variables in the statistical model. The AIC is based on the log-likelihood function to evaluate the performance of statistical models [
24,
25,
26,
27]. The better the performance of the model, the smaller the AIC value. We consider the BIC as our final measure for model comparison. This is defined as Equation (13) [
24,
26].
Compared to AIC, BIC is a model evaluation measure that reflects the size n of data as a penalty. That is, as n increases, BIC also increases. Like AIC, BIC also has a smaller value as the model is better. Finally, we show the step-by-step process of our method from the given original data performance evaluation of the statistical model between the original and synthetic data in
Figure 5.
The original data with zero-inflated problem are transformed into the synthetic data without the problem using generative modeling based on the GAN, so we can overcome the sparsity of zero inflation and carry out statistical models such as linear regression. To compare the performances between the original and proposed synthetic data, we evaluate the model performances using five measures: PRESS, R-squared, log-likelihood, AIC, and BIC. In the next section, a performance comparison experiment is performed on the statistical analysis results between the original data and synthetic data using the five evaluation measures.
5. Discussion
The GAN trained the original data and estimated the mean and variance of normal distribution to represent the given data. Next, the generator of the GAN constructed the synthetic data by generating samples from the estimated normal distribution. In this process, the discriminator of the GAN performed a discrimination task so that the sample was similar to the original data. In this process, the zero values are replaced with very small non-zero numbers generated from the normal distribution, and the zero-inflated problem is solved. For this reason, in this paper we chose the GAN as a generative model to settle the zero-inflated problem. Compared to existing research that used zero-inflated data and developed methods to analyze the original data, we conducted research to generate synthetic data from the original data with zero inflation and analyzed it. This is because there are limits to analyzing data with zero-inflated problems as is. So, we considered using the generative models to change the zero values to very small continuous data using GAN modeling. The GAN is a representative model used to generate synthetic data from original data. In particular, the GAN is an efficient generative model for zero-inflated data by using random noise that follows a normal distribution. In our experimental results, we showed the improved performance of our proposed method using five measures, PRESS, R-squared, log-likelihood, AIC, and BIC, for statistical model evaluation. In our results, we verified that the synthetic data are better than the original data in all evaluation measures. The proposed method can be applied to various big data analysis, because there are zero values in most big data such as sensor data or bio-data. In this paper, the evaluation measures are focused on the count data such as the occurred frequency values of keywords in documents. But the measures are not suitable for other big data. So, we need to develop new measures to evaluate the performance between synthetic and original data. We will also expand analysis methods for zero-inflated data from statistics to machine learning in our future work.
6. Conclusions
We proposed a method to overcome the zero-inflated problem in text data analysis. This problem causes the deterioration of model performance in statistical analysis. This is because numerous zero values dominate the entire data set. To solve this problem, we proposed a generative model using the GAN. We applied the GAN model to analyze text data with the sparsity of zero inflation. For statistical analysis of text data, we preprocessed the text data and built a document–keyword matrix as structured data. The element of this matrix is the frequency value of the keywords occurring in a document. In general, the matrix contains very sparse data, because most of the elements have zero value. Therefore, we generated synthetic data similar to the original data using the GAN, a representative generative model, to solve this sparsity problem. In this process, the zero values were replaced with very small non-negative numbers. To show the validity of our research, we performed statistical analysis on each of the original and synthetic data and carried out performance comparison using five representative model evaluation measures: PRESS, R-squared, log-likelihood, AIC, and BIC. To evaluate the performance between original and synthetic data, we used two data sets, patent documents and simulation data. We found that the performance of the synthetic data was better than that of the original data in both experimental results. So, we verified the improved performance of our proposed method.
In our future works, we will study more advanced models based on generative and statistical models. For example, we will consider Bayesian GAN as an approach to overcome the zero-inflated problem. This is a combined model consisting of Bayesian learning and generative modeling [
34]. The main finding of our research is how to change the zero values of text data to very small non-zero numbers using the generative model based on the GAN. Therefore, this paper is expected to contribute to solving the problem of zero inflation that occurs in the analysis of various big data, including text data. For example, the zero-inflated sparsity occurring in sensor data of the Internet of Things will be overcome by using the proposed method in this paper. Also, in the academic field, our research will be a starting study on how to solve the zero-inflated problem using synthetic data. This is because most of the existing studies were studies on the methods that analyzed the original zero-inflated data. In the future, we expect the emergence of various new studies that solve the problem of zero-inflated sparsity by using the generation of synthetic data using new generative models.