Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling

Jun, Sunghae

doi:10.3390/computers12120258

Open AccessArticle

Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling

by

Sunghae Jun

Department of Data Science, Cheongju University, Chungbuk 28503, Republic of Korea

Computers 2023, 12(12), 258; https://doi.org/10.3390/computers12120258

Submission received: 9 October 2023 / Revised: 29 November 2023 / Accepted: 8 December 2023 / Published: 10 December 2023

(This article belongs to the Special Issue Uncertainty-Aware Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

In big data analysis, various zero-inflated problems are occurring. In particular, the problem of inflated zeros has a great influence on text big data analysis. In general, the preprocessed data from text documents are a matrix consisting of the documents and terms for row and column, respectively. Each element of this matrix is an occurred frequency of term in a document. Most elements of the matrix are zeros, because the number of columns is much larger than the rows. This problem is a cause of decreasing model performance in text data analysis. To overcome this problem, we propose a method of zero-inflated text data analysis using generative adversarial networks (GAN) and statistical modeling. In this paper, we solve the zero-inflated problem using synthetic data generated from the original data with zero inflation. The main finding of our study is how to change zero values to the very small numeric values with random noise through the GAN. The generator and discriminator of the GAN learned the zero-inflated text data together and built a model that generates synthetic data that can replace the zero-inflated data. We conducted experiments and showed the results, using real and simulation data sets to verify the improved performance of our proposed method. In our experiments, we used five quantitative measures, prediction sum of squares, R-squared, log-likelihood, Akaike information criterion and Bayesian information criterion to evaluate the model’s performance between original and synthetic data sets. We found that all performances of our proposed method are better than the traditional methods.

Keywords:

generative adversarial networks; zero-inflated text data; synthetic data generation; count data; statistical analysis

1. Introduction

In big data analysis, the major task is based on text and image data [1,2]. Most machine learning researches with generative modeling have been focused on image data. However, there is not much research on generative models for continuous data analysis. In general, the preprocessed text data from documents are numerical data. So, we need to study the generative models for continuous data analysis. In this paper, we focus on developing the generative model for text data analysis. In the text data analysis, we have to preprocess the text documents using text mining techniques [3]. The text documents are transformed into a document–keyword matrix as structured data for statistical analysis [4]. The matrix consists of documents and keywords for its column and row, and the matrix element represents the frequency value of the keywords occurring in the document. Most of the elements are zero values because the number of keywords is much larger than the number of documents [3,4]. This is the zero-inflated sparsity problem that arises in text data analysis [4]. This causes deterioration in the performance of text data analysis [4]. So, we have to overcome the zero-inflated problem for improving the performance of statistical models.

Most of the previous research on zero-inflated data analysis was focused on the development of a new analytical method for zero-inflated data itself; but they have analytical limitations. This is because the data still have the zero-inflated problem. Unlike the previous studies, we do not develop new analysis methods to analyze the zero-inflated data itself. The goal of our research is to solve the zero-inflated problem by using the generative model to learn the zero-inflated data, generate synthetic data that describes the original data well, and analyze it. In this paper, we use the generative adversarial network (GAN) as a generative model to generate synthetic data from original data. So far, the GAN has been mainly used as a generative model for image data, but we use the GAN to study how to generate synthetic data for numerical data containing many zeros.

In the next section, we illustrate the zero-inflated data analysis. The proposed method is shown in Section 3. Also, our experimental results are in Section 4. In this section, we also discuss some theoretical and practical issues related to our proposed method and experimental results. In Section 5, we discuss our research findings and contribution to related works. The conclusions of our research and possible future works are described in the Section 6.

2. Zero-Inflated Data Analysis

Until now, various studies to settle the zero-inflated problem have been performed in diverse fields such as statistics and machine learning [5,6,7,8]. Park and Jun (2023) used the compound Poisson model to solve the zero-inflated problem in patent data analysis [4]. They separated the given data into zero and non-zero parts and applied the compound Poisson and gamma distributions. They showed a better performance than the traditional models such as the generalized linear model and zero-inflated Poisson model. Lu et al. (2014) and Neelon and Chung (2017) considered Bayesian inference to analyze the zero-inflated count data [5,6]. Using the data augmentation by Markov chain Monte Carlo (MCMC), Lu et al. (2014) applied the Bayesian architecture of prior, likelihood, and posterior to the zero-inflated Poisson model [5]. Neelon and Chung (2017) developed a Bayesian latent factor zero-inflated Poisson model for the analysis of zero-inflated count data [6]. They also used the MCMC data augmentation for Bayesian inference. In statistics, many methods related to zero inflation were provided [7,9,10,11]. Hilbe (2014) introduced zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB) to deal with the sparsity of zero inflation [9,10]. In machine learning, Sidumo et al. (2023) used random forest, support vector machine, K-nearest neighbor, and the artificial neural network for zero-inflated data analysis [7].

Zero-inflated data refers to data in which too many zero values are included in the given data. In general, the zero excess problem is a major cause of deteriorating model performance. In the meantime, many methods have been studied to solve this problem [9,10,11]. A basic approach to zero-inflated data analysis is to separate the given data into the zero part and the non-zero part as follows [9,11]:

P (X = x) = \{\begin{matrix} π + (1 - π) P (X = 0) \\ (1 - π) P (X > 0) \end{matrix}

(1)

This model is defined by two parts, P(X = 0) and P(X > 0). P(X < 0) is not considered because all X values are greater than or equal to zero. In Equation (1), P(x) is a probability density function and

π

is the proportion of zeros in the given data or preprocessed data. Using this equation, we can reduce the influence of zero in the entire data by reducing P(x) when

π

increases. According to the type of probability distribution applied to P(x), we can build various zero-inflated models. For example, a model with Poisson distribution is ZIP, and a model with negative binomial distribution is ZINB. In the ZINB regression, we assume two approaches to data generation [8]. The zero values are followed by Bernoulli distribution with a success probability

π

, and the non-negative values are dependent on negative binomial distribution with the success probability

(1 - π)

and mean

λ

. Success means that the zero value is observed. The zero values can be generated by both approaches. So, we estimate the probability of zero by combining the probabilities of two approaches. First, the X is zero with probability

π

. Also, the X is x with probability

(1 - π)

and is distributed as negative binomial with

λ

and r, where r is over dispersion parameter. Therefore, the distribution of ZINB for zero value is defined as Equation (2) [8,10].

P (X = 0) = π + (1 - π) {(1 + r λ)}^{- \frac{1}{r}}, x = 0

(2)

The parameters of this distribution are

λ

(mean) and r (over dispersion parameter). Equation (2) is the zero part of Equation (1) using ZINB. Next, we show the non-zero part of Equation (1) using the ZINB model in Equation (3) [8].

P (X = x) = (1 - π) \frac{Γ (x + \frac{1}{r}) {(r λ)}^{x}}{Γ (x + 1) Γ (\frac{1}{r}) {(1 + r λ)}^{x + \frac{1}{r}}}, x = 1,2, \dots

(3)

We can predict the count values for the count component using this equation. Using Equations (2) and (3), we can analyze the zero-inflated data. In addition, like the process of ZINB, we perform the zero-inflated data analysis using ZIP.

3. Proposed Method

In this paper, we propose a new method to solve the sparsity problem in text data analysis. The sparsity problem occurs because the proportion of zeros included in the preprocessed text document data is too large. This problem causes performance degradation in the results of statistical data analysis. Using the synthetic data generation from the original data, we try to overcome the zero-inflated problem in the preprocessed text data. We call the preprocessed data original data. There are many generative models to generate the synthetic data [12,13,14,15,16,17,18,19,20,21,22]. Among them, we use the GAN, because this model has a discriminator as well as a generator for the sparse data with zero inflation. Figure 1 shows the entire procedure for the proposed method.

In Figure 1, the text documents are based on various data sources such as the Internet, social network services, sensor data, relational databases, and patents. The text data are preprocessed to form structured data such as the document–keyword matrix using text mining techniques. The preprocessing is carried out using the following process.

(Step 1) Collecting patent documents.
(Step 2) Patent text data import and export.
(2-1) Corpus: collection of patent documents.
(2-2) Transformation: stemming, stop-word removal, etc.
(Step 3) Create document–keyword matrix.

Figure 2 shows the document–keyword matrix preprocessed from the patent documents related to the extended reality technology.

The matrix consists of documents and keywords for its row and column, respectively. Its element represents the frequency of keywords occurring in the document. In general, a significant amount of data have a frequency value of zero. This is because the number of keywords is much larger than the number of documents in the document–keyword matrix. Therefore, the matrix has a very sparse data structure with zero inflation. The statistical analysis results of this matrix data are not good because of the zero-inflation problem. So, we propose a method generating synthetic data from the original data (document–keyword matrix) using a generative model based on the GAN. The GAN is composed of two networks, generator and discriminator. The generator carries out the generation of synthetic data and creates a new document–keyword matrix from the probability distribution estimated by original data, the preprocessed document–keyword matrix. The discriminator also determines how similar the synthetic data generated from the generator are to the original data. Based on the cost function, the generator and discriminator continue competing until a predetermined quality of synthetic data is reached. The data generated from the GAN model, finally built through repeated learning and updating, resolves the sparsity problem of zero excess. Finally, statistical analysis is performed using this data. Figure 3 represents the synthetic data generated from the original data by the GAN.

In Figure 3, we can see that the zero-inflated problem in the document–keyword matrix has been solved. Next, we describe the proposed method step by step in detail. First, we show our generative model based on the GAN for synthetic data generation from the original document–keyword matrix. The document–keyword matrix is represented in the following data.

(X_{i 1}, X_{i 2}, \dots, X_{i p}), i = 1, 2, \dots, n

(4)

In Equation (4), p and n are the numbers of keywords and documents, respectively. Also,

X_{i j}

is the frequency value of the jth keyword occurring in the ith document. Each X is a column vector. For learning the generative model, we select a keyword and use it as the target variable in the GAN model. Also, the target variable is used for the response variable in our statistical model. So, the data of Equation (4) is changed to the following data architecture.

(Y_{i}, X_{i 1}, X_{i 2}, \dots, X_{i (p - 1)}), i = 1, 2, \dots, n

(5)

X_{i 1}, X_{i 2}, \dots, X_{i (p - 1)}

are input variables and

Y_{i}

is the output variable in our model. In general, we determine

Y_{i}

as the keyword with the largest frequency value. Therefore, the data of Equation (5) are from the preprocessed document–keyword matrix. This is used as the original data in the GAN. Figure 4 shows our GAN model for synthetic data generation.

From the noise, the generator makes the synthetic data, which is the new document–keyword matrix. The generator builds a model

P (X)

that can generate the new document–keyword matrix. The synthetic data are compared with the original data, which are the preprocessed document–keyword matrix in the discriminator. The discriminator predicts Y using given X in the data of Equation (5) and estimates

P (Y | X)

. The generator and discriminator play the minmax game against each other. The discriminator wants to maximize the reward

V (D, G)

as follows [23]:

V (D, G) = E_{x ~ P_{d a t a} (x)} (l o g (D (x))) + E_{x ~ Q_{Z} (z)} (l o g (1 - D (G (z))))

(6)

In Equation (6), D and G are discriminator and generator, respectively. The

D (x)

and

G (z)

are the distributions of the discriminator and generator, and

P_{d a t a} (x)

is the probability distribution of the original data.

Q_{Z} (z)

is the probability distribution of noise such as the standard normal distribution. Contrary to the discriminator, the generator tries to minimize the

V (D, G)

. The goal of this game is to reach a state where the distribution of the original data and the distribution of the synthetic data are the same. At this time, the probability distribution of the discriminator

D (x)

is 0.5 for all x. That is, the discriminator cannot distinguish between the original data and the synthetic data, so that the quality of the synthetic data is almost the same as that of the original data. Finally, the synthetic data reduces the zero-inflated problem in the document–keyword matrix preprocessed from the original data. Using this synthetic data, we perform statistical analysis as the following model [24,25]:

Y_{i} = β_{0} + β_{1} X_{1} + β_{2} X_{2} + \dots + β_{k} X_{k} + ε_{i}, i = 1,2, \dots, n

(7)

This equation is the defined linear regression model in our experiments using practical and simulation data sets. In Equation (7), Y and X are response and explanatory variables and

(β_{0}, β_{1}, β_{2}, \dots, β_{k})

are model parameters.

ε_{i}

is the error term distributed to

N (0, σ^{2})

. In the experiments in this paper, we use the statistical model of Equation (4) to evaluate the performance between the original data and the synthetic data. We use five measures, prediction sum of squares (PRESS), R-squared, log-likelihood, Akaike information criterion (AIC) and Bayesian information criterion (BIC) for performance evaluation between the original and synthetic data in statistical analysis [9,10,11,24,25]. The PRESS statistic has been used for model validation based on quality of prediction. The statistic is defined as follows [25]:

P R E S S = \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i (- i)})}^{2}

(8)

In Equation (8),

y_{i}

is one observation not used in model building, and

{\hat{y}}_{i (- i)}

represents the predicted value of

y_{i}

. That is, for n data, (n − 1) are used to estimate the model parameters, and the remaining one is used to evaluate the performance of the prediction. Therefore, the smaller the value of the PRESS statistic, the better the performance of the model. The second measure of model evaluation is the R-squared, defined as Equation (9) [25,26].

R^{2} = \frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - \bar{y})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}},

(9)

where

y_{i}

and

{\hat{y}}_{i}

are real and predicted values, respectively. Also,

\bar{y}

is the mean of all y values. R-squared, which has a value between 0 and 1, indicates the explanatory power of the constructed model, and the larger this value, the higher the explanatory power of the model. We also consider the log-likelihood function to infer the parameters of the statistical model. For the parameter vector

β = (β_{0}, β_{1}, β_{2}, \dots, β_{k})

, the likelihood is defined as Equation (10) [26].

L (β; y | x) = \prod_{i = 1}^{n} f (y_{i} | x_{i}; β)

(10)

The likelihood is a joint density function of Y given X for

β

. In general, for the convenience of mathematical calculations, we use log-likelihood as the logarithm of likelihood in Equation (11) [26].

L (β) = \sum_{i = 1}^{n} l o g f (y_{i} | x_{i}; β)

(11)

We can judge that the larger the log-likelihood is, the better the performance of the model is. Next, we use the AIC to compare the model performance between the original and synthetic data. The AIC is represented by Equation (12) [24,26].

A I C = - 2 l o g L (β) + 2 p

(12)

In Equation (12), p is the number of variables in the statistical model. The AIC is based on the log-likelihood function to evaluate the performance of statistical models [24,25,26,27]. The better the performance of the model, the smaller the AIC value. We consider the BIC as our final measure for model comparison. This is defined as Equation (13) [24,26].

B I C = - 2 l o g L (β) + l o g (n) p

(13)

Compared to AIC, BIC is a model evaluation measure that reflects the size n of data as a penalty. That is, as n increases, BIC also increases. Like AIC, BIC also has a smaller value as the model is better. Finally, we show the step-by-step process of our method from the given original data performance evaluation of the statistical model between the original and synthetic data in Figure 5.

The original data with zero-inflated problem are transformed into the synthetic data without the problem using generative modeling based on the GAN, so we can overcome the sparsity of zero inflation and carry out statistical models such as linear regression. To compare the performances between the original and proposed synthetic data, we evaluate the model performances using five measures: PRESS, R-squared, log-likelihood, AIC, and BIC. In the next section, a performance comparison experiment is performed on the statistical analysis results between the original data and synthetic data using the five evaluation measures.

4. Experimental Results

To show the improved performance and validity of our proposed method, we conducted experiments using the practical and simulated data. The practical data are the patent documents related to extended reality technology. Also, we generate the simulation data from Poisson and normal distributions. For computing language and software, we use the R data language and its provided packages to carry out our experiments [28,29,30,31]. We use normal random variable for the distribution of noise in our GAN. Also, the learning rate for the optimizer is 0.0001, and the numbers of mini batch size and training epochs are 50 and 150, respectively.

4.1. Patent Document Data Analysis

We used patent documents related to extended reality technology to evaluate the performance of our proposed method. We searched the patents from the USPTO and KIPRIS patent databases [32,33]. Using the text mining techniques, we constructed the patent–keyword matrix with 3000 rows and 58 columns. Each element of this matrix is a keyword frequency in a patent document. The columns represent keywords and the number of zeros in the keywords are shown in Table 1.

In Table 1, the first column represents the keyword, and the second column is the number of zeros in each column. Lastly, the third column shows the percentage of zeros in each keyword; this means the degree of sparsity. We found that most of the sparsity values are between 50% and 100%. In addition, the percentage of total sparsity is 83.91%. Therefore, we can confirm that the patent–keyword matrix is very sparse. To solve the sparsity problem, we generated the synthetic data from original data using GAN modeling. Table 2 shows our comparison result among original, noise-added, and synthetic data sets according to performance evaluation measures.

In Table 2, we carried out the following statistical model with six predictor variables and one response variable in (14).

R e a l i t y = β_{0} + β_{1} d e v i c e + β_{2} v i r t u a l + β_{3} e x t e n d + β_{4} d a t a + β_{5} o b j e c t + β_{6} a u g m e n t + ε

(14)

This is based on the statistical model in Equation (7). In the results of Table 2, we compared the synthetic data with original and noise-added data via five measures: PRESS, R-squared, log-likelihood, AIC, and BIC. The values of PRESS, AIC, and BIC of the synthetic data are all smaller than the original and noise-added data. The R-squared and log-likelihood of the synthetic data are all larger than the original and noise-added data. Additionally, the performance of the analysis model using the original data is better than when using noise-added data. Therefore, we can confirm that the result performance of synthetic data is the best.

4.2. Simulation Data Analysis

We generated simulation data with count predictor and continuous response variables for another experiment. Our simulation data consist of five predictors and one response, as showed in Table 3.

In Table 3, each of the five predictors from X1 to X5 follows Poisson distribution with different parameters (means). The response variable follows normal distribution with a mean of zero and a variance of one. In addition, the correlation structure of simulation data is shown in Table 4.

In this experiment, we built the simulation data using the correlation matrix in Table 4. Table 5 shows the number of zeros included in the variables and their sparsity.

The parameter of Poisson distribution represents the average number of occurrences of an event; we found that X3 with the smallest parameter has the highest sparsity. Next, we show the comparison result between the original and synthetic data sets according to performance evaluation measures using simulation data. The statistical model for this experiment is shown in Equation (7) with k = 5. Y is generated from normal distribution with a mean of zero and a variance of one. The others are distributed to Poisson random variables according to different parameters. Table 6 illustrates the comparison results among original, noise-added, and synthetic data sets.

Like the analytical results in Section 4.1, we can find that the results of evaluation measures using synthetic data are the best. So, we show the improved performance of our proposed method through the experimental results using real and simulation data from Section 4.1 and Section 4.2.

5. Discussion

The GAN trained the original data and estimated the mean and variance of normal distribution to represent the given data. Next, the generator of the GAN constructed the synthetic data by generating samples from the estimated normal distribution. In this process, the discriminator of the GAN performed a discrimination task so that the sample was similar to the original data. In this process, the zero values are replaced with very small non-zero numbers generated from the normal distribution, and the zero-inflated problem is solved. For this reason, in this paper we chose the GAN as a generative model to settle the zero-inflated problem. Compared to existing research that used zero-inflated data and developed methods to analyze the original data, we conducted research to generate synthetic data from the original data with zero inflation and analyzed it. This is because there are limits to analyzing data with zero-inflated problems as is. So, we considered using the generative models to change the zero values to very small continuous data using GAN modeling. The GAN is a representative model used to generate synthetic data from original data. In particular, the GAN is an efficient generative model for zero-inflated data by using random noise that follows a normal distribution. In our experimental results, we showed the improved performance of our proposed method using five measures, PRESS, R-squared, log-likelihood, AIC, and BIC, for statistical model evaluation. In our results, we verified that the synthetic data are better than the original data in all evaluation measures. The proposed method can be applied to various big data analysis, because there are zero values in most big data such as sensor data or bio-data. In this paper, the evaluation measures are focused on the count data such as the occurred frequency values of keywords in documents. But the measures are not suitable for other big data. So, we need to develop new measures to evaluate the performance between synthetic and original data. We will also expand analysis methods for zero-inflated data from statistics to machine learning in our future work.

6. Conclusions

We proposed a method to overcome the zero-inflated problem in text data analysis. This problem causes the deterioration of model performance in statistical analysis. This is because numerous zero values dominate the entire data set. To solve this problem, we proposed a generative model using the GAN. We applied the GAN model to analyze text data with the sparsity of zero inflation. For statistical analysis of text data, we preprocessed the text data and built a document–keyword matrix as structured data. The element of this matrix is the frequency value of the keywords occurring in a document. In general, the matrix contains very sparse data, because most of the elements have zero value. Therefore, we generated synthetic data similar to the original data using the GAN, a representative generative model, to solve this sparsity problem. In this process, the zero values were replaced with very small non-negative numbers. To show the validity of our research, we performed statistical analysis on each of the original and synthetic data and carried out performance comparison using five representative model evaluation measures: PRESS, R-squared, log-likelihood, AIC, and BIC. To evaluate the performance between original and synthetic data, we used two data sets, patent documents and simulation data. We found that the performance of the synthetic data was better than that of the original data in both experimental results. So, we verified the improved performance of our proposed method.

In our future works, we will study more advanced models based on generative and statistical models. For example, we will consider Bayesian GAN as an approach to overcome the zero-inflated problem. This is a combined model consisting of Bayesian learning and generative modeling [34]. The main finding of our research is how to change the zero values of text data to very small non-zero numbers using the generative model based on the GAN. Therefore, this paper is expected to contribute to solving the problem of zero inflation that occurs in the analysis of various big data, including text data. For example, the zero-inflated sparsity occurring in sensor data of the Internet of Things will be overcome by using the proposed method in this paper. Also, in the academic field, our research will be a starting study on how to solve the zero-inflated problem using synthetic data. This is because most of the existing studies were studies on the methods that analyzed the original zero-inflated data. In the future, we expect the emergence of various new studies that solve the problem of zero-inflated sparsity by using the generation of synthetic data using new generative models.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declare no conflict of interest.

References

Mikalef, P.; Krogstie, J. Examining the interplay between big data analytics and contextual factors in driving process innovation capabilities. Eur. J. Inf. Syst. 2020, 29, 260–287. [Google Scholar] [CrossRef]
Thakur, N.; Han, C.Y. A Study of Fall Detection in Assisted Living: Identifying and Improving the Optimal Machine Learning Method. J. Sens. Actuator Netw. 2021, 10, 39. [Google Scholar] [CrossRef]
Feinerer, I.; Hornik, K. Package ‘tm’ Version 0.7-11, Text Mining Package; CRAN of R Project, R Foundation for Statistical Com-puting: Vienna, Austria, 2023. [Google Scholar]
Park, S.; Jun, S. Zero-Inflated Patent Data Analysis Using Compound Poisson Models. Appl. Sci. 2023, 13, 4505. [Google Scholar] [CrossRef]
Lu, L.; Fu, Y.; Chu, P.; Zhang, X. A Bayesian Analysis of Zero-Inflated Count Data: An Application to Youth Fitness Survey. In Proceedings of the Tenth International Conference on Computational Intelligence and Security, Kunming, China, 15–16 November 2014; pp. 699–703. [Google Scholar]
Neelon, B.; Chung, D. The LZIP: A Bayesian Latent Factor Model for Correlated Zero-Inflated Counts. Biometrics 2017, 73, 185–196. [Google Scholar] [CrossRef] [PubMed]
Sidumo, B.; Sonono, E.; Takaidza, I. Count Regression and Machine Learning Techniques for Zero-Inflated Overdispersed Count Data: Application to Ecological Data. Ann. Data Sci. 2023, 1–15. [Google Scholar] [CrossRef]
Yusuf, O.B.; Bello, T.; Gureje, O. Zero Inflated Poisson and Zero Inflated Negative Binomial Models with Application to Number of Falls in the Elderly. Biostat. Biom. Open Access J. 2017, 1, 69–75. [Google Scholar]
Hilbe, J.M. Negative Binomial Regression, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Hilbe, J.M. Modeling Count Data; Cambridge University Press: New York, NY, USA, 2014. [Google Scholar]
Cameron, A.C.; Trivedi, P.K. Regression Analysis of Count Data, 2nd ed.; Cambridge University Press: New York, NY, USA, 2013. [Google Scholar]
Zhou, X.; Hu, Y.; Wu, J.; Liang, W.; Ma, J.; Jin, Q. Distribution Bias Aware Collaborative Generative Adversarial Network for Imbalanced Deep Learning in Industrial IoT. IEEE Trans. Ind. Inform. 2023, 19, 570–580. [Google Scholar] [CrossRef]
Xu, M.; Baraldi, P.; Lu, X.; Zio, E. Generative Adversarial Networks With AdaBoost Ensemble Learning for Anomaly Detection in High-Speed Train Automatic Doors. IEEE Trans. Intell. Transp. Syst. 2022, 23, 23408–23421. [Google Scholar] [CrossRef]
Deng, L.; He, C.; Xu, G.; Zhu, H.; Wang, H. PcGAN: A Noise Robust Conditional Generative Adversarial Network for One Shot Learning. IEEE Trans. Intell. Transp. Syst. 2022, 23, 25249–25258. [Google Scholar] [CrossRef]
Li, C.; Xu, K.; Zhu, J.; Liu, J.; Zhang, B. Triple Generative Adversarial Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9629–9640. [Google Scholar] [CrossRef] [PubMed]
Yan, C.; Chang, X.; Li, Z.; Guan, W.; Ge, Z.; Zhu, L.; Zheng, Q. ZeroNAS: Differentiable Generative Adversarial Networks Search for Zero-Shot Learning. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 9733–9740. [Google Scholar] [CrossRef] [PubMed]
Rosenfeld, B.; Simeone, O.; Rajendran, B. Spiking Generative Adversarial Networks With a Neural Network Discriminator: Local Training, Bayesian Models, and Continual Meta-Learning. IEEE Trans. Comput. 2022, 71, 2778–2791. [Google Scholar] [CrossRef]
Tang, C.; He, Z.; Li, Y.; Lv, J. Zero-Shot Learning via Structure-Aligned Generative Adversarial Network. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 6749–6762. [Google Scholar] [CrossRef] [PubMed]
You, H.; Cheng, Y.; Cheng, T.; Li, C.; Zhou, P. Bayesian Cycle-Consistent Generative Adversarial Networks via Marginalizing Latent Sampling. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4389–4403. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Prasad, R.G.N.; Sekuboyina, A.; Niu, C.; Bai, S.; Hemmert, W.; Menze, B. Micro-Ct Synthesis and Inner Ear Super Resolution via Generative Adversarial Networks and Bayesian Inference. In Proceedings of the IEEE 18th International Symposium on Biomedical Imaging (ISBI), Nice, France, 13–16 April 2021; pp. 1500–1504. [Google Scholar]
Yang, S.; Zhou, F.; Chen, D.; Wen, C. Deep Learning Fault Diagnosis Method Based on Feature Generative Adversarial Networks for Unbalanced Data. In Proceedings of the CAA Symposium on Fault Detection, Supervision and Safety for Technical Processes (SAFEPROCESS), Xiamen, China, 5–7 July 2019; pp. 465–470. [Google Scholar]
Yan, R.; Yuan, Y.; Wang, Z.; Geng, G.; Jiang, Q. Active Distribution System Synthesis via Unbalanced Graph Generative Adversarial Network. IEEE Trans. Power Syst. 2022, 38, 4293–4307. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 1–9. [Google Scholar]
Bruce, P.; Bruce, A.; Gedeck, P. Practical Statistics for Data Scientists; O’Reilly Media: Sebastopol, CA, USA, 2020. [Google Scholar]
Montgomery, D.C.; Peck, E.A.; Vining, G.G. Introduction to Linear Regression Analysis; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Murphy, K.P. Machine Learning: A Probabilistic Perspective; MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
Theodoridis, S. Machine Learning A Bayesian and Optimization Perspective; Elsevier: London, UK, 2015. [Google Scholar]
R Development Core Team. R: A language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023; Available online: http://www.R-project.org (accessed on 15 May 2023).
Neunhoeffer, M. Package ‘RGAN’ Version 0.1.1, Generative Adversarial Nets (GAN) in R; CRAN of R Project, R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
Amatya, A.; Demirtas, H. PoisNor: An R Package for Generation of Multivariate Data with Poisson and Normal Marginals. Commun. Stat. Simul. Comput. 2015, 46, 2241–2253. [Google Scholar] [CrossRef]
Li, H.; Chen, R.; Nguyen, H.; Chung, Y.; Gao, R.; Demirtas, H. Package ‘RNGforGPD’ Version 1.1.0, Random Number Generation for Generalized Poisson Distribution; CRAN of R Project, R Foundation for Statistical Computing: Vienna, Austria, 2022. [Google Scholar]
USPTO. The United States Patent and Trademark Office. Available online: http://www.uspto.gov (accessed on 1 March 2022).
KIPRIS. Korea Intellectual Property Rights Information Service. Available online: www.kipris.or.kr (accessed on 1 March 2022).
Moriña, D.; Puig, P.; Navarro, A. Analysis of zero inflated dichotomous variables from a Bayesian perspective: Application to occupational health. BMC Med. Res. Methodol. 2021, 21, 277. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Procedure of proposed method.

Figure 2. Procedure document–keyword matrix preprocessed from patent documents.

Figure 3. Synthetic data generated from original data by the GAN.

Figure 4. GAN for document–keyword matrix.

Figure 5. Step-by-step process of our method.

Table 1. Number of zeros and sparsity.

Keyword	# of Zeros	Sparsity (%)	Keyword	# of Zeros	Sparsity (%)
wall	2860	95.33	interact	2770	92.33
visual	2779	92.63	inform	2636	87.87
virtual	2101	70.03	image	2338	77.93
view	2626	87.53	head	2771	92.37
video	2770	92.33	generate	2410	80.33
user	2135	71.17	face	2859	95.30
time	2773	92.43	eye	2852	95.07
system	1759	58.63	extend	1532	51.07
surface	2540	84.67	environment	2503	83.43
structure	2743	91.43	electric	2834	94.47
space	2691	89.70	edge	2857	95.23
signal	2761	92.03	display	2126	70.87
sensor	2698	89.93	device	1880	62.67
scene	2873	95.77	detect	2723	90.77
rotate	2795	93.17	data	2490	83.00
render	2811	93.70	control	2531	84.37
region	2793	93.10	content	2735	91.17
receive	2453	81.77	contact	2835	94.50
reality	1873	62.43	connect	2655	88.50
present	2466	82.20	configure	2334	77.80
posit	2373	79.10	compute	2619	87.30
physic	2788	92.93	component	2810	93.67
optic	2757	91.90	communication	2738	91.27
object	2542	84.73	capture	2750	91.67
move	2791	93.03	camera	2775	92.50
mobile	2880	96.00	augment	2419	80.63
map	2863	95.43	associate	2654	88.47
light	2747	91.57	assemble	2812	93.73
layer	2837	94.57	arrange	2720	90.67
interface	2811	93.70	andor	2678	89.27

Table 2. Comparison result among original, noise-added, and synthetic data sets according to performance evaluation measures: patent document data.

Evaluation Measure	Original Data	Noise-Added Data	Synthetic Data
PRESS	3117.086	3640.162	114.242
R-squared	0.4283	0.3077	0.9755
Log-likelihood	−2841.364	−4529.54	658.551
AIC	5696.728	9075.08	−1301.102
BIC	5738.773	9123.131	−1253.051

Table 3. Simulation data structure.

Variable	Distribution	Parameter
Predictor (X1)	Poisson	mean = 0.15
Predictor (X2)	Poisson	mean = 0.21
Predictor (X3)	Poisson	mean = 0.12
Predictor (X4)	Poisson	mean = 1.22
Predictor (X5)	Poisson	mean = 0.88
Response (Y)	Normal	mean = 0, variance = 1

Table 4. Correlation matrix for generating simulation data.

Variable	X1	X2	X3	X4	X5	Y
X1	1.00	0.42	0.35	0.25	0.09	0.14
X2	0.42	1.00	0.12	0.29	−0.22	0.19
X3	0.35	0.12	1.00	0.46	−0.14	0.13
X4	0.25	0.29	0.46	1.00	0.10	0.36
X5	0.09	−0.22	−0.14	0.10	1.00	0.58
Y	0.14	0.19	0.13	0.36	0.58	1.00

Table 5. Number of zeros and sparsity.

Variable	Poisson Parameter	# of Zeros	Sparsity (%)
X1	0.15	4288	85.76
X2	0.21	4066	81.32
X3	0.12	4462	89.24
X4	1.22	1474	29.48
X5	0.88	2050	41.00

Table 6. Comparison result among original, noise-added, and synthetic data sets according to performance evaluation measures: simulation data.

Evaluation Measure	Original Data	Noise-Added Data	Synthetic Data
PRESS	2836.303	3943.247	142.9447
R-squared	0.4293	0.2066	0.8305
Log-likelihood	−5671.349	−6495.143	−437.0238
AIC	11,356.7	13,004.29	888.0477
BIC	11,402.32	13,049.91	922.402

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jun, S. Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling. Computers 2023, 12, 258. https://doi.org/10.3390/computers12120258

AMA Style

Jun S. Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling. Computers. 2023; 12(12):258. https://doi.org/10.3390/computers12120258

Chicago/Turabian Style

Jun, Sunghae. 2023. "Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling" Computers 12, no. 12: 258. https://doi.org/10.3390/computers12120258

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Zero-Inflated Text Data Analysis using Generative Adversarial Networks and Statistical Modeling

Abstract

1. Introduction

2. Zero-Inflated Data Analysis

3. Proposed Method

4. Experimental Results

4.1. Patent Document Data Analysis

4.2. Simulation Data Analysis

5. Discussion

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI