Next Article in Journal
A Novel ON-State Resistance Modeling Technique for MOSFET Power Switches
Previous Article in Journal
Numerical Methods That Preserve a Lyapunov Function for Ordinary Differential Equations
Previous Article in Special Issue
Lie-Group Type Quadcopter Control Design by Dynamics Replacement and the Virtual Attractive-Repulsive Potentials Theory
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Noise-Aware Multiple Imputation Algorithm for Missing Data

School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(1), 73; https://doi.org/10.3390/math11010073
Submission received: 8 November 2022 / Revised: 20 December 2022 / Accepted: 21 December 2022 / Published: 25 December 2022
(This article belongs to the Special Issue Numerical Analysis with Applications in Machine Learning)

Abstract

:
Missing data is a common and inevitable phenomenon. In practical applications, the datasets usually contain noises for various reasons. Most of the existing missing data imputing algorithms are affected by noises which reduce the accuracy of the imputation. This paper proposes a noise-aware missing data multiple imputation algorithm NPMI in static data. Different multiple imputation models are proposed according to the missing mechanism of data. Secondly, the method to determine the imputation order of multivariablesmissing is given. A random sampling consistency algorithm is proposed to estimate the initial values of the parameters of the multiple imputation model to reduce the influence of noise data and improve the algorithm’s robustness. Experiments on two real datasets and two synthetic datasets verify the accuracy and efficiency of the proposed NPMI algorithm, and the results are analyzed.

1. Introduction

Data imputation is very important for practical applications [1,2,3,4]. However, data quality is not always high enough because the dataset contains many noisy data besides missing data. The performance of many existing missing value imputation algorithms is affected by noise. For example, when there is a lot of noise, the regression models or neural networks for imputing missing data can easily be overfitted, resulting in a significant error. Therefore, it is necessary to design an imputation algorithm that can guarantee imputation accuracy even when there is a lot of noise.
In this paper, we propose a noise-aware missing data multiple imputation algorithm, which can fill in the missing data with high quality, even for the data with a lot of noise. It is necessary to construct different multiple imputation models to ensure the accuracy of missing data imputation based on missing data mechanisms and the order of imputation of different missing data in multivariable imputation. The research process faces the following three challenges:
Firstly, how can we construct different multiple imputation models according to missing data mechanisms? Most of the multiple imputation methods proposed at present are constructed assuming that the deletion mechanism is random deletion and cannot directly impute data without considering the missing data mechanism.
Secondly, how can we determine the imputation order of different missing data when multiple variables are missing? Different imputation order has a significant influence on imputation accuracy.
Thirdly, how can we ensure the robustness of the multiple imputation algorithm? The algorithm should ensure high imputation accuracy in the case of a large amount of noise.
We design a noise-aware multiple imputation algorithm to solve the above three challenges.

2. Related Work

Many existing methods for imputing missing data have been proposed [5], and can be divided into two categories. One is based on statistics: simple missing data imputation methods include the mean imputation method [6] and class-mean imputation method [7]. The class-mean imputation method is optimized based on the mean imputation method, and the attributes are grouped first. Then, the mean of attributes in the same group after partitioning is used to impute missing data. Other commonly used statistical imputation methods include linear regression and logistic regression [8,9,10]. According to the correlation between attributes, the regression imputation method uses the data in the complete dataset to establish the model between missing and complete variables to predict the value of missing variables. The method based on statistics is multiple imputation, which was first proposed by Rubin in 1987. It uses a series of possible values to fill in each missing data. The multiple imputation method makes up for the defect of single imputation and considers the uncertainty of missing data.
The other type of method is based on machine learning: the problem of missing data imputation has gradually attracted attention in machine learning and data mining. At present, the proposed methods include KNN [11,12,13], kernel [14,15], K-means [16,17], decision tree [18,19,20], regression [21], naive Bayes [22,23], Bayesian network [24], neural network [25,26,27,28], etc. In neural network imputation, the attributes are imputed variables, and related attributes are input vectors. We construct classification or fitting functions through training, which can be used for classification and regression. Some methods, such as frequent pattern mining and rule discovery, embed missing data imputation directly into the data mining algorithm so that the imputation algorithm can be better combined with applications.
It is notable that even though many MV imputation methods have been proposed, there are few existing multiple imputation approaches. There have been studies proving that in most cases, the multiple imputation effectively reflects the uncertainty of the data while retainingthe major advantages of single imputation methods.
On the other hand, few MV imputation methods consider the existence of noises in the data. However, in many real-world datasets, besides missing values, noises may be embedded in the data. The works [29,30] investigate the impact of noises and propose the imputation algorithms for noisy data. Nevertheless, they are not multiple imputation methods which suffer the problems of single imputation mentioned above. Therefore, in this paper, we aim to propose an effective noise-aware multiple imputation method for noisy data.

3. Construct Multiple Imputation Model

3.1. Problem Definition

In order to describe the problem, some definitions are given.
Definition 1. Static Dataset.
A dataset in which data are fixed and do not have updates.
Definition 2. Missing Data.
It is also known as missing value, which refers to a certain (or some) variable or feature in a statistic that has no value in an observation. People generally use “unknown”, “missing”, and “blank” to represent missing data. A computer commonly uses “NA”, “NULL”, and other symbols to mark missing data.
Definition 3. Outlier.
A datum that is significantly different from other data objects, as if a different mechanism produced it. It means an individual value in a sample whose value deviates from the rest of the observed values in the sample to which it (or they) belongs.
Definition 4. Problem Definition.
Y is an n × p dimensional static dataset containing p variables with missing data. The observed part in the dataset is represented by Y o b s , and the missing part is represented by Y m i s . Then, the problem of missing data imputation on static datasets is estimating the missing values in Y m i s based on the observed data Y o b s by approximating the posterior distribution of Y m i s , i.e.,  Y m i s = P Y m i s Y o b s .
The missing data model and mechanism are essential for the missing data imputation problem because they have different imputation methods. Therefore, the research on missing data imputation needs to build multiple imputation models and form different imputation algorithms according to specific data imputation patterns and mechanisms. Most existing imputation methods are classified according to the missing datapattern under the assumption that the missing mechanism is random missing (MAR). They cannot directly impute the missing data. They do not know whether there is a correlation between variables in the dataset or, in some cases, no correlation between variables. Therefore, in this section, different imputation models from the two aspects of random missing mechanism (MAR) and ignoring missing data mechanism are given.

3.2. Monte Carlo Markov Chain Method

With the random missing mechanism (MAR), the Monte Carlo Markov chain (MCMC) method [31] was used to construct the multiple imputation model. MCMC is a collection of methods that explore posterior distribution based on Bayesian theory. It obtains samples of the distribution by establishing a Markov chain that does not change the distribution and conducts subsequent statistical analysis based on these samples. MCMC forms a Markov chain based on Bayesian theory to make the distribution of missing data and parameters converge through the Markov chain to simulate the posterior distribution of missing data. Then, it obtains the distribution by simulation and then randomly extracts the imputation value from the posterior distribution. In practice, the simulation of complex distribution is generally realized by DA (data augmentation technology). It is notable that, even though Bayesian methods are powerful to compute the posterior distribution of data, it cannot be used directly for the data with MVs. To tackle this issue, we adopt MCMC to make the estimated posterior distribution of data converge to a stable state. Moreover, Bayesian methods usually depend on the strong assumption of independency among attributes of data; however, it is almost impractical in real-world applications. Therefore, the Bayesian methods struggle to achieve the optimal performance.
The process of the MCMC algorithm is as follows: firstly, the EM method is used to estimate the initial parameters of the model, which are used as the initial values of DA. Then, the DA algorithm is used to iterate IP for two steps continuously, and when the length of the chain is enough, the Markov chain is formed. According to the Markov property, the missing values obtained by DA converge to the distribution function of the missing values. The Markov chain of stationary distribution is established, and then samples are drawn from it to obtain the posterior distribution samples for statistical analysis. Next, the concrete steps of the MCMC algorithm to realize multiple imputation are given.
Suppose it is a dimensional matrix with two variables Y = Y o b s , Y mis with missing data. Y o b s is observed data y 1 ( o b s ) , y 2 ( o b s ) , , y m ( o b s ) . Y m i s is missing data ( y 1 ( m i s ) , y 2 ( m i s ) , , y m ( m i s ) . When the dataset Y contains missing data, the posterior probability distribution of the observed data p θ Y o b s is difficult to simulate. However, if the missing data Y mis is filled by its estimated value, the posterior probability distribution of the obtained complete dataset p θ Y o b s , Y m i s becomes easy to simulate. This section estimates the missing data based on the assumption that Y = Y o b s , Y m i s follows a p dimensional normal distribution with mean μ = μ 1 , μ 2 , 1 , μ P and variance Σ = σ J P . Graham et al. verified that in the case of nonmultivariate normal distribution, the imputation value obtained by assuming multivariate normal distribution is also very close to the real value. According to Bayesian theory, the posterior probability of missing data under the condition of given observation data can be obtained:
P Y mis Y o b s = f Y m i s Y o b s , θ k θ Y o b s d θ
where k θ Y o b s represents the posterior density function of parameters under the condition of observed data, denoted as
k θ Y o b s L ( θ Y ) π ( θ ) d Y m i s
In Equation (2), π ( θ ) refers to the prior probability of the parameter calculated according to the observed data. Since Equation (1) cannot be calculated directly by integration, the MCMC method is used to simulate the distribution of multiple imputation P Y m i s Y o b s , and then complete m datasetsby random sampling Y mis . Before performing multiple imputation, the initial values of the parameter mean vector μ and variance matrix Σ need to be estimated.

3.3. Sample-Based Regression Prediction Method

When it is impossible to determine whether there is a correlation between variables in the dataset, or in some cases, there is obviously no correlation between variables in the dataset, the relationship between variables cannot be used to establish a multiple imputation model. In this case, it can be considered from the perspective of samples because there is a certain similarity between samples, and the variable columns of similar samples are similar. The value of missing data can be estimated by using the observed data of similar samples with the same attribute. Therefore, when the missing data mechanism is negligible, the multiple imputation model can be established by using some relationship between similar samples.
We propose a method to establish multiple imputing models according to the similarity relationship between samples, namely, sample-based regression prediction (SRP). This method includes three steps: firstly, k-means clustering is performed on the samples, and the k-nearest neighbor (KNN) algorithm is used to obtain k complete nearest neighbor samples for each cluster with missing data. Then, the regression model of the missing attributes is established by using the K complete samples similar to the missing samples. The missing variables of missing samples are used as explained variables, and the complete variables of similar complete samples are used as explanatory variables. The initial parameters of the model are calculated by the least square method using other complete samples in the cluster. Finally, the regression model is used to construct multiple imputation models, and the imputation steps are given to obtain M imputation values of missing data. Next, the SRP algorithm’s detailed steps to implement multiple imputations are given.

3.3.1. K-Nearest Neighbor Sample Selection

The K-nearest neighbor algorithm finds the nearest K neighbors to the target variable through distance calculation, which is widely used in various fields. This section uses the KNN algorithm to select K complete samples before clustering the dataset. The K-nearest neighbors and the missing samples in the same class cluster are subject to the same distribution. The Euclidean distance between two samples X x 1 , x 2 , x p and Y y 1 , y 2 , y p is defined as
d ( X , Y ) = l = 1 p x i y i 2
where x i , y i represent the i-th attribute value of the sample in the sample X and Y, respectively, and P represents the number of attributes. The Euclidean distance between samples reflects the difference between samples. The greater the distance is, the greater the difference between two samples is, that is, the more dissimilar the attribute values are between the two samples. In other words, the smaller the distance, the higher the similarity between the two samples, and the more similar the attribute values between the two samples are. According to the Euclidean distance, K complete samples most similar to the target samples containing missing data can be obtained.

3.3.2. Imputation Process

The sample set is N = X mis , X , where X m i s represents the sample set containing missing data, and X represents the most similar sample without missing data, i.e.,  X = X 1 , X 2 , , X K . Each sample has p attributes, including n attributes without missing values, and  P n attributes with missing values. Then, a regression model is established:
X mis = β 0 + β 1 X 1 + β 2 X 2 + + β K X K + ε
The parameters β ^ and residual variance matrix σ ^ 2 in the model need to be estimated. The initial values of the regression coefficients are estimated by the least square method and denoted as β ^ = ( β ^ 0 , β ^ 1 , , β ^ K ) , and σ ^ 2 . The posterior distribution of parameter β and σ 2 can be calculated by β ^ · and σ ^ 2 , where the distribution χ 2 can be constructed as follows:
σ ^ 2 σ 2 ( n k 1 ) χ 2 ( n k 1 )
If σ 2 is given, the posterior distribution of it is: β | σ 2 N ( β ^ , σ 2 ( X X ) 1 )
The steps to obtain the fill value each time are as follows:
(1) A new parameter, β * , and σ * 2 are extracted from the posterior predictive distribution of regression parameter β and σ 2 , that is, a random variable g is extracted from the degrees of freedom distribution n k 1 to obtain the new variance:
σ * 2 = σ ^ 2 ( n k 1 ) / g
Draw K + 1 random numbers from a normal distribution with mean β ^ = ( β ^ 0 , β ^ 1 , , β ^ K ) and variance σ 2 ( X X ) 1 , denoted as β * = ( β ^ * 0 , β ^ * 1 , , β ^ * K ) .
(2) For each missing value, the predicted value is
X m i s = β ^ * 0 + β ^ * 1 X 1 + β ^ * 2 X 2 + + β ^ * K X K
(3) The above steps (1) and (2) are repeated M to generate the M imputation value of the missing data and obtain the complete dataset after imputation. Each complete dataset is used to re-estimate the regression model parameters, and the estimated values of group regression coefficients can be obtained, denoted as β ^ 1 = ( β ^ 0 , 1 , β ^ 1 , 1 , , β ^ K , 1 ) , β ^ 2 = ( β ^ 0 , 2 , β ^ 1 , 2 , , β ^ K , 2 ) β ^ m = ( β ^ 0 , m , β ^ 1 , m , , β ^ K , m ) . Finally, the mean value β ¯ = i = 1 m β ^ i / m is taken as the estimated value of the regression model parameters and the final imputation value is recalculated.

3.4. The Impact of Noisy Data Reduction

In practice, for various reasons, the obtained datasets inevitably contain noise. Most missing data imputation algorithms are sensitive to noise. That is, the algorithm’s performance is affected by noise, the imputation accuracy will decrease with the increase of noise, and even some imputation algorithms will fail when the proportion of noise is large, and the imputation accuracy of the algorithm will also be affected by noisy data. Therefore, the random sample consensus (Ransac) [32] algorithm is proposed in this section to estimate the initial values of the parameters of the multiple imputation algorithm, to improve the robustness of the multiple imputation algorithm and ensure the imputation accuracy even when the algorithm contains a large amount of noise.
The input of the Ransac algorithm is a set of samples S, a parameterized model M that can be interpreted or adapted to the observed data, and some additional parameter settings, such as the number of iterations. In general, the algorithm can obtain the optimal global solution by a certain number of iterations. The Ransac algorithm consists of the following five steps:
(1) The number of samples N u m ( S ) > n in a sample set S, and the model satisfied by the samples in the set is represented by M. Select a subset S from the sample set S by random sampling method, in which the number of samples is n (n is the minimum number of samples needed to determine the parameters of the model M). Initialize the model M with the set S .
(2) The sample set whose error with model M is less than a certain threshold t in the remainder set C S S = S \ S and the sample set S . S * is the set of interior points of model M and the consensus set formed by them.
(3) If N u m ( S * ) N , (where N is a threshold, representing the minimum number of consistent set points of the current correct model), then recalculate the model parameters with S * .
(4) If N u m ( S * ) N , (where N is a threshold, representing the minimum number of consistent set points of the correct model), then select a new subset randomly S * and repeat the above steps.
(5) After a certain number of samplings, the maximum consistent set S * obtained is selected and used to re-estimate the model. The algorithm ends.

4. Multiple Imputation Algorithm for Noise Sensing

The algorithm presented in this section summarizes the above algorithms in this paper, aiming to solve the accuracy and robustness of multiple imputation algorithms on static datasets. In this section, this kind of multiple imputation algorithm, which also considers whether to construct an appropriate multiple imputation model according to the missing mechanism of ignoring data, determines different missing data imputation order according to the missing multivariable data, and reduces the impact of noisy data, is called a noise-aware multiple imputation algorithm, abbreviated as NPMI. Next, the specific process of the NPMI algorithm is presented in Algorithm 1.
The whole process of the NPMI algorithm can be divided into the following four steps:
(1) Step M: Imputation model construction.
The key to multiple imputation algorithms is determining how to construct the imputation model. Different imputation models will result in different imputation algorithms. Constructing an appropriate imputation model can reflect the uncertainty of missing data and the uncertainty caused by missing data. According to the different imputation models, different multiple imputation algorithms are formed. The NPMI algorithm selects multiple imputation models based on whether the missing data mechanism is considered. When the missing mechanism of data is random missing (MAR), the Monte Carlo Markov chain method (MCMC) introduced in Section 3.2 is used to simulate the posterior distribution of missing data to obtain the distribution P Y m i s , θ Y o b s , so as to construct the multiple imputation model. If we ignore missing data mechanisms, we do not know if there is a correlation between datasets variables, and do not even have the correlation between attributes.We consider using samples to establish the relationship between multiple fill model, which is described in Section 3.3. A regression prediction method based on the sample (SRP) multiple fill model was constructed.
(2) Step P: Parameter initialization.
In the traditional multiple imputation algorithm, the EM algorithm is used to estimate the initial parameters of the model, and the EM algorithm is used to estimate the mean vector μ = ( μ 1 , μ 2 , 1 , μ P ) and variance Σ = ( σ J P ) of the dimensional normal distribution. However, the initial values of the parameters estimated by the EM algorithm may be optimal local parameters, and the accuracy of the estimated parameter values decreases significantly when the dataset contains a lot of noise. Therefore, the NPMI algorithm adopts the random sampling consistency algorithm (Ransac) introduced in Section 3.4 to estimate the initial values of the multiple imputation model parameters determined by the S step to ensure that relatively accurate initial values of the model parameters can be obtained even when there is a large amount of noise.
(3) Step P: The imputation sequence determination.
The NPMI algorithm treats the imputed missing data as complete data, that is, the observed values are used for subsequent imputation. Therefore, different imputation order will result in different imputation results, and different imputation order greatly impacts the imputation accuracy. In this paper, we adopt the mutual information to measure the dependency between each missing attribute and those complete attributes. Given a missing attribute, the lower of the mutual information, the dependency between the missing attribute, and complete attributes is lower, and vice versa. Therefore, we first impute the attribute with the highest mutual information. Moreover, we treat the missing attribute after imputation as a new complete attribute used in later imputation. The process introduced above is repeated until all missing attributes are imputed.
(4) Step I: The missing data imputation.
When the M step adopts the Monte Carlo Markov chain method in Section 3.2 to construct the multiple imputation model, imputation step I is used to obtain the imputation value of missing data Y mis ( t + 1 ) from the conditional distribution of given parameters and observed variables P ( Y mis Y o b s , θ ( t ) ) , that is, to find
Y m i s ( t + 1 ) P Y m i s Y o b s , θ ( t ) .
A Markov chain converging to the distribution is formed, the imputation value for the missing data from the posterior distribution of the missing data is independently extracted, extraction is repeated M times, and the imputation value of the missing data and the mean and variance of the population parameter of M groups of different point estimates are generated. The M complete datasets are analyzed and combined to obtain the final population parameters and imputation values.
When the M step adopts the sample-based regression prediction in Section 3.2 to construct the multiple imputation model, the imputation step I is used to obtain the imputation value of the missing data from the regression model established by K similar complete samples. For each missing datum,
Y m i s = β ^ * 0 + β ^ * 1 X 1 + β ^ * 2 X 2 + + β ^ * K X K
The values of the new regression parameters β and residual variance matrix σ * 2 are extracted from the corresponding posterior predictive distribution. The extraction is repeated to generate the imputation values of the missing data, and the imputed complete dataset is obtained. Each complete dataset is used to re-estimate the regression model parameters, and the estimated values of the group regression coefficients can be obtained, denoted as
β ^ 1 = β ^ 0 , 1 , β ^ 1 , 1 , , β ^ K , 1 , β ^ 2 = β ^ 0 , 2 , β ^ 1 , 2 , , β ^ K , 2 , β ^ m = β ^ 0 , m , β ^ 1 , m , , β ^ K , m
Finally, the mean value β ¯ = i = 1 m β ^ i / m is taken as the estimated value of the regression model parameters and the final imputation value is recalculated.
Based on the above introduction, there are four steps in our algorithm. The first step is to construct an appropriate imputation model based on missing mechanism, where the time complexity is O ( 1 ) . In the second step, the model parameters are initialized by employing the EM algorithm. Moreover, due to the existence of noise in the data, we adopt the Ransac algorithm in the M step. The time complexity of Ransac is O k R * n , where k R is the iteration number and n is the data size. Since the EM algorithm is iterative, the time complexity of step 2 is O k E * k R * n , where k E is the iteration number of EM. The third step is imputation order determination by computing the mutual information between the Y o b s and Y m i s for each missing attribute. The time complexity is O d m * n , where d m is the number of missing attributes. Finally, we obtain M candidate estimations and derive the final imputation results in the fourth step, where the time complexity is O ( M n ) . Based on the above analysis, the time complexity of our algorithm is considered as O ( m n ) , where m = max k E * k R , d m , M . Since there are many iterative computations in our model, especially the use of Ransac in the M algorithm, the time cost of our algorithm is higher than existing multiple imputation methods, inevitably, which is shown in our experiments (in Section 5.3).
Algorithm 1 Framework of ensemble learning for our system.
Input: 
The static dataset with missing data Y = ( Y o b s , Y mis ) , the multiple imputation model P, the imputation multiplicity M of multiple imputation, the K value of K nearest neighbor algorithm, the minimum number of samples n required by the Ransac algorithm to determine the initial parameters of the model, the iteration number of Ransac algorithm i, and the missing data mechanism D.
Output: 
Complete static dataset Y . if D = M A R , then P = M C M C ( Y o b s ) ; else if D = u n d e f i n e d ;     P = S R P ( Y o b s ( i ) ) ;
1:
Initialize ( μ , Σ ) = Ransac ( i , n ) ;
2:
if Y m i s has multiple variables, then
3:
   for each Y mis ( i ) Y m i s do
4:
 Calculate the mutual information between Y mis ( i ) and Y o b s , I ( Y mis ( i ) , Y o b s ) ;
5:
   Select Y mis ( max ) = Max ( I ( Y mis ( i ) , Y o b s ) )
6:
   if Y mis ( max ) has continuous deletions in, then
7:
The order of imputation consecutive deletions in Y mis ( max ) is from the end point to the middle Else the order in which attributes are filled is not considered
8:
for each missing data in Y mis ( max ) do
9:
     calculate the fill-in value according to the formula
10:
   Add the fill Y mis ( max ) to Y o b s
11:
else if Y mis is the only variable, and, is continuously missing, then
12:
   The sequence of consecutive deletions in Y mis is from the end to the middle
13:
   for each missing data in Y mis do
14:
   calculate the fill-in value according to the formula
15:
   Add the fill Y mis to Y o b s
16:
Repeat until there are no variables in the set Y mis
17:
return Y ;

5. Experiments

In this section, the algorithms proposed are evaluated experimentally on real and synthetic datasets. Considering that different miss rates and different noise levels may affect the imputation results, different miss rates are simulated on real datasets, and different degrees of miss rates and noise levels are simulated on synthetic datasets. The algorithm is implemented in Java. The PC used in the experiment is an Intel(R) Core(TM) i7-3770 CPU @3.40 GHz processor, dual core, 4G running memory and 64-bit Windows 10 operating system.

5.1. Experimental Dataset

This experiment verifies the algorithm proposed in this section on two real and two synthetic datasets. The real datasets are the Weather dataset and the Sensor dataset. Manual datasets include the Air Quality dataset and the Test dataset. The four datasets are introduced in detail as follows:
(1) Weather dataset: This dataset comes from the weather data provided by the literature [33]. Specifically, it is composed of weather data about temperature, humidity, pressure, wind speed, and other attributes of 30 cities in the United States from multiple websites since March 2010. The sampling time is every 45 min. The data provided by one of the data sources are complete and reliable. They do not contain missing data.
(2) Sensor dataset: Continuous data on temperature, humidity, node voltage, light intensity, and other attributes obtained from 54 Mica2 sensor sections deployed in the Intel Berkeley laboratory with a sampling frequency of 30 s/time in the same indoor space. Since there is a small amount of missing data in the Sensor dataset and the truth value of these missing data is unknown, a small amount of missing real data in the Sensor dataset should be deleted first to obtain the complete dataset without missing data, and the complete dataset should be compressed by hours. The processed datasets are used as experimental datasets in this section to evaluate the performance of different imputation algorithms.
(3) Air Quality dataset: This dataset comes from the content of chemical elements monitored in the literature [34], that is, the monitoring data of air quality in a severely polluted area in Italy by the sensor equipment deployed in the area. Each monitoring datum contains multiple attributes, such as carbon monoxide (CO) content, nitrogen dioxide (NO2) content, total nitrogen oxide (NOx) content, etc. The problem of missing data on this dataset is real, and the true value corresponding to the missing value is known, which is provided by other, more reliable data sources. A total of 8985 observations were obtained, among which 1595 observations contained missing data.
(4) Test dataset: A multivariate normal distribution is simulated, and a series of values are randomly selected from the variables meeting the multivariate normal distribution to simulate a multivariate normal distribution dataset containing eight variables. The sample sizes of the dataset are 500, 1000, 3000, 5000, and 8000, respectively. The sample mean and covariance matrix of the multivariate normal distribution were determined by Richard [35] et al.
The real and synthetic datasets above need to be processed next. For real datasets, different proportions of missing data need to be simulated, that is, the observed data are randomly labeled as missing data according to different proportions of missing data, which are 5%, 10%, 20%, 30%, 40%, and 50%, respectively. On the one hand, it is easier to compare the advantages and disadvantages of different imputation methods. On the other hand, Barzi et al. mentioned that when the missing rate of data exceeds 60%, the data have completely lost their usable value. For synthetic datasets, different levels of noise need to be simulated. Gaussian noise was randomly injected into the part of the data in the synthetic dataset with noise ratios of 5%, 10%, 20%, 30%, 40%, and 50%, respectively. When the synthetic dataset (Test) is used to verify the efficiency of the algorithm, the missing rate needs to be simulated, and the missing rate is 5%, 10%, 20%, 30%, and 50%, respectively. Itshould be noted that when different proportions of missing data are simulated for the Test dataset, missing data only occur on variables Y 1 Y 4 , and there are no missing data on variables Y 5 Y 8 .

5.2. Evaluation Metrics

In order to evaluate the performance of the missing data imputation algorithm, that is, to compare the advantages and disadvantages and rationality of the NPMI algorithm proposed in this section and other imputation algorithms in the imputation of missing data, evaluation metrics are given from the following aspects. It should be noted that for each of the following evaluation formulations, this experiment considers the average of 200 independent simulations.
(1) Accuracy of missing data imputation. In general, the mean absolute error, MAD, mean absolute percentage error, MAPE, and root mean square error, RMSE of the imputation value can be used to evaluate the imputation accuracy. The larger the value of MAD, MAPE, and RMSE is, the larger the difference between the imputation value and the real value is, while the smaller the value is, the more matching between the imputation value and the real value, and the better the imputation effect of missing data is. The values of MAD, MAPE, and RMSE can be calculated by the following formulas:
M A D = i = 1 n x i x i n
M A P E = 1 n i = 1 n x i x i x i
R M S E = i = 1 n x i x i 2 n
where x i represents the true value of missing data, x i is the estimated value of missing data, and n is the number of missing data. In this paper, root mean square error (RMSE) is used to evaluate the accuracy of missing data imputation.
(2) The deviation of the distribution of missing variables after imputation. Duringcalculation of relative deviation MRD fill after missing variable distribution, the greater the deviation after fill ofthe lack of variable distribution and the greater the difference between the real distribution of the variables, and deviation, the less the lack of complete variables can better reflect the real distribution characteristics, namely, the missing data imputation effect is better.The value of MRD can be calculated by the following formulas:
M R D μ = | μ ¯ μ | μ
M R D Σ = | Σ ¯ Σ | Σ
where μ ¯ represents the mean value of the missing variable after imputation, μ represents the mean value of the missing variable in the real situation, Σ ¯ represents the variance of the missing variable after imputation, and Σ represents the variance of the missing variable in the real situation.
(3) Efficiency of the algorithm. The estimation efficiency of various imputation algorithms was evaluated in milliseconds (ms) by comparing the average execution time of each missing data imputation algorithm over multiple runs.

5.3. Result Analysis

The NPMI algorithm proposed in this section is compared with the multiple imputation algorithm proposed in [9], the regression imputation algorithm proposed in [12], and the EM imputation algorithm proposed in [36]. In order to comprehensively compare the advantages and disadvantages of the imputation algorithm proposed in this section with other imputation algorithms, the experiment analyzes the imputation accuracy of missing data, the deviation of the distribution of missing variables after imputation, and the imputation efficiency. The accuracy of missing data imputation and the distribution deviation of missing variables after imputation are evaluated and analyzed on the above real datasets (Weather and Sensor) and synthetic datasets (Air Quality and Test). The algorithms are executed on the above four datasets with different missing proportions and different noise levels. The imputation accuracy and relative deviation are compared. However, the evaluation and analysis of imputation efficiency are only carried out on the synthetic datasets (Test). The four algorithms are executed on the synthetic datasets (Test) with different data sizes, different missing ratios, and different noise levels, and the execution time of the algorithms is compared.
(1) Imputation accuracy.
The proportion of missing data is different. On the real datasets (Weather and Sensor), the imputation accuracy of each imputation algorithm is compared, as shown in Figure 1.
As shown in Figure 1, the RMSE error of the NPMI imputation algorithm proposed in this section is approximately the smallest on the above two real datasets (Weather and Sensor), that is, the imputation accuracy of the NPMI imputation algorithm is the highest. The ordinary MI filling algorithm has the second-best effect, the EM imputation effect is the second best, and the regression imputation effect is the worst. This is because NPMI imputation considers the distribution of data and the uncertainty of missing data; EM imputation is an iterative imputation method, which can obtain a relatively good imputation accuracy through several iterations. However, regression imputation requires linear relationships between variables, while the relationship between complete variables and missing variables in real datasets is more complex and cannot be accurately described by linear models. In some cases, the variables in the dataset are obviously not linear, so the imputation accuracy of regression imputation is not satisfactory.
On the above two datasets, with the increase of the proportion of missing data, the RMSE of the four imputation algorithms shows an increasing trend, that is, the imputation accuracy is decreasing. This is because the more missing data there are, the more effective information of the sample is lost, and the less objective the sample is. Therefore, with the increase of the missing proportion, the imputation accuracy of the imputation algorithm decreases. With the increase of missing proportion, the imputation accuracy of NPMI imputation and MI imputation is relatively less affected, because multiple imputation itself can more truly reflect the overall distribution of the sample and is closer to the posterior distribution of missing data. However, as the proportion of missing data increases, the imputation accuracy of MI imputation changes faster than that of the NPMI imputation. This is because the larger the proportion of missing data, the more variables are missing. MI imputation does not consider the imputation order, while the NPMI algorithm proposed in this section considers the imputation order. Therefore, the imputation accuracy of the NPMI algorithm is relatively less affected by the missing proportion.
The noise ratio is different. On the synthetic datasets (Air Quality and Test), the imputation accuracy of each imputation algorithm is compared, as shown in the figure below.
As shown in Figure 2, on the synthetic datasets (Air Quality and Test) with different noise ratios, the RMSE of the NPMI imputation algorithm proposed in this section is approximately the smallest, that is, the imputation accuracy of the NPMI imputation algorithm is the highest. As the proportion of noise increases, the RMSEs of the above four imputation algorithms all show an increasing trend, that is, the imputation accuracy is decreasing. This is because noise affects the data quality of the whole dataset. The higher the proportion of noise, the worse the data quality of the whole dataset.
It can be seen from Figure 2b that the execution effect of the NPMI algorithm on the synthetic dataset (Test) is relatively stable, because the synthetic dataset is ideal and fully meets the multivariate normal distribution. However, the execution effect of the NPMI algorithm on other real datasets is also relatively good. In line with the nonmultivariate normal distribution verified by Graham et al., the imputation value obtained by the multivariate normal distribution assumption is also very close to the real value.
(2) The deviation of the distribution of missing variables after imputation.
The proportion of missing data is different. On the real datasets (Weather and Sensor), the deviation of the distribution of missing variables after imputation by each imputation algorithm is compared, as shown in Figure 3.
As shown in Figure 3, the relative deviations and values of the distribution of missing variables in the above two real datasets (Weather and Sensor) are approximately the smallest for the NPMI imputation algorithm proposed in this section, that is, the complete missing variables after imputation by the NPMI imputation algorithm can better reflect the distribution characteristics of missing variables in the real situation. It better reflects the uncertainty caused by missing data, and the distribution of missing variables after imputation is close to the real distribution of missing variables. Using theMI fill algorithm to fill after missing variable distribution deviation is relatively good. The other two fill algorithms used to fill after missing the distribution of the variable deviation are bigger, because the two fill algorithms, without considering the uncertainty of the missing data and overall uncertainty caused by missing data, only consider missing data to fill possible values. Therefore, there is a big difference between the distribution of missing variables filled by these two methods and the real distribution of missing variables.
The noise ratio is different. On the synthetic datasets (Air Quality and Test), the deviation comparison of the distribution of missing variables after imputation by each imputation algorithm is shown in Figure 4.
As shown in Figure 4, the relative deviation and sum of the distribution of missing variables in the two synthetic datasets (Air Quality and Test) proposed by the NPMI imputation algorithm in this section are approximately the smallest, that is, the distribution of missing data filled by the NPMI imputation algorithm is close to the real distribution of missing variables. It can better reflect the distribution characteristics of the missing variables in the real situation. The relative deviation of the distribution of missing variables after imputation by the MI algorithm is better than the other two imputation algorithms. The relative deviation of the distribution of missing variables after imputation by regression algorithm is the largest.
(3) Algorithm efficiency.
The four algorithms are executed on synthetic datasets (Test) with different data sizes, different proportions of missing data, and different proportions of noise. The average execution efficiency of each filling algorithm is compared, as shown in Figure 5.
Figure 5a shows the comparison results of the execution efficiency of each algorithm on the synthetic dataset (Test) with different data volumes. It can be seen that under the same missing rate, the time cost of the above four filling algorithms all show an upward trend with the increase of data volume. In comparison, the execution efficiency of regression filling is the highest, followed by EM filling, and MI filling is close to that of the NPMI filling proposed in this paper, which consumes the most time. This is because the process of solving model parameters by linear regression padding is relatively simple, and the calculation process of EM padding and MI padding requires several iterations. In addition to the iteration that requires MI padding, NPMI padding uses the random sampling consistency method in the model initialization step, which is also an iterative method, so it consumes a lot of time, but on the whole, the time cost of NPMI is acceptable.
Figure 5b shows the comparison results of the execution efficiency of each algorithm on the synthetic dataset (Test) with different proportions of missing data. It can be seen that with the same amount of data, the time cost of the above four imputation algorithms is approximately on the rise with the increase of the proportion of missing data, and the execution time of NPMI imputation is the longest.
Figure 5c shows the comparison results of the execution efficiency of each algorithm on synthetic datasets with different noise ratios (Test). It can be seen that with the same amount of data and the same proportion of missing data, the time cost of the above four filling algorithms approximately does not change as the proportion of noise increases, that is, the execution time of the algorithms does not change with the proportion of noise.

6. Conclusions

For the static datasets, a noise-aware multiple imputation algorithm was proposed, namely, NPMI. The algorithm improves the multiple imputation algorithm. The algorithm can be divided into four steps: i. M step, constructing different multiple imputation models based on whether or not to consider missing datamechanisms, which could reasonably reflect the uncertainty of missing data and the uncertainty caused by missing data; ii. P step, initializing the parameters. The random sample consensus algorithm was used to estimate the initial value of parameters of NPMI to reduce the influence of noisy data on the model parameter estimation and ensure the accuracy of the model parameters; iii. S step, determining the order of imputing. The criterion for determining the order of imputing different missing data when the multivariate data are missing is given, that is, calculating the mutual information between each missing variable and the complete variable to determine the imputing order; iv. I step, calculating the imputation value of the missing data according to the given imputing model. A lot of experiments were carried out on real datasets and synthetic datasets. The experimental results show that the NPMI imputing algorithm has better accuracy and effectiveness than the existing imputing algorithms for static datasets.

Author Contributions

Conceptualization, F.L. and H.S.; methodology, F.L. and H.S.; software, H.S. and F.L.; formal analysis, F.L. and H.S.; writing—original draft preparation, F.L.; writing—review and editing, F.L., Y.G. and G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The research and publication of the paper was funded by Fundamental Research Funds of the Central Universities (N2216017).

Data Availability Statement

The dataset can be accessed upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lindig, S.; Louwen, A.; Moser, D.; Topic, M. Outdoor PV system monitoring—Input data quality, data imputation and filtering approaches. Energies 2020, 13, 5099. [Google Scholar] [CrossRef]
  2. Hemanth, G.R.; Raja, S.C. Proposing suitable data imputation methods by adopting a Stage wise approach for various classes of smart meters missing data–Practical approach. Expert Syst. Appl. 2022, 187, 115911. [Google Scholar] [CrossRef]
  3. Dang, H.A.; Jolliffe, D.; Carletto, C. Data gaps, data incomparability, and data imputation: A review of poverty measurement methods for data-scarce environments. J. Econ. Surv. 2019, 33, 757–797. [Google Scholar] [CrossRef] [Green Version]
  4. Seo, B.; Shin, J.; Kim, T.; Youn, B.D. Missing data imputation using an iterative denoising autoencoder (IDAE) for dissolved gas analysis. Electr. Power Syst. Res. 2022, 212, 108642. [Google Scholar] [CrossRef]
  5. Kelkar, B.A. Missing Data Imputation: A Survey. Int. J. Decis. Support Syst. Technol. 2022, 14, 1–20. [Google Scholar] [CrossRef]
  6. Wang, Z.; Sha, E.H.M.; Hu, X. Combined partitioning and data padding for scheduling multiple loop nests. In Proceedings of the 2001 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, Atlanta, GA, USA, 16–17 November 2001; pp. 67–75. [Google Scholar]
  7. Samsudin, N.A.; Mustapha, A.; Arbaiy, N.; Hamid, I.R.A. Extended local mean-based nonparametric classifier for cervical cancer screening. In International Conference on Soft Computing and Data Mining; Springer: Cham, Switzerland, 2017; pp. 386–395. [Google Scholar]
  8. Rao, Q. Empirical likelihood-based inference in linear models with missing data. Scand. J. Stat. 2002, 29, 563–576. [Google Scholar]
  9. Lai, P.; Wang, Q. Semiparametric efficient estimation for partially linear single-index models with responses missing at random. J. Multivar. Anal. 2014, 128, 33–50. [Google Scholar] [CrossRef]
  10. Jing, X.Y.; Qi, F.; Wu, F.; Xu, B. Missing data imputation based on low-rank recovery and semi-supervised regression for software effort estimation. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), Austin, TX, USA, 14–22 May 2016; pp. 607–618. [Google Scholar]
  11. Oehmcke, S.; Zielinski, O.; Kramer, O. kNN ensembles with penalized DTW for multivariate time series imputation. In Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, Canada, 24–29 July 2016; pp. 2774–2781. [Google Scholar]
  12. Qin, Y.; Zhang, S.; Zhang, C. Combining kNN imputation and bootstrap calibrated: Empirical likelihood for incomplete data analysis. In Exploring Advances in Interdisciplinary Data Mining and Analytics: New Trends; IGI Global: Hershey, PA, USA, 2012; pp. 278–289. [Google Scholar]
  13. Ban, T.; Zhang, R.; Pang, S.; Sarrafzadeh, A.; Inoue, D. Referential knn regression for financial time series forecasting. In International Conference on Neural Information Processing; Springer: Berlin/Heidelberg, Germany, 2013; pp. 601–608. [Google Scholar]
  14. Zhang, S.; Jin, Z.; Zhu, X. Missing data imputation by utilizing information within incomplete instances. J. Syst. Softw. 2011, 84, 452–459. [Google Scholar] [CrossRef]
  15. Zhu, X.; Zhang, S.; Jin, Z.; Zhang, Z.; Xu, Z. Missing value estimation for mixed-attribute data sets. IEEE Trans. Knowl. Data Eng. 2010, 23, 110–121. [Google Scholar] [CrossRef]
  16. Liao, Z.; Lu, X.; Yang, T.; Wang, H. Missing data imputation: A fuzzy K-means clustering algorithm over sliding window. In Proceedings of the 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China, 14–16 August 2009; Volume 3, pp. 133–137. [Google Scholar]
  17. Guru, D.S.; Kumar, N.V.; Suhil, M. Feature selection of interval valued data through interval K-means clustering. Int. J. Comput. Vis. Image Process. (IJCVIP) 2017, 7, 64–80. [Google Scholar] [CrossRef]
  18. Fu, Z.; Golden, B.L.; Lele, S.; Raghavan, S.; Wasil, E.A. A genetic algorithm-based approach for building accurate decision trees. INFORMS J. Comput. 2003, 15, 3–22. [Google Scholar] [CrossRef]
  19. Rahman, G.; Islam, Z. A decision tree-based missing value imputation technique for data pre-processing. In Proceedings of the Ninth Australasian Data Mining Conference-Volume 121, Ballarat, Australia, 1 December 2011; pp. 41–50. [Google Scholar]
  20. Zhang, S.; Qin, Z.; Ling, C.X.; Sheng, S. “Missing is useful”: Missing values in cost-sensitive decision trees. IEEE Trans. Knowl. Data Eng. 2005, 17, 1689–1693. [Google Scholar] [CrossRef] [Green Version]
  21. Zhang, A.; Song, S.; Sun, Y.; Wang, J. Learning individual models for imputation. In Proceedings of the 2019 IEEE 35th International Conference on Data Engineering (ICDE), Macao, China, 8–11 April 2019; pp. 160–171. [Google Scholar]
  22. Oba, S.; Sato, M.A.; Takemasa, I.; Monden, M.; Matsubara, K.I.; Ishii, S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics 2003, 19, 2088–2096. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  23. Hruschka, E.R.; Hruschka, E.R.; Ebecken, N.F. A Bayesian imputation method for a clustering genetic algorithm. J. Comput. Methods Sci. Eng. 2011, 11, 173–183. [Google Scholar] [CrossRef]
  24. Hruschka, E.R.; Hruschka, E.R.; Ebecken, N.F. Bayesian networks for imputation in classification problems. J. Intell. Inf. Syst. 2007, 29, 231–252. [Google Scholar] [CrossRef]
  25. Ravi, V.; Krishna, M. A new online data imputation method based on general regression auto associative neural network. Neurocomputing 2014, 138, 106–113. [Google Scholar] [CrossRef]
  26. Vilardell, M.; Buxó, M.; Clèries, R.; Martínez, J.M.; Garcia, G.; Ameijide, A.; Font, R.; Civit, S.; Marcos-Gragera, R.; Vilardell, M.L.; et al. Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival. Artif. Intell. Med. 2020, 107, 101875. [Google Scholar] [CrossRef]
  27. Ozturk, A. Accuracy improvement in air-quality forecasting using regressor combination with missing data imputation. Comput. Intell. 2021, 37, 226–252. [Google Scholar] [CrossRef]
  28. Luo, Y.; Cai, X.; Zhang, Y.; Xu, J. Multivariate time series imputation with generative adversarial networks. Adv. Neural Inf. Process. Montréal Can. Syst. 2018, 31, 1603–1614. [Google Scholar]
  29. Zhu, B.; He, C.; Liatsis, P. A robust missing value imputation method for noisy data. Appl. Intell. 2012, 36, 61–74. [Google Scholar] [CrossRef]
  30. Ma, Q.; Gu, Y.; Lee, W.C.; Yu, G.; Liu, H.; Wu, X. REMIAN: Real-time and error-tolerant missing value imputation. ACM Trans. Knowl. Discov. Data (TKDD) 2020, 14, 1–38. [Google Scholar] [CrossRef]
  31. Kass, R.E.; Carlin, B.P.; Gelman, A.; Neal, R.M. Markov chain Monte Carlo in practice: A roundtable discussion. Am. Stat. 1998, 52, 93–100. [Google Scholar]
  32. Raguram, R.; Frahm, J.M.; Pollefeys, M. A comparative analysis of RANSAC techniques leading to adaptive real-time random sample consensus. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2008; pp. 500–513. [Google Scholar]
  33. Dong, X.L.; Berti-Equille, L.; Hu, Y.; Srivastava, D. Global detection of complex copying relationships between sources. Proc. VLDB Endow. 2010, 3, 1358–1369. [Google Scholar] [CrossRef]
  34. De Vito, S.; Massera, E.; Piga, M.; Martinotto, L.; Di Francia, G. On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens. Actuators B Chem. 2008, 129, 750–757. [Google Scholar] [CrossRef]
  35. Johnson, R. Practical Multivariate Statistical Analysis, 4th ed.; Tsinghua University Press: Bejing, China, 2001. [Google Scholar]
  36. Bernaards, C.A.; Sijtsma, K. Influence of imputation and EM methods on factor analysis when item nonresponse in questionnaire data is nonignorable. Multivar. Behav. Res. 2000, 35, 321–364. [Google Scholar] [CrossRef]
Figure 1. Accuracy of NPMI, MI, regression, and EM-based imputation.
Figure 1. Accuracy of NPMI, MI, regression, and EM-based imputation.
Mathematics 11 00073 g001
Figure 2. Accuracy of NPMI, MI, regression, and EM-based imputation.
Figure 2. Accuracy of NPMI, MI, regression, and EM-based imputation.
Mathematics 11 00073 g002
Figure 3. Deviation of the distribution of missing variables for NPMI, MI, regression, and EM.
Figure 3. Deviation of the distribution of missing variables for NPMI, MI, regression, and EM.
Mathematics 11 00073 g003
Figure 4. Deviation of the distribution of missing variables for NPMI, MI, regression, and EM.
Figure 4. Deviation of the distribution of missing variables for NPMI, MI, regression, and EM.
Mathematics 11 00073 g004
Figure 5. Deviation of the distribution of missing variables for NPMI, MI, regression, and EM.
Figure 5. Deviation of the distribution of missing variables for NPMI, MI, regression, and EM.
Mathematics 11 00073 g005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, F.; Sun, H.; Gu, Y.; Yu, G. A Noise-Aware Multiple Imputation Algorithm for Missing Data. Mathematics 2023, 11, 73. https://doi.org/10.3390/math11010073

AMA Style

Li F, Sun H, Gu Y, Yu G. A Noise-Aware Multiple Imputation Algorithm for Missing Data. Mathematics. 2023; 11(1):73. https://doi.org/10.3390/math11010073

Chicago/Turabian Style

Li, Fangfang, Hui Sun, Yu Gu, and Ge Yu. 2023. "A Noise-Aware Multiple Imputation Algorithm for Missing Data" Mathematics 11, no. 1: 73. https://doi.org/10.3390/math11010073

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop