To verify the effectiveness of our method, four comparative experiments are designed. In the first experiment, three statistical imputation techniques provided by the FATE platform are considered for comparison: MAX, MIN, and MEAN. In the second experiment, the proposed vertical federated imputation approach is compared with a typical centralized imputation method. Then, in the third experiment, the lossless performance of the federated method is verified. Finally, the KNN imputation method is applied to the regression task to verify its contribution to regression task.
4.3.1. Federated Comparative Experiment
In this experiment, the proposed vertical federated imputation method is compared with the existing missing imputation methods. As far as we have learned, no one has researched about vertical federated imputation method before, and constant value imputation methods such as MAX, MIN, and MEAN are most frequently used in the vertical federated situation currently. So, the above three methods are compared. In the aspect of data missing generation, MCAR mechanism is adopted to generate an incomplete data set with missing rates of 10%, 20%, and 30% randomly. This process is repeated 10 times to obtain the averaged evaluation. For the imputed data, we test the validity by comparing the original values to the imputed values.
The experimental results are as
Figure 4, where the results with 10% missing rate appear in the first row, and the remaining rows indicate the evaluation results with 20%, 30% missing rates. The ’missing value index’ in
Figure 4 refers to the positional index of an element within matrix
, signifying the ordinal placement of a value equal to 1.
Overall, among the methods studied, the proposed federated KNN imputation method yields imputed values that are closer to the original distribution of the data. As more values are removed, the data matrix becomes sparser (i.e., fewer complete values are available for training), which in turn degrades interpolation performance. That is, all techniques tend to use denser data matrices to produce more accurate imputation, which is consistent with previous research [
34]. Root mean squared error (RMSE) is selected as the performance metric to further compare the imputation effect of the proposed method with the other three methods. Each experiment is conducted 10 times, results are shown in
Table 3 and displayed in
Figure 5,
Figure 6 and
Figure 7.
We perform 10 evaluations using different random seeds to remove different parts of the original values. As can be seen from
Figure 5,
Figure 6 and
Figure 7, the RMSE values of the proposed method are significantly lower than the RMSE values of the other three methods. Among the remaining three methods, MAX and MIN have the highest RMSE values across different missing rates. Compared to MEAN, our method has lower RMSE values across different missing rates.
Constant value imputation methods are to impute all missing values by using fixed values that meet certain conditions. These methods are not sensitive to the variation of feature values, and do not consider the correlation between samples and features. Compared with constant value imputation methods, the superiority of the vertical federated KNN imputation method lies in the following aspects. Firstly, it leverages the similarity between samples to impute missing values, providing potentially more accurate estimates compared to simple constant imputation methods, as it considers the overall data structure of different parties. Secondly, it not only takes into account information from the feature with missing values but also considers relationships between other features. This helps capture the complex structure and patterns in the data. Lastly, it is a non-parametric method, making no assumptions about the distribution of data. This flexibility is advantageous when dealing with diverse types of data and problems.
4.3.2. Centralized Comparative Experiment
In this experiment, we compare the proposed method with traditional centralized imputation algorithms including MEAN, LR, KNN, and RF to verify the enhancement of the imputation effect by heterogeneous data sets from two parties. In conducting experiments with the centralized imputation algorithm, only the missing data set from party A is utilized for imputation. For experiments with the proposed federated algorithm, both data sets of party A and party B are employed to impute the missing data set of party A. Compare the imputation effect of the data set of party A under the five imputation methods. Similarly, the missing rates are increased from 5% to 40% at the intervals of 5%, and each experiment is conducted three times under each missing rate, using RMSE as the performance metric. The average RMSE of three repeated experiments are given in
Table 4 and the deviation is shown in
Figure 8.
The results of the vertical federated KNN imputation method are notably superior to other traditional data imputation methods in all cases of missing rates. The performance of the proposed method and the centralized KNN imputation algorithm gradually decreases when the missing rate exceeds 10%. As the missing rate increases, the experimental metrics of different methods collectively deteriorate and gradually converge. This phenomenon is attributed to the increasing difficulty in imputation as the missing rate rises.
The centralized algorithms can only be applied to one side of the data set, and the effectiveness of the algorithms will be reduced when the one-sided data set has fewer features. Compared with the centralized imputation algorithms, the proposed algorithm expands the feature space by using multi-party data sets, and provides more basis for the similarity of samples. By combining data sets from different sources for imputation, the proposed federated missing data imputation algorithm contributes to enhancing the model’s generalization performance. This is because they can capture a broader range of data features and patterns.
4.3.4. Contribution to Regression
Another experiment to verify the effectiveness of the federated KNN method is to use the imputed data for modeling tasks, and compare the corresponding modeling performances with other imputation methods. In this experiment, federated KNN imputation method is selected as the basic experimental group, and the MAX, MIN, and MEAN imputation methods are selected as the comparison group to carry out the vertical federated linear regression modeling tasks, respectively.
The imputed data set is utilized for linear regression modeling to predict motor speed. The 80% of the data set is used as the training set and the rest as the test set. During the testing process, there are basically two methods for imputation. The first method is to carry out the completely same process as training process on testing data set. The other method is to impute with well imputed training data set. Considering the scale of testing set may be small and cannot impute by itself, the second way is adopted. Explainable variance and RMSE are selected as the regression performance metrics. For each method, 10 sets of randomized experiments are conducted, and the average performance metrics from these 10 sets are taken as the final values. The algorithm of vertical federated linear regression realized inside FATE is adopted and the whole training and testing processes are shown in the
Figure 10.
Both processes utilize the Reader component to read the data, and the processed data enters the DataTransform component for format conversion. After completing the data conversion, both sides’ data is intersected through the Intersection component to align the data samples based on the ID. The KnnImputation component is then employed to impute any missing values in the data. Following feature imputation, the HeteroLinR linear regression component is used for vertical federated regression modeling. Finally, the results are passed to the Evaluation component to compute the explainable variance and RMSE of the imputation. The experimental results are as follows.
In
Figure 11, the left side depicts a chart comparing explained variances. The horizontal axis represents the missing rate, while the vertical axis indicates the values of explained variance. For each missing rate, the explained variances of the four methods are plotted together in a group. It can be observed that, within each group, the FKNN method consistently achieves the maximum value. The right side is a chart comparing RMSE values. The horizontal axis represents the missing rate, while the vertical axis displays the values of RMSE. For each level of missing data, the RMSE values of the four methods are plotted together as a group. It can be observed that, within each group, the FKNN method consistently achieves the minimum value.
In the
Figure 12,
Figure 13 and
Figure 14, the left side comprises box plots illustrating explained variance, while the right side depicts box plots for RMSE. The upper and lower edges of the box represent the middle 50% range of the ten dots, with the midpoint indicating the mean. Additionally, dots are plotted and methods are distinguished by different colors. It can be observed that FKNN exhibits optimal and significant mean performance. However, the performance gap between the MEAN method and the FKNN method is relatively small. The reasons are as follows.
In this experiment, it is assumed that Party A’s data contain missing values, while Party B’s data are devoid of any missing values. According to the experimental results, even though the imputed Party A’s data exhibit a lower imputation RMSE, the contribution to regression model performance is limited. This is because, as it is settled, Party A’s data set contributes less to the label prediction while Party B’s data set contributes more to the label prediction. Subsequently, the experiment is modified once again, assuming that Party B’s data set also contains missing values. Under the same missing rate with Party A, the experiment is conducted once more, and the results are illustrated as follows.
In
Figure 15, the differences between the four methods, whether in terms of explained variance or RMSE, have increased compared to
Figure 11, making the distinctions more pronounced. For example, at the 10% missing rate, the difference of the explained variance between FKNN and MEAN is about 0.04, which is larger than 0.01 in
Figure 11. In the
Figure 16,
Figure 17 and
Figure 18, FKNN consistently achieves optimal mean performance. Compared to the
Figure 12,
Figure 13 and
Figure 14, it can be observed that the interquartile range of each box plot is further expanded at the upper and lower edges, and the distribution of the 10 dots becomes more dispersed. This is because that the missing data from the Party B has a significant impact on the linear regression model. Generally speaking, the proposed federated KNN imputation method makes a significant contribution to linear regression modeling tasks.
Based on the results, it is evident that compared to the scenario where only Party A’s data contain missing values, the imputed Party B’s data show a more noticeable improvement in regression model performance. This indicates, from the perspective of predicting the guest label, that the quality of Party B’s data is superior to that of Party A’s data. This aligns with the original intention of vertical federated modeling, which is to leverage different features from other parties to obtain a better machine learning model.
Taking into consideration the results from the above two experiments, the proposed missing imputation method achieves the maximum value of explainable variance, and the minimum value of RMSE for the listed 10%, 20%, and 30% missing rate cases, compared to the other three basic methods. In the case of 10% missing rate, all four methods achieve their respective maximum explainable variance values, minimum RMSE values.
Overall, data imputation contributes significantly to regression modeling, with the proposed FKNN imputation method demonstrating the most evident modeling contribution. This contribution is also dependent on the features themselves, such as the features of Party B contributing more to the regression model than those of Party A. In such cases, imputing missing data for Party B can enhance the performance of the regression model more effectively.