Next Article in Journal
Parallel Matrix-Free Higher-Order Finite Element Solvers for Phase-Field Fracture Problems
Previous Article in Journal
CyVerse Austria—A Local, Collaborative Cyberinfrastructure
Previous Article in Special Issue
Data-Driven Bayesian Network Learning: A Bi-Objective Approach to Address the Bias-Variance Decomposition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Windowing as a Sub-Sampling Method for Distributed Data Mining

by
David Martínez-Galicia
1,*,
Alejandro Guerra-Hernández
1,
Nicandro Cruz-Ramírez
1,
Xavier Limón
2 and
Francisco Grimaldo
3
1
Centro de Investigación en Inteligencia Artificial, Universidad Veracruzana, Sebastián Camacho No 5, Xalapa, Veracruz, Mexico 91000, Mexico
2
Facultad de Estadística e Informática, Universidad Veracruzana, Av. Xalapa s/n, Xalapa, Veracruz, Mexico 91000, Mexico
3
Departament d’Informàtica, Universitat de València, Avinguda de la Universitat, s/n, Burjassot-València, 46100 València, Spain
*
Author to whom correspondence should be addressed.
Math. Comput. Appl. 2020, 25(3), 39; https://doi.org/10.3390/mca25030039
Submission received: 31 May 2020 / Revised: 27 June 2020 / Accepted: 29 June 2020 / Published: 30 June 2020
(This article belongs to the Special Issue New Trends in Computational Intelligence and Applications)

Abstract

:
Windowing is a sub-sampling method, originally proposed to cope with large datasets when inducing decision trees with the ID3 and C4.5 algorithms. The method exhibits a strong negative correlation between the accuracy of the learned models and the number of examples used to induce them, i.e., the higher the accuracy of the obtained model, the fewer examples used to induce it. This paper contributes to a better understanding of this behavior in order to promote windowing as a sub-sampling method for Distributed Data Mining. For this, the generalization of the behavior of windowing beyond decision trees is established, by corroborating the observed negative correlation when adopting inductive algorithms of different nature. Then, focusing on decision trees, the windows (samples) and the obtained models are analyzed in terms of Minimum Description Length (MDL), Area Under the ROC Curve (AUC), Kulllback–Leibler divergence, and the similitude metric Sim1; and compared to those obtained when using traditional methods: random, balanced, and stratified samplings. It is shown that the aggressive sampling performed by windowing, up to 3% of the original dataset, induces models that are significantly more accurate than those obtained from the traditional sampling methods, among which only the balanced sampling is comparable in terms of AUC. Although the considered informational properties did not correlate with the obtained accuracy, they provide clues about the behavior of windowing and suggest further experiments to enhance such understanding and the performance of the method, i.e., studying the evolution of the windows over time.

1. Introduction

Windowing is a sub-sampling method that enabled the decision tree inductive algorithms ID3 [1,2,3] and C4.5 [4,5] to cope with large datasets, i.e., those whose size precludes loading them in memory. Algorithm 1 defines the method: First, a window is created by extracting a small random sample of the available examples in the full dataset. The main step consists of inducing a model with that window and of testing it on the remaining examples, such that all misclassified examples are moved to the window. This step iterates until a stop condition is reached, e.g., all the available examples are correctly classified or a desired level of accuracy is reached.
Algorithm 1 Windowing.
Require:    E x a m p l e s {The original training set}
Ensure:    M o d e l {The induced model}
1:
W i n d o w s a m p l e ( E x a m p l e s )
2:
E x a m p l e s E x a m p l e s W i n d o w
3:
repeat
4:
   s t o p C o n d t r u e
5:
   m o d e l i n d u c e ( W i n d o w )
6:
   for  e x a m p l e E x a m p l e s   do
7:
     if   c l a s s i f y ( m o d e l , e x a m p l e ) c l a s s ( e x a m p l e ) then
8:
       W i n d o w W i n d o w { e x a m p l e }
9:
       E x a m p l e s E x a m p l e s { e x a m p l e }
10:
       s t o p C o n d f a l s e
11:
    end if
12:
  end for
13:
until  s t o p   C o n d
14:
return   m o d e l
Despite Wirth and Catlett [6] publishing an early critic about the computational cost of windowing and its inability to deal with noisy domains, Fürnkranz [7] argues that this method still offers three advantages: (a) it copes well with memory limitations, reducing considerably the number of examples required to induce a model of acceptable accuracy; (b) it offers an efficiency gain by reducing the time of convergence, specially when using a separate-and-conquer inductive algorithm, as Foil [8], instead of the divide-and-conquer algorithms such as ID3 and C4.5., and; (c) it offers an accuracy gain, specially in noiseless datasets, possibly explained by the fact that learning from a subset of examples may often result in a less over-fitting theory.
Even when the lack of memory is not usually an issue nowadays, similar concerns arise when mining big and/or distributed data, i.e., the impossibility or inconvenience of using all the available examples to induce models. Windowing has been used as the core of a set of strategies for Distributed Data Mining (DDM) [9] obtaining good accuracy results, consistent with the expected achievable accuracy and number of examples required by the method. On the contrary, efficiency suffers for large datasets as the cost of testing the models in the remaining examples is not negligible (i.e., the for loop in Algorithm 1, line 6), although it can be alleviated by using GPUs [10]. More relevant for this paper is the fact that these Windowing-based strategies based on J48, the Weka [11] implementation of C4.5, show a strong correlation ( 0.8175845 ) between the accuracy of the learned decision trees and the number of examples used to induce them, i.e., the higher the accuracy obtained, the fewer the number of examples used to induce the model. The windows in this method can be seen as samples and reducing the size of the training sets, even up to a 95% of the available training data, still enables accuracy values above 95%.
These promising results encourage the adoption of windowing as a sub-sampling method for Distributed Data Mining. However, they suggest some issues that must be solved for such adoption. The first one is the generalization of windowing beyond decision trees. Does windowing behave similarly when using different models and inductive algorithms? The first contribution of this paper is to corroborate the correlation between accuracy and the size of the window, i.e., the number of examples used to induce the model, when using inductive algorithms of different nature, showing that the advantages of windowing as a sub-sampling method can be generalized beyond decision trees. The second issue is the need of a deeper understanding of the behavior of windowing. How is that such a big reduction in the number of training examples, maintains acceptable levels of accuracy? This is particularly interesting as we have pointed out that high levels of accuracy correlate with smaller windows. The second contribution of the paper is thus to approach such a question in terms of the informational properties of both the windows and the models obtained by the method. These properties do not unfortunately correlate with the obtained accuracy of windowing and suggest the study of the evolution of the windows over as future work. Finally, a comparison with traditional methods as random, stratified, and balanced samplings, provides a better understanding of windowing and evaluates its adoption as an alternative sampling method. Under equal conditions, i.e., same original full dataset and size of the sample, windowing shows to be significantly more accurate than the traditional samplings and comparable to balanced sampling in terms of AUC. The paper is organized as follows: Section 2 introduces the adopted materials and methods; Section 3 presents the obtained results; and Section 4 discusses conclusions and future work.

2. Materials and Methods

This section describes the implementation of windowing used in this work, as included in JaCa-DDM; the datasets used in experimentation; and the experiments themselves.

2.1. Windowing in JaCa-DDM

Because of our interest in Distributed Data Mining settings, JaCa-DDM (https://github.com/xl666/jaca-ddm) was adopted to run our experiments. This tool [9] defines a set of windowing-based strategies using J48, the Weka [11] implementation of C4.5, as inductive algorithm. Among them, the Counter strategy is the most similar to the original formulation of windowing, with the exception of:
  • The dataset may be distributed in different sites, instead of the traditional approach based on a single dataset in a single site.
  • The loop for collecting the misclassified examples to be added to the window is performed by a set of agents using copies of the model distributed among the available sites, in a round-robin fashion.
  • The initial window is a stratified sample, instead of a random one.
  • An auto-adjustable stop criteria is combined with a configurable maximum number of iterations.
The configuration of the strategy (Table 1) used for all the experiments reported in this paper, is adopted from the literature [10].

2.2. Datsets

Table 2 lists the datasets selected from the UCI [12] and MOA [13] repositories to conduct our experiments. They vary in the number of instances, attributes, and class’ values; as well as in the type of the attributes. Some of them are affected by missing values. The literature [10] reports experiments on larger datasets, up to 4.8 × 10 6 instances, exploiting GPUs. However, datasets with higher dimensions are problematic, e.g., imdb-D with 1002 attributes does not converge using the Counter strategy.

2.3. Experiments

Two experiments were designed to cope with the issues approached by this work, i.e., the generalization of windowing beyond decision trees; a deeper understanding of its behavior in informational terms; and the comparison with traditional sampling methods. All of them were executed on a Intel Core i5-8300H at 2.3GHz, up to 3.9GHz with 8Gb DDR4. 8 distributed sites were simulated on this machine. JaCa-DDM also allows the adoption of real distributed sites over a network, but the aspects of windowing we study here, are not affected by simulating distribution.

2.3.1. On the Generalization of Windowing

The first experiment seeks to corroborate the correlation between the accuracy of the learned model and the amount of instances used to induce the model. It attempts to provide practical evidence about the generalization of windowing. For this, different Weka classifiers are adopted that replace J48. JaCa-DDM allows easy replacement and configuration of the new classifier artifacts of the system, namely:
Naive Bayes.
A probabilistic classifier based on Bayes’ theorem with a strong assumption of independence among attributes [14].
jRip. 
An inductive rule learner based on RIPPER that builds a set of rules while minimizing the amount of error [15].
Multilayer-perceptron. 
A multi-layer perceptron trained by backpropagation with sigmoid nodes except for numeric classes, in which case the output nodes become unthresholded linear units [16].
SMO. 
An implementation of John Platt’s sequential minimal optimization algorithm for training a support vector classifier [17].
All classifiers are induced by running a 10-fold stratified cross-validation on each dataset, then observing the average accuracy of the obtained models and the average percentage of the original dataset used to induce the model, i.e., 100% means the full original dataset was used to create the window.

2.3.2. On the Properties of Samples and Models Obtained by Windowing

The second experiment pursues a deeper understanding of the informational properties of the computed models, as well as those of the samples obtained by Windowing, i.e., the final windows. For this, given the positive results of the first experiment, we focus exclusively on decision trees (J48), for which different metrics to evaluate performance, complexity and data compression are well known. They include:
  • The model accuracy defined as the percentage of correctly classified instances.
    T P + T N T P + F P + T N + F N
    where T P , T N , F P and F N respectively stand for the true positive, true negative, false positive, and false negative classifications using the test data.
  • The metric AUC defined as the probability of a random instance to be correctly classified [18].
    A U C = 1 2 T P T P + F N + T N T N + F P
    Even though this measure was conceived for binary classification problems. Foster Provost [19] proposes an implementation for multi-class problems based in the weighted average of AUC metrics for every class using a one-against-all approach, and the weight for every AUC is calculated as the class’ appearance frequency in the data p ( c i ) .
    A U C t o t a l = c i C A U C ( c i ) · p ( c i )
  • The MDL principle states that the best model to infer from a dataset is the one which minimizes the sum of the length of the model L ( H ) , and the length of the data when encoded using the theory as a predictor for the data L ( D | H ) [20].
    M D L = L ( H ) + L ( D | H )
    For decision trees, Quinlan [21] proposes the next definition:
    • The number of bits needed to encode a tree is:
      L ( H ) = n n o d e s ( 1 + l n ( n a t t r i b u t e s ) ) + n l e a v e s ( 1 + l n ( n c 1 a s s e s ) )
      where n n o d e s , n a t t r i b u t e s , n l e a v e s and n c 1 a s s e s stand for the number of nodes, attributes, leaves and classes. This encoding uses a recursive top-down, depth-first procedure, where a tree which is not a leaf is encoded by a sequence of 1, the attribute code at his root, and the respective encodings of the subtrees. If a tree or subtree is a leaf, its enconding is a sequence of 0, and the class code.
    • The number of bits needed to encode the data using the decision tree is:
      L ( D | H ) = l L e a v e s l o g 2 ( b + 1 ) + l o g 2 n k
      where n is the number of instances, k is the number of positives instances for binary classification and b is a known a priori upper bound on k, typically b = n . For non-binary classification, Quinlan proposes a iterative approach where exceptions are sorted by their frequency, and then codified with the previous formula.
  • The Kullback–Leibler divergence ( D K L ) [22] is defined as:
    D K L ( P | | Q ) = x X P ( x ) l o g 2 P ( x ) Q ( x )
    where P and Q are probability distributions for the full dataset and the window, both are defined on the same probability space X, and x represents a class in the distribution. Instead of using a model to represent a conditional distribution of variables, as usual, we focus on the class distribution, computed as the marginal probability. Values closer to zero reflect higher similarity.
  • S i m 1 [23] is a similarity measure between datasets defined as:
    s i m 1 ( D i , D j ) = | I t e m ( D i ) I t e m ( D j ) | | I t e m ( D i ) I t e m ( D j ) |
    where D i is the window and D j is the full dataset; and I t e m ( D ) denotes the set of pairs attribute-value occurring in D. Values closer to one reflect higher similarity.
These metrics are used to compare the sample (the window) and the model computed by windowing, against those obtained as follows, once a random sample of the original data set is reserved as test set:
  • Without sampling, using all the available data to induce the model.
  • By Random sampling, where any instance has the same selection probability [24].
  • By Stratified random sampling, where the instances are subdivided by their class into subgroups, the number of selected instances per subgroup is defined as the division of the sample size by the number of instances [24].
  • By Balanced random sampling, as stratified random sampling, the instances are subdivided by their class into subgroups, but the number of selected instances per subgroup is defined as the division of the sample size by the number of subgroups, this allows the same number of instances per class [24].
Ten repetitions of 10-fold stratified cross-validation are run on each dataset. For a fair comparison, all the samples have the size of the window being compared. Statistical validity of the results is established following the method proposed by Demšar [25]. This approach enables the comparison of multiple algorithms on multiple data sets. It is based on the use of the Friedman test with a corresponding post-hoc test. Let R i j be the rank of the j t h of k algorithms on the i t h of N data sets. The Friedman test [26,27] compares the average ranks of algorithms, R j = 1 N i R i j . Under the null-hypothesis, which states that all the algorithms are equivalent and so their ranks R j should be equal, the Friedman statistic:
χ F 2 = 12 N k ( k + 1 ) j R j 2 k ( k + 1 ) 2 4
is distributed according to χ F 2 with k 1 degrees of freedom, when N and k are big enough ( N > 10 and k > 5 ). For a smaller number of algorithms and data sets, exact critical values have been computed [28]. Iman and Davenport [29] showed that Friedman’s χ F 2 is undesirably conservative and derived an adjusted statistic:
F f = ( N 1 ) × χ F 2 N × ( k 1 ) χ F 2
which is distributed according to the F-distribution with k 1 and ( k 1 ) ( N 1 ) degrees of freedom. If the null hypothesis of similar performances is rejected, then the Nemenyi post-hoc test is realized for pairwise comparisons. The performance of two classifiers is significantly different if their corresponding average ranks differ by at least the critical difference:
C D = q α k ( k + 1 ) 6 N
where critical values q α are based on the Studentized range statistic divided by 2 .
For the comparison of multiple classifiers, the results of the post-hoc tests can be visually represented with a simple critical distance diagram. This type of visualization will be described in the Statistical Tests in Section 3.

3. Results

Results are organized accordingly to the following issues:
  • Generalization of the behavior of windowing, i.e., high accuracy correlating with fewer training examples used to induce the model, when other inductive algorithms, apart of J48, are adopted.
  • Informational properties of the samples obtained by different methods, based on the Kullback–Leibler divergence and the attribute-value similitude.
  • Properties of the models induced with the samples, in terms of their size, complexity, and data compression, which supplies information about their data fitting capacity.
  • Predictive performance of the induced models in terms of accuracy and the AUC.
  • Statistical tests about significant gains produced by windowing using the former metrics.

3.1. Windowing Generalization

Figure 1 shows a strong negative correlation between the number of training instances used to induce the models, expressed as a percentage with respect to the totality of available examples, and the accuracy of the induced model. Such correlation exists, independently of the adopted inductive algorithm. These results are consistent with the behavior of windowing when using J48, as reported in the literature [9] and corroborates that under windowing, in general, the models with higher accuracy use less examples to be induced.
However, accuracy is affected by the adopted inductive algorithm, e.g., Hypothyroid is approached very well by jRip (99.23 ± 0.48 of accuracy) requiring few examples (5% of the full dataset); while Multilayer-Perceptron is not quite successful in this case (92.26 ± 2.75 of accuracy) requiring more examples (24%). This behavior is also observed between SMO and jRip for Waveform5000. These observations motivated analyzing the properties of the samples and induced models, as described in the following subsections. Table 3 shows the accuracy results in detail and Table 4 shows the number of examples used to induce the models, best results are highlighted in gray. Appendix A shows the accuracy values for models without using windowing under a 10-fold cross-validation. Windowing accuracies are comparable to those obtained without using windowing. Table 7 also corroborate this this for the J48 classifier.
Large datasets such as as Adult, Letter, Poker-Lsn, Splice, and Waveform5000 did not finish on reasonable time when using jRip, Multilayer-Perceptron and SMO, with and without windowing. In such cases, results are reported as not available (na). This might be solved by running the experiments in a real cluster of 8 nodes, instead of simulating the sites in a single machine, as done here, but it is not relevant for the purposes of this work. In the following results, Poker-lsn dataset was excluded because the cross-validations runs do not finish on a reasonable time, this might be solved with more computational power. The results were kept this way because they illustrate that some classifiers exhibit a computational cost which precludes convergence.

3.2. Samples Properties

For each dataset considered in this work, Table 5 shows some properties of the samples obtained by the following methods: windowing, as described before; the Full-Dataset under a 10-folds cross-validation (90% of all available data); and the random, stratified, and balanced samplings. Properties include the size of the sample in terms of the number of instances; the standard deviation of the class distribution ( S t . D v . C . D . ); and two measures of similarity between the samples and the original dataset: The Kullback–Leibler divergence and the metric s i m 1 . With the exception of Full-Dataset, the size of the rest of the samples is determined by the windowing method and its autostop method. For the sake of fairness, windowing is executed first and the size of the sample obtained in this way is adopted for the rest of the sampling methods. Reductions in the size of the training set are as big as 97% of the available data (Hypothyroid).
According to Kullback–Leibler Divergence, windowing is the method that skews more the original class distribution in non-balanced datasets. It is also observed that the class distribution on the windows is more balanced, and its effectiveness probably depends on the number of available examples for the minority classes. For instance, Full-Dataset shows an unbalanced class distribution ( S t . D v . C . D . = 0.449 ) in Hypothyroid, while windowing got a coefficient of 0.293 . Windowing can not completely balance the number of examples per class since the percentage of the available examples for the minority classes are around of 5%. The random sampling, the Full-Dataset, and the stratified sampling do not tend to modify the class distribution. However, it does not seem to be a correlation between this coefficient and the obtained accuracy.
Full-Dataset is, without surprise, the sample that gathers more attribute/values pairs from the original data, since it uses 90% of the available data. It is included in the results exclusively for comparison with the rest of the sampling methods. Table 5 also show that windowing tends to collect more information content in most of the datasets compared with all the sampling, this is probably result of the heuristic nature of windowing. There are some datasets, like Breast and German, where all the techniques have one as the measured value of S i m 1 . Unfortunately, as in the previous case, this notion of similarity neither seems to correlate with the observed accuracy, for instance, as mentioned, for Breast and German all the sampling methods gathers all the original pairs attribute-value ( S i m 1 = 1.0 ), but while the accuracy obtained for Breast is around 95%, when using German it is around 71%. In concordance with these results, the window for Breast uses 17% of the available examples, while German uses 64% (Table 5).

3.3. Model Complexity and Data Compression

Table 6 shows the results for the MDL, calculated using the test dataset. Respecting the number of bits required to encode a tree ( L ( H ) ), Windowing and Full-Dataset tend to induce more complex models, i.e, trees with more nodes. This is probably because windowing favors the search for more difficult patterns in the set of available instances, which require more complex models to be expressed. Respecting the number of bits required to encode the test data, given the induced decision tree, ( L ( D | H ) ) a better compression is achieved using windowing and Full-Dataset than when using the traditional samplings. Big differences in data compression using windowing are exhibit in datasets like Mushroom, Segment, and Waveform-5000. One possible explanation for this is that instances gathered by sampling techniques do not capture the data nature because of their random selection and the small number of instances in the sample.
The sum of the former metrics, the MDL, reports bigger models in most of the datasets when using windowing and Full-Dataset. This result does not represent an advantage, but properties such as the predictive performance also play an important role in model selection.

3.4. Predictive Performance

Table 7 shows the predictive performance in terms of accuracy and the AUC. Even though the random, stratified and balanced samplings usually induce simpler models, the decision trees do not seem to be more general than their windowing and Full-Dataset counterparts. In other words, the predictive ability of decision trees induced with the traditional samplings are, most of the time, lower than the models induced using windowing and Full-Dataset. Models induced with windowing have the same accuracy as those obtained by Full-Dataset and, sometimes, they even show a higher accuracy, e.g., waveform-500. In terms of AUC, windowing and Full-Dataset were the best samples, but the balanced sampling is pretty close to their performance.

3.5. Statistical Tests

The figures in this section visualize the results of the post-hoc Nemenyi test for the metrics previously shown in Table 5, Table 6 and Table 7. This compact, information-dense visualization, called as Critical Difference diagram, consists on a main axis where the average rank of each methods is plotted along with a line that represents the Critical Difference (CD). Methods separated by a distance shorter than the CD are statistically indistinguishable, i.e., the evidence is not sufficient to conclude whether they have a similar performance and are connected by a black line. In contrast, methods separated by a distance larger than the CD have a statistically significant difference in performance. The best performing methods are those with lower rank values shown on the left of the figure.
Figure 2 shows the results for the number of bits required to encode the induced models ( L ( H ) ) presented in Table 6. The groups of connected algorithms are not significantly different. In this case, the complexity of the models induced using windowing does not show significant differences with the complexity of the models induced using the Full-Dataset or balanced sampling.
Figure 3 shows the results in terms of data compression given the decision tree ( L ( D | H ) ). If the compressibility provided by the models is verified on a stratified sample of unseen data, windowing and Full-Dataset tend to compress significantly better compared to traditional sampling methods. However, windowing tends to generate more complex models probably because its heuristic behavior enables the seek for more difficult patterns in the data.
Figure 4 shows the results in terms of MDL in the test set. Windowing and Full-Dataset do not show significant differences, nor they are statistically different to the traditional sampling methods. That is, that the induced decision trees generally need the same number of bits to be represented.
Figure 5 shows the results for accuracy. Windowing performs very well, being almost as accurate as Full-Dataset without significant differences. Both methods are strictly better than the random, balanced, and stratified samplings. When considering the AUC in Figure 6, results are very similar but the balanced sampling does not show significant differences with windowing and the Full-Dataset. Recall that both, windowing and balanced sampling, tend to balance the class distribution of the instances.
In terms of class distribution (Figure 7), windowing is known to be the method that tends to skew the distribution the most, given that the counter examples added to the window in each iteration of this algorithm belong most probably to the current minority class. As expected, the balanced and the random sampling methods also skew the class distribution showing no significant differences with windowing. According to the percentage of attribute-value pairs given by S i m 1 (Figure 8), windowing and the traditional sampling methods cannot obtain the full set of attribute-value pairs included in the original dataset. Despite this, windowing is still very competent when it comes to prediction.

4. Conclusions

The generalization of the behavior of windowing beyond decision trees and the J48 algorithm has been corroborated. Independently of the inductive method used with windowing, high accuracies correlate with aggressive samplings up to 3% of the original datasets. This result motivates the study of the properties of the samples and models proposed in this work. Unfortunately, the Kullback–Leibler divergence and s i m 1 do not seem to correlate with accuracy, although the first one is indicative of the balancing effect performed by windowing. MDL provided useful information in the sense that, although all methods generate models of similar complexity, it is important to identify which component of the MDL is more relevant in each case. For example, less complex decision trees, as those induced by random, balanced and stratified samplings, are more general but less accurate. In contrast, decision trees with better data compression, such as those induced using windowing and Full-Dataset, tend to be larger but more accurate. The key factor that makes the difference is the significant reduction of instances for induction. Recall that determining the size of the samples is done automatically in windowing, based on the auto-stop condition of this method. When using traditional sampling methods the size must be figured out by the user of the technique. To the best of our knowledge, this is the first comparative study of windowing in this respect. This work suggests future lines of research on windowing, including:
  • Adopting metrics for detecting relevant, noisy, and redundant instances to enhance the quality and size of the obtained samples, in order to improve the performance of the obtained models. Maillo et al. [30] review multiple metrics to describe redundancy, complexity, and density of a problem and also propose two data big metrics. These kind of metrics may be helpful to select instances that provides quality information.
  • Studying the evolution of windows over time can offer more insights about the behavior of windowing. The main difficulty here is adapting some of the used metrics, e.g., MDL, to be used with models that are not decision trees.
  • Dealing with datasets of higher dimensions. Melgoza-Gutiérrez et al. [31] propose an agent & artifacts-based method to distribute vertical partitions of datasets and deal with the growing time complexity when datasets have a high number of attributes. It is expected that the achieved understanding on windowing contributes to combine these approaches.
  • Applying windowing to real problems. Limón et al. [10] applies windowing to the segmentation of colposcopic images presenting possible precancerous cervical lesions. Windowing is exploited here to distribute the computational cost of processing a dataset of 1.4 × 10 6 instances and 30 attributes. The exploitation of windowing to cope with learning problems of distributed nature is to be explored.

Author Contributions

Conceptualization, D.M.-G. and A.G.-H.; methodology, D.M.-G., A.G.-H. and N.C.-R.; software, A.G.-H., X.L. and D.M.-G.; validation, A.G.-H., N.C.-R., X.L. and F.G.; formal analysis, D.M.-G. and A.G.-H.; investigation, A.G.-H. and D.M.-G.; resources, X.L.; writing—original draft preparation, D.M.-G.; writing—review and editing, A.G.-H., N.C.-R., X.L. and F.G.; visualization, D.M.-G.; project administration, A.G.-H. All authors have read and agree to the published version of the manuscript.

Funding

The first author was funded by a scholarship from Consejo Nacional de Ciencia y Tecnología (CONACYT), Mexico, CVU:895160. The last author was supported by project RTI2018-095820-B-I00 (MCIU/AEI/FEDER, UE).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Appendix A. Results of Accuracy without Using Windowing

Table A1. Average accuracy without using windowing under a 10-fold cross validation (na = not available).
Table A1. Average accuracy without using windowing under a 10-fold cross validation (na = not available).
j48NBjRipMPSMO
Adult85.98 ± 0.2883.24 ± 0.1984.65 ± 0.16nana
Australian87.10 ± 0.6585.45 ± 1.5784.44 ± 1.7883.10 ± 1.2886.71 ± 1.43
Breast96.16 ± 0.3897.84 ± 0.5195.03 ± 0.8996.84 ± 0.7796.67 ± 0.40
Credit-g73.59 ± 2.1175.59 ± 1.0473.45 ± 1.9673.10 ± 0.7276.66 ± 2.87
Diabetes72.95 ± 0.7775.83 ± 1.1778.27 ± 1.8174.51 ± 1.4678.02 ± 1.79
Ecoli84.44 ± 1.3283.5 ± 1.6482.25 ± 3.1183.69 ± 1.4483.93 ± 1.31
German73.89 ± 1.5976.94 ± 2.2970.06 ± 0.9070.26 ± 0.9674.55 ± 1.76
Hypothyroid99.48 ± 0.2095.72 ± 0.6899.60 ± 0.1594.38 ± 0.2594.01 ± 0.48
Kr-vs-kp99.31 ± 0.0687.68 ± 0.4399.37 ± 0.2999.06 ± 0.1396.67 ± 0.37
Letter87.81 ± 0.1064.33 ± 0.2886.34 ± 0.22nana
Mushroom100.0 ± 0.0095.9 ± 0.32100.0 ± 0.00100.0 ± 0.00100.0 ± 0.00
Poker-lsn99.79 ± 0.0059.33 ± 0.03nanana
Segment96.02 ± 0.2979.95 ± 0.6995.25 ± 0.5295.61 ± 0.9192.97 ± 0.36
Sick98.88 ± 0.2993.13 ± 0.4398.19 ± 0.2295.81 ± 0.4593.70 ± 0.56
Splice93.81 ± 0.3995.05 ± 0.3694.19 ± 0.27na93.46 ± 0.48
Waveform500075.58 ± 0.3780.25 ± 0.3379.54 ± 0.37na86.81 ± 0.21

References

  1. Quinlan, J.R. Induction over Large Data Bases; Technical Report STAN-CS-79-739; Computer Science Department, School of Humanities and Sciences, Stanford University: Stanford, CA, USA, 1979. [Google Scholar]
  2. Quinlan, J.R. Learning efficient classification procedures and their application to chess en games. In Machine Learning; Michalski, R.S., Carbonell, J.G., Mitchell, T.M., Eds.; Morgan Kaufmann: San Francisco, CA, USA, 1983; Volume I, Chapter 15; pp. 463–482. [Google Scholar] [CrossRef]
  3. Quinlan, J.R. Induction of Decision Trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef] [Green Version]
  4. Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann: San Mateo, CA, USA, 1993; Volume 1. [Google Scholar]
  5. Quinlan, J.R. Improved Use of Continuous Attributes in C4.5. J. Artif. Intell. Res. 1996, 4, 77–90. [Google Scholar] [CrossRef] [Green Version]
  6. Wirth, J.; Catlett, J. Experiments on the Costs and Benefits of Windowing in ID3. In Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, MI, USA, 12–14 June 1988; Laird, J.E., Ed.; Morgan Kaufmann: San Mateo, CA, USA, 1988; pp. 87–99. [Google Scholar]
  7. Fürnkranz, J. Integrative windowing. J. Artif. Intell. Res. 1998, 8, 129–164. [Google Scholar] [CrossRef] [Green Version]
  8. Quinlan, J.R. Learning Logical Definitions from Relations. Mach. Learn. 1990, 5, 239–266. [Google Scholar] [CrossRef] [Green Version]
  9. Limón, X.; Guerra-Hernández, A.; Cruz-Ramírez, N.; Grimaldo, F. Modeling and implementing distributed data mining strategies in JaCa-DDM. Knowl. Inf. Syst. 2019, 60, 99–143. [Google Scholar] [CrossRef]
  10. Limón, X.; Guerra-Hernández, A.; Cruz-Ramírez, N.; Acosta-Mesa, H.G.; Grimaldo, F. A Windowing Strategy for Distributed Data Mining Optimized through GPUs. Pattern Recognit. Lett. 2017, 93, 23–30. [Google Scholar] [CrossRef]
  11. Witten, I.H.; Frank, E.; Hall, M.A. Data Mining: Practical Machine Learning Tools and Techniques; Morgan Kaufmann Publishers: Burlington, MA, USA, 2011. [Google Scholar]
  12. Dua, D.; Graff, C. UCI Machine Learning Repository. 2017. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 29 June 2020).
  13. Bifet, A.; Holmes, G.; Kirkby, R.; Pfahringer, B. MOA: Massive Online Analysis. J. Mach. Learn. Res. 2010, 11, 1601–1604. [Google Scholar]
  14. John, G.H.; Langley, P. Estimating Continuous Distributions in Bayesian Classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, Montreal, QC, Canada, 18–20 August 1995; Morgan Kaufmann: San Mateo, CA, USA, 1995; pp. 338–345. [Google Scholar]
  15. Cohen, W.W. Fast Effective Rule Induction. In Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; Morgan Kaufmann: San Francisco, CA, USA, 1995; pp. 115–123. [Google Scholar]
  16. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations; MIT Press: Cambridge, MA, USA, 1986; pp. 318–362. [Google Scholar]
  17. Platt, J. Fast Training of Support Vector Machines Using Sequential Minimal Optimization. In Advances in Kernel Methods: Support Vector Learning; Schoelkopf, B., Burges, C., Smola, A., Eds.; MIT Press: Cambridge, MA, USA, 1998. [Google Scholar]
  18. Sokolova, M.; Lapalme, G. A Systematic Analysis of Performance Measures for Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
  19. Provost, F.; Domingos, P. Well-Trained PETs: Improving Probability Estimation Trees (2000). Available online: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.309 (accessed on 29 June 2020).
  20. Rissanen, J. Stochastic Complexity and Modeling. Ann. Stat. 1986, 14, 1080–1100. [Google Scholar] [CrossRef]
  21. Quinlan, J.R.; Rivest, R.L. Inferring decision trees using the minimum description length principle. Inf. Comput. 1989, 80, 227–248. [Google Scholar] [CrossRef] [Green Version]
  22. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  23. Zhang, S.; Zhang, C.; Wu, X. Knowledge Discovery in Multiple Databases; Springer-Verlag London, Limited: London, UK, 2004. [Google Scholar]
  24. Ros, F.; Guillaume, S. Sampling Techniques for Supervised or Unsupervised Tasks; Springer: Cham, Switzerland, 2019. [Google Scholar] [CrossRef]
  25. Demšar, J. Statistical Comparisons of Classifiers over Multiple Data Sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  26. Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
  27. Friedman, M. A Comparison of Alternative Tests of Significance for the Problem of m Rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
  28. Zar, J.H. Biostatistical Analysis, 5th ed.; Prentice-Hall, Inc.: Upper Saddle River, NJ, USA, 2007. [Google Scholar]
  29. Iman, R.L.; Davenport, J.M. Approximations of the critical region of the fbietkan statistic. Commun. Stat. Theory Methods 1980, 9, 571–595. [Google Scholar] [CrossRef]
  30. Maillo, J.; Triguero, I.; Herrera, F. Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data. IEEE Access 2020, 8, 87918–87928. [Google Scholar] [CrossRef]
  31. Melgoza-Gutiérrez, J.; Guerra-Hernández, A.; Cruz-Ramírez, N. Collaborative Data Mining on a BDI Multi-Agent System over Vertically Partitioned Data. In Proceedings of the 13th Mexican International Conference on Artificial Intelligence, Tuxtla Gutiérrez, Mexico, 16–22 November 2014; Gelbukh, A., Castro-Espinoza, F., Galicia-Haro, S.N., Eds.; IEEE Computer Society: Los Alamitos, CA, USA, 2014; pp. 215–220. [Google Scholar]
Figure 1. Correlation between accuracy and percentage of used training examples when windowing. J48 = 0.98 , NB = 0.96 , jRip = 0.98 , MP = 0.98 , and SMO = 0.99 . In general, the models with higher accuracy use less examples to be induced.
Figure 1. Correlation between accuracy and percentage of used training examples when windowing. J48 = 0.98 , NB = 0.96 , jRip = 0.98 , MP = 0.98 , and SMO = 0.99 . In general, the models with higher accuracy use less examples to be induced.
Mca 25 00039 g001
Figure 2. Demšar test regarding the required bits to encode trees, L ( H ) .
Figure 2. Demšar test regarding the required bits to encode trees, L ( H ) .
Mca 25 00039 g002
Figure 3. Demšar test regarding the required bits to encode the test data given the decision tree, L ( D | H ) .
Figure 3. Demšar test regarding the required bits to encode the test data given the decision tree, L ( D | H ) .
Mca 25 00039 g003
Figure 4. Demšar test regarding the MDL computed on the test dataset.
Figure 4. Demšar test regarding the MDL computed on the test dataset.
Mca 25 00039 g004
Figure 5. Demšar test regarding the accuracy over the test dataset.
Figure 5. Demšar test regarding the accuracy over the test dataset.
Mca 25 00039 g005
Figure 6. Demšar test regarding the AUC over the test dataset.
Figure 6. Demšar test regarding the AUC over the test dataset.
Mca 25 00039 g006
Figure 7. Demšar test regarding the Kullback–Leibler Divergence.
Figure 7. Demšar test regarding the Kullback–Leibler Divergence.
Mca 25 00039 g007
Figure 8. Demšar test regarding Sim1.
Figure 8. Demšar test regarding Sim1.
Mca 25 00039 g008
Table 1. Configuration of the counter strategy. Adopted from Limón et al. [10].
Table 1. Configuration of the counter strategy. Adopted from Limón et al. [10].
ParameterValue
ClassifierJ48
PruningTrue
Number of nodes8
Maximum number of rounds15
Initial percentage for the window0.20
Validation percentage for the test0.25
Change step of accuracy every round0.35
Table 2. Datasets, adopted from UCI and MOA.
Table 2. Datasets, adopted from UCI and MOA.
DatasetInstancesAttributesAttribute TypeMissing ValuesClasses
Adult4884215MixedYes2
Australian69015MixedNo2
Breast68310NumericNo2
Diabetes7689MixedNo2
Ecoli3368NumericNo8
German100021MixedNo2
Hypothyroid377230MixedYes4
Kr-vs-kp319637NumericNo2
Letter2000017MixedNo26
Mushroom812423NominalYes2
Poker-lsn82920111MixedNo10
Segment231020NumericNo7
Sick377230MixedYes2
Splice319061NominalNo3
Waveform5000500041NumericNo3
Table 3. Average windowing accuracy under a 10-fold cross validation (na = not available).
Table 3. Average windowing accuracy under a 10-fold cross validation (na = not available).
J48NBjRipMPSMO
Adult86.17 ± 0.5584.54 ± 0.62nanana
Australian85.21 ± 4.7785.79 ± 4.2585.94 ± 3.9381.74 ± 6.3185.80 ± 4.77
Breast94.42 ± 3.9797.21 ± 2.3495.31 ± 2.7595.45 ± 3.1496.33 ± 3.12
Diabetes73.03 ± 3.9976.03 ± 4.3371.74 ± 7.6772.12 ± 4.0076.04 ± 3.51
Ecoli82.72 ± 6.8183.93 ± 7.0081.22 ± 6.6382.12 ± 7.4984.53 ± 4.11
German71.10 ± 5.4075.20 ± 2.8270.20 ± 3.8569.60 ± 4.8475.80 ± 3.12
Hypothyroid99.46 ± 0.1795.36 ± 0.9999.23 ± 0.4892.26 ± 2.7594.30 ± 0.53
Kr-vs-kp99.15 ± 0.6696.65 ± 0.8498.46 ± 0.9598.72 ± 0.5496.62 ± 0.75
Letter85.79 ± 1.2469.28 ± 1.2685.31 ± 1.06nana
Mushroom100.00 ± 0.0099.80 ± 0.16100.00 ± 0.00100.00 ± 0.00100.0 ± 0.00
Poker-lsn99.75 ± 0.0760.02 ± 0.42nanana
Segment96.53 ± 1.4784.24 ± 1.9195.54 ± 1.5596.10 ± 1.1592.42 ± 1.87
Sick98.64 ± 0.5396.34 ± 1.4497.93 ± 0.9596.32 ± 1.0496.71 ± 0.77
Splice94.04 ± 0.7995.32 ± 1.0792.75 ± 2.11na92.41 ± 1.34
Waveform500073.06 ± 2.5582.36 ± 1.6477.02 ± 1.59na85.94 ± 1.32
Table 4. Average size of the final window (the sample) under a 10-fold cross validation, in terms of the percentage of the full dataset used for induction (na = not available).
Table 4. Average size of the final window (the sample) under a 10-fold cross validation, in terms of the percentage of the full dataset used for induction (na = not available).
J48NBjRipMPSMO
Adult0.30 ± 0.010.21 ± 0.00nanana
Australian0.31 ± 0.020.25 ± 0.010.33 ± 0.020.39 ± 0.040.27 ± 0.01
Breast0.17 ± 0.010.06 ± 0.000.14 ± 0.010.11 ± 0.010.09 ± 0.01
Diabetes0.54 ± 0.050.40 ± 0.020.52 ± 0.040.48 ± 0.030.42 ± 0.02
Ecoli0.38 ± 0.030.27 ± 0.010.40 ± 0.030.31 ± 0.030.29 ± 0.02
German0.56 ± 0.040.43 ± 0.010.59 ± 0.020.58 ± 0.020.47 ± 0.02
Hypothyroid0.05 ± 0.000.12 ± 0.010.05 ± 0.000.24 ± 0.010.12 ± 0.01
Kr-vs-kp0.08 ± 0.010.16 ± 0.010.13 ± 0.000.08 ± 0.000.12 ± 0.00
Letter0.35 ± 0.020.38 ± 0.000.39 ± 0.01nana
Mushroom0.03 ± 0.000.04 ± 0.000.03 ± 0.000.02 ± 0.000.02 ± 0.00
Poker-lsn0.05 ± 0.000.59 ± 0.00nanana
Segment0.16 ± 0.010.22 ± 0.010.19 ± 0.010.14 ± 0.010.18 ± 0.00
Sick0.07 ± 0.000.10 ± 0.010.08 ± 0.000.11 ± 0.010.10 ± 0.00
Splice0.26 ± 0.010.11 ± 0.000.25 ± 0.01na0.19 ± 0.00
Waveform50000.59 ± 0.020.22 ± 0.010.52 ± 0.00na0.26 ± 0.01
Table 5. Samples properties.
Table 5. Samples properties.
DatasetMethodInstancesSt. Dv. C.D.KL DivSim1
AdultWindowing14502.840 ± 574.2660.083 ± 0.0040.128 ± 0.0040.386 ± 0.012
AdultFull-Dataset43957.800 ± 0.4020.369 ± 0.0000.000 ± 0.0000.935 ± 0.001
AdultRandom-sampling14502.840 ± 574.2660.374 ± 0.0490.005 ± 0.0050.418 ± 0.013
AdultStratified-sampling14502.840 ± 574.2660.369 ± 0.0000.000 ± 0.0000.418 ± 0.013
AdultBalanced-sampling14502.840 ± 574.2660.000 ± 0.0000.206 ± 0.0000.400 ± 0.013
AustralianWindowing215.440 ± 14.3630.031 ± 0.0200.017 ± 0.0080.999 ± 0.006
AustralianFull-Dataset621.000 ± 0.0000.078 ± 0.0010.000 ± 0.0000.999 ± 0.005
AustralianRandom-sampling215.440 ± 14.3630.080 ± 0.0470.004 ± 0.0050.986 ± 0.016
AustralianStratified-sampling215.440 ± 14.3630.078 ± 0.0040.000 ± 0.0000.986 ± 0.016
AustralianBalanced-sampling215.440 ± 14.3630.001 ± 0.0020.009 ± 0.0000.987 ± 0.016
BreastWindowing109.210 ± 14.7320.043 ± 0.0300.086 ± 0.0311.000 ± 0.000
BreastFull-Dataset614.700 ± 0.4610.212 ± 0.0000.000 ± 0.0001.000 ± 0.000
BreastRandom-sampling109.210 ± 14.7320.224 ± 0.1070.019 ± 0.0171.000 ± 0.000
BreastStratified-sampling109.210 ± 14.7320.215 ± 0.0070.000 ± 0.0001.000 ± 0.000
BreastBalanced-sampling109.210 ± 14.7320.003 ± 0.0030.066 ± 0.0031.000 ± 0.000
DiabetesWindowing436.260 ± 27.7680.087 ± 0.0220.025 ± 0.0090.751 ± 0.028
DiabetesFull-Dataset691.200 ± 0.4020.213 ± 0.0010.000 ± 0.0000.954 ± 0.004
DiabetesRandom-sampling436.260 ± 27.7680.214 ± 0.0210.001 ± 0.0010.763 ± 0.028
DiabetesStratified-sampling436.260 ± 27.7680.215 ± 0.0020.000 ± 0.0000.766 ± 0.028
DiabetesBalanced-sampling436.260 ± 27.7680.001 ± 0.0010.067 ± 0.0010.770 ± 0.028
EcoliWindowing126.640 ± 8.5790.109 ± 0.0050.182 ± 0.0550.761 ± 0.026
EcoliFull-Dataset302.400 ± 0.4920.145 ± 0.0000.001 ± 0.0010.979 ± 0.006
EcoliRandom-sampling126.640 ± 8.5790.147 ± 0.0100.007 ± 0.0100.763 ± 0.025
EcoliStratified-sampling126.640 ± 8.5790.154 ± 0.0040.013 ± 0.0030.758 ± 0.027
EcoliBalanced-sampling126.640 ± 8.5790.099 ± 0.0040.113 ± 0.0280.781 ± 0.028
GermanWindowing584.750 ± 25.3080.119 ± 0.0120.041 ± 0.0061.000 ± 0.000
GermanFull-Dataset900.000 ± 0.0000.283 ± 0.0000.000 ± 0.0001.000 ± 0.000
GermanRandom-sampling584.750 ± 25.3080.284 ± 0.0220.001 ± 0.0011.000 ± 0.000
GermanStratified-sampling584.750 ± 25.3080.283 ± 0.0010.000 ± 0.0001.000 ± 0.000
GermanBalanced-sampling584.750 ± 25.3080.055 ± 0.0220.079 ± 0.0151.000 ± 0.000
HypothyroidWindowing151.680 ± 9.6190.293 ± 0.0170.262 ± 0.0470.428 ± 0.017
HypothyroidFull-Dataset3394.800 ± 0.4020.449 ± 0.0000.000 ± 0.0000.979 ± 0.005
HypothyroidRandom-sampling151.680 ± 9.6190.580 ± 0.1490.212 ± 0.1030.387 ± 0.020
HypothyroidStratified-sampling151.680 ± 9.6190.516 ± 0.0070.000 ± 0.0010.387 ± 0.013
HypothyroidBalanced-sampling151.680 ± 9.6190.191 ± 0.0040.668 ± 0.0230.435 ± 0.016
Kr-vs-kpWindowing242.550 ± 18.4250.050 ± 0.0360.010 ± 0.0120.998 ± 0.004
Kr-vs-kpFull-Dataset2876.400 ± 0.4920.031 ± 0.0000.000 ± 0.0000.999 ± 0.004
Kr-vs-kpRandom-sampling242.550 ± 18.4250.221 ± 0.1300.106 ± 0.0990.975 ± 0.013
Kr-vs-kpStratified-sampling242.550 ± 18.4250.032 ± 0.0030.000 ± 0.0000.977 ± 0.009
Kr-vs-kpBalanced-sampling242.550 ± 18.4250.001 ± 0.0010.001 ± 0.0000.977 ± 0.008
LetterWindowing7390.450 ± 491.4350.008 ± 0.0000.037 ± 0.0020.989 ± 0.006
LetterFull-Dataset18000.000 ± 0.0000.001 ± 0.0000.000 ± 0.0000.999 ± 0.002
LetterRandom-sampling7390.450 ± 491.4350.007 ± 0.0010.022 ± 0.0090.983 ± 0.008
LetterStratified-sampling7390.450 ± 491.4350.000 ± 0.0000.000 ± 0.0000.985 ± 0.007
LetterBalanced-sampling7390.450 ± 491.4350.001 ± 0.0000.001 ± 0.0000.984 ± 0.006
MushroomWindowing219.490 ± 16.8710.043 ± 0.0330.004 ± 0.0050.968 ± 0.021
MushroomFull-Dataset7311.600 ± 0.4920.025 ± 0.0000.000 ± 0.0001.000 ± 0.000
MushroomRandom-sampling219.490 ± 16.8710.504 ± 0.2442.083 ± 1.8520.833 ± 0.072
MushroomStratified-sampling219.490 ± 16.8710.026 ± 0.0040.000 ± 0.0000.903 ± 0.032
MushroomBalanced-sampling219.490 ± 16.8710.002 ± 0.0020.001 ± 0.0000.902 ± 0.033
SegmentWindowing371.280 ± 27.4580.104 ± 0.0080.390 ± 0.0760.279 ± 0.015
SegmentFull-Dataset2079.000 ± 0.0000.000 ± 0.0000.000 ± 0.0000.938 ± 0.003
SegmentRandom-sampling371.280 ± 27.4580.050 ± 0.0070.105 ± 0.1440.310 ± 0.019
SegmentStratified-sampling371.280 ± 27.4580.002 ± 0.0010.000 ± 0.0000.315 ± 0.018
SegmentBalanced-sampling371.280 ± 27.4580.002 ± 0.0010.000 ± 0.0000.315 ± 0.018
SickWindowing264.600 ± 17.4200.305 ± 0.0280.233 ± 0.0320.565 ± 0.019
SickFull-Dataset3394.800 ± 0.4020.621 ± 0.0000.000 ± 0.0000.979 ± 0.005
SickRandom-sampling264.600 ± 17.4200.623 ± 0.0660.015 ± 0.0140.483 ± 0.018
SickStratified-sampling264.600 ± 17.4200.623 ± 0.0020.000 ± 0.0000.483 ± 0.014
SickBalanced-sampling264.600 ± 17.4200.002 ± 0.0010.665 ± 0.0020.495 ± 0.014
SpliceWindowing835.300 ± 29.6890.072 ± 0.0110.036 ± 0.0090.969 ± 0.043
SpliceFull-Dataset2871.000 ± 0.0000.169 ± 0.0470.000 ± 0.0000.987 ± 0.034
SpliceRandom-sampling835.300 ± 29.6890.161 ± 0.0000.014 ± 0.0130.890 ± 0.060
SpliceStratified-sampling835.300 ± 29.6890.161 ± 0.0010.000 ± 0.0000.862 ± 0.036
SpliceBalanced-sampling835.300 ± 29.6890.001 ± 0.0010.104 ± 0.0010.871 ± 0.046
Waveform-5000Windowing3263.590 ± 330.0000.006 ± 0.0040.000 ± 0.0000.940 ± 0.018
Waveform-5000Full-Dataset4500.000 ± 0.0000.004 ± 0.0000.000 ± 0.0000.983 ± 0.001
Waveform-5000Random-sampling3263.590 ± 330.0000.018 ± 0.0100.002 ± 0.0020.932 ± 0.019
Waveform-5000Stratified-sampling3263.590 ± 330.0000.004 ± 0.0000.000 ± 0.0000.932 ± 0.019
Waveform-5000Balanced-sampling3263.590 ± 330.0000.000 ± 0.0000.000 ± 0.0000.932 ± 0.019
Table 6. Model complexity and test data compression.
Table 6. Model complexity and test data compression.
DatasetMethodL(H)L(D|H)MDL
AdultWindowing1361.599 ± 465.8502366.019 ± 59.7093727.618 ± 483.653
AdultFull-Dataset2077.010 ± 282.5652374.002 ± 49.9854451.012 ± 270.561
AdultRandom-sampling1009.386 ± 276.4292420.278 ± 56.4583429.664 ± 264.703
AdultStratified-sampling1031.172 ± 181.1552410.870 ± 49.9323442.042 ± 186.437
AdultBalanced-sampling1351.736 ± 265.6682423.024 ± 44.2713774.759 ± 274.906
AustralianWindowing77.299 ± 29.06741.284 ± 6.849118.582 ± 30.088
AustralianFull-Dataset66.820 ± 16.93441.044 ± 6.711107.864 ± 17.430
AustralianRandom-sampling45.151 ± 18.59241.820 ± 6.91686.971 ± 19.120
AustralianStratified-sampling50.313 ± 22.01641.836 ± 6.77692.149 ± 21.220
AustralianBalanced-sampling44.603 ± 22.87842.327 ± 6.76486.929 ± 22.830
BreastWindowing46.541 ± 13.19925.904 ± 4.58472.445 ± 12.435
BreastFull-Dataset58.757 ± 7.94225.338 ± 5.28084.095 ± 8.195
BreastRandom-sampling22.301 ± 6.55529.008 ± 7.22951.309 ± 7.316
BreastStratified-sampling23.991 ± 6.91528.631 ± 6.72052.622 ± 8.350
BreastBalanced-sampling22.767 ± 7.80128.191 ± 5.71050.959 ± 8.137
DiabetesWindowing59.000 ± 37.20765.437 ± 5.227124.437 ± 37.477
DiabetesFull-Dataset126.620 ± 46.01964.383 ± 5.161191.003 ± 45.988
DiabetesRandom-sampling95.960 ± 38.98965.674 ± 4.884161.634 ± 39.119
DiabetesStratified-sampling94.940 ± 39.26164.354 ± 5.965159.294 ± 39.505
DiabetesBalanced-sampling104.840 ± 36.62165.263 ± 5.003170.103 ± 36.829
EcoliWindowing99.328 ± 23.15229.959 ± 7.767129.287 ± 23.257
EcoliFull-Dataset144.454 ± 19.80427.648 ± 6.460172.102 ± 18.623
EcoliRandom-sampling69.348 ± 16.85333.969 ± 9.853103.317 ± 15.614
EcoliStratified-sampling65.678 ± 16.21434.174 ± 10.71099.852 ± 16.457
EcoliBalanced-sampling83.869 ± 20.90430.357 ± 7.087114.226 ± 20.376
GermanWindowing315.252 ± 60.18282.866 ± 5.220398.118 ± 60.077
GermanFull-Dataset287.566 ± 54.04983.857 ± 5.339371.423 ± 53.413
GermanRandom-sampling211.627 ± 51.69283.245 ± 5.156294.871 ± 51.783
GermanStratified-sampling212.684 ± 54.54583.006 ± 5.125295.689 ± 53.830
GermanBalanced-sampling238.184 ± 51.81384.412 ± 5.352322.596 ± 51.356
HypothyroidWindowing84.812 ± 19.10828.291 ± 6.449113.102 ± 20.727
HypothyroidFull-Dataset122.317 ± 10.79127.105 ± 6.877149.422 ± 10.562
HypothyroidRandom-sampling15.667 ± 15.278189.232 ± 110.454204.899 ± 96.402
HypothyroidStratified-sampling30.645 ± 6.46567.493 ± 22.68398.138 ± 22.336
HypothyroidBalanced-sampling45.353 ± 10.44861.502 ± 18.798106.854 ± 18.199
Kr-vs-kpWindowing198.034 ± 14.57069.919 ± 4.871267.953 ± 14.944
Kr-vs-kpFull-Dataset219.807 ± 16.87069.345 ± 4.277289.152 ± 17.014
Kr-vs-kpRandom-sampling64.438 ± 18.81698.961 ± 21.032163.399 ± 21.636
Kr-vs-kpStratified-sampling72.664 ± 18.34192.724 ± 15.119165.388 ± 15.947
Kr-vs-kpBalanced-sampling73.848 ± 18.72191.842 ± 14.262165.690 ± 15.840
LetterWindowing11862.644 ± 473.1121248.697 ± 64.01713111.341 ± 453.031
LetterFull-Dataset12431.372 ± 180.8961165.793 ± 38.86913597.165 ± 182.617
LetterRandom-sampling7020.909 ± 385.2221473.635 ± 81.3568494.544 ± 358.576
LetterStratified-sampling7102.767 ± 358.0001461.702 ± 80.1618564.469 ± 328.131
LetterBalanced-sampling7126.843 ± 381.5071449.106 ± 76.5678575.949 ± 354.232
MushroomWindowing79.249 ± 7.03376.881 ± 4.163156.130 ± 7.189
MushroomFull-Dataset77.237 ± 0.60079.510 ± 1.744156.747 ± 1.810
MushroomRandom-sampling18.228 ± 19.552461.838 ± 353.124480.066 ± 337.153
MushroomStratified-sampling31.126 ± 14.101114.606 ± 23.525145.732 ± 20.201
MushroomBalanced-sampling31.879 ± 15.063113.501 ± 22.427145.380 ± 17.422
SegmentWindowing348.723 ± 34.36981.656 ± 10.719430.379 ± 33.528
SegmentFull-Dataset365.928 ± 22.56979.045 ± 9.609444.973 ± 22.295
SegmentRandom-sampling142.987 ± 22.538135.754 ± 31.843278.741 ± 31.578
SegmentStratified-sampling142.715 ± 18.438126.640 ± 24.516269.356 ± 26.762
SegmentBalanced-sampling141.267 ± 17.852127.325 ± 23.254268.591 ± 26.010
SickWindowing170.530 ± 26.60050.476 ± 8.212221.005 ± 26.977
SickFull-Dataset182.701 ± 22.49142.346 ± 7.910225.047 ± 20.038
SickRandom-sampling21.786 ± 16.60580.715 ± 38.277102.501 ± 24.810
SickStratified-sampling31.126 ± 6.76855.199 ± 13.73686.325 ± 15.387
SickBalanced-sampling57.996 ± 17.44660.045 ± 9.531118.040 ± 18.444
SpliceWindowing725.951 ± 53.364181.187 ± 11.871907.139 ± 53.195
SpliceFull-Dataset745.146 ± 51.142179.689 ± 11.014924.834 ± 52.532
SpliceRandom-sampling425.144 ± 52.153187.097 ± 21.631612.240 ± 47.209
SpliceStratified-sampling443.339 ± 51.337188.061 ± 19.286631.400 ± 48.312
SpliceBalanced-sampling419.763 ± 41.676188.473 ± 20.593608.236 ± 40.687
Waveform-5000Windowing2418.668 ± 215.760363.799 ± 56.4992782.467 ± 224.433
Waveform-5000Full-Dataset2615.956 ± 94.305415.810 ± 20.6013031.766 ± 92.381
Waveform-5000Random-sampling1957.647 ± 203.398413.447 ± 24.5482371.094 ± 202.636
Waveform-5000Stratified-sampling1957.202 ± 199.174417.104 ± 26.3482374.306 ± 196.151
Waveform-5000Balanced-sampling1966.554 ± 193.650417.152 ± 28.1332383.706 ± 190.987
Table 7. Predictive performance.
Table 7. Predictive performance.
DatasetMethodTest AccTest AUC
AdultWindowing86.355 ± 0.88978.227 ± 1.161
AdultFull-Dataset86.074 ± 0.39077.080 ± 0.823
AdultRandom-sampling85.516 ± 0.42376.131 ± 2.021
AdultStratified-sampling85.677 ± 0.40176.680 ± 0.885
AdultBalanced-sampling80.489 ± 0.72281.956 ± 0.580
AustralianWindowing85.710 ± 4.35585.471 ± 4.411
AustralianFull-Dataset86.536 ± 3.96986.239 ± 4.041
AustralianRandom-sampling85.101 ± 4.37584.849 ± 4.517
AustralianStratified-sampling85.391 ± 4.16485.142 ± 4.266
AustralianBalanced-sampling85.536 ± 3.92585.584 ± 3.854
BreastWindowing94.829 ± 2.80494.368 ± 3.117
BreastFull-Dataset95.533 ± 2.67495.058 ± 2.830
BreastRandom-sampling92.696 ± 3.82191.687 ± 4.739
BreastStratified-sampling92.783 ± 3.48591.956 ± 3.982
BreastBalanced-sampling92.433 ± 3.55892.301 ± 3.627
DiabetesWindowing74.161 ± 4.86470.041 ± 5.654
DiabetesFull-Dataset74.756 ± 4.66171.211 ± 5.027
DiabetesRandom-sampling72.280 ± 4.52068.602 ± 5.403
DiabetesStratified-sampling73.222 ± 5.11370.254 ± 5.721
DiabetesBalanced-sampling71.018 ± 5.22271.726 ± 4.937
EcoliWindowing82.777 ± 6.35388.848 ± 4.134
EcoliFull-Dataset82.822 ± 5.46788.873 ± 3.567
EcoliRandom-sampling80.059 ± 6.26886.924 ± 4.218
EcoliStratified-sampling79.586 ± 6.22786.721 ± 4.113
EcoliBalanced-sampling79.405 ± 6.36086.981 ± 4.034
GermanWindowing71.660 ± 4.60863.119 ± 5.518
GermanFull-Dataset71.300 ± 3.76562.605 ± 4.388
GermanRandom-sampling71.800 ± 3.78262.867 ± 4.408
GermanStratified-sampling71.640 ± 3.79962.857 ± 4.546
GermanBalanced-sampling67.820 ± 4.44866.833 ± 4.014
HypothyroidWindowing99.483 ± 0.34698.880 ± 1.204
HypothyroidFull-Dataset99.528 ± 0.35398.871 ± 1.259
HypothyroidRandom-sampling94.340 ± 2.52470.634 ± 23.378
HypothyroidStratified-sampling96.877 ± 1.65294.594 ± 4.769
HypothyroidBalanced-sampling96.236 ± 1.83197.598 ± 1.421
Kr-vs-kpWindowing99.302 ± 0.58399.294 ± 0.594
Kr-vs-kpFull-Dataset99.415 ± 0.43399.412 ± 0.433
Kr-vs-kpRandom-sampling94.171 ± 2.95994.139 ± 3.061
Kr-vs-kpStratified-sampling94.956 ± 1.76694.956 ± 1.802
Kr-vs-kpBalanced-sampling94.984 ± 1.72794.996 ± 1.756
LetterWindowing87.161 ± 2.07493.324 ± 1.078
LetterFull-Dataset87.943 ± 0.72093.731 ± 0.375
LetterRandom-sampling82.216 ± 1.00690.753 ± 0.523
LetterStratified-sampling82.376 ± 1.14890.836 ± 0.597
LetterBalanced-sampling82.430 ± 1.16090.864 ± 0.603
MushroomWindowing100.000 ± 0.000100.000 ± 0.000
MushroomFull-Dataset100.000 ± 0.000100.000 ± 0.000
MushroomRandom-sampling73.746 ± 23.61073.625 ± 23.684
MushroomStratified-sampling98.367 ± 0.81398.312 ± 0.831
MushroomBalanced-sampling98.424 ± 0.81998.376 ± 0.831
SegmentWindowing96.329 ± 1.65597.859 ± 0.965
SegmentFull-Dataset96.710 ± 1.33598.081 ± 0.779
SegmentRandom-sampling90.719 ± 3.18194.586 ± 1.855
SegmentStratified-sampling91.515 ± 2.07495.051 ± 1.210
SegmentBalanced-sampling91.455 ± 1.98495.015 ± 1.157
SickWindowing98.688 ± 0.64093.667 ± 3.370
SickFull-Dataset98.741 ± 0.52393.662 ± 3.323
SickRandom-sampling96.193 ± 1.88775.662 ± 19.843
SickStratified-sampling97.301 ± 1.05186.908 ± 6.166
SickBalanced-sampling94.785 ± 1.85594.812 ± 2.641
SpliceWindowing94.132 ± 1.68295.626 ± 1.344
SpliceFull-Dataset94.216 ± 1.47495.723 ± 1.125
SpliceRandom-sampling89.997 ± 2.22692.370 ± 1.951
SpliceStratified-sampling90.339 ± 1.97392.757 ± 1.572
SpliceBalanced-sampling89.846 ± 2.19992.902 ± 1.570
Waveform-5000Windowing83.802 ± 9.86487.848 ± 7.402
Waveform-5000Full-Dataset75.202 ± 1.98981.396 ± 1.493
Waveform-5000Random-sampling75.046 ± 2.15981.279 ± 1.619
Waveform-5000Stratified-sampling75.252 ± 1.98181.431 ± 1.487
Waveform-5000Balanced-sampling75.514 ± 2.14381.628 ± 1.609

Share and Cite

MDPI and ACS Style

Martínez-Galicia, D.; Guerra-Hernández, A.; Cruz-Ramírez, N.; Limón, X.; Grimaldo, F. Windowing as a Sub-Sampling Method for Distributed Data Mining. Math. Comput. Appl. 2020, 25, 39. https://doi.org/10.3390/mca25030039

AMA Style

Martínez-Galicia D, Guerra-Hernández A, Cruz-Ramírez N, Limón X, Grimaldo F. Windowing as a Sub-Sampling Method for Distributed Data Mining. Mathematical and Computational Applications. 2020; 25(3):39. https://doi.org/10.3390/mca25030039

Chicago/Turabian Style

Martínez-Galicia, David, Alejandro Guerra-Hernández, Nicandro Cruz-Ramírez, Xavier Limón, and Francisco Grimaldo. 2020. "Windowing as a Sub-Sampling Method for Distributed Data Mining" Mathematical and Computational Applications 25, no. 3: 39. https://doi.org/10.3390/mca25030039

Article Metrics

Back to TopTop