Data Preprocessing in Pattern Recognition: Recent Progress, Trends and Applications

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (30 June 2020) | Viewed by 34117

Special Issue Editors


E-Mail Website
Guest Editor
Department of Computer Languages and Systems, Universitat Jaume I, 12071 Castelló de la Plana, Spain
Interests: pattern recognition; machine learning; data mining; data science
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
División Multidisciplinaria en Ciudad Universitaria, Universidad Autónoma de Ciudad Juárez, Av. José de Jesús Delgado 18100, Ciudad Juárez 32310, Chihuahua, Mexico
Interests: big data classification; meta-learning; class imbalance; time series; ensembles, neural networks
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

The current availability of rich data sets from several sources poses new opportunities to develop pattern recognition systems in a diverse array of industry, government, health, and academic areas. To reach accurate pattern recognizers on a given task is crucial to prepare the proper raw data set, converting inconsistent data into trustworthy data. In a pattern recognition project, 80% of the effort is focused on preparing data sets. Therefore, the data preprocessing step is vital to produce high-quality data and build models with excellent generalization performance. Even though the data preparation and data preprocessing techniques have been widely studied, the exploration is frequently performed in a solo manner. However, several studies have showed that data sets may exist with a mixture of data complexities such as class imbalance, data set shift, class overlapping, and high feature dimensionality, among others.

This Special Issue aims at collecting high-quality papers on recent advances and reviews that address the challenge of data transformation, integration, cleaning, normalization, feature selection, instance selection, and discretization. Furthermore, applications in which some of these intrinsic data characteristics appear are welcome.

Prof. J. Salvador Sánchez
Prof. Vicente García
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Class imbalance
  • Class overlapping
  • High dimensionality
  • Missing data
  • Dataset shift
  • Small size problem
  • Outlier and noisy data

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

2 pages, 164 KiB  
Editorial
Special Issue on Data Preprocessing in Pattern Recognition: Recent Progress, Trends and Applications
by José Salvador Sánchez and Vicente García
Appl. Sci. 2022, 12(17), 8709; https://doi.org/10.3390/app12178709 - 30 Aug 2022
Viewed by 968
Abstract
The availability of rich data sets from several sources poses new opportunities to develop pattern recognition systems in a diverse array of industry, government, health, and academic areas [...] Full article

Research

Jump to: Editorial

23 pages, 8842 KiB  
Article
News Classification for Identifying Traffic Incident Points in a Spanish-Speaking Country: A Real-World Case Study of Class Imbalance Learning
by Gilberto Rivera, Rogelio Florencia, Vicente García, Alejandro Ruiz and J. Patricia Sánchez-Solís
Appl. Sci. 2020, 10(18), 6253; https://doi.org/10.3390/app10186253 - 09 Sep 2020
Cited by 12 | Viewed by 2462
Abstract
‘El Diario de Juárez’ is a local newspaper in a city of 1.5 million Spanish-speaking inhabitants that publishes texts of which citizens read them on both a website and an RSS (Really Simple Syndication) service. This research applies natural-language-processing and machine-learning algorithms to [...] Read more.
‘El Diario de Juárez’ is a local newspaper in a city of 1.5 million Spanish-speaking inhabitants that publishes texts of which citizens read them on both a website and an RSS (Really Simple Syndication) service. This research applies natural-language-processing and machine-learning algorithms to the news provided by the RSS service in order to classify them based on whether they are about a traffic incident or not, with the final intention of notifying citizens where such accidents occur. The classification process explores the bag-of-words technique with five learners (Classification and Regression Tree (CART), Naïve Bayes, kNN, Random Forest, and Support Vector Machine (SVM)) on a class-imbalanced benchmark; this challenging issue is dealt with via five sampling algorithms: synthetic minority oversampling technique (SMOTE), borderline SMOTE, adaptive synthetic sampling, random oversampling, and random undersampling. Consequently, our final classifier reaches a sensitivity of 0.86 and an area under the precision-recall curve of 0.86, which is an acceptable performance when considering the complexity of analyzing unstructured texts in Spanish. Full article
Show Figures

Figure 1

22 pages, 680 KiB  
Article
A New Under-Sampling Method to Face Class Overlap and Imbalance
by Angélica Guzmán-Ponce, Rosa María Valdovinos, José Salvador Sánchez and José Raymundo Marcial-Romero
Appl. Sci. 2020, 10(15), 5164; https://doi.org/10.3390/app10155164 - 27 Jul 2020
Cited by 25 | Viewed by 3156
Abstract
Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most [...] Read more.
Class overlap and class imbalance are two data complexities that challenge the design of effective classifiers in Pattern Recognition and Data Mining as they may cause a significant loss in performance. Several solutions have been proposed to face both data difficulties, but most of these approaches tackle each problem separately. In this paper, we propose a two-stage under-sampling technique that combines the DBSCAN clustering algorithm to remove noisy samples and clean the decision boundary with a minimum spanning tree algorithm to face the class imbalance, thus handling class overlap and imbalance simultaneously with the aim of improving the performance of classifiers. An extensive experimental study shows a significantly better behavior of the new algorithm as compared to 12 state-of-the-art under-sampling methods using three standard classification models (nearest neighbor rule, J48 decision tree, and support vector machine with a linear kernel) on both real-life and synthetic databases. Full article
Show Figures

Figure 1

12 pages, 317 KiB  
Article
Data Reduction in the String Space for Efficient kNN Classification Through Space Partitioning
by Jose J. Valero-Mas and Francisco J. Castellanos
Appl. Sci. 2020, 10(10), 3356; https://doi.org/10.3390/app10103356 - 12 May 2020
Cited by 9 | Viewed by 1782
Abstract
Within the Pattern Recognition field, two representations are generally considered for encoding the data: statistical codifications, which describe elements as feature vectors, and structural representations, which encode elements as high-level symbolic data structures such as strings, trees or graphs. While the vast majority [...] Read more.
Within the Pattern Recognition field, two representations are generally considered for encoding the data: statistical codifications, which describe elements as feature vectors, and structural representations, which encode elements as high-level symbolic data structures such as strings, trees or graphs. While the vast majority of classifiers are capable of addressing statistical spaces, only some particular methods are suitable for structural representations. The kNN classifier constitutes one of the scarce examples of algorithms capable of tackling both statistical and structural spaces. This method is based on the computation of the dissimilarity between all the samples of the set, which is the main reason for its high versatility, but in turn, for its low efficiency as well. Prototype Generation is one of the possibilities for palliating this issue. These mechanisms generate a reduced version of the initial dataset by performing data transformation and aggregation processes on the initial collection. Nevertheless, these generation processes are quite dependent on the data representation considered, being not generally well defined for structural data. In this work we present the adaptation of the generation-based reduction algorithm Reduction through Homogeneous Clusters to the case of string data. This algorithm performs the reduction by partitioning the space into class-homogeneous clusters for then generating a representative prototype as the median value of each group. Thus, the main issue to tackle is the retrieval of the median element of a set of strings. Our comprehensive experimentation comparatively assesses the performance of this algorithm in both the statistical and the string-based spaces. Results prove the relevance of our approach by showing a competitive compromise between classification rate and data reduction. Full article
Show Figures

Figure 1

21 pages, 1620 KiB  
Article
Detection of Non-Technical Losses Using SOSTLink and Bidirectional Gated Recurrent Unit to Secure Smart Meters
by Hira Gul, Nadeem Javaid, Ibrar Ullah, Ali Mustafa Qamar, Muhammad Khalil Afzal and Gyanendra Prasad Joshi
Appl. Sci. 2020, 10(9), 3151; https://doi.org/10.3390/app10093151 - 30 Apr 2020
Cited by 47 | Viewed by 3822
Abstract
Energy consumption is increasing exponentially with the increase in electronic gadgets. Losses occur during generation, transmission, and distribution. The energy demand leads to increase in electricity theft (ET) in distribution side. Data analysis is the process of assessing the data using different analytical [...] Read more.
Energy consumption is increasing exponentially with the increase in electronic gadgets. Losses occur during generation, transmission, and distribution. The energy demand leads to increase in electricity theft (ET) in distribution side. Data analysis is the process of assessing the data using different analytical and statistical tools to extract useful information. Fluctuation in energy consumption patterns indicates electricity theft. Utilities bear losses of millions of dollar every year. Hardware-based solutions are considered to be the best; however, the deployment cost of these solutions is high. Software-based solutions are data-driven and cost-effective. We need big data for analysis and artificial intelligence and machine learning techniques. Several solutions have been proposed in existing studies; however, low detection performance and high false positive rate are the major issues. In this paper, we first time employ bidirectional Gated Recurrent Unit for ET detection for classification using real time-series data. We also propose a new scheme, which is a combination of oversampling technique Synthetic Minority Oversampling TEchnique (SMOTE) and undersampling technique Tomek Link: “Smote Over Sampling Tomik Link (SOSTLink) sampling technique”. The Kernel Principal Component Analysis is used for feature extraction. In order to evaluate the proposed model’s performance, five performance metrics are used, including precision, recall, F1-score, Root Mean Square Error (RMSE), and receiver operating characteristic curve. Experiments show that our proposed model outperforms the state-of-the-art techniques: logistic regression, decision tree, random forest, support vector machine, convolutional neural network, long short-term memory, hybrid of multilayer perceptron and convolutional neural network. Full article
Show Figures

Figure 1

16 pages, 4592 KiB  
Article
ProLSFEO-LDL: Prototype Selection and Label- Specific Feature Evolutionary Optimization for Label Distribution Learning
by Manuel González, José-Ramón Cano and Salvador García
Appl. Sci. 2020, 10(9), 3089; https://doi.org/10.3390/app10093089 - 29 Apr 2020
Cited by 8 | Viewed by 1971
Abstract
Label Distribution Learning (LDL) is a general learning framework that assigns an instance to a distribution over a set of labels rather than to a single label or multiple labels. Current LDL methods have proven their effectiveness in many real-life machine learning applications. [...] Read more.
Label Distribution Learning (LDL) is a general learning framework that assigns an instance to a distribution over a set of labels rather than to a single label or multiple labels. Current LDL methods have proven their effectiveness in many real-life machine learning applications. In LDL problems, instance-based algorithms and particularly the adapted version of the k-nearest neighbors method for LDL (AA-kNN) has proven to be very competitive, achieving acceptable results and allowing an explainable model. However, it suffers from several handicaps: it needs large storage requirements, it is not efficient predicting and presents a low tolerance to noise. The purpose of this paper is to mitigate these effects by adding a data reduction stage. The technique devised, called Prototype selection and Label-Specific Feature Evolutionary Optimization for LDL (ProLSFEO-LDL), is a novel method to simultaneously address the prototype selection and the label-specific feature selection pre-processing techniques. Both techniques pose a complex optimization problem with a huge search space. Therefore, we have proposed a search method based on evolutionary algorithms that allows us to obtain a solution to both problems in a reasonable time. The effectiveness of the proposed ProLSFEO-LDL method is verified on several real-world LDL datasets, showing significant improvements in comparison with using raw datasets. Full article
Show Figures

Figure 1

28 pages, 1941 KiB  
Article
Impact of Imbalanced Datasets Preprocessing in the Performance of Associative Classifiers
by Adolfo Rangel-Díaz-de-la-Vega, Yenny Villuendas-Rey, Cornelio Yáñez-Márquez, Oscar Camacho-Nieto and Itzamá López-Yáñez
Appl. Sci. 2020, 10(8), 2779; https://doi.org/10.3390/app10082779 - 16 Apr 2020
Cited by 4 | Viewed by 2058
Abstract
In this paper, an experimental study was carried out to determine the influence of imbalanced datasets preprocessing in the performance of associative classifiers, in order to find the better computational solutions to the problem of credit scoring. To do this, six undersampling algorithms, [...] Read more.
In this paper, an experimental study was carried out to determine the influence of imbalanced datasets preprocessing in the performance of associative classifiers, in order to find the better computational solutions to the problem of credit scoring. To do this, six undersampling algorithms, six oversampling algorithms and four hybrid algorithms were evaluated in 13 imbalanced datasets referring to credit scoring. Then, the performance of four associative classifiers was analyzed. The experiments carried out allowed us to determine which sampling algorithms had the best results, as well as their impact on the associative classifiers evaluated. Accordingly, we determine that the Hybrid Associative Classifier with Translation, the Extended Gamma Associative Classifier and the Naïve Associative Classifier do not improve their performance by using sampling algorithms for credit data balancing. On the other hand, the Smallest Normalized Difference Associative Memory classifier was beneficiated by using oversampling and hybrid algorithms. Full article
Show Figures

Figure 1

32 pages, 1600 KiB  
Article
Exploring the Patterns of Job Satisfaction for Individuals Aged 50 and over from Three Historical Regions of Romania. An Inductive Approach with Respect to Triangulation, Cross-Validation and Support for Replication of Results
by Daniel Homocianu, Aurelian-Petruș Plopeanu, Nelu Florea and Alin Marius Andrieș
Appl. Sci. 2020, 10(7), 2573; https://doi.org/10.3390/app10072573 - 09 Apr 2020
Cited by 6 | Viewed by 4299
Abstract
In this paper, we explore the determinants of being satisfied with a job, starting from a SHARE-ERIC dataset (Wave 7), including responses collected from Romania. To explore and discover reliable predictors in this large amount of data, mostly because of the staggeringly high [...] Read more.
In this paper, we explore the determinants of being satisfied with a job, starting from a SHARE-ERIC dataset (Wave 7), including responses collected from Romania. To explore and discover reliable predictors in this large amount of data, mostly because of the staggeringly high number of dimensions, we considered the triangulation principle in science by using many different approaches, techniques and applications to study such a complex phenomenon. For merging the data, cleaning it and doing further derivations, we comparatively used many methods based on spreadsheets and their easy-to-use functions, custom filters and auto-fill options, DAX and Open Refine expressions, traditional SQL queries and also powerful 1:1 merge statements in Stata. For data mining, we used in three consecutive rounds: Microsoft SQL Server Analysis Services and SQL DMX queries on models built involving both decision trees and naive Bayes algorithms applied on raw and memory consuming text data, three LASSO variable selection techniques in Stata on recoded variables followed by logistic and Poisson regressions with average marginal effects and generation of corresponding prediction nomograms operating directly in probabilistic terms, and finally the WEKA tool for an additional validation. We obtained three Romanian regional models with an excellent accuracy of classification (AUROC > 0.9) and found several peculiarities in them. More, we discovered that a good atmosphere in the workplace and receiving recognition as deserved for work done are the top two most reliable predictors (dual-core) of career satisfaction, confirmed in this order of importance by many robustness checks. This type of meritocratic recognition has a more powerful influence on job satisfaction for male respondents rather than female ones and for married individuals rather unmarried ones. When testing the dual-core on respondents aged 50 and over from most of the European countries (more than 75,000 observations), the positive surprise was that it undoubtedly resisted, confirming most of our hypotheses and also the working principles of support for replication of results, triangulation and the golden rule of robustness using cross-validation. Full article
Show Figures

Figure 1

20 pages, 1821 KiB  
Article
An Ensemble of Locally Reliable Cluster Solutions
by Huan Niu, Nasim Khozouie, Hamid Parvin, Hamid Alinejad-Rokny, Amin Beheshti and Mohammad Reza Mahmoudi
Appl. Sci. 2020, 10(5), 1891; https://doi.org/10.3390/app10051891 - 10 Mar 2020
Cited by 36 | Viewed by 3116
Abstract
Clustering ensemble indicates to an approach in which a number of (usually weak) base clusterings are performed and their consensus clustering is used as the final clustering. Knowing democratic decisions are better than dictatorial decisions, it seems clear and simple that ensemble (here, [...] Read more.
Clustering ensemble indicates to an approach in which a number of (usually weak) base clusterings are performed and their consensus clustering is used as the final clustering. Knowing democratic decisions are better than dictatorial decisions, it seems clear and simple that ensemble (here, clustering ensemble) decisions are better than simple model (here, clustering) decisions. But it is not guaranteed that every ensemble is better than a simple model. An ensemble is considered to be a better ensemble if their members are valid or high-quality and if they participate according to their qualities in constructing consensus clustering. In this paper, we propose a clustering ensemble framework that uses a simple clustering algorithm based on kmedoids clustering algorithm. Our simple clustering algorithm guarantees that the discovered clusters are valid. From another point, it is also guaranteed that our clustering ensemble framework uses a mechanism to make use of each discovered cluster according to its quality. To do this mechanism an auxiliary ensemble named reference set is created by running several kmeans clustering algorithms. Full article
Show Figures

Figure 1

16 pages, 4115 KiB  
Article
A Novel Ensemble Framework Based on K-Means and Resampling for Imbalanced Data
by Huajuan Duan, Yongqing Wei, Peiyu Liu and Hongxia Yin
Appl. Sci. 2020, 10(5), 1684; https://doi.org/10.3390/app10051684 - 02 Mar 2020
Cited by 6 | Viewed by 2594
Abstract
Imbalanced classification is one of the most important problems of machine learning and data mining, existing in many real datasets. In the past, many basic classifiers such as SVM, KNN, and so on have been used for imbalanced datasets in which the number [...] Read more.
Imbalanced classification is one of the most important problems of machine learning and data mining, existing in many real datasets. In the past, many basic classifiers such as SVM, KNN, and so on have been used for imbalanced datasets in which the number of one sample is larger than that of another, but the classification effect is not ideal. Some data preprocessing methods have been proposed to reduce the imbalance ratio of data sets and combine with the basic classifiers to get better performance. In order to improve the whole classification accuracy, we propose a novel classifier ensemble framework based on K-means and resampling technique (EKR). First, we divide the data samples in the majority class into several sub-clusters using K-means, k-value is determined by Average Silhouette Coefficient, and then adjust the number of data samples of each sub-cluster to be the same as that of the minority classes through resampling technology, after that each adjusted sub-cluster and the minority class are combined into several balanced subsets, the base classifier is trained on each balanced subset separately, and finally integrated into a strong ensemble classifier. In this paper, the extensive experimental results on 16 imbalanced datasets demonstrate the effectiveness and feasibility of the proposed algorithm in terms of multiple evaluation criteria, and EKR can achieve better performance when compared with several classical imbalanced classification algorithms using different data preprocessing methods. Full article
Show Figures

Figure 1

15 pages, 548 KiB  
Article
Data Sampling Methods to Deal With the Big Data Multi-Class Imbalance Problem
by Eréndira Rendón, Roberto Alejo, Carlos Castorena, Frank J. Isidro-Ortega and Everardo E. Granda-Gutiérrez
Appl. Sci. 2020, 10(4), 1276; https://doi.org/10.3390/app10041276 - 14 Feb 2020
Cited by 65 | Viewed by 6730
Abstract
The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance [...] Read more.
The class imbalance problem has been a hot topic in the machine learning community in recent years. Nowadays, in the time of big data and deep learning, this problem remains in force. Much work has been performed to deal to the class imbalance problem, the random sampling methods (over and under sampling) being the most widely employed approaches. Moreover, sophisticated sampling methods have been developed, including the Synthetic Minority Over-sampling Technique (SMOTE), and also they have been combined with cleaning techniques such as Editing Nearest Neighbor or Tomek’s Links (SMOTE+ENN and SMOTE+TL, respectively). In the big data context, it is noticeable that the class imbalance problem has been addressed by adaptation of traditional techniques, relatively ignoring intelligent approaches. Thus, the capabilities and possibilities of heuristic sampling methods on deep learning neural networks in big data domain are analyzed in this work, and the cleaning strategies are particularly analyzed. This study is developed on big data, multi-class imbalanced datasets obtained from hyper-spectral remote sensing images. The effectiveness of a hybrid approach on these datasets is analyzed, in which the dataset is cleaned by SMOTE followed by the training of an Artificial Neural Network (ANN) with those data, while the neural network output noise is processed with ENN to eliminate output noise; after that, the ANN is trained again with the resultant dataset. Obtained results suggest that best classification outcome is achieved when the cleaning strategies are applied on an ANN output instead of input feature space only. Consequently, the need to consider the classifier’s nature when the classical class imbalance approaches are adapted in deep learning and big data scenarios is clear. Full article
Show Figures

Figure 1

Back to TopTop