HyMOTree: Automatic Hyperparameters Tuning for Non-Technical Loss Detection Based on Multi-Objective and Tree-Based Algorithms

Coelho, Francisco Jonatas Siqueira; Feitosa, Allan Rivalles Souza; Alcântara, André Luís Michels; Li, Kaifeng; Lima, Ronaldo Ferreira; Silva, Victor Rios; da Silva-Filho, Abel Guilhermino

doi:10.3390/en16134971

Open AccessArticle

HyMOTree: Automatic Hyperparameters Tuning for Non-Technical Loss Detection Based on Multi-Objective and Tree-Based Algorithms

by

Francisco Jonatas Siqueira Coelho

^1,*

,

Allan Rivalles Souza Feitosa

¹

,

André Luís Michels Alcântara

²

,

Kaifeng Li

³

,

Ronaldo Ferreira Lima

³

,

Victor Rios Silva

³

and

Abel Guilhermino da Silva-Filho

¹

Informatics Center (CIn), Federal University of Pernambuco, Recife 50670-901, PE, Brazil

²

Eldorado Research Institute, Campinas 13083-898, SP, Brazil

³

Paulista Power and Light Company, Campinas 13070-740, SP, Brazil

^*

Author to whom correspondence should be addressed.

Energies 2023, 16(13), 4971; https://doi.org/10.3390/en16134971

Submission received: 24 April 2023 / Revised: 9 June 2023 / Accepted: 13 June 2023 / Published: 27 June 2023

(This article belongs to the Special Issue Artificial Intelligence and Machine Learning Applied to Energy Systems)

Download

Browse Figures

Versions Notes

Abstract

:

The most common methods to detect non-technical losses involve Deep Learning-based classifiers and samples of consumption remotely collected several times a day through Smart Meters (SMs) and Advanced Metering Infrastructure (AMI). This approach requires a huge amount of data, and training is computationally expensive. However, most energy meters in emerging countries such as Brazil are technologically limited. These devices can measure only the accumulated energy consumption monthly. This work focuses on detecting energy theft in scenarios without AMI and SM. We propose a strategy called HyMOTree intended for the hyperparameter tuning of tree-based algorithms using different multiobjective optimization strategies. Our main contributions are associating different multiobjective optimization strategies to improve the classifier performance and analyzing the model’s performance given different probability cutoff operations. HyMOTree combines NSGA-II and GDE-3 with Decision Tree, Random Forest, and XGboost. A dataset provided by a Brazilian power distribution company CPFL ENERGIA™ was used, and the SMOTE technique was applied to balance the data. The results show that HyMOTree performed better than the random search method, and then, the combination between Random Forest and NSGA-II achieved 0.95 and 0.93 for Precision and F1-Score, respectively. Field studies showed that inspections guided by HyMOTree achieved an accuracy of 76%.

Keywords:

energy theft; non-technical losses; hyperparameter tuning; machine learning; multiobjective algorithm

1. Introduction

The electrical industry is divided into generation, transmission, and distribution. The generation is responsible for converting sources of mechanical energy or raw material into electricity. This stage usually occurs in hydroelectric, thermoelectric, wind, and solar power plants. After generation, the electrical voltage is raised to a few thousand volts and injected into the transmission network. It extends and branches out for thousands of kilometers until it reaches the power distribution company, which is responsible for distributing electricity that supplies the final consumer.

During this process, a portion of the generated energy is lost due to natural phenomena or intentional acts mainly in low-voltage networks, which are closer to the final consumer. According to Mendes [1], electrical losses in distribution systems correspond to 70% of total losses in the electrical industry. From the point of view of power distribution companies, losses can be defined as the difference between the electricity purchased and billed to their consumers [2], and they can be divided into Technical Losses (TL) and Non-Technical Losses (NTL).

The TL occurs due to the physical effects of the energy transformation and transmission processes, such as the joule effect, dielectric losses, and magnetic forces. Generally, adopting modern and efficient equipment in electrical systems, such as low-resistance cables or high-efficiency transformers, tends to reduce TLs.

On the other hand, the NTL is linked to faulty energy meters, measurement errors, or energy theft, ref. [2] and it impacts the distribution companies’ economic and financial balance and service quality. According to Medeiros [3], most of the NTL occurs in the low-voltage segment and costs up to US$96 billion per year [4]. This criminal practice is mainly associated with socioeconomic characteristics and is more prevalent in countries with a lower development rate or poorer regions, such as peripheries and slums.

In Brazil, the NTL is determined monthly by the Electric Energy Trading Chamber, being apportioned at 50% for generation and 50% for consumers [2]. Thus, electricity theft also increases the energy bills of honest consumers.

Data from [5] in Figure 1 show the NTL percentage in countries that are part of the BRICS and the USA. In Brazil, in 2021, energy theft generated costs of approximately R$6.5 billion [6]. In addition to economic losses, energy theft reduces the quality of supply, representing a risk to public safety, as it alters the characteristics of the electrical network and is one of the major causes of fires and deaths from electrical discharges.

From an environmental point of view, energy theft delays global efforts to reduce emissions of harmful gases into the environment. Fraudulent customers require more of the world’s energy matrix, which is predominantly based on fossil fuels [7] such as coal, oil and derivatives, and natural gas [8]. Even Brazil, whose energy matrix is composed of 44.7% renewable sources meets the demand peaks with non-renewable and polluting sources, such as coal, oil, and natural gas [9].

Power distribution companies have resorted to modern techniques based on machine learning to combat energy theft. To train these classifiers, consumption data are collected from customers several times a day by Smart Meters (SMs) and sent to power distribution companies through the Advanced Metering Infrastructure (AMI). AMI aggregates the monitoring, collection, and storage functionalities of consumption data, allowing consumers to manage their electricity costs. Furthermore, the AMI provides details about the operational state of energy networks [10].

The fundamental component of the AMI is the SMs, which are the evolution of traditional energy meters. Those devices can measure active and reactive power, voltage levels, current, power factor, and consumption interruption, among other important variables for setting up consumption patterns [11], while traditional meters only measure accumulated consumption in kilowatt hours (kWh). However, SMs are not available in some parts of the world. In countries such as Brazil, even in large centers, most electricity consumption meters are electromechanical or technologically limited. In this scenario, the most frequent technique to identify fraudulent customers is the on-site inspection of energy meters. However, this approach is very costly from the financial point of view of human resources, given the need to send professionals to the suspected client. In addition, it is a technique with low assertiveness.

Another popular strategy to combat energy theft, in a scenario without AMI or SM, is inspection motivated by complaints. This strategy generally has more accuracy. However, the complaints frequency is very sparse, and they do not cover all the fraud cases. Another well-known method is identifying a pattern in consumption characterized by a partial or total reduction in the consumption of a consumer unit. The main advantage of this strategy is that identifying the abnormal consumption pattern does not require a priori field inspection teams.

Methods such as the ones mentioned above are still prevalent, but machine learning-based approaches have gained ground, as they only require scanning databases with historical consumption to infer suspicious consumption profiles. This approach has shown good assertiveness and lower cost when compared to the exhaustive search strategy or incursion motivated by suspicions.

The state-of-the-art shows that works focused on NTL detection commonly use public datasets and algorithms based on Deep Learning (DL). Furthermore, public datasets linked to NTL generally comprise data collected via AMI or SM. Their main limitations are unlabeled data or few samples about fraudulent consumer profiles. Another critical point is that these datasets may only represent the reality of some regions. Among the most popular datasets, we can mention the Irish Smart Energy Trial (ISET) [12] and the dataset from the State Grid Corporation of China (SGCC) [13].

ISET comprises consumption data from 5000 Electric Ireland residential and business customers. Data were collected via SM, with a sampling rate of 30 min, and corresponded to the period from 2009 to 2010. This dataset does not bring any information about energy theft. Its purpose is to depict information about energy consumption habits regarding electric energy. The SGCC dataset is very popular among scientific publications approaching the topic addressed here. It brings data on the energy consumption of more than 42,000 Chinese state-owned energy company customers collected by SMs, in which 8.55% of instances are labeled as fraudulent customers [14]. The data in the dataset correspond to the daily consumption of each customer, collected from 2014 to 2016, totaling 1035 days.

ISET is used in the work developed by Blazakis et al. [15]. The authors proposed an approach based on an Adaptive Neuro-Fuzzy Inference system called ANFIS. This proposal merges Artificial Neural Networks (ANNs) learning capacity and the inference from Fuzzy Logic. The ANFIS architecture comprises four layers: the product layer, the normalized layer, the defuzzification layer, and the summation layer. The ANFIS training process takes place in two phases: forward and backpropagation. As ISET does not bring energy theft cases, the authors generated synthetic data to simulate fraudster profiles.

In Hu et al. [11], the authors used ISET, residential and industrial consumption data from a Chinese province and the Yahoo S5 dataset as a data source [16], which is composed of real and synthetic time series. The Chinese province dataset comprises consumption samples of fraudsters and non-fraudsters, which are collected five times daily via AMI. Due to the few samples labeled as fraudster profiles, the authors resorted to the data augmentation technique to generate synthetic data and balance the dataset. The model proposed by the authors, intituled MFEFD, is based on DL with semi-supervised learning. The MFEFD is based on two stages. In the first one, features are extracted from the customer consumption curve, and in the second, the classification takes place. In this approach, the classifier determines whether a consumer is a fraudster taking the seven-day consumption curve as input data.

Massaferro et al. [17] address the energy theft problem, considering the costs of sending inspection teams and the financial return for the utility. The work focused on a scenario without SM or AMI. The authors use two datasets with data on Uruguayan consumers. The first one has unlabeled samples and was used to generate synthetic data to simulate cases of fraudsters. This work compares the performance of the Support Vector Machine, Random Forest (RF), and ANN algorithms. The RF model proves a more suitable performance. ANN and Support Vector Machine-based regressors were used to measure the amount of energy consumed by a fraudster customer.

Ramos et al. [18] compare the performance of different algorithms for selecting the most relevant features in detecting customers that steal energy. The dataset used in the research is labeled and gathered information about the energy consumption history, contracted demand and measured demand, reactive power, power factor, transformer power, and a load factor of 4952 commercial and 3182 industrial consumers. The Optimum-Path Forest algorithm was used as a reference classifier for testing the approaches. The authors applied the Wilcoxon test to compare the performances and verify if their results had significant differences.

Yan et al. [19] also rely on ISET to train an XGBoost-based classifier. The authors consider a scenario with AMI. Classifiers trained with the SVM, Logistic Regression, RF, K-Nearest Neighbors Algorithm, Naive Bayes classifier, and AdaBoost algorithms are used for comparison purposes. Precision, False Positive Rate, Recall, and Area Under the Curve (AUC) are the evaluation metrics used. In this approach, the authors used the dataset to generate synthetic data related to six different types of theft situations.

In Ullah et al. [14], an approach that integrates the benefits of AlexNet and AdaBoost is proposed. The AlexNet network extracts features from the consumption curves while dealing with the dimensionality curse problem. AdaBoost is used in the classification step. It was observed that the performance of AdaBoost increases with a more significant number of trees. The dataset used was the SGCC dataset. To balance the dataset, an undersampling method was applied.

Hassan et al. [20] see the consumption history as a time series, the authors combining a Convolutional Neural Network (CNN) and a Long Short-Term Memory Neural Network (LSTM). In model training, an SGCC dataset was used. The missing dataset values were imputed based on historical consumption averages. To balance the dataset, the authors generated synthetic data using the Synthetic Minority Over-Sampling (SMOTE) [21] technique. This work trained the model with weekly, biweekly, and monthly data. During the performance tests, the authors used data from only 10,000 consumers. They compared the performance of the proposed model with SVM and Logistic Regression, concluding that the CNN-LSTM approach presents a more suitable accuracy. The results also show that the insertion of synthetic data improved the classifier’s performance concerning detecting the minority class (cases of non-technical losses) while slightly reducing the model’s accuracy in detecting instances of the majority class.

In the problem addressed, tabular data and categorical variables are common in datasets. CNNs, on the other hand, perform better with grid-like data such as images. LSTMs are commonly used with time series or sequential data. However, historical energy consumption may have different sampling rates, which can hinder the performance of models trained by LSTM.

DL-based models require a more significant amount of data to achieve satisfactory performance and generate a classifier whose decision process is not transparent. On the other hand, tree-based classification models require a lower computational cost in the training process and deal very well with missing or unbalanced data. In addition, the decision-making process in these inference models is usually interpretable.

This work proposes a methodology based on multiobjective evolutionary algorithms (MOEAs) and Decision Trees for NTL detection in a scenario without AMI or SM, in which the consumption samples are monthly accumulated in kWh.

According to [22], population-based algorithms, such as the bat algorithm or artificial bee colony algorithms, perform less efficiently than evolutionary algorithms. The authors classify the MOEAs mechanism as based on dominance, based on indicator, and based on decomposition. The first group selects and updates individuals based on the dominant relationship among the solutions, the second group guides the search through a metric to find the real solutions to a problem, and the last group converts a problem into several sub-problems. Unlike others, the last group of MOEAs has a lower computational cost because all the sub-problems are solved simultaneously.

Our proposal uses two MOEAs to search for combinations of hyperparameters that maximize the performance of the classifiers in detecting NTL. The classifiers’ performance was evaluated using F1-Score and Precision metrics. The MOEAs’ performances are evaluated using convergence metrics. Our main contributions are the use of different multiobjective optimization strategies to improve the classifier performance, the evaluation of the model to a different probability cutoff operational, and an approach that allows the choice of different non-Pareto-dominated solutions to the adjustment of model behavior concerning the performance metrics and constraints of the addressed problem.

The data used in this work were provided by Paulista Power and Light Company (CPFL ENERGIA), which is a power distribution company that operates in 687 Brazilian cities and has around 9.6 million customers.

This work is organized as follows: Section 2 describes our methodology, dataset preprocessing, and evaluation metrics. The results obtained are discussed in Section 3. Finally, we present the conclusions of our work.

2. Materials and Methods

In this section, we present how the data provided by CPFL ENERGIA™ were preprocessed, which algorithms were used in this work, and the metrics used to evaluate the models, and finally, we show a performance comparison between the versions of the algorithm with and without the hyper-tuning process. The following sections describe the steps in Figure 2.

2.1. Preprocessing

The preprocessing of the dataset consists of cleaning, correcting inaccuracies, and transforming the raw data into information suitable for machine learning algorithms. The preprocessing steps applied in the data are detailed below:

2.1.1. Dataset Description

The dataset provided by CPFL Energia™ comprises samples of the history of 1.3 million consumer units, containing monthly samples of electricity consumption in kWh, consumer class, tariff group, contractual information, and notes about inspections carried out on the meter of energy for these consumers, corresponding to the period of 5 years. This information forms a set of 16 features: two are numeric, and the others are categorical.

2.1.2. Missing Data

Missing data mean some dataset samples may need completion or information inclusion. Equipment failures or errors when entering data are factors that may cause this problem. In our approach, records with missing data on location, electricity consumption, or inspection notes were considered unusable and removed from the dataset. These three blocks of information are essential to geographically identify the consumer, extract features from the consumer profile, and label fraudsters and non-fraudsters. Records whose inspection notes point to something outside the scope of this work also were removed. A total of 507,383 records with missing data were removed, which is equivalent to 38.5% of the dataset.

2.1.3. Labeling Process

Labeling is the process of identifying raw data and assigning them context information. The dataset samples were labeled using as reference the inspection notes. CPFL ENERGIA™ uses codes to describe the problem found in the energy meter inspections. These codes describe operational errors in the installation of the meter, different NTL, and violations in the energy meter, among others. The consumers in the dataset whose inspection note code indicated energy theft were labeled as 1. Otherwise, if the inspection note code indicated regular operation of the energy meter, the consumer was labeled as 0. In summary, 110,632 samples were labeled as positive cases for NTL, which was equivalent to 13.6% of the dataset samples. The dataset is imbalanced concerning the target variable, which is a phenomenon commonly observed in datasets linked to NTL detection.

2.1.4. Dataset Balancing

Dataset balancing is a technique used to equalize the distribution of classes in the training dataset. It is essential to avoid a bias toward majority classes, resulting in poor predictability for minority classes.

In our approach, we apply the oversampling technique based on SMOTE, which consists of creating synthetic samples of the minority class. SMOTE randomly selects an instance of the minority class and then determines the closest neighbor samples. It then generates synthetic instances between the selected samples [21]. This process is repeated until the dataset is balanced or until it achieves stopping criteria. The great advantage of SMOTE is its ability to preserve the characteristics and patterns present in the original dataset and deal with non-complex and non-linear minority classes.

2.2. Features Extraction

Feature extraction involves creating new features based on the raw data, making the dataset more informative. Our feature extraction approach uses numerical values containing information about the consumption history and categorical values referring to the type of tariff, contracted voltage, and others.

In the scenario addressed, the consumption measurement is taken monthly, so it is necessary to aggregate these data to be used as input to the model. Univariate statistical indicators of the entire customer consumption time series were used for this purpose. This approach was inspired in [15], where the consumption curve samples were used to calculate the mean, median, entropy, skewness, standard deviation, kurtosis, variance, energy, and load factor. The univariate statistical indicators used for feature extraction are shown in Table 1. After this step, the dataset now has 38 features.

2.3. Models Training

DL-based models generally require a higher computational cost and a huge number of training samples to perform satisfactorily, which are characteristics that may limit its application to problems where observation samples are rare. Furthermore, this technique only performs satisfactorily when the dataset is balanced. Another disadvantage is that the trained model is a black box, where the decision-making process is difficult to explain and understand [23].

The main tree-based algorithms are Decision Tree (DT), RF, and XGBoost. The learning process of these algorithms is supervised, and their operation is based on the branching through nodes, where each node represents a decision criterion based on a variable [24], and each branch denotes a value of the node [25]. The leaves represent the model’s outputs, that is, the prediction. A significant advantage of using tree-inspired models is that depending on the complexity of the data, the decision-making process can be transparent and interpretable.

The DT algorithm generates simple and intuitive classifiers since its entire structure is based on a single tree, as shown in Figure 3a. In this way, the computational cost demanded in model training is very low [26]. The DT algorithm’s great advantage is its lower propensity for overfitting, which is a condition in which the model learns the training dataset so well that it performs poorly on unseen datasets [27]. On the other hand, its performance may not be adequate when the data are more complex, causing underfitting patterns [28]. This phenomenon occurs when few predictors are included in the model, representing poorly the data pattern and affecting its performance and its generalization ability negatively [27].

RF combines several Decision Trees into a single model, as shown in Figure 3b. In the training process, each tree is trained with a subset of the training data. This helps reduce the tendency to overfitting while increasing the accuracy of the model [28]. The final prediction of the model is determined from the output of all the Decision Trees [25], considering the simple average or majority vote. The RF algorithm also performs well with large datasets or many features.

XGBoost is a gradient-boosting algorithm in which the Decision Trees’ adjustment occurs sequentially. In this process, each Decision Tree is trained to correct the errors of the previous tree [26]. This technique ensures better model accuracy. XGBoost also performs well with missing data [29]. The model’s output follows the same logic as the RF, where voting or averaging combines the Decision Tree into a single response. The most significant resemblance between XGBoost and RF is how they combine several Decision Trees’ responses to arrive at a final prediction. This method is called ensemble [26]. Another great advantage of using ensembles is the lower risk of obtaining a local minimum during the training process [26].

Proposed Approach

The proposed approach focuses on maximizing the model performance by searching for and adjusting the hyperparameters (HPs). The hypertuning is based on two MOEAs: the evaluation metrics of multiobjective and tree model-based algorithms. The HPs are values that cannot be estimated directly by the model and are usually adjusted before training. The search for HPs aims to adapt the classifiers to the dataset used in the training process, improving the model’s performance. Grid Search and Random Search are the most used algorithms to search for the best combination of HPs. However, both algorithms require a high time cost based on exhaustive searches.

The MOEAs provide an alternative method to exhaustive searches. These algorithms are inspired by the Darwinian theory of natural evolution. It can find near-optimal solutions in the search space based on objective functions. The MOEAs do not require prior knowledge of the problem, and their search strategy uses a population of points in contrast to the single-point approach of classical optimization methods.

During the search process, the MOEAs employ the concept of fittest individuals (solutions) to form a Pareto Front (PF) with solutions that represent the balance between the objective functions that the MOEAs must maximize or minimize. A significant advantage of MOEAs is the possibility of finding optimal solutions even for problems with concurrent objective functions. Figure 4 summarizes all the steps of the MOEAs.

The MOEAs start searching for the fittest individuals (set of solutions), generating a random population. These individuals are then evaluated based on one or more fitness functions. After these steps, the individuals considered most suitable are sorted and selected for the reproduction process, which is divided into crossover and mutation. In the reproduction process, the individuals combine their genes to generate a new individual and then undergo changes in their genes to create variations in the same individual. The selection process of individuals for crossover and mutation is probabilistic. In the next step, the fitness of the new individuals is evaluated, and the less fit ones are removed from the population of possible solutions. At the end of the iteration, it is checked whether the stopping criterion has been reached; if not, the whole process is repeated. Some MOEAs with recognized effectiveness in optimizing engineering, finance, and other problems are the Non-Dominated Sorting Genetic Algorithm II (NSGA-II) [30] and the Generalized Differential Evolution 3 (GDE-3) [31].

NSGA-II uses the non-dominance technique to evaluate and select the best individuals in a population. Non-dominant individuals concerning others are used to generate new individuals. NSGA-II applies the Crowing Distance concept to ensure good spacing between PF solutions, so lower-density solutions are selected less frequently. The Crowing Distance also ensures population diversity and prevents premature convergence [30].

GDE-3 uses a differential mutation strategy to generate new solutions from existing solutions. The algorithm randomly selects three individuals from the population to generate a new candidate solution. The differences between the two selected solutions are linearly combined in differential mutation. Then, the result is multiplied by a scale factor and added to the third selected solution [31]. If this solution is better than the original one, it will be added to the population. To assess the diversity of the population, the GDE-3 calculates the Euclidean distance between individuals. Solutions that are too close to each other are removed.

Our approach combines DT, RF, and XGBoost algorithms with NSGA-II and GDE-3 for automatic HPs search. The MOEAs are responsible for finding the combination of HPs that results in the best performance of the classifiers. This strategy, whose diagram is shown in Figure 5, was named HyMOTree.

The input parameters of HyMOTree are the dataset and the list of HPs for each algorithm. HPs can be binary, integer, categorical, or floating points. In addition to these parameters, the user can inform about the population size and the number of iterations for the search process. However, these settings are set by default.

Once the dataset and HPs are configured, instances of each tree-based algorithm are created and arranged with each of the MOEAs. Each MOEA creates its initial population of solutions (hyperparameters), which will be analyzed following the fitness criterion based on the F1-Score and Precision metrics. At each iteration, individuals are evaluated and undergo crossover and mutation processes according to the strategy of each MOEA. To minimize time-related costs, each MOEA was modified to parallelize the evaluation of each individual in the current population. Instead of evaluating one individual at a time, the entire population becomes threads. This strategy increases the demand for memory. On the other hand, it minimizes the search time for HPs.

After finishing all iterations, the performance of each combination of algorithms is evaluated, and the best combination of HPs is shown. When the stopping criterion is reached, the performance of each model, generated by the arrangement of algorithms, is evaluated. The F1-Score metric, the number of iterations, and the population size are used for this purpose. The metrics of convergence Hypervolume and Spacing are used in case of a tie. Finally, the model with the best performance and the HPs used in this process are presented. The list of HPs used in our approach is shown in Table 2.

2.4. MOEAs Evaluation

The quantitative evaluation of a Pareto Front can be completed using metrics that evaluate the convergence and the PF coverage in objective space and how the solutions are distributed. In this analysis, the PF of solutions of NSGA-II and GDE-3 algorithms was evaluated by the metrics Hypervolume (HV) [32] and Spacing (SP) [33]. These metrics are effective even for problems whose real PF of solutions is unknown.

The HV of non-dominated solutions measures the size of the portion of objective space dominated by solutions as a group [34]. This metric can measure the PF convergence of different generations of individuals. The higher value of HV indicates that the set of solutions dominates a larger area in the objective space. Formally, the HV indicator is defined as follows:

H (S) = Λ (⋃_{_{p ⩽ r}^{p \in S}} [p, r])

(1)

According to [35],

Δ

denotes the Lebesgue measure. Alternatively, it is interpreted as the measure of the union of boxes, where

[p, r] =

{

q \in R^{d} ∣ p ⩽ q

and

p ⩽ r

} denotes the box delimited below by

p \in S

and above by r.

The SP metric measures the diversity of solutions in a PF. This metric gauges how evenly the non-dominated solutions are spread along the PF. It is calculated by averaging the Euclidean distances between each pair of solutions in the PF. The lowest value of SP indicates a uniform distribution in PF [36]. This indicator is computed with:

S P (S) = \sqrt{\frac{1}{|S| - 1} \sum_{i = 1}^{|S|} {(\bar{d} - d_{i})}^{2}}

(2)

where

d_{i} = m i n_{(s_{i}, s_{j}) \in S, s_{i} \neq s_{j}} {∥F (s_{i}) - F (s_{j})∥}_{1}

is the l1 distance between a point

s_{i} \in S

and the closest point of the Pareto front approximation produced by the same algorithm, and

\bar{d}

is the mean of the

d_{i}

[37].

2.5. Evaluation

Evaluation metrics are essential to measuring the effectiveness of a model. However, only the appropriate metric can accurately measure how reliable the classifier is when a problem imposes restrictions.

In our approach, the cost of sending inspection teams to non-fraud customers can make the search for NTL economically unfeasible. Therefore, we should focus on reducing false negatives. The Precision metric gauges the classification accuracy in the classified positive samples [38]. Prioritizing this metric is critical when a false positive could be costly. Maximizing accuracy scores translates to higher revenue recovery with minimized inspection-related costs [39]. The overall performance assessment of the model can be completed using the F1-Score metric. It is the harmonic mean between Recall and Precision [11].

According to [20], the F1-Score works well in cases where the dataset is unbalanced. The Recall metric can be defined as the proportion of positive results ranked in the actual positive samples [38]. From an electric utility perspective, the Recall score represents the portion of revenue lost due to illegal actions that are recoverable through onsite inspections [39]. The Precision, Recall, and F1-Score metrics are defined as:

P r e c i s i o n = \frac{T r u e P o s i t i v e (T P)}{T r u e P o s i t i v e (T P) + F a l s e P o s i t i v e (F P)}

(3)

R e c a l l = \frac{T r u e P o s i t i v e (T P)}{T r u e P o s i t i v e (T P) + T r u e N e g a t i v e (T N)}

(4)

F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l} = \frac{2 * T P}{2 * T P + F P + F N}

(5)

In addition to these metrics, the receiver operator characteristic curve (ROC) can also be used to measure the sensitivity and specificity of the model for different classification thresholds. The Area Under the Curve can also be used to compare the performance of different models.

3. Results

The experiments were carried out in virtual machines from the Azure™ environment and implemented using the Python programming language with the libraries Scikit-learn [40], Pandas, and Matplotlib. The cluster used to collect the result had six instances of virtual machines with 384 Gb of memory and 34 cores. In all simulations, the training set contained 70% of the dataset samples.

3.1. Performance Evaluation

The high number of features in a dataset implies a more significant computational effort in the training process of classifiers and a greater probability of the classifier not being able to generalize the data. This phenomenon is known as the curse of dimensionality [26].

Another problem related to a large number of features is that models can become more complex and tend to overfit their training data, making the classifier’s accuracy poor when dealing with data not seen in the training. Overfitting is caused by three main reasons: noisy instances that are not labeled correctly, too few training examples, and over-learning [41].

Feature selection can also remove noise and undesirable characteristics that lead to poor performance of the algorithm, minimizing the risk of overfitting, as shown in [42] where the authors evaluate different classification algorithms trained with different datasets and in [43] where the authors try to avoid overfitting the model by removing features based on Pearson’s correlation.

The pruning of the Decision Tree also can be used to avoid overfitting. There are two ways to achieve this: prevent the generation of non-significant branches through pre-pruning or generating the Decision Tree, and then remove non-significant branches through post-pruning [44]. However, we will focus on a more general solution based on feature selection.

The filter method is one of the main techniques for removing features. In it, a correlation matrix, usually calculated using Pearson’s correlation, defines the level of correlation between features. A high correlation between two features means they have similar behaviors, so an attempt to remove one or both can be accomplished and then used to evaluate possible harming to the model’s learning process. Generally, the acceptable limit for correlation between two variables is determined heuristically.

To reduce the computational cost in the training process of classifiers and the risk of overfitting, we apply a search process (Algorithm 1) to determine the correlation value that maximizes the number of features removed by minimizing the performance of the classifiers. The pseudo-algorithm of this strategy is shown below:

Algorithm 1: Exaustive dimensionality reduction method based on Pearson correlation

1:: Input: $D a t a s e t, a l g o r i t h m_{i n s t a n c e s}, r u n_{r e p e t i t i o n s}$
2:: Output: $F 1_{h i s t}$
3:: Initialize:
4:: $D a t a s e t_{c o p y} \leftarrow []$
5:: $F 1_{h i s t} \leftarrow []$
6:: for $i = 1$ to $r u n_{r e p e t i t i o n s}$ do
7:: for $p = 0.05, 0.10 \dots, 0.95$ do
8:: $D a t a s e t_{c o p y} \leftarrow D a t a s e t$
9:: Compute the Pearson correlation for every pair of features in $D a t a s e t_{c o p y}$
10:: Drop out correlated features from $D a t a s e t_{c o p y}$ based on value of p
11:: Training models for all $a l g o r i t h m_{i n s t a n c e s}$ with $D a t a s e t_{c o p y}$
12:: Evaluate all the models in $a l g o r i t h m_{i n s t a n c e s}$ in terms of metric F1-Score
13:: $F 1_{h i s t} \leftarrow i, p,$ F1-Score
14:: end for
15:: end for

Initially, the algorithm receives as input parameters the dataset, instances of the tree-based algorithms, and the number of times the experiment must be repeated. In line 2, the work variables are initialized. In line 6, the main loop is controlled by the variable p, which is incremented from 0.05 to 0.95 in steps of 0.05. On line 8, the dataset is copied to a temporary variable. The following line calculates the Pearson correlation for all pairs of features in the dataset. In line 10, features with a Pearson correlation equal to or greater than the value of p are removed from the temporary dataset. Then, the reduced dataset is used to train classifiers, which will be evaluated from the perspective of the F1-Score metric in line 12. Finally, the data corresponding to the current iteration, the value of p, and the performance score of the classifiers are stored in

F 1_{h i s t}

, which serves as a data log for the entire process. All of the main loop steps are repeated according to the

r u n_{r e p e t i t i o n s}

variable. After finishing all iterations, the program will show all data stored in the

F 1_{h i s t}

variable. The graph in Figure 6, built from data saved in

F 1_{h i s t}

, shows how the correlation used in the filter method influenced the dataset’s number of features. In the graph, it is possible to see that even for values of p above 0.9, there is a considerable reduction in the number of features.

The graph in Figure 7 shows how the number of features in the dataset influences the performance of the classifiers trained by DT, RF, and XGBoost. This analysis takes into account the F1-Score metric and Pearson’s correlation. The results show that DT and RF have similar performances, but RF performs slight superior over DT. For both, the dataset with few features tends to degrade its performance. On the other hand, as the number of features in the dataset increases, the performance of the two classifiers improves. When the dataset was reduced to 21 features, the average value of the F1-Score metric for the DT model stabilizes at 0.67.

On the other hand, RF tends to perform better when the dataset has more features. In this experiment, the model trained with a dataset reduced to 27 features reached an average value of 0.69 for the F1-Score metric. In this experiment, XGBoost had the worst performance. The average value achieved for the F1-Score was 0.61 for the dataset with 19 features.

The data from this experiment show that the Pearson correlation value that minimizes the number of features in the dataset without causing significant deterioration to the performance of the models is 0.70. At this point, circled in red in Figure 7, the dataset now has 19 features, which is equivalent to a reduction of 50% in the number of features.

For comparison purposes, Table 3 shows values of the F1-Score and Precision metrics for the three classifiers trained with the original dataset and with the dataset reduced to 19 features. It can be seen in the table that the performance of the classifiers in the two experiments is practically the same. The difference between the performance metrics for the two situations does not reach 1%. The standard deviation between parentheses, shown in the table, are multiplied by

10^{- 3}

.

In order to improve the performance of the models, the NSGA-II and the GDE-3 datasets were combined with each of the three algorithms to provide HPs capable of adjusting the classification model to the dataset. The objective functions used in both optimization algorithms were based on the F1-Score and Precision metrics. The experiments use different population sizes (10, 20, 30, 40, and 50), and each experiment’s stopping criterion was defined as equal to 50 generations. Table 4 shows the average HV and SP metrics for each population size. The results of this experiment are also shown in Figure 8, Figure 9 and Figure 10.

Figure 8 shows that the HPs tuning process significantly improved the performance of the models trained by DT. The maximum F1-Score value obtained for the combinations with DT NSGA-II and DT GDE-3 was 0.85 and 0.86, respectively. NSGA-II manages to reach the maximum value in terms of F1-Score with a population size of 40 after 40 iterations. GDE-3 achieved the same result with two configurations; the first occurred with a population size of 50 individuals after 20 iterations, and the second occurred with a population size of 40 individuals after 30 iterations.

The average value of the HV and SP metrics shown in Table 4 indicates that the GDE-3 converges to a solution set with greater diversity and covers a larger area in the objective space with a population size of 40 individuals after 30 iterations.

The graph in Figure 9 shows that the set of HPs obtained by the RF NSGA-II and RF GDE-3 combinations improved the classifier’s performance. In this experiment, both combinations reached a value of 0.93 for the F1-Score metric, which represents a gain of 24 percentage points concerning the results obtained withou HPs tuning. In this experiment, the GDE-3, with a population size of 50 individuals, converges to the maximum F1-Score value after 50 iterations. NSGA-II achieved the same result with a population size of 40 individuals after 40 iterations. Thus, NSGA-II can find a more suitable combination of HPs faster than GDE-3.

Figure 10 shows that the model trained after tuning the HPs reached an average value of 0.81 for the F1-Score metric, representing a gain of 20 percentage points. In this experiment, the combination of XGBoost NSGA-II and XGBoost GDE-3 achieved the same result with a population size equal to 50 individuals after 30 and 50 iterations, respectively. However, NSGA-II converges with only 30 iterations, while GDE-3 requires 50. Therefore, the XGBoost NSGA-II combination can provide a more suited combination of HPs faster.

This experiment shows that the tuning of HPs through MOEAs improves the performance of the DT, RF, and XGBoost models concerning the F1-Score and Precision metrics. The results of this experiment and the rank of performance of algorithms are shown in Table 5 and Table 6. It is observed that the model trained by the RF and NSGA-II combination obtained the best performance concerning the analyzed metrics. Thus, the NSGA-II RF combination is the most suitable for searching for HPs that maximize the RF performance in NTL detection concerning our approach. Table 7 shows the set of HPs the RF NSGA-II combination found.

To test the effectiveness of our approach, the performance of HyMOTree was compared with the random search method. In this process, the search algorithm was used to find the best combination of HPs for DT, RF, and XGBoost. Due to its random nature, there is a high probability that the random search will find a suitable combinations of HPs. However, this approach is not controllable, and the search can find different results in each experiment. This search method was implemented through the Scikit-learn library [40]. The results are shown in Table 5. The standard deviation between parentheses, shown in the table, are multiplied by

10^{- 2}

.

A Pareto front resulting from the combination of NSGA-II RF algorithms is shown in Figure 11. The solution marked with a red cross would be the one that maximizes Precision to favor true positives and avoid sending teams for inspections that are more likely to be false positives. The solution still keeps an F1-Score level of around 0.93.

Figure 12 presents the ROC curve for the combinations of algorithms with the best performance. It is observed that the curve corresponding to the approach that combines the RF algorithm with NSGA-II stands out from the others, which suggests that the model has superior sensitivity and specificity at different cutoff points, demonstrating greater efficiency in the classification.

Requirements related to the minimum sample size for the classifier to identify NTL can make it economically unfeasible. In the focus scenario of this work, consumption profiles are composed of samples with monthly accumulated in kWh, so if the customer constantly changes address, the number of consumption samples may never reach the minimum quantity required by the model. Another critical point is that a huge sample window implies more stolen energy.

Our approach has no requirements regarding the size of the consumer history. The model trained with the proposed methodology is able to identify fraud in new and old customers but with different levels of accuracy. The graph in Figure 13 shows the F1-Score metric concerning the size of consumers’ history (number of months).

The performance of the model improves as the number of historical samples increases. The model reaches 93% concerning the F1-Score metric when the customer has at least nine consumption samples. However, if we consider the point where the model’s performance exceeds 80% for the F1-Score metric, we can say that the model trained with our approach can identify NTL in consumers with a minimum history equal to 7 months.

The MOEAs may take a long time to find good combinations of HPs. Therefore, to reduce the cost of time, NSGA-II and GDE-3 were modified to parallelize the aptitude evaluation process of the individuals in the solutions set. Table 8 shows the time average in hours to search for HPs for the canonical and parallelized algorithms. It is observed that the Decision Tree algorithm had the most significant gain in terms of time cost reduction. Regarding Random Forrest and XGBoost, it was observed that high values for HPs regarding the number of estimators or learning rate tend to increase the time needed in both algorithms for the hyper-tuning process.

Although RF and XGBoost do not linearly scale the speedup of the algorithms, it is possible to improve the results by making internal optimizations to the execution threads. However, this aspect was not the direct focus of this research work. We are working on this aspect in future works.

3.2. Field Validation

Our approach was validated by CPFL ENERGIA™ using real consumers. The model’s predictions were used to point consumer units suspected of committing fraud in different randomly chosen cities. From the positive fraudster consumer units predicted by the model, 808 were randomly chosen. Inspection teams were assigned to carry out field investigations. From this amount, only 522 inspections were successfully realized, and from those realized, 397 consumers were caught committing energy theft. That is, our approach achieved an accuracy of 76% in the field test.

The results extracted from this experiment reveal the model’s accuracy based on the predicted probability that a customer is defrauding his consumption. Figure 14 shows that the data have been divided into four probability ranges. It can be observed that the model presents high precision when a record is classified with up to 80% probability of fraud. However, the accuracy drops significantly for records classified with up to 60% probability.

4. Conclusions

This work investigated applying a multiobjective approach to improve the performance of tree-based classification models in detecting non-technical losses, which was called HyMOTree. Our approach was based on a scenario without Advanced Metering Infrastructure, where consumption data comprise monthly samples of technologically limited energy meters. Based on the scenario mentioned above, a dataset was created with billing data from CPFL ENERGIA™ consumers. To balance the dataset, the SMOTE technique was applied. The NSGA-II and GDE-3 algorithms were used to provide a combination of hyperparameters for the DT, RF, and XGBoost algorithms. To minimize the time cost of the hyper-tuning process, the MOEAs were modified to parallelize the solutions set evaluation. A performance comparison between HyMOTree and the random search method was performed. This experiment showed that our approach was able to find the best combination of HPs for all algorithm combinations analyzed.

Our results showed that for the proposed scenario, the combination of RF and NSGA-II algorithms performed better, reaching 0.95 and 0.93 for Precision and F1-Score, respectively. Field inspection data showed that the proposed solution improved the accuracy of the fraud detection team by 76%.

Author Contributions

Conceptualization, F.J.S.C. and A.G.d.S.-F.; Methodology, F.J.S.C., A.R.S.F., A.L.M.A. and A.G.d.S.-F.; Validation, R.F.L. and V.R.S.; Investigation, F.J.S.C., A.L.M.A. and A.G.d.S.-F.; Data curation, K.L. and R.F.L.; Writing—original draft, F.J.S.C.; Writing—review & editing, A.R.S.F. and A.G.d.S.-F.; Project administration, A.G.d.S.-F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received financial support of CPFL Energia™, by means of the Research and Development project PD-00063-3080/2021, with resources from ANEEL’s R&D program in cooperation with Eldorado Research Institute. The APC was funded by Eldorado Research Institute.

Acknowledgments

The authors thank CPFL Energia™ and Eldorado Research Institute for collaborating and supporting this work. They also thank CNPQ and FADE-UFPE.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

MDPI	Multidisciplinary Digital Publishing Institute
DOAJ	Directory of open access journals
TLA	Three-letter acronym
LD	Linear dichroism
TL	Technical Losses
NTL	Non-Technical Losses
AMI	Advanced Metering Infrastructure
SM	Smart Meter
kWh	kilowatt-hours
DL	Deep Learning
ISET	Irish Smart Energy Trial
SGCC	State Grid Corporation of China
ANN	Artificial Neural Networks
RF	Random Forest
AUC	Area Under Curve
CNN	Convolutional Neural Network
LSTM	Long Short-Term Memory Neural Network
SMOTE	Synthetic Minority Over-sampling
MOEA	Multiobjective Evolutionary Algorithms
DT	Decision Tree
HPs	Hyperparameters
PF	Pareto Front
NSGA-II	Non-Dominated Sorting Genetic Algorithm II
GDE-3	Generalized Differential Evolution 3
HV	Hypervolume
SP	Spacing

References

Mendes, A.; Franca, P.; Lyra, C.; Pissarra, C.; Cavellucci, C. Capacitor placement in large-sized radial distribution networks. IEE Proc.-Gener. Transm. Distrib. 2005, 152, 496–502. [Google Scholar] [CrossRef] [Green Version]
Brazilian Electricity Regulatory Agency, Loss Report. Available online: https://antigo.aneel.gov.br/documents/654800/18766993/Relat%C3%B3rio+Perdas+de+Energia_+Edi%C3%A7%C3%A3o+1-2021.pdf/143904c4-3e1d-a4d6-c6f0-94af77bac02a (accessed on 15 November 2022).
Henriques, H.; Corrêa, R.; Fortes, M.; Borba, B.; Ferreira, V. Monitoring technical losses to improve non-technical losses estimation and detection in LV distribution systems. Measurement 2020, 161, 107840. [Google Scholar] [CrossRef]
Khan, I.; Javaid, N.; Taylor, C.; Ma, X. Robust data driven analysis for electricity theft attack-resilient power grid. IEEE Trans. Power Syst. 2023, 38, 537–548. [Google Scholar] [CrossRef]
Strategy& Brazil. Propositions for the Problems of Non-Technical Losses in the Distribution of Electrical Energy. Available online: https://www.strategyand.pwc.com/br/pt/no-que-pensamos/Proposicoes_para_os_Problemas_das_Perdas_Nao_Tecnicas_na_Distribuicao_de_Energia_Eletrica_A4_07Dez2020_VF.pdf (accessed on 10 February 2023).
Campos, A. Agência Brasil Losses from Fraud in Brazil Added up to BRL 336 bi in 2021. Available online: https://agenciabrasil.ebc.com.br/en/geral/noticia/2022-08/losses-fraud-brazil-added-brl-336-bi-2021 (accessed on 2 February 2023).
Hasanuzzaman, M.; Zubir, U.; Ilham, N.; Che, H. Global electricity demand, generation, grid system, and renewable energy polices: A review. WIREs Energy Environ. 2016, 6, e222. [Google Scholar] [CrossRef]
International Energy Agency Energy, Statistics Data Browser. Available online: https://www.iea.org/data-and-statistics/data-tools/energy-statistics-data-browser?country=WORLD&fuel=Energy%20supply&indicator=TESbySource (accessed on 30 January 2022).
EPE Empresa de Pesquisa Energética, Matriz Energética e Elétrica. Available online: https://www.epe.gov.br/pt/abcdenergia/matriz-energetica-e-eletrica (accessed on 20 January 2023).
Huang, C.; Sun, C.; Duan, N.; Jiang, Y.; Applegate, C.; Barnes, P.; Stewart, E. Smart Meter Pinging and Reading Through AMI Two-Way Communication Networks to Monitor Grid Edge Devices and DERs. IEEE Trans. Smart Grid 2022, 13, 4144–4153. [Google Scholar] [CrossRef]
Hu, T.; Guo, Q.; Shen, X.; Sun, H.; Wu, R.; Xi, H. Utilizing Unlabeled Data to Detect Electricity Fraud in AMI: A Semisupervised Deep Learning Approach. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3287–3299. [Google Scholar] [CrossRef]
Irish Social Science Date Archive. Available online: https://www.ucd.ie/issda/data/commissionforenergyregulationcer/ (accessed on 3 February 2023).
Zheng, Z.; Yang, Y.; Niu, X.; Dai, H.; Zhou, Y. Wide and Deep Convolutional Neural Networks for Electricity-Theft Detection to Secure Smart Grids. IEEE Trans. Ind. Inform. 2018, 14, 1606–1615. [Google Scholar] [CrossRef]
Ullah, A.; Javaid, N.; Asif, M.; Javed, M.; Yahaya, A. AlexNet, AdaBoost and Artificial Bee Colony Based Hybrid Model for Electricity Theft Detection in Smart Grids. IEEE Access 2022, 10, 18681–18694. [Google Scholar] [CrossRef]
Blazakis, K.; Kapetanakis, T.; Stavrakakis, G. Effective Electricity Theft Detection in Power Distribution Grids Using an Adaptive Neuro Fuzzy Inference System. Energies 2020, 13, 3110. [Google Scholar] [CrossRef]
Yahoo!Webscope. S5—A Labeled Anomaly Detection Dataset, Version 1.0. Available online: https://webscope.sandbox.yahoo.com/ (accessed on 5 February 2022).
Massaferro, P.; Martino, J.; Fernandez, A. Fraud Detection in Electric Power Distribution: An Approach That Maximizes the Economic Return. IEEE Trans. Power Syst. 2020, 35, 703–710. [Google Scholar] [CrossRef]
Ramos, C.; Rodrigues, D.; Souza, A.; Papa, J. On the Study of Commercial Losses in Brazil: A Binary Black Hole Algorithm for Theft Characterization. IEEE Trans. Smart Grid 2018, 9, 676–683. [Google Scholar] [CrossRef] [Green Version]
Yan, Z.; Wen, H. Electricity Theft Detection Base on Extreme Gradient Boosting in AMI. IEEE Trans. Instrum. Meas. 2021, 70, 1–9. [Google Scholar] [CrossRef]
Hasan, M.; Toma, R.; Nahid, A.; Islam, M.; Kim, J. Electricity Theft Detection in Smart Grid Systems: A CNN-LSTM Based Approach. Energies 2019, 12, 3310. [Google Scholar] [CrossRef] [Green Version]
Chawla, N.; Bowyer, K.; Hall, L.; Kegelmeyer, W. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Xu, M.; Zhang, M.; Cai, X.; Zhang, G. Adaptive neighbourhood size adjustment in MOEA/D-DRA. Int. J.-Bio-Inspired Comput. 2021, 17, 14. [Google Scholar] [CrossRef]
Loyola-Gonzalez, O. Black-Box vs. White-Box: Understanding Their Advantages and Weaknesses From a Practical Point of View. IEEE Access 2019, 7, 154096–154113. [Google Scholar] [CrossRef]
Rokach, L.; Maimon, O. Top-Down Induction of Decision Trees Classifiers—A Survey. IEEE Trans. Syst. Man Cybern. Part (Appl. Rev.) 2005, 35, 476–487. [Google Scholar] [CrossRef] [Green Version]
Wilson, B.; Dhas, J.; Sreedharan, R.; Krish, R. Ensemble learning-based classification on local patches from magnetic resonance images to detect iron depositions in the brain. Int. J.-Bio-Inspired Comput. 2021, 17, 260–266. [Google Scholar] [CrossRef]
Sagi, O.; Rokach, L. Ensemble learning: A survey. WIREs Data Min. Knowl. Discov. 2018, 8, e1249. [Google Scholar] [CrossRef]
López, O.; López, A.; Crossa, J. Overfitting, Model Tuning, and Evaluation of Prediction Performance. In Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer: Cham, Switzerlad, 2022; pp. 109–139. [Google Scholar]
Galar, M.; Fernandez, A.; Barrenechea, E.; Bustince, H.; Herrera, F. A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches. IEEE Trans. Syst. Man Cybern. Part (Appl. Rev.) 2012, 42, 463–484. [Google Scholar] [CrossRef]
Aydin, Z.; Ozturk, Z. Performance Analysis of XGBoost Classifier with Missing Data. In Proceedings of the 1st International Conference on Computing and Machine Intelligence, Online, 19–20 February 2021. [Google Scholar]
Deb, K.; Pratap, A.; Agarwal, S.; Meyarivan, T. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 2002, 6, 182–197. [Google Scholar] [CrossRef] [Green Version]
Kukkonen, S.; Lampinen, J. GDE3: The third Evolution Step of Generalized Differential Evolution. In Proceedings of the 2005 IEEE Congress on Evolutionary Computation, Edinburgh, UK, 2–5 September 2005. [Google Scholar]
Zitzler, E. Evolutionary Algorithms for Multiobjective Optimization: Methods and Applications. Ph.D. Thesis, Swiss Federal Institute of Technology, Zurich, Switzerland, 1999. [Google Scholar]
Schott, J. Fault Tolerant Design Using Single and Multicriteria Genetic Algorithm Optimization; Technical Report; Air Force Institute of Technology: Wright-Patterson AFB, OH, USA, 1995. [Google Scholar]
Santos, T.; Xavier, S. A Convergence Indicator for Multi-Objective Optimisation Algorithms. Trends Appl. Comput. Math. 2018, 19, 437–448. [Google Scholar] [CrossRef] [Green Version]
Guerreiro, A.; Fonseca, C.; Paquete, L. The Hypervolume Indicator. ACM Comput. Surv. 2021, 54, 1–42. [Google Scholar] [CrossRef]
Sharifi, M.; Akbarifard, S.; Qaderi, K.; Madadi, M. A new optimization algorithm to solve multi-objective problems. Sci. Rep. 2021, 11, 20326. [Google Scholar] [CrossRef]
Audet, C.; Bigeon, J.; Cartier, D.; Digabel, S.; Salomon, L. Performance indicators in multiobjective optimization. Eur. J. Oper. Res. 2021, 292, 397–422. [Google Scholar] [CrossRef]
Feng, X.; Hui, H.; Liang, Z.; Guo, W.; Que, H.; Feng, H.; Yao, Y.; Ye, C.; Ding, Y. A Novel Electricity Theft Detection Scheme Based on Text Convolutional Neural Networks. Energies 2020, 13, 5758. [Google Scholar] [CrossRef]
Bhat, R.; Trevizan, R.; Sengupta, R.; Li, X.; Bretas, A. Identifying Nontechnical Power Loss via Spatial and Temporal Deep Learning. In Proceedings of the 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), Anaheim, CA, USA, 18–20 December 2016. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Amro, A.; Al-Akhras, M.; Hindi, K.; Habib, M.; Shawar, B. Instance Reduction for Avoiding Overfitting in Decision Trees. J. Intell. Syst. 2021, 30, 438–459. [Google Scholar] [CrossRef]
Liu, Y.; Mu, Y.; Chen, K.; Li, Y.; Guo, J. Daily Activity Feature Selection in Smart Homes Based on Pearson Correlation Coefficient. Neural Process. Lett. 2020, 51, 1771–1787. [Google Scholar] [CrossRef]
Salam, M.; Taher, A.; Samy, M.; Mohamed, K. The Effect of Different Dimensionality Reduction Techniques on Machine Learning Overfitting Problem. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 15. [Google Scholar] [CrossRef]
Bramer, M. Avoiding Overfitting of Decision Trees. In Principles of Data Mining; Springer: London, UK, 2013; pp. 121–136. [Google Scholar]

Figure 1. Technical losses percentage in countries that are part of the BRICS and the USA [5].

Figure 2. Research development methodology.

Figure 3. Representation of Decision Tree (a) and Random Forest (b) classifiers.

Figure 4. Flowchart of MOEAs.

Figure 5. Proposed methodology.

Figure 6. Number of features vs. Pearson’s correlation.

Figure 7. Performance analyses of the classifiers for different Pearson correlation values, and the point (red circle) that minimizes the number of features in the dataset.

Figure 8. Evolution in terms of F1-Score to Decision Tree algorithm as a function of the numbers of generations for different population sizes using NSGA-II (a) and GDE-3 (b) algorithms.

Figure 9. Evolution in terms of F1-Score to Random Forest as a function of the numbers of generations for different population sizes using NSGA-II (a) and GDE-3 (b) algorithms.

Figure 10. Evolution in terms of F1-Score to XGBoost as a function of the numbers of generations for different population sizes using NSGA-II (a) and GDE-3 (b) algorithms.

Figure 11. NSGA-II RF Pareto front of F1-Score vs. Precision metrics, considering 40 generations and a population size equal to 50 and the point (red cross) that maximizes the metric Precision.

Figure 12. Receiver operating characteristic curve of three algorithms.

Figure 13. F1-Score vs. number of samples in consumer history.

Figure 14. Model accuracy for different ranges of predicted probability.

Table 1. Univariate statistical indicators used for feature extraction.

Statistical Indicators	Description
Mean	Measure of average values of samples
Median	Represents the middle value of samples
Variance	Indicates the spread of data around the mean
Deviation	Represents the distance of each sample from the mean
Coefficient	Compares the standard deviation of samples with its mean
Min	The smallest value of samples
Max	The largest value of samples
Kurtosis	Describes the shape of the distribution of samples
MAD	Distance between each sample and the mean of the samples
Mode	Indicates the value that occurs most frequently in samples
Count	Total number of samples

Table 2. Set of hyperparameters used in tuning.

Decision Tree		Random Forest		XGBoost
Variable	Type	Variable	Type	Variable	Type
splitter	Binary	n_estimators	Integer	n_estimators	Integer
max_depth	Integer	max_depth	Integer	max_depth	Integer
criterion	Categorical	criterion	Categorical	criterion	Categorical
min_samples_leaf	Integer	oob_score	Binary	learning_rate	Float
max_features	Categorical	max_features	Categorical	max_features	Categorical

Table 3. Average of metrics F1-Score and Precision.

	Entire Dataset		Reduced Dataset
Algorithm	F1-Score	Precision	F1-Score	Precision
Decision Tree	0.67 (1.22)	0.61 (1.70)	0.67 (1.42)	0.62 (0.74)
Random Forest	0.69 (1.45)	0.71 (2.03)	0.69 (1.23)	0.71 (0.44)
XGBoost	0.61 (1.73)	0.52 (2.11)	0.60 (1.67)	0.52 (1.81)

Table 4. Averages of convergence metrics.

	Decision Tree				Random Forest				XGBoost
	NSGA-II		GDE-3		NSGA-II		GDE-3		NSGA-II		GDE-3
${pop}_{size}$	HV	SP	HV	SP	HV	SP	HV	SP	HV	S	HV	SP
10	0.96	1.55	0.98	2.09	0.95	2.05	1.05	1.26	0.65	2.09	0.69	2.05
20	0.96	1.12	1.12	1.15	1.09	1.95	1.09	1.10	0.77	2.01	0.71	1.56
30	1.01	2.09	1.11	2.05	1.12	1.02	1.29	1.05	0.80	2.05	0.82	2.01
40	1.10	1.08	1.13	1.03	1.32	1.16	1.30	1.08	0.93	1.30	0.93	1.33
50	1.12	1.01	1.12	1.16	1.37	0.98	1.35	1.04	0.96	1.12	0.95	1.25

Table 5. Average of F1-Score and Precision metrics after hypertuning process.

	NSGA-II		GDE-3		Random Search
Algorithm	F1-Score	Precision	F1-Score	Precision	F1-Score	Precision
Decision Tree	0.86 (1.80)	0.88 (1.72)	0.86 (1.25)	0.88 (1.33)	0.69 (9.25)	0.55 (8.37)
Random Forest	0.93 (2.15)	0.95 (1.21)	0.92 (2.36)	0.95 (1.98)	0.75 (5.22)	0.70 (6.43)
XGBoost	0.81 (1.20)	0.86 (2.49)	0.81 (2.45)	0.86 (2.51)	0.70 (4.89)	0.62 (6.12)

Table 6. Performance ranking of the best combinations of algorithms.

Algorithms	Rank
Decision Tree GDE-3	3
Random Forest NSGA-II	1
XGBoost NSGA-II	2

Table 7. Optimized set of hyperparameters found by RF NSGA-II combination.

Hyperparameters	Value
n_estimators	424
max_depth	395
criterion	entropy
oob_score	True
max_features	None

Table 8. The time average in hours required to search for hyperparameters and the speedup after thread-based parallelism (standard deviation).

Algorithm	Canonical	Parallelized	Speedup
Decision Tree	18.3 (0.40)	1.6 (0.15)	11.4×
Random Forest	39.2 (1.12)	18.1 (1.43)	2.1×
XGBoost	75.1 (1.78)	40.5 (1.9)	1.8×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Coelho, F.J.S.; Feitosa, A.R.S.; Alcântara, A.L.M.; Li, K.; Lima, R.F.; Silva, V.R.; da Silva-Filho, A.G. HyMOTree: Automatic Hyperparameters Tuning for Non-Technical Loss Detection Based on Multi-Objective and Tree-Based Algorithms. Energies 2023, 16, 4971. https://doi.org/10.3390/en16134971

AMA Style

Coelho FJS, Feitosa ARS, Alcântara ALM, Li K, Lima RF, Silva VR, da Silva-Filho AG. HyMOTree: Automatic Hyperparameters Tuning for Non-Technical Loss Detection Based on Multi-Objective and Tree-Based Algorithms. Energies. 2023; 16(13):4971. https://doi.org/10.3390/en16134971

Chicago/Turabian Style

Coelho, Francisco Jonatas Siqueira, Allan Rivalles Souza Feitosa, André Luís Michels Alcântara, Kaifeng Li, Ronaldo Ferreira Lima, Victor Rios Silva, and Abel Guilhermino da Silva-Filho. 2023. "HyMOTree: Automatic Hyperparameters Tuning for Non-Technical Loss Detection Based on Multi-Objective and Tree-Based Algorithms" Energies 16, no. 13: 4971. https://doi.org/10.3390/en16134971

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HyMOTree: Automatic Hyperparameters Tuning for Non-Technical Loss Detection Based on Multi-Objective and Tree-Based Algorithms

Abstract

1. Introduction

2. Materials and Methods

2.1. Preprocessing

2.1.1. Dataset Description

2.1.2. Missing Data

2.1.3. Labeling Process

2.1.4. Dataset Balancing

2.2. Features Extraction

2.3. Models Training

Proposed Approach

2.4. MOEAs Evaluation

2.5. Evaluation

3. Results

3.1. Performance Evaluation

3.2. Field Validation

4. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI