Electricity Theft Detection using Machine Learning

—This research work dealt with the indiscriminate theft of electric power, reported as a non-technical loss, affecting electric distribution companies and customers, triggering serious consequences including fires and blackouts. The research focused on recommending the best prediction model using Machine Learning in electrical energy theft. The source of the information on the electricity consumption of 42372 consumers was a dataset published in the State Grid Corporation of China. The method used was data imputation, data balancing (oversampling and under sampling), and feature extraction to improve energy theft detection. Five Machine Learning models were tested. As a result, the accuracy indicator of the SVM model was 81%, K-Nearest Neighbors 79%, Random Forest 80%, Logistic Regression 69%, and Naive Bayes 68%. It is concluded that the best performance, with an accuracy of 81%, is obtained by using the SVM model.


INTRODUCTION
In the world, 70% of electricity consumption is lost and 30% in the Caribbean and South America, of which Peru stands out with 7% according to the Inter-American Development Bank [1]. Electricity losses are categorized into two categories: energy delivered to customers (unpaid energy) and losses generated in transmission and distribution lines, which are inherent to electricity transmission. Likewise, nontechnical losses comprise the majority of losses in electricity networks and can account for more than 40% of the total electricity produced [2]. These types of losses are attributed to different sources, the most important and common being the alteration of metering equipment, illegal connections to the electrical grid, and energy theft [3]. Regarding distribution in Peru, annually, electricity theft generates losses of 103 million soles, equivalent to 207 GWh, for the companies providing the service [4]. However, this type of loss not only affects these companies but also the offenders themselves and people in the surrounding area, causing various accidents such as electric shocks, fires, and power outages.
According to [5], as presented in Table I, a division is made into countries, utilities, and society, which are categories represented in non-technical effects or consequences in which electric power has many losses.
The background of the respective research is based on multiple studies that have been conducted in many countries, designing intelligent systems that help to deal with this problem, mainly using Machine Learning techniques, which will be presented below: In 2018, wide convolutional networks (CNNs) were used for one-dimensional data, and deep convolutional networks (CNNs) were used for two-dimensional data. The onedimensional data were converted into two-dimensional electricity consumption data [6]. On the other hand, a study was carried out in which SVM was applied, using customer consumption data and the total energy distributed by the supplier, which allowed the calculation of the errors produced by electricity meters [7].
In 2019, a combination of neural networks, employed for the conversion of a one-dimensional dataset into a twodimensional one, and random forests were used to perform customer classification [8]. In 2020, the k-nearest neighbors algorithm and empirical mode decomposition were used to extract the most important attributes from the dataset and obtain good accuracy in detecting energy theft [9]. Another study used a text convolutional neural network (Text-CNN) to effectively extract periodic features about energy consumption and detect electricity theft [10].
In 2021, several classification algorithms were compared, the main one being lightGBM, a fast algorithm based on decision trees, which achieved an accuracy of 84% [11]. Other algorithms compared are logistic regression, with an accuracy of 71%, stochastic gradient descent, with an accuracy of 65%, and decision tree, with 86%. According to [12], they proposed that for feature learning (classified into theft and non-theft), a deep convolutional neural network was used. Smart counters at different epochs provided data that was used for SVM training. The time interval of 15 minutes that the smart meter had to record the data through a source coming from the residential and industrial sector which is comprised of 26530 consumers which is the product of data collection.
In the study of [13], the authors evaluated 23 classifiers using the F1 score as a performance parameter. They used as a basis the data of a Brazilian company oriented to the electric power industry, with 261,489 consumers, with approximately 1400 fields. From the results obtained, they concluded that the classifiers (ensemble methods) are the most appropriate, allowing the identification of non-technical cases of electric power loss. The F1 score of 0.45 is the result of the gradient boosted three and an accuracy of 66.50% (actual field inspections) with respect to the rotation forest.

II. MATERIALS AND METHODS
The procedure that was applied as a solution for electricity theft detection encompassed in the respective workflow is basically made up of five parts: data set acquisition, preprocessing, data balancing, feature extraction, classification, and acquired data set, shown in Fig. 1.
According to Fig. 1, the parts of the workflow will be detailed as follows:

A. Dataset Acquisition
The method of data collection was done through smart meters. The data comes from the daily consumption of electric energy belonging to the State Grid Corporation of China (http://www.sgcc.com.cn/), which was founded on December 29, 2002, and which supplies more than 1.1 billion inhabitants, covering 88% of the national territory. The description of the dataset used is presented in Table II. According to Table II, we have the temporality range of the data, which comprises from January 1, 2014 to October 31, 2016 (approximately 147 weeks). The file size is 167 MB (175,194,613 bytes) in csv format , with respect to the data structure of the dataset is divided into customers who steal electricity amounting to 3615 ( 8.55%) and normal customers who consume electricity amounting to 38757 (91.5%) of which add up the total amount of records in 42372 (total customers).

B. Preprocessing
Usually, the electricity consumption represented by the dataset is constituted in some cases by erroneous and missing values and this is caused by problems in smart meters, storage with many problems, unreliability in transmission of metering data and others [20]. For the recovery of missing values in the content of the dataset of the respective research is the interpolation method [21] which is represented through the following formula 1: Where: : Attribute of electricity consumption data NaN: Non-numeric value Next, the technique for the recovery of missing data was applied, using the average electricity consumption of each customer for which missing values were substituted. In addition, outliers were found, very different from the rest, which were restored using the equation of the three sigma rule, shown in the respective formula 2 where: std (

C. Feature Extraction
In order to classify the consumers, characteristics were extracted from their electricity consumption records. The characteristics used were the following: mean, standard deviation, peak to peak, skewness, median absolute deviation, entropy, and kurtosis.

D. Data Balancing
The dataset being used has been found to be imbalanced, with a greater amount of data representing people who are not stealing electricity compared to those who are stealing, which complicates the classification process. To balance the data, techniques such as oversampling and undersampling can be applied. For oversampling, a technique called SMOTE was often used, in which new instances are synthesized from other instances using the k-Nearest Neighbors technique [14]. However, it is suggested to use a subsampling technique in conjunction with the SMOTE technique [15]. For this research work on the dataset, the random subsampling technique was applied, dividing the data into disjoint training and test sets that are randomly partitioned several times [22].

E. Classification
According to [23], the classification process allows to obtain different classes, but based on a grouping of outputs through one or more input variables. In the research, a set of algorithms were applied for this purpose, each of which will be detailed below: The SVM algorithm is designed to find the optimal separating hyperplane between classes based on support vectors (extremes of the class distributions). The training data are separated into classes using boundaries, which results in the maximization of the distance between the various data sets and the boundary [16].
The training dataset, consisting of n cases represented by {x_i, y_i}, i = 1...n, where y_(i)∈{1,-1}, is used to form a classifier for accurate generalization. A hyperplane is defined as: Where there is a normal vector denoted by w and a point x, where both are in the hyperplane and b is the bias. And each point in the sample must satisfy: The k-Nearest Neighbors algorithm is a supervised classification algorithm that classifies or predicts based on proximity, which is calculated using various distance metrics [17]. In this study, we will use the Manhattan distance, defined as: Random forest is a supervised classification and regression algorithm that performs well on classification problems. It builds a set of decision trees and bases the final output on majority voting in classification problems. The decision tree algorithm for regression and classification is constructed by evaluating questions and node splits, which contribute to the further reduction of Gini impurities when answering [18].
Logistic regression is a classification algorithm that aims to predict or explain the values of a qualitative target variable as a function of a set of qualitative or quantitative explanatory variables. It is an extension of linear regression that uses the logit function for qualitative classification [19]. The logit function is defined as: Following the calculation of the conditional probabilities of which one event occurs with respect to the other, is the concept of Bayes Theorem of which naive bayes is a classification algorithm which is defined by the following respective formula:

III. RESULTS
The results were obtained based on the preprocessing of the data, totaling 33,009 instances, of which 20% were used for testing and 80% for training to predict the respective model to be compared. The processed dataset is shown in Table III.

A. Support Vector Machine Algorithm
After experimenting with different kernels to determine the optimal kernel for classification using the Support Vector Machine, the results of this experiment are shown in Table IV. The RBF (Radial basis function) kernel was chosen. The parameters chosen were "gamma": 0.5 and "C": 100, obtaining an accuracy of 81%. Fig. 2 and Table V show the results through the classification report and the confusion matrix as follows.

B. K-Nearest Neighbors Algorithm
Using the Manhattan metric, the best number of neighbors for this classification was 5, as shown in Fig. 3, obtaining an accuracy of 79%. The results obtained in the confusion matrix and ranking report using these two parameters are defined in the following graphs (as seen in Fig. 4 and Table VI). 423 | P a g e www.ijacsa.thesai.org

D. Logistic Regression Algorithm
This model was trained using 1000 iterations and the inverse of the regularization strength 'C' as 10, obtaining an accuracy of 69%. The classification report and the confusion matrix results were obtained, which are determined through the following graphs ( Fig. 6 and Table VIII): 424 | P a g e www.ijacsa.thesai.org

E. Naive Bayes Algorithm
The default parameters for classification were used for this algorithm, obtaining the following results (see Fig. 7 and Table IX).    As shown in Table X, the SVM model has a higher accuracy indicator score of 81%.

IV. DISCUSSION
The research of [6] focused on making a comparison of CNN, SVM, LR, RUSBoost models in order to know who has the best prediction. The accuracy result of the SVM model was 0.772, contrasting with our research that also developed a comparison of models such as SVM, RF, KNN, LR and NB, having the best accuracy results of 0.81 for SVM and 0.80 for RF. If we compare the SVM model results of both researches, there is an improvement of 0.038 (3.8%) in favor of the present research. Likewise, the research of [8], also makes a comparison of models such as CNN-RF, CNN-GBDT, CNN-SVM, CNN, SVM, RF, LR and GBDT, the SVM model has an accuracy of 0.77, compared with the present research, achieving an improvement of 0.04 (4%). Next, we have another research by [10], which proposes a new model (TextCNN) for electricity theft detection and also makes a comparison with traditional machine learning models (LR, SVM), the SVM model has an accuracy of 0.70, compared with the present research, achieving an improvement of 0.11 (11%). The SVM model has been compared for all research, however, this model compared with the research [6], which uses the CNN model, results in an accuracy of 0.92 (92%) and the research [10], whose model is Text-CNN whose accuracy value is 0.90 (90%), although it is true that both have better performance, however more computing power is needed when identifying consumers who steal electricity.

V. CONCLUSIONS
This research proposed an electricity theft detection model based on Support Vector Machine using electricity consumption information obtained from the State Grid Corporation of China, achieving a maximum detection accuracy of 81%.
The models have limitations because it was not possible to correctly classify about 25% of the electricity theft cases, which may be due to the lack of data on electricity thieves compared to those who did not steal electricity. However, we attempted to solve this problem using data balancing techniques (oversampling and under sampling).