This paper presents a hybrid model, named as hybrid deep neural network, which combines convolutional neural network, particle swarm optimization, and gated recurrent unit, termed as convolutional neural network-particle swarm optimization-gated recurrent unit model. The major aims of the model are to perform accurate electricity theft detection and to overcome the issues in the existing models. The issues include overfitting and inability of the models to handle imbalanced data. For this purpose, the electricity consumption data of smart meters is taken from state grid corporation of China. An electric utility company gathers the data from the intelligent antenna-based smart meters installed at the consumers’ end. The dataset contains real-time data with missing values and outliers. Therefore, it is first preprocessed to get the refined data followed by feature engineering for selection and extraction of the finest features from the dataset using convolutional neural network. The classification of electricity consumers is performed by dividing them into honest and fraudulent classes using the proposed particle swarm optimization-gated recurrent unit model. The proposed model is evaluated by performing simulations in terms of several performance measures that include accuracy, area under the curve, -score, recall, and precision. The comparison between the proposed hybrid deep neural network and benchmark models is also performed. The benchmark models include gated recurrent unit, long short term memory, logistic regression, support vector machine, and genetic algorithm-based gated recurrent unit. The results indicate that the proposed hybrid deep neural network model is more efficient in handling class imbalanced issues and performing electricity theft detection. The robustness, accuracy, and generalization of the model are also analyzed in the proposed work.

1. Introduction

The rapid growth of energy consumers has increased the energy demand, which requires efficient generation and distribution of energy at the grid level. In this regard, smart grid [1], with the incorporation of advanced metering infrastructure (AMI), monitors the energy consumption patterns of consumers. AMI establishes a bidirectional communication between consumers and grid to balance supply and demand of energy [2]. Despite of demand management challenge, smart grid faces two types of losses during energy transmission. The first type is technical losses (TLs); whereas the second type is nontechnical losses (NTLs). The former occurs due to energy loss in power distribution lines and transformers. The latter is also known as commercial loss and it occurs due to unregistered connections [3], unpaid bills [4], tampering the antenna-based smart meters, etc., [5]. Moreover, the major reason for NTLs is electricity theft, caused by fraudulent consumers.

NTLs further lead to revenue loss in the economy of countries [6]. For instance, the electric utilities of Brazil and India face about 4.5 billion dollars loss annually [7]. The utilities of USA face around 6 billion dollars loss annually [8]. It is necessary for the power utilities to overcome the revenue loss by detecting NTLs. Therefore, several strategies are being used by the power utilities. The integration of AMI in power grids provides many advanced and automatic electricity theft detection (ETD) methods. However, calculating such losses and figuring out their exact locations are considered as the most crucial tasks [9]. The inefficiency and less profitability of power utilities are other major concerns. These issues cause extra burden for honest consumers by adding extra charges in their actual utility bills. The aforementioned losses lead to other issues as well, such as hampering and inflation of the industrial routine and load shedding [10].

Several strategies and methods are presented in literature to handle issues related to ETD. These methods are commonly based on hardware, game theory, and data driven [1]. The hardware-based methods, termed as the state-based methods as well [11], perform their operations with the utilization of physical devices, such as transformers, radio-frequency identification tags, sensors, and other electrical equipment. The state-based methods calculate the difference between energy generation at power utility side and energy consumption at consumers’ side. These methods achieve high efficiency in theft detection; however, the maintenance and safety of physical devices are major concerns [12]. On the other hand, in the game theory-based methods [13], a game is created between the participants (utilities and energy consumers). Both of the participants compete with each other in order to increase the utility and to get more benefits [14]. Although the game theory-based methods are more efficient contrary to the state-based methods; however, they are based on assumptions and are not able to perform efficient ETD.

In literature, data-driven-based methods are presented to perform efficient ETD. These methods utilize the machine learning (ML) techniques and models. Some of them are decision tree (DT) [3], artificial neural network (ANN) [15], support vector machine (SVM) [16], etc. These techniques use both labeled and unlabeled data to give optimal ETD results, as they have efficient learning capabilities. However, handling the imbalanced class data is challenging for these methods [17]. The traditional classifiers become biased towards the majority class if the data is imbalanced. Therefore, data balancing is required before classification in order to avoid biasness of a classifier and to achieve optimal ETD results. Data balancing is performed using resampling techniques, which balance the data in different classes present in a dataset. The abbreviations used in this work are presented in Table 1.

1.1. Problem Definition and Statement

The increased electricity demand has led to several issues, not just in underdeveloped countries but also in developed countries. The increase in poverty rate is one of the issues, which forces the people to perform electricity theft. In the energy sector, people are adopting illegal means of using electricity to fulfill their demands, i.e., electricity theft. Therefore, ETD is an important thing and needs immediate attention to avoid ever-increasing electricity theft rate. Keeping this in mind, many data analysis techniques, such as SVM, logistic regression (LR), and gated recurrent unit (GRU), have been presented in the literature. However, efficient results have not been achieved yet because of the limitations in these techniques [15]. Some of these limitations are poor learning rate, limited generalization capability, etc. However, the biggest issue being faced is the class imbalanced issue. There exists a huge difference in the number of instances in both classes, i.e., honest and theft consumers’ classes. Electricity theft leads to a revenue loss of billion dollars annually for electric utilities [1] and poses serious threats to the country’s economy. In addition, electric utilities face electricity losses, which are further classified into TLs ad NTLs; the latter being the most difficult to tackle. NTLs are caused by meddling either with the smart antennas or the smart meters installed at the consumers’ end. Therefore, to deal with the class imbalanced issue and to avoid NTLs, an efficient model is presented in this work, termed as particle swarm optimization GRU (PSO-GRU).

1.2. Contributions

This work presented a new variant of neural networks (NNs), named as new hybrid deep neural network (HDNN), in order to address the class imbalanced and overfitting problems in ETD. This work extends the idea present in [18]. The following are the primary contributions of this work: (i)A metaheuristic model, known as particle swarm optimization (PSO), is used with conventional GRU and convolutional neural network (CNN) to fine tune the parameters and to improve the learning rate, which makes the proposed PSO-GRU model more generalized in terms of training and testing in order to solve the overfitting issue(ii)A real-world electricity consumption (EC) dataset provided by State Grid Corporation of China (SGCC) [19] is used(iii)Data normalization and preprocessing are done using local average method and min-max normalization technique, respectively, and(iv)A comparison is made between the proposed and existing models, which proves the model’s efficiency in terms of ETD

The rest of the paper is organized as follows. Section 2 gives an overview of the related work done for ETD. The proposed HDNN model is described in Section 3. The simulations performed in the proposed work are discussed in Section 4. In the end, Section 5 concludes the paper and presents the future work.

In literature, many systems and approaches are presented for ETD. Most of them are based on hardware and game theory. However, maintenance and data diversity problems are still faced by these approaches. It is observed that the ML techniques are better than the abovementioned ETD methods due to no maintenance requirements and their ability to handle data diversity. However, various existing machine and deep learning techniques proposed in the literature face the overfitting problem [17].

The authors in [3] proposed a classification technique based on ensemble bagged tree (EBT) for detecting NTLs in power grids. The proposed technique handles the electricity loss issue in Multan Electric Power Company (MEPCO), Pakistan. In the proposed work, the technique is validated in terms of various performance metrics and is found more efficient than the existing techniques. In [20], the authors proposed a hybrid model for ETD that is based on long short-term memory (LSTM) and CNN. The model is compared with other models, and the results show that it beats the existing models and achieves high accuracy. The proposed model used in [21] is based on the relative entropy (RE) along with principal component analysis (PCA). This work is aimed at detecting the electricity losses, which occur in the vicinity of AMI, using the reconstructed data. The model is evaluated for sensitivity and specificity, and results indicate the good performance of the model for ETD.

In [22], the authors used fuzzy logic technique for the detection of suspicious electricity consumers. The selected time series data is linked with consumers, and fuzzy sets of suspicion are created. Based on these fuzzy sets, a threshold value is decided, which helps in the detection of suspicious consumers. The proposed technique’s performance is examined in terms of curve membership function, and the results show that it performs better than benchmark techniques. Similarly, the authors in [23] presented fuzzy logic technique to detect electricity theft and to increase the reliability of the power grid. The proposed technique is evaluated by presenting sixteen real-world scenarios. Efficiency of the technique is evaluated in terms of classifying honest and fraudulent consumers. However, integrating the renewable sources with power grids is not handled well. Similarly, the authors in [24] extracted the EC behavior patterns of the users and detected the abnormal consumption behavior. The authors in [25] proposed a deep learning model to overcome the issues related to NTLs in smart grids. This model considered an unlabeled data and an adversarial model to mitigate the overfitting issue. It performs better than the existing models. In [26], a hybrid approach is presented for ETD, which is based upon Gaussian mixture model (GMM) and LSTM. In this work, actual time series data is considered and some improvements are made in the LSTM structure. The simulations are carried out to show the performance of the proposed approach in terms of ETD.

In [27], the authors proposed a maximal overlap discrete wavelet packet transform (MODWPT) based model for feature extraction and random undersampling boosting (RUSBoost) technique to detect NTLs in the power grids. A comparison of the proposed and existing techniques is done, and results indicate the efficient performance of the proposed technique for NTL detection. The model is compared with the benchmark techniques, and the results show that the proposed technique outperforms the existing techniques in terms of NTL detection. The authors in [28] established a relationship between commercial losses and characterization of irregular consumers using black hole algorithm. Two different datasets are used in this work provided by Brazilian electric utility, and theft categorization is performed.

The authors in [29] presented a clustering-based approach for the detection of electricity thefts. The approach is based on maximum information coefficient (MIC) and clustering technique by fast search and find of density peaks (CFSFDP). Irish smart meter dataset is used in this work for carrying out the simulations. It is observed that the proposed model performs efficient theft detection. Similarly, a clustering approach is used by the authors in [30] to divide the consumers into clusters on the basis of load consumption and perform efficient short-term load forecasting. The authors in [31] adopted the LSTM method for forecasting EC of the consumers based on the recent past consumption profiles. The continuous monitoring of the profiles helps in efficient ETD. The simulation results prove the model’s efficiency. The authors in [32] used big data analytics for forecasting the EC along with the corresponding price. The simulations are performed to prove the model’s efficiency in terms of price forecasting. The authors in [33] proposed a fault-tolerant model to preserve the privacy of users and perform data aggregation at the smart grids. Table 2 summarizes the related work in a tabular form for better understanding.

3. Proposed System Model

In this section, the proposed system model is described along with the dataset used in this work. Furthermore, different techniques used in this work are discussed. The proposed system model is shown in Figure 1.

3.1. Description of the Proposed Model

The proposed HDNN-based model consists of several steps, discussed as follows. Initially, the data is gathered from the intelligent antenna-based smart meters installed at the consumers’ end, shown in the lower box of the proposed system model. Each smart home has a smart meter and an intelligent antenna, which helps in recording the EC data. The data is saved and is made publicly available by SGCC. Afterward, the dataset is preprocessed in order to normalize the data and to remove redundant and irrelevant data. Also, the outliers are removed to get more refined data. The most important features are obtained by performing feature engineering process. In this process, feature selection and extraction are done. Then, the classification of normal and fraudulent consumers is done. Figure 2 shows the flow of data acquired from the smart homes having intelligent antenna-based smart meters. In Table 3, the identified limitations are mapped with their respective solutions and validations. (1)Description of the Dataset. In this work, real-time EC data of the users is used, provided by SGCC [19]. The data is gathered from the consumers’ side using intelligent antenna-based smart meters. The dataset consists of 1,035 features. A subset containing data of 3000 consumers is selected from the whole dataset, in which 2480 are normal consumers; while the remaining 520 consumers are fraudulent. It can be clearly observed that the dataset is imbalanced, due to which ETD is highly affected. In this work, the data is balanced using SMOTE [20], which balanced the number of fraudulent and normal consumers. In addition, the dataset is divided into 75% and 25%, respectively, for training and testing purposes. Table 4 gives a detailed description of the dataset used in the proposed work(2)Synthetic Minority Oversampling Technique. It is considered as one of the oversampling techniques, which increases the data points in the minority class (fraudsters) in order to handle imbalanced data problem. In SMOTE, the synthesized data points are generated in the minority class. In this work, the highly imbalanced data is balanced by using SMOTE. In SMOTE, if depicts a sample of a minority class, then is selected as its nearest neighbors. The synthesized or fake data points are generated by the following equation.where presents a number that is chosen between 0 and 1 and denotes the Euclidean distance between the minority class and its neighboring class sample. It is calculated in

3.2. Data Preprocessing

The major aim of preprocessing step is to get the most refined data from the whole dataset. In this step, the missing or Not a Number (NaN) values are recovered by local average method, which is given in the following equation [7]. when the value of NAN is continuous, then 0.10 will be the value of . represents the EC of a user at specific time interval. has a binary value, i.e., either 0 or 1, which is based on threshold and is calculated using the following equation. where is computed in the following equation

The min-max normalization technique is used to transform and normalize the dataset in the range of [0, 1]. The minimum value is transformed into 0; whereas the maximum value is transformed into 1 and other values are transformed between 0 and 1. The min-max technique is calculated by the equation given below,

refers to the actual value of the features and indicates the value after normalization. While is the maximum value of , and is the minimum value of . The equations used above are adopted from [20].

3.3. Feature Engineering

Once the data is preprocessed and normalized, the feature engineering process is performed. This process includes two steps: one is feature selection and another is feature extraction. The former selects the most relevant data features from the whole dataset to reduce both overfitting and training time and to improve accuracy; whereas the latter extracts the selected features for data dimensionality reduction and removal of data redundancy. With the feature engineering process, the performance of the model is enhanced.

In this work, feature engineering is done using a DNN, termed as CNN. The idea of CNN was primarily presented in [20]. The typical CNN architecture has various layers, which include convolution, pooling, and fully connected layers. The first convolution layer contains many convolution filters, which are termed as kernels and they perform mapping operation. The convolution layer is mathematically given in the following equation [34] where refers to the activation function and represents the convolution operation. and represent the learnable parameters in the th feature filter. The next layer in CNN is the pooling layer, which comes after the convolution layer. The main objectives of this layer are to extract the meaningful features and to perform the downsampling of each feature map to achieve dimensionality reduction. It also reduces the execution time of the network. The pooling layer consists of two common functions, which are as follows. (i)Average Pooling. This function calculates the average number of features in the feature map(ii)Maximum Pooling. This function calculates the maximum number of optimal features in the feature map

In CNN, the third fully connected layer performs the final classification. In ETD, the data is classified into honest and fraudulent consumers’ classes. The mathematical representation of the fully connected layer is given in Equation (8), taken from [34], where represents weight and represents bias.

Function that is used in CNN to predict the final output is known as the Softmax function. The output is given in binary form, either 0 or 1 [20]. Equation (9) provides a complete mathematical form of CNN, as given in [34]. Figure 3 gives an overview of the architecture of CNN,

The parameters involved in Equation (9) are initialized with some random number using the normal distribution. denotes the total number of training example. Initially, the weight is assigned randomly, then later it is updated using the gradient descent method. Equations (10) and (11) correspond to and , where represents the learning rate, denotes partial derivative, is the connection weight between th neuron in the th layer and th neuron in the th layer. is the bias of the th neuron in the th layer. Equations (10) and (11) are repeated until the optimal value of objective function is achieved. The above mathematical representations are motivated from [34]. The hyperparameters of CNN used in this work and their values are given in Table 5.

3.4. Gated Recurrent Unit

It is considered as a variation of LSTM and recurrent neural network (RNN) and is a subclass of DNN. It resolves the vanishing gradient problem of RNN by using two gates: update gate and reset gate. These gates determine that how much information is required to pass to the future. In the updated gate, the past or previous information needed to be passed to the future is determined. Equation (12) gives the formula to calculate the output of the updated gate for time series data, taken from [35]. where shows an input that is given to the network unit and is multiplied by its weight . The maintains the previous information and is multiplied by its weight as well. Then, these weights are summed up and the result is squashed between 0 and 1 by applying the sigmoid function. The reset gate decides that how much previous information is required to be neglected. Equation (13) gives the mathematical form of the reset gate, taken from [35]. where is multiplied by its weight , and the is multiplied by its weight . Figure 4 shows the architectural view of GRU. In Table 6, the hyperparameters of GRU used in this work along with their values are presented.

3.5. Particle Swarm Optimization

It is a population-based stochastic technique that handles the local optima issue by covering the search space with global optimum solutions. The traditional ML techniques, such as GRU, LR, and SVM, are mostly stuck into local optima. That is why it is not suitable to utilize such techniques for ETD due to their poor ETD performance. In this work, a hybrid technique is made by integrating PSO with GRU to perform efficient and accurate ETD. The proposed technique overcomes the local optima issue very efficiently. PSO performs the searching operation via swarm particles, which are updated in every next iteration. The best optimal solution is achieved by moving each particle in the direction of previous best and global best solutions in the swarm [18]. Equations (14) and (15) give the mathematical form of calculating and , respectively. where indicates the particle index, gives the current iteration number, gives the total number of the particles, represents the fitness function, and tells the position. The velocity of a particle is updated using the following equation. where is a weight of inertia that is used to balance both global and local exploitation. Whereas and indicate the uniformly distributed random variables that are in the range of [0,1]. While and represent the positive constant parameters, which are also known as acceleration coefficients. The hyperparameters of PSO used in this work and their values are given in Table 7. The above given mathematical formulations of PSO are taken from [18]. Algorithm 1 gives the pseudocode of PSO.

The efficiency of the presented hybrid model is optimized by passing three parameters of GRU to PSO. Based on these parameters, the training and testing processes of the model are optimized for given dataset. As a result, the model becomes accurate and more robust. These parameters are discussed below. (i)Hidden Layer. It is considered the most important layer of GRU. It is positioned between input layer and output layer. This layer primarily performs the computational operations. Moreover, the weights are given to input values by this layer. After successfully accessing optimal input sets, the results are passed to the output layer for final predictions(ii)Batches. They determine the number of training samples required to compute training and testing loss. Generally, the loss is calculated by the predefined loss function(iii)Learning Rate. It establishes the adjustment of the network weights, which are needed regarding the loss gradient

1: Initialization
2: for each particle i =1,…,Np, do.
3: (a) Initialize the particle’s position using uniform distribution as and U(LB, UB) where UB and LB represent the upper bound and lower bound of the search space, respectively.
 (b) Initialize pbest to its initial position:pbest(i,0) = pi(0)
5: (c) Initialize gbest to the minimal value of the swarm.
:gbest(0) = pi(0).
6: (d) Initialize Velocity: Vi U(—UB-LB—),—UB-LB—
7: end for.
8: Repeat until the Termination Criteria Is Met
9: for each particle i =1,…,NP, do.
10: (a) Pick random numbers: r1,r2 U(0,1).
11: (b) Update the particle’s velocity. See Equation (16).
12: end for.
13: if f[pi(t)] < f[pbest(i,t)], then.
14: Update the best Known Position of Particle I: Pbest(I,T) = pi(T)
15: if f[pi(t)] < f[gbest(t)], then.
16: Update the swarm’s best Known Position: Gbest(T) = pi(T)
17: end if.
18: end if.
19: Output gbest(t) that holds the best found solution.
20: End.
3.6. Particle Swarm Optimization-Gated Recurrent Unit Binary Classification

A hybrid deep model is presented in this work that is based on PSO-GRU. In the existing models, overfitting and parameter tuning are the main problems. These problems further lead to increase in false-negative rate and false positive rate (which are given in Equations (19) and (20)). Therefore, a hybrid PSO-GRU model is presented to overcome the aforementioned issues. The pseudocode of the model is given in Algorithm 2. The parameters of GRU are tuned using PSO. Afterward, the well-tuned model is used for classification. The main purpose of PSO is to improve the learning of the GRU network and to solve overfitting problem. In this whole process, the data is initially preprocessed. In this phase, local average method is used for recovering missing values; whereas min-max normalization is applied to scale the data. Afterward, data balancing is performed using SMOTE oversampling technique, which balances the provided data in different classes by generating synthetic samples of the minority class. If the classifier is trained on imbalanced data, then it is biased towards the majority class; therefore, the data is balanced using SMOTE technique. Afterward, useful features are extracted from the dataset by CNN. The extracted features are then passed to GRU for training. The parameters of GRU are tuned using PSO. Finally, the fine-tuned model is used to perform classification.

1: Initialization
2: Load the dataset Ds from the smart meter Sm
3: Perform pre-processing steps
4: Split Ds into Dst (train dataset), Dste (test dataset)
5: Adjust the parameter of the classifier such that
and e. Where b is the number of the batches, h is the number of hidden layers, n is the number of neurons in the hidden layer and e is the number of epoch set for training and testing phases 6: Pass the h, b and lr parameters to the PSO 7: Steps in training phase:
8: Build gruPSOPSO − GRU with the network parameters b, h and lr for the Ds
9: Repeat
10: At the nth epoch do
11: Train the gruPSO to fetch b from Ds
12: Proposed model’s performance is measured in terms of AUC, accuracy, precision, recall and F1-score
13: Update the performance of the queue ∈ Q. where Q
14: End
15: The loss function is calculated until loss function ls ≤ Cn. Where Cn is the convergence threshold
16: If early stopping is performed, then
17: Proceed
18: Else, go to step 8
19: Steps in testing phase:
20: Fetch Dste from the Ds
21: Test is performed by gruPSO on the Dste to generate the final results through plots
22: Compare the proposed and benchmark models on the basis of Dste in terms of performance measures for final prediction
23: End

4. Simulation Results and Discussion

Several simulations are conducted to assess the performance of the proposed hybrid model. The simulation results, performance metrics, and benchmark models are discussed in this section.

4.1. Performance Metrics

The performance of the proposed model is examined by considering several performance measures, which include AUC, accuracy, -score, recall, and precision. The training and testing loss and accuracy are also calculated to assess the performance of the proposed and the benchmark models. A detailed description of the performance metrics is given below. (1)Area Under Curve. This performance metric is used for the validation of model by considering an AUC between two integrals. Moreover, it provides the accumulative performance of the binary classes. Generally, the value of AUC is either 0 or 1. Where 0 value means that the performance of the model is poor; whereas 1 indicates the best performance. Equation (17) is used to compute the value of AUC [20], as given belowwhere the rank values of samples are indicated by . represents the total positive samples, and represents the negative ones. (2)-Score. It is referred as -measure as well. It calculates the testing accuracy of the model and assesses the testing score using recall and precision. The -score is calculated using the following equation [20](3)Recall and Precision. The recall and precision are expressed using the following equationswhere recall calculates the number of true positives in all of the classifier’s results. True positive means correctly classified energy thieves. False-positive means honest electricity consumers misclassified as thieves. False-negative means energy thieves misclassified as honest consumers and true negative means correctly classified honest consumers. Whereas precision is the measure of relevance of the classifier’s results. If precision is high, it means that classifier’s relevant results are more than the irrelevant results.

4.1.1. Case Study A

In this case study, the SGCC dataset is considered, which is publically available on internet. The brief description about dataset is given in Section III-A1 [Page 3].

4.2. Description of Existing Models with their Performance

This section presents the existing models along with their performance based on the abovementioned performance metrics. (1)Support Vector Machine Model. SVM is widely used in ETD for binary classification [36]. The performance metrics used for SVM, and their results are given in Table 8. It performs better than LR model; however, its performance is worse than LSTM, GRU, and PSO. It means that SVM is less accurate in handling the imbalanced data.(2)Logistic Regression Model. LR is a popular classifier that is widely used for both classification and regression. In literature, it is also used for ETD [37]. In this work, the performance of LR is examined in terms of aforementioned performance metrics. The results of LR are given in Table 9, which show that it performs worse than all of the benchmark techniques. The reasons behind this are overfitting issue and inability of LR to handle imbalanced ETD data(3)Long Short Term Memory Model. LSTM is a DNN model and is widely used for feature extraction and classification in ETD [38, 39]. The performance of LSTM is checked in terms of abovementioned metrics. The results are presented in Table 10, which show that LSTM performs better than SVM and LR and worse than the proposed model. The proposed model performs better than LSTM because it does not face overfitting problem, which degrades the performance of LSTM(4)Gated Recurrent Unit Model. GRU is also a DNN model [40]. It is an advanced version of the LSTM. Its results are shown in Table 11, which are better as compared to the benchmarks. It means that GRU is capable of handling imbalanced data while avoiding overfitting(5)Genetic Algorithm Model. Genetic algorithm (GA) is a metaheuristic technique. Its performance is assessed based on various performance metrics, as given in Table 12. The results show that GA performs better than the benchmark models: SVM, LR, LSTM, and GRU. GA is also found to be more accurate and robust than the benchmarks because of its better learning capability

4.3. Results

The simulations are performed to evaluate the proposed and benchmark models by considering the aforementioned performance measures along with loss and accuracy. Moreover, the models are integrated for the performance evaluation. The SVM and LR obtain 0.68% and 0.63% of ACU score, respectively. SVM has higher AUC score as compared to LR because of using kernel trick to cope with the nonlinear data. In contrast, LR has lowest AUC score of 0.63% because it has only one hidden layer, which did not handle high dimensional data effectively and stuck in local minima. The performance results of SVM and LR are given in Tables 8 and 9, respectively.

Figures 5 and 6 show the accuracy and loss values of combined CNN-LSTM model. The CNN is utilized to extract abstract and latent features from EC data with the help of convolutional and pooling layers. Whereas LSTM extracts temporal patterns and classifies consumers’ records into normal and abnormal data patterns. From Figure 5, the training accuracy is observed as 81%; whereas the testing accuracy of the model is 75.5%. Both accuracies increase for an increasing number of epochs, which shows that model gives good results on a large number of epochs. However, the model is trained only for four epochs due to limited resources. There is a 6% difference between training and testing accuracies curves, which indicates that the model is stuck into an overfitting problem. On the other hand, Figure 6 shows that the training and testing losses of the model decrease as number of epochs increase. After the 4th epoch, the training and testing loses are 19% and 24.5%, respectively.

This implies that the epoch is the main controlling parameter that decides the optimal point where model achieves higher performance. However, the CNN-LSTM obtains 75.5% test accuracy, which is neither satisfactory nor acceptable in ETD. This is happened due to the inappropriate selection of hyperparameters. For deep learning models, the suitable selection of hyperparameters has great influence on the performance results.

Figures 7 and 8 show accuracy and loss values of combined CNNGRU model on training and testing datasets. The CNN is used to extract optimal features while classification task is performed through GRU model. The GRU model has reset and update gates that extract more relevant information from extracted high variance features through CNN and remove the noisy and redundant features. This process makes the performance of CNN-GRU better than CNN-LSTM. There is a 4% difference between accuracy curves and a 6.97% difference between loss curves on training and testing datasets, which indicate that the CNN-GRU model is stuck into an overfitting problem. The inappropriate tuning of hyperparameters leads to an overfitting problem where the model gives good results on seen data as compared to unseen data.

In literature, there are different techniques to tune the hyperparameters of ML and deep learning models like random search, grid search, gradient-based optimization, and evolutionary algorithms. Each one of these methods has its pros and cons. In this study, we utilize evolutionary algorithms PSO and GA to find optimal hyperparameters of the CNN-GRU model. These algorithms make a search space of hyperparameters and try to find the optimal combination where the model gives high values of performance indicators.

The PSO is merged with CNN-GRU for hyperparameters tuning to enhance the performance of the proposed model. The results are shown in Figures 9 and 10. The former shows the training and testing accuracy of CNN-GRU, which is increased with the increase in the number of epochs. Whereas the training and testing loss are presented in the latter figure and the results show that the loss decreases with the increase in the number of epochs. Furthermore, CNN-GRU is integrated with GA to find optimal combination of hyperparameters. The performance of combined CNN-GRU-GA is presented in Figures 11 and 12. The training and testing accuracies of CNN-GRU-GA are shown in Figure 11. The figure shows that both training and testing accuracy are approximately 87% and 86.3%, respectively. Figure 12 presents the training and testing loss of CNNGRU-GA, which keep decreasing with the increasing number of epochs. As shown in the figure, the training loss and the testing loss are 13% and 13.7%, respectively, which are approximately the same. Next, training and testing accuracy and loss of hybrid CNNGRU-PSO model are evaluated in Figures 9 and 10, respectively. Here, PSO and GA are used for tuning the hyperparameters of the CNNGRU model. The results exhibit that the PSO obtains optimal set of parameters as compared to GA because the PSO require less number of parameters and less execution time. In PSO, each solution has its own local best, which leads it towards global best after each iteration. Whereas GA has crossover and mutation steps that create diversity in newly generated offsprings and prevent the model from falling into local optima problem. However, in this case, PSO performs better as compared to GA and gives optimal combination hyperparameters where CNN-GRU gives good results. The former shows that both training and testing accuracy of CNN-GRU-PSO increase with the increasing number of epochs. The training accuracy is 89%; while testing accuracy is 87.3%. Whereas in latter, the training and testing loss are given, which are 11% and 12.7%, respectively. Moreover, CNN is combined with GRU and the performance of CNN-GRU is evaluated in terms of training and testing accuracy and loss. In Figures 510, the performance of the combined models is depicted.

It is observed that the CNN-GRU-PSO has the maximum accuracy as compared to the other models. Similarly, the CNN-GRU-PSO model has a minimum loss, which shows the model’s generalization. Figures 13 and 14 show the combined performance of the used and existing models for AUC. In former, AUC is calculated using both the false positive rate and true positive rate. The results indicate that the proposed CNN-GRU-PSO achieves high AUC score. Whereas the other models have a low AUC score. Moreover, it is shown that the proposed model beats the existing ones regarding AUC in the presence of imbalanced dataset. The GRU module in proposed model has strong ability to learn temporal correlation from long-term electricity load profile of consumers. It also maintains the context of previous EC information, which helps out to handle any nonmalicious (weather condition, family structure, etc.) change in EC profile. Moreover, the integration of PSO for parameters tuning further enhance the performance of the proposed model towards efficient ETD. Furthermore, the proposed model is compared with the benchmark models in terms of mentioned performance metrics, and the result is shown in Figure 15. The result indicates that the hybrid CNN-GRU-PSO model is more robust, accurate, efficient, and more generalized than the benchmarks, as given in Table 13.

Table 14 describes the running time of the proposed and baseline models. The SVM model takes 220 s during the training phase, which is higher than all other schemes. The selected SGCC dataset is high dimensional and not linearly separable. SVM draws hyperplanes and then picks an optimal hyperplane of high margin for distinguishing two classes ( represents the number of dimensions).

So, that is why SVM takes higher time as compared to other models.

The LR takes lowest execution time because of its simplex layering structure. It has only one hidden layer of neural network (NN). So, it needs less weights to learn and consumes less time as compared to other deep learning models. The CNN-GRU takes 56 seconds running time in training phase, which is 10 seconds less than CNNLSTM because of less gated configuration as compared to LSTM. The proposed CNN-GRU-PSO has higher execution time as compared to other models because of using PSO for tuning the hyperparameters of both CNN and GRU models concurrently.

4.3.1. Case Study B

In this case study, the PRECON dataset is used, which is publically available on internet. The dataset is collected by Pakistan Residential EC company. This dataset contains the EC history of 42 residential houses for 365 days. In dataset, the EC of each user is recorded after one minute time period. However, in this work, the data granularity is reduced into half hour for ease. EC of 30 minutes is aggregated into single value for all dataset. All the consumers which are participated in the experiment are assumed as honest consumers. So, the dataset contains the EC of only normal consumers. For ETD, malicious samples need to be generated as well. For this, six theft attacks are applied on benign samples in order to generate attack samples. These attacks are introduced by Jokar et al. in [1]. After applying theft attacks, the generated theft samples are drastically increased than benign class samples. Now, the dataset becomes imbalance. For data balancing, SMOTE is used to synthesize the minority benign class.

4.4. Results and Discussion

The simulations are performed on PRECON dataset, and the performance is measured through different performance metrics. Figure 16 depicts the loss of the proposed hybrid model. The decline in loss is noted on every epoch. The epoch is the main controlling parameter during the training phase. The optimal epoch value is 25, which is founded by PSO during parameter tuning. It is observed that the proposed model efficiently learns the EC patterns on each iteration by using the strong feature learning and temporal correlation abilities of the CNN and GRU models, respectively. Similarly, Figure 17 shows the proposed model accuracy. The accuracy tells about how accurately data samples are classified. The higher accuracy means higher correct predictions. The accuracy is increasing gradually on test and train data after each epoch. The optimal epoch value found by PSO is 25. Figure 18 illustrates -score, which is the harmonic mean of precision and recall. It helps the model to accurately identify the energy thieves. The higher -score is beneficial for power utilities to recover maximum revenue. The AUC score of the proposed and baseline models is presented in Figure 19. The AUC measures the separability between the positive and negative classes. The proposed model obtains 0.95 of AUC score, which is higher than all benchmark models. This implies that the proposed model efficiently distinguishes two classes and reduces the miss classification rate to a minimal level. Furthermore, Table 15 describes the performance results of baseline models and the proposed model on the PRECON dataset. It is seen from the results that the SVM and LR also perform well because the dataset has 48 features after reducing the data granularity to half hour. The SVM and LR capable to deal this data dimensional effectively and learn EC pattern for efficient ETD. The regular LSTM captures temporal correlation from historical EC data and reduces high FPR. The hybrid of CNN-LSTM and CNN-GRU achieves satisfactory performance towards efficient ETD due to using the CNN feature extraction ability and storing temporal correlations abilities of LSTM and GRU models. However, the proposed CNNGRU-PSO outperforms the other models because of utilizing PSO for hyperparameter tuning. The best selection of hyperparameters boosts the performance of the classification model. PSO finds the optimal set of parameters, where the models obtain the highest performance. It is concluded that the proposed model efficiently performs on both case studies. The performance results indicate that our method is an ideal solution for efficient ETD.

4.5. Statistical Test Evaluating the Significant of Method Result

Statistical tests are performed to check the significant and robustness of different learning algorithms. These tests determine whether one learning algorithm outperforms another learning algorithm by chance (due to randomness) or in real. The core principal of these tests is null hypothesis. The null hypothesis is constituted according to the problem. For instance, for classification task, the null hypothesis would be the difference between the mean performance of two algorithms is probably real or not. Numerous statistical tests are available to judge the significant of different case studies. However, in this work, our intention is to compare and judge the performance of different classification algorithms. A good statistical test has capable to judge the randomness in different classifiers’ results, random variation in classification error, and randomness in the selection of test data. In this regard, three suitable statistical tests are opted that are closely related to the classification task. The detailed description of these test is given as follows. (1)5x2cv PairedTest. A well-known statistical test for evaluating the performance of classification and regression models. It is introduced by Dietterich in 1998 [41]. In this study, this test is conducted to judge the performance of different classifiers. It consist of twofold cross-validation with five repeats. For each fold, the classifier is trained and the results are recorded. Afterward, the 5x2 paired test is applied on the final result to accept or reject the null hypothesis. In this case, the null hypothesis defines as the difference between the mean performance of two algorithms is probably real or not. In this test, the value is calculated against each value. The denotes the probability value, which decides that the result of your sample data is occurred by chance or not. The value ranges from 0 to 1. The smallest values are suitable. In general, the value should be less than 0.05 (5%). However, this test is suitable where sample size is small and the mean is known(2)5x2cv CombinedTest. This basic principal of this test is similar to the 5x2cv paired test except some improvement. This test is introduced by Alpaydin [42]. It is suitable for large sample size and capable to compare more than two populations at the same time. This test is conducted to compare the variance of two population while the 5x2cv paired test judges the mean of two populations. The working mechanism of this test is similar to the 5x2cv paired test. The two-fold cross-validation is conducted with five repeats. The results of each fold are recorded and then test is applied to evaluate the null hypothesis. The null hypothesis is based on the value. The null hypothesis is accepted if value is smaller than 0.05 (5%) otherwise rejected(3)McNemar’s Test. This test is nonparametric and distribution free statistical test. This test is conducted to check the disagreement of the classifiers on the performance results. In this test, a contingency table is constructed to judge its homogeneity. According to our scenario, this test judges how two or more classifier agree or disagree on the same sample. For this, the contingency table is designed according to the confusion matrix of two classifiers. The structure of contingency shown in Table 16 is given below. The table is finalized by filling the required values, and these values are derived from the confusion matrix. Afterward, the McNemar’s test statistic is calculated from the contingency table by using following formulawhere denotes McNemar’s test statistic. Similar to abovementioned test, a null hypothesis is formulated. The null hypothesis () is defined as the classifiers have similar proportion of errors in test set or vice versa. The value is calculated to accept or reject the . The is rejected if the value of is less than 0.005 and vise versa.

Table 17 describes the results of different statistical tests. It is seen that the results of 5x2cv paired test outperform the other tests because it is suitable for large population size. The value of probability () is almost lesser that 0.005 (5%). The smallest value of indicates that the models’ results are that occurred by chance. The test does not perform well because the available dataset has large in both population and sample size. The McNemar’s test also yields better value of and proves that the models’ results are not occurred by chance. All the results are real and do not depend on any noise factor.

5. Conclusion and Future Work

This work presents a HDNN based model in order to detect electricity theft in the smart grid. For this, dataset is taken from SGCC, which provides the real EC data gathered using intelligent antenna-based smart meters installed at the consumers’ end. The proposed model works in several steps. In the preprocessing step, the raw data is normalized, and the outliers and missing values are handled. The preprocessing is done by the local average method and min-max normalization technique. Then, the feature engineering step is performed using CNN. Once the most relevant and normalized data is obtained, the classification process is done using PSO-GRU integrated CNN. In this step, the normal and fraudulent consumers are classified. The proposed model is validated in terms of several performance metrics like accuracy, recall, precision, AUC, and -score. Moreover, comparison of the proposed and existing hybrid models is done. The models include CNN-GRU, CNN-LSTM, and CNN-GRU-GA.

The comparison results show the efficiency, accuracy, robustness, and generalization of the proposed hybrid model for handling imbalanced class issue in terms of ETD. Despite that, our proposed method is an ideal solution towards efficient ETD. However, it has incurred little bit higher computational cost because the proposed model’s modules are integrated in a sequential manner (CNN-GRU-PSO). First, CNN takes time while capturing potential features from high-dimensional EC data. Second, GRU processes the CNN’s extracted features map for final classification. Meanwhile, PSO tunes the hyperparameters of both CNN and GRU models. This working flow of the proposed model consumed a little bit higher execution time as compared to the existing methods. Moreover, the proposed method has a lack in some complex real-world scenarios by accurately identifying the electricity thieves due to the addition of simulated theft data (in minority class using SMOTE). For the future, more robust techniques will be utilized to efficiently handle the overfitting issue.

Data Availability

The datasets used in this study are openly available in [henryRD- lab/ElectricityTheftDetection] at [23] [Page 3, Section III-A1].

Conflicts of Interest

The authors declare no conflict of interest.