Histogram of missing values present in SGCC dataset.

Histogram of missing values present in SGCC dataset.

Source publication
Article
Full-text available
This study presents a novel feature-engineered–natural gradient descent ensemble-boosting (NGBoost) machine-learning framework for detecting fraud in power consumption data. The proposed framework was sequentially executed in three stages: data pre-processing, feature engineering, and model evaluation. It utilized the random forest algorithm-based...

Contexts in source publication

Context 1
... the labeled dataset was explored, the next task for developing an efficient supervised ML classification framework was to accurately impute the missing entries in the acquired data. Figure 5 shows the histogram for missing values present in the accumulated dataset. ...
Context 2
... can be observed in Figure 5 that each amongst the 60.11% of the total consumers contained less than 200 missing data (NaN) entries, while this number ranged between 300 and 600 and 695 and 705 for 16.89% and 23% of consumers, respectively. Since it was quite difficult to accurately impute such a huge number of missing data entries, only those consumers whose NaN entries were less than 200 were shortlisted for further processing. ...
Context 3
... 6 depicts the kWh consumption patterns of four random consumers before and after the proposed imputation process. It can be observed in Figure 5 that each amongst the 60.11% of the total consumers contained less than 200 missing data (NaN) entries, while this number ranged between 300 and 600 and 695 and 705 for 16.89% and 23% of consumers, respectively. Since it was quite difficult to accurately impute such a huge number of missing data entries, only those consumers whose NaN entries were less than 200 were shortlisted for further processing. ...

Similar publications

Article
Full-text available
Financial statement fraud is a global problem for investors, audit firms, regulators, and other stakeholders. Fraud detection can be regarded as a binary classification problem with a false negative being more expensive than a false positive. Although existing studies have made great efforts to detect fraud using various data‐mining techniques, the...

Citations

... Hussain et al. have also introduced a sophisticated machine-learning framework dubbed NGBoost for NTL detection in [27]. A time-series feature-capturing tool known as Time Series Feature Extraction Library and the whale optimisation technique were used to produce statistical, temporal, and spectral attributes. ...
Article
Full-text available
Numerous strategies have been proposed for the detection and prevention of non‐technical electricity losses due to fraudulent activities. Among these, machine learning algorithms and data‐driven techniques have gained prominence over traditional methodologies due to their superior performance, leading to a trend of increasing adoption in recent years. A novel two‐step process is presented for detecting fraudulent Non‐technical losses (NTLs) in smart grids. The first step involves transforming the time‐series data with additional extracted features derived from the publicly available State Grid Corporation of China (SGCC) dataset. The features are extracted after identifying abrupt changes in electricity consumption patterns using the sum of finite differences, the Auto‐Regressive Integrated Moving Average model, and the Holt‐Winters model. Following this, five distinct classification models are used to train and evaluate a fraud detection model using the SGCC dataset. The evaluation results indicate that the most effective model among the five is the Gradient Boosting Machine. This two‐step approach enables the classification models to surpass previously reported high‐performing methods in terms of accuracy, F1‐score, and other relevant metrics for non‐technical loss detection.
... The model is then optimized by using the scoring rule S(P q , y) using a maximum likelihood estimation function that yields calibrated uncertainty and point predictions. 34 The probability distribution function (P q (yjx)) of each predicted outcome ranges from 0 to 1. As the value of the function increases, the probability of predicting the data accurately also increases. ...
Article
Dry reforming of methane (DRM) is a promising technology for syngas production from CH 4 and CO 2 . However, discovering feasible and efficient catalysts remains challenging despite recent advancements in machine learning. Herein, we present a novel probabilistic prediction-based, high-throughput screening methodology that demonstrates outstanding performance, with a coefficient of determination ( R ² ) of 0.936 and root-mean-square error (RMSE) of 6.66. Additionally, experimental validation was performed using 20 distinct catalysts to ensure the accurate verification of the model, 17 of which were previously unreported combinations. Our model accurately predicts CH 4 conversion rates and probability values by considering catalyst design, pretreatment, and operating variables, providing reliable insights into catalyst performance. The proposed probabilistic prediction-based screening methodology, which we introduce for the first time in the field of catalysis, holds significant potential for accelerating the discovery of catalysts for DRM reactions and expanding their application scope in other crucial industrial processes. Thus, the methodology effectively addresses a key challenge in the development of active catalysts for energy and environmental research.
... Improved results are obtained for XGBoost with FPR value of 0.005. A similar machine learning (ML) ensemble technique is applied by Zhongzong and He in [13]. The research is followed using the Irish dataset while doing preprocessing steps. ...
Article
Full-text available
... Ransomware uses the operating system's own cryptographic libraries to encrypt the files on a victim's device [1]. It can target several environments and platforms including cloud-based systems, Internet of Things, wireless sensor networks, power grid SCADA (Supervisory Control and Data Acquisition (SCADA)), and intelligent transportation systems [2][3][4][5][6]. Although the nature of the ransomware infection process is similar to other malware categories, the employment of cryptographic means makes the effect of an attack irreversible if the decryption key is not available [7]. ...
Article
Full-text available
Ransomware is a type of malware that employs encryption to target user files, rendering them inaccessible without a decryption key. To combat ransomware, researchers have developed early detection models that seek to identify threats before encryption takes place, often by monitoring the initial calls to cryptographic APIs. However, because encryption is a standard computational activity involved in processes, such as packing, unpacking, and polymorphism, the presence of cryptographic APIs does not necessarily indicate an imminent ransomware attack. Hence, relying solely on cryptographic APIs is insufficient for accurately determining a ransomware pre-encryption boundary. To this end, this paper is devoted to addressing this issue by proposing a Temporal Data Correlation method that associates cryptographic APIs with the I/O Request Packets (IRPs) based on the timestamp for pre-encryption boundary delineation. The process extracts the various features from the pre-encryption dataset for use in early detection model training. Several machine and deep learning classifiers are used to evaluate the accuracy of the proposed solution. Preliminary results show that this newly proposed approach can achieve higher detection accuracy compared to those reported elsewhere.
... It removes the mean and scales of each variable to unit variance. The gradient boosting decision trees (GBDT), including XGBoost [53,54], LightGBM [55], CatBoost [56], NGBoost [57], and RF [58], can independently manage the missing values by themselves, while the SVM cannot control the missing value [59]. The GBDT algorithms handle the missing values for evaluating the ML model. ...
Article
Global greenhouse gas emissions from the construction concrete industry are 50% higher than those from all other industries combined. Concrete incorporating waste and recycled materials could help lessen the negative efects of environmental problems. Agricultural waste is increasingly being used to substitute cement in environmentally friendly concrete produc�tion. Rice husk ash (RHA) is a workable alternative that merits further investigation. Since evaluating the properties of concrete containing RHA requires extensive and time-consuming experimentation, machine learning (ML) can accurately predict its properties. Consequently, this study aims to anticipate and develop an empirical formula for RHA concrete’s compressive strength (CS) using ML algorithms. This study employs several ML methods such as random forest, support vector machine, light gradient boosting machine (LightGBM), extreme gradient boosting (XGBoost), and SHAP. A total of 192 data points are used in this study to assess the CS of RHA-blended concrete. The input parameters are age, amount of cement, rice husk ash, superplasticizer, water, and aggregates. Across all ML models, the XGBoost method is used to build a highly accurate predictive model. Predicting RHA concrete's CS using an existing XGBoost model is consistently accurate. R2 demonstrates a CS of 0.99 during training and 0.94 during testing. Model characteristics and complex correla�tions are explained using the SHAP algorithm. The proposed model’s prediction outcomes are compared to prior research, and the best ML algorithm is selected.
... It removes the mean and scales of each variable to unit variance. The gradient boosting decision trees (GBDT), including XGBoost [53,54], LightGBM [55], CatBoost [56], NGBoost [57], and RF [58], can independently manage the missing values by themselves, while the SVM cannot control the missing value [59]. The GBDT algorithms handle the missing values for evaluating the ML model. ...
... However, the system is difficult to operate. In addition, the volume of data collected by the power metering device is huge and the data diverse, which creates challenges for electricity inspection [8][9][10]. To solve this problem, it is argued here that the introduction of data analysis and data mining based on the application of electricity metering technique, intelligent analysis and monitoring of users' electricity consumption data, can reduce difficulties in operation and improve detection accuracy. ...
Article
Full-text available
Electricity inspection is important to support sustainable development and is core to the marketing of electric power. In addition, it contributes to the effective management of power companies and to their financial performance. Continuous improvement in the penetration rate of new energy generation can improve environmental standards and promote sustainable development, but creates challenges for electricity inspection. Traditional electricity inspection methods are time-consuming and quite inefficient, which hinders the sustainable development of power firms. In this paper, a load-forecasting model based on an improved moth-flame-algorithm-optimized extreme learning machine (IMFO-ELM) is proposed for use in electricity inspection. A chaotic map and improved linear decreasing weight are introduced to improve the convergence ability of the traditional moth-flame algorithm to obtain optimal parameters for the ELM. Abnormal data points are screened out to determine the causes of abnormal occurrences by analyzing the model prediction results and the user’s actual power consumption. The results show that, compared with existing PSO-ELM and MFO-ELM models, the root mean square error of the proposed model is reduced by at least 1.92% under the same conditions, which supports application of the IMFO-ELM model in electricity inspection. The proposed power-load-forecasting-based abnormal data detection method can improve the efficiency of electricity inspection, enhance user experience, contribute to the intelligence level of power firms and promote their sustainable development.
... Their use of data class balancing in the first stage has added more credibility to their results, but the average 10-fold precision and recall were 93% and 92%. In the same series of studies, Hussain S. et al., in [26], presented a natural gradient descent ensemble-boosting (NGBoost) machine-learning framework for NTLs. The most relevant statistical, temporal, and spectral domain-based features were extracted using the time-series feature-extraction library (TSFEL) and whale optimization algorithm. ...
... From Table 6 and Figure 8, the study's proposed method stands out compared to the state-of-the-art data-based methods considering the six performance measures combined. Its results are uniformly high with The reported Accuracy, Recall, Specificity, Precision, F1-score, and AUC of the proposed method are higher than the other ensemble methods, namely, FA-XGBoost [21], DERUSBOOST [30], CatBoost [25], NGBoost [26], Decision tree [28], AdaBoost [28], CNN-GRU-PSO [31], CNN [32], WADCNN [1], BSVM [22], ANN [28], and Deep ANN [28]. The FA-XGBoost [21] method achieved the second-best performance after the proposed method when comparing the accuracy, recall, F1-score, and AUC value metrics. ...
... The FA-XGBoost [21] method achieved the second-best performance after the proposed method when comparing the accuracy, recall, F1-score, and AUC value metrics. As for the CatBoost [25], NGBoost [26], and BSVM [22] methods that were used in very recent research works, they all showed similar performances with an accuracy ranging from 0.93 to 0.94, recall ranging from 0.91 to 0.92, and precision ranging from 0.95 to 0.96. Thus, all three methods reflect a slightly lower performance when detecting fraud compared to the proposed method. ...
Article
Full-text available
Several approaches have been proposed to detect any malicious manipulation caused by electricity fraudsters. Some of the significant approaches are Machine Learning algorithms and data-based methods that have shown advantages compared to the traditional methods, and they are becoming predominant in recent years. In this study, a novel method is introduced to detect the fraudulent NTL loss in the smart grids in a two-stage detection process. In the first stage, the time-series readings are enriched by adding a new set of extracted features from the detection of sudden Jump patterns in the electricity consumption and the Autoregressive Integrated moving average (ARIMA). In the second stage, the distributed random forest (DRF) generates the learned model. The proposed model is applied to the public SGCC dataset, and the approach results have reported 98% accuracy and F1-score. Such results outperform the other recently reported state-of-the-art methods for NTL detection that are applied to the same SGCC dataset.
... However, it is important to note how the authors dealt with the records' continuity because accurate forecasting relies on continuous time-series records. Furthermore, Hussain et al. [78] reported that many missing data entries made it challenging to impute the electric power consumption data accurately. Only 60.11% of the total consumers with null entries lower than 200 were considered for MVI, whereas 39.89% of the customer records were removed from the experiment. ...
Article
Full-text available
Missing values are highly undesirable in real-world datasets. The missing values should be estimated and treated during the preprocessing stage. With the expansion of nature-inspired metaheuristic techniques, interest in missing value imputation (MVI) has increased. The main goal of this literature is to identify and review the existing research on missing value imputation (MVI) in terms of nature-inspired metaheuristic approaches, dataset designs, missingness mechanisms, and missing rates, as well as the most used evaluation metrics between 2011 and 2021. This study ultimately gives insight into how the MVI plan can be incorporated into the experimental design. Using the systematic literature review (SLR) guidelines designed by Kitchenham, this study utilizes renowned scientific databases to retrieve and analyze all relevant articles during the search process. A total of 48 related articles from 2011 to 2021 were selected to assess the review questions. This review indicated that the synthetic missing dataset is the most popular baseline test dataset to evaluate the effectiveness of the imputation strategy. The study revealed that missing at random (MAR) is the most common proposed missing mechanism in the datasets. This review also indicated that the hybridizations of metaheuristics with clustering or neural networks are popular among researchers. The superior performance of the hybrid approaches is significantly attributed to the power of optimized learning in MVI models. In addition, perspectives, challenges, and opportunities in MVI are also addressed in this literature. The outcome of this review serves as a toolkit for the researchers to develop effective MVI models.