Random Forest initial implementation

Source publication

Fig. 1 Hunt's Concept Learning System framework

Fig. 4 Decision Tree split/partition function diagram

Fig. 5 Random Forest initial implementation

Fig. 6 Random Forest split/partition function diagram

Fig. 7 Random Forest optimized implementation

Random forest implementation and optimization for Big Data analytics on LexisNexis’s high performance computing cluster platform

Article

Full-text available

Jul 2019

Abstract In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest’s learning process is based on the principle of recursive partitioning and a...

Context 1

... and it returns the final RF model as output. Moreover, the final RF model is an ensemble of DTs, and thus the RF learning process is also based upon the iterative split/partition process (as mentioned in "Recursion as iteration" section) with a couple of modifications in order to handle both Bagging and Random Feature Selection (as shown in Fig. ...

View in full-text

Context 2

... bootstrap of the training data within the RF learning process is performed at the beginning of the process just before growing the ensemble of DTs, as shown in Fig. 5. For each tree in the ensemble, one new dataset is generated through sampling with replacement from the original dataset. Each new sampled dataset must have the same size as the original, be properly identified by group_id, be assigned to a root node identified by node_id = group_id, and finally be stored in the Training Data into Root ...

View in full-text

Efficient development of high performance data analytics in Python

Article

Full-text available

Oct 2019

Our society is generating an increasing amount of data at an unprecedented scale, variety, and speed. This also applies to numerous research areas, such as genomics, high energy physics, and astronomy, for which large-scale data processing has become crucial. However, there is still a gap between the traditional scientific computing ecosystem and b...

IGO_CM: An Improved Grey-Wolf Optimization Based Classification Model for Cyber Crime Data Analysis Using Machine Learning

Article

Full-text available

Apr 2024
WIRELESS PERS COMMUN

The internet utilization has been developing quickly, mostly in previous eras. Though, by way of the internet develops an important section of the daily life routine, cybercrime is similarly on the grow. The prices of cybercrime will approximately five lakh crores dollars per year through 2022 according to cyber security endeavours details in 2021. The cyber attackers exploit some internet resources as a principal way of transformation through a victim system; therefore intruders generate benefit based on economic, promotional and many more through developing the susceptibilities over devices. The computing cybercrime threats and providing security procedures through physical schemes utilizing previous methodological techniques and moreover examinations have unsuccessful many times to govern cybercrime threats. The previous literature in field of cybercrime threats agonizes from absence of evaluation schemes to guess the cybercrimes, mainly on unstructured information. Hence, an Improved Grey-wolf Optimization based Classification Model (IGO-CM) is developed with the help of chaos system and information entropy utilizing machine learning schemes for cybercrime data analysis to compute the rate of cybercrime by classifying the cybercrime data. The protection examinations by means of the relationship of data analysis methodologies provide services to examine and classify crime data in unstructured form taking from India. The implementation of IGO-CM is performed on MATLAB 2021a tool for unstructured cybercrime data and the outcomes describe the superior performance of IGO-CM depending on accuracy, F-Measure, standard deviation, purity index, intra-cluster distance, root mean square error, and time complexity against a popular classification scheme K-Means, and some optimization schemes like ACO, ALO, PSO and GO.

Improving IT Support Efficiency Using AI-Driven Ticket Random Forest Classification Technique

Article

Full-text available

Oct 2023

This research project aims to improve IT support efficiency at Indonesian company XYZ by using AI-based IT support ticket classification integration. This method involved collecting over 1,000 support tickets from the company's IT ticketing system, GLPI, and pre-processing the data to ensure the quality and relevance of the data for analysis. Claims data is enriched with relevant features, including textual information and categorical attributes such as urgency, impact, and requirement expertise. To improve the ticket preference matrix, AI-based language models, especially OpenAI's GPT-3, are used. These templates help to reclassify and improve the work of IT support teams. In addition, the ticket data is used to train the Random Forest classifier, allowing automatic classification of tickets based on their specific characteristics. The performance of the ticket classification system is evaluated using a variety of metrics, and the results are compared with alternative methods to assess effectiveness. of the Random Forest algorithm. This evaluation demonstrates the system's ability to correctly classify and prioritize incoming tickets. The successful implementation of this project at Company XYZ is a model for other organizations looking to optimize their IT support through AI-driven approaches. By providing simplified ticket classification and admission ticket reclassification based on AI algorithms, this research helps leverage AI technologies to improve IT support processes. Ultimately, the proposed solution benefits both support providers and users by improving efficiency, response times, and overall customer satisfaction.

Improving spam email classification accuracy using ensemble techniques: a stacking approach

Article

Full-text available

Sep 2023
INT J INF SECUR

Spam emails pose a substantial cybersecurity danger, necessitating accurate classification to reduce unwanted messages and mitigate risks. This study focuses on enhancing spam email classification accuracy using stacking ensemble machine learning techniques. We trained and tested five classifiers: logistic regression, decision tree, K-nearest neighbors (KNN), Gaussian naive Bayes and AdaBoost. To address overfitting, two distinct datasets of spam emails were aggregated and balanced. Evaluating individual classifiers based on recall, precision and F1 score metrics revealed AdaBoost as the top performer. Considering evolving spam technology and new message types challenging traditional approaches, we propose a stacking method. By combining predictions from multiple base models, the stacking method aims to improve classification accuracy. The results demonstrate superior performance of the stacking method with the highest accuracy (98.8%), recall (98.8%) and F1 score (98.9%) among tested methods. Additional experiments validated our approach by varying dataset sizes and testing different classifier combinations. Our study presents an innovative combination of classifiers that significantly improves accuracy, contributing to the growing body of research on stacking techniques. Moreover, we compare classifier performances using a unique combination of two datasets, highlighting the potential of ensemble techniques, specifically stacking, in enhancing spam email classification accuracy. The implications extend beyond spam classification systems, offering insights applicable to other classification tasks. Continued research on emerging spam techniques is vital to ensure long-term effectiveness.

Exploring Maximum Tree Depth and Random Undersampling in Ensemble Trees to Optimize the Classification of Imbalanced Big Data

Article

Full-text available

Jun 2023

We present findings from experiments in Medicare fraud detection, that are the result of research on two new, publicly available datasets. In this research, we employ popular, open-source Machine Learning algorithms to identify fraudulent healthcare providers in Medicare insurance claims data. As far as we know, we are the first to publish a study that includes datasets compiled from the latest Medicare Part B and Medicare Part D data. The datasets became available in 2021, and are the largest such datasets that we know of. We report details on two important findings. The first finding is that increased maximum tree depth is associated with the best performance in terms of area under the receiver-operating characteristic curve (AUC) for both datasets. The second finding, which is an important counterbalance to the first finding, is that one may utilize random undersampling (RUS) to reduce the size of the training data and simultaneously achieve similar or better AUC scores.To the best of our knowledge, our study is novel in reporting the importance of maximum tree depth for classifying imbalanced Big Data. Moreover, this work is unique in demonstrating that one may employ RUS to mitigate the increased resource consumption of higher maximum tree depth.

Journal Of Big Data: A Bibliometric Analysis

Preprint

Full-text available

Jun 2023

The Journal of Big Data has been a leading international journal in Decision Sciences and Computer Sciences. The goal of this research is to look into the Journal of Big Data's bibliometric attributes through scientific activity based on journal article citation links. Using a bibliometric approach, this study examines all of the journal's publications since its inception. The goal is to provide a comprehensive overview of the major factors influencing the journal. This analysis covers key issues such as the journal's publication and citation structure, the most cited articles, and the journal's leading authors, institutions, and countries. The network analysis was performed to look into author keywords through Journal of Big Data publications. The software that was utilized to create this analysis of present patterns and potential future developments was Scopus, Rstudio, and Vosviewer. A bibliometric study was utilized to examine articles and review papers, and only works published in the Scopus database between 2014 and 2021 were considered. The information on each manuscript's bibliometric was acquired after the findings were clearly explained. The initial findings included 632 articles.

Investigating the Performance of Optimizing the Convolutional Neural Network in Detecting Malware Attack

Chapter

Full-text available

Apr 2023

Malware is one of the major issues in cybersecurity and among computer users. It has caused severe loss to businesses, organizations, and people. Therefore, this research aims to detect malware using the convolutional neural network (CNN) algorithm. Chi-Square has been used and has selected twenty important features as input for the machine learning models. In addition, random and grid searches have been used to optimize the performance of the CNN and determine the best hyperparameter on the algorithm. The experiments show that CNN has outperformed Neural Network’s performance in terms of accuracy and precision. Meanwhile, the CNN has random search accuracy and precision; thus, we conclude the randomized search algorithm produced a good prediction result with CNN for large datasets.KeywordsMalware detectionMachine learningHyperparameter tuningConvolutional neural network

Prediction of spatio-temporal AQI data

Article

Mar 2023

Multi-hazards (landslides, floods, and gully erosion) modeling and mapping using machine learning algorithms

Article

Nov 2022
J AFR EARTH SCI

The current study aimed at producing a multi-hazard susceptibility map for the Hasher-Fayfa Basin. The basin is part of the Jazan region in southwestern Saudi Arabia and is characterized by mountainous terrain. Recently, this area has experienced many extreme natural processes, which become natural hazard events when they intersect with human activities (urban areas and infrastructures). In this work, the probabilities of the three main hazards; landslides, floods, and gully erosion are mapped using machine learning algorithms such as boosted regression tree (BRT), generalized linear model (GLM), Flexible discriminant analysis (FDA), random forest (RF), and multivariate discriminant analysis (MDA). Several factors from various sources, including, topographic, geologic, meteorological, hydrologic, and human activities, were incorporated to produce the final multi-hazard susceptibility model. The area under the curve (AUC) was used to determine the best predictive model for each type of natural hazard. AUC values between 80 and 90% indicated that the model was very good, and values above 90% indicated that the model had excellent predictive capability. Based on the accuracy evaluation, the FDA model was found to be the most accurate landslides prediction with an AUC value of (92.7%, excellent performance). The RF model was found to be the most accurate in predicting floods and erosion with AUC values of (97.2% - excellent and 83.3% - very good performance, respectively). Finally, a map of multi-hazard susceptibility was created by coupling of the mentioned three hazards. The results showed that 33.5% of the total area is safe (no-hazard), while 66.5% is characterized by at least a single hazard and a combination of two or three hazards. Machine learning approaches are useful tools as a basis for management and mitigation processes, based on multi-hazard modeling.

An Effective Analysis of Cloud Related Platform using Big Data Approaches for Effective Implementation of High-Performance Computing through Structural Equation Modeling (SEM) Approach

Article

Jul 2022

G. Bansal, M. Saxena, S. S. Johar, B. A. Othman, S. S. Husain and D. Kapila

The recent technologies are focusing on High- performance computing to fasten the complex calculations and market procedures in industries. Therefore, analysis of the big- data and cloud-computing is necessary to implement the “High- performance computing” (HPC) in practice. Based on the advantages, the research aimed to identify the advantages and disadvantages of cloud computing as well as HPCs. The research used SEM approach to justify the procedure for implementing HPC effectively. In this research, SPSS software was used to conduct multiple regression analysis. A significance value of p < 0.05 is considered ‘significant’ which describes a strong relationship between the dependent and independent variable. The independent variables selected are HPC in industry and year; and the dependent variable is ‘Profit Gain’. Findings showed an insignificant relationship between the dependent and independent variable. According to the result of the SPSS data, the selected variables are not significant and the regression data is not fitting as per the requirements as well.

Impact of Preprocessing on Tree Canopy Cover Modelling: Does Gap-Filling of Landsat Time Series Improve Modelling Accuracy?

Article

Full-text available

Jul 2022

Preprocessing of Landsat images is a double-edged sword, transforming the raw data into a useful format but potentially introducing unwanted values with unnecessary steps. Through recovering missing data of satellite images in time series analysis, gap-filling is an important, highly developed, preprocessing procedure, but its necessity and effects in numerous Landsat applications, such as tree canopy cover (TCC) modelling, are rarely examined. We address this barrier by providing a quantitative comparison of TCC modelling using predictor variables derived from Landsat time series that included gap-filling versus those that did not include gap-filling and evaluating the effects that gap-filling has on modelling TCC. With 1-year Landsat time series from a tropical region located in Taita Hills, Kenya, and a reference TCC map in 0–100 scales derived from airborne laser scanning data, we designed comparable random forest modelling experiments to address the following questions: 1) Does gap-filling improve TCC modelling based on time series predictor variables including the seasonal composites (SC), spectral-temporal metrics (STMs), and harmonic regression (HR) coefficients? 2) What is the difference in TCC modelling between using gap-filled pixels and using valid (actual or cloud-free) pixels? Two gap-filling methods, one temporal-based method (Steffen spline interpolation) and one hybrid method (MOPSTM) have been examined. We show that gap-filled predictors derived from the Landsat time series delivered better performance on average than non-gap-filled predictors with the average of median RMSE values for Steffen-filled and MOPSTM-filled SC’s being 17.09 and 16.57 respectively, while for non-gap-filled predictors, it was 17.21. MOPSTM-filled SC is 3.7% better than non-gap-filled SC on RMSE, and Steffen-filled SC is 0.7% better than non-gap-filled SC on RMSE. The positive effects of gap-filling may be reduced when there are sufficient high-quality valid observations to generate a seasonal composite. The single-date experiment suggests that gap-filled data (e.g. RMSE of 16.99, 17.71, 16.24, and 17.85 with 100% gap-filled pixels as training and test datasets for four seasons) may deliver no worse performance than valid data (e.g. RMSE of 15.46, 17.07, 16.31, and 18.14 with 100% valid pixels as training and test datasets for four seasons). Thus, we conclude that gap-filling has a positive effect on the accuracy of TCC modelling, which justifies its inclusion in image preprocessing workflows.

Random Forest initial implementation

Contexts in source publication

Similar publications

Citations