Fig 5 - available from: Journal of Big Data
This content is subject to copyright. Terms and conditions apply.
Random Forest initial implementation

Random Forest initial implementation

Source publication
Article
Full-text available
Abstract In this paper, we comprehensively explain how we built a novel implementation of the Random Forest algorithm on the High Performance Computing Cluster (HPCC) Systems Platform from LexisNexis. The algorithm was previously unavailable on that platform. Random Forest’s learning process is based on the principle of recursive partitioning and a...

Contexts in source publication

Context 1
... and it returns the final RF model as output. Moreover, the final RF model is an ensemble of DTs, and thus the RF learning process is also based upon the iterative split/partition process (as mentioned in "Recursion as iteration" section) with a couple of modifications in order to handle both Bagging and Random Feature Selection (as shown in Fig. ...
Context 2
... bootstrap of the training data within the RF learning process is performed at the beginning of the process just before growing the ensemble of DTs, as shown in Fig. 5. For each tree in the ensemble, one new dataset is generated through sampling with replacement from the original dataset. Each new sampled dataset must have the same size as the original, be properly identified by group_id, be assigned to a root node identified by node_id = group_id, and finally be stored in the Training Data into Root ...

Similar publications

Article
Full-text available
Our society is generating an increasing amount of data at an unprecedented scale, variety, and speed. This also applies to numerous research areas, such as genomics, high energy physics, and astronomy, for which large-scale data processing has become crucial. However, there is still a gap between the traditional scientific computing ecosystem and b...

Citations

... Formerly, with a particular prediction strategy does not regularly deal an appreciated elevation of precision under all cybercrime areas because of frequently according to condition used. Various classification schemes like K-Means [8] and optimization strategies [9][10][11][12] like Ant Colony Optimization (ACO) [13], Particle Swarm Optimization (PSO) and Ant Lion Optimization (ALO) are proposed and implemented on numerous amounts of cybercrime data to classify the attacks into various groups. While some classification strategies recognize a superior result through a particular dataset [14,15], the performance of some classification schemes may be corresponding on additional datasets. ...
Article
Full-text available
The internet utilization has been developing quickly, mostly in previous eras. Though, by way of the internet develops an important section of the daily life routine, cybercrime is similarly on the grow. The prices of cybercrime will approximately five lakh crores dollars per year through 2022 according to cyber security endeavours details in 2021. The cyber attackers exploit some internet resources as a principal way of transformation through a victim system; therefore intruders generate benefit based on economic, promotional and many more through developing the susceptibilities over devices. The computing cybercrime threats and providing security procedures through physical schemes utilizing previous methodological techniques and moreover examinations have unsuccessful many times to govern cybercrime threats. The previous literature in field of cybercrime threats agonizes from absence of evaluation schemes to guess the cybercrimes, mainly on unstructured information. Hence, an Improved Grey-wolf Optimization based Classification Model (IGO-CM) is developed with the help of chaos system and information entropy utilizing machine learning schemes for cybercrime data analysis to compute the rate of cybercrime by classifying the cybercrime data. The protection examinations by means of the relationship of data analysis methodologies provide services to examine and classify crime data in unstructured form taking from India. The implementation of IGO-CM is performed on MATLAB 2021a tool for unstructured cybercrime data and the outcomes describe the superior performance of IGO-CM depending on accuracy, F-Measure, standard deviation, purity index, intra-cluster distance, root mean square error, and time complexity against a popular classification scheme K-Means, and some optimization schemes like ACO, ALO, PSO and GO.
... While the implementation of IT software or tools can help manage incoming tickets, there are still bottlenecks to be fixed (Qamili et al., 2018). The correct implementation of algorithms like Random Forest in high-performance computing platforms can also further enhance these processes (Herrera et al., 2019). Moreover, the potential of AI-enabled service chains, which requires the alignment of Service Level Agreements (SLAs) with AI systems, signifies the next frontier in optimizing operational efficiency (Engel et al., 2022). ...
Article
Full-text available
This research project aims to improve IT support efficiency at Indonesian company XYZ by using AI-based IT support ticket classification integration. This method involved collecting over 1,000 support tickets from the company's IT ticketing system, GLPI, and pre-processing the data to ensure the quality and relevance of the data for analysis. Claims data is enriched with relevant features, including textual information and categorical attributes such as urgency, impact, and requirement expertise. To improve the ticket preference matrix, AI-based language models, especially OpenAI's GPT-3, are used. These templates help to reclassify and improve the work of IT support teams. In addition, the ticket data is used to train the Random Forest classifier, allowing automatic classification of tickets based on their specific characteristics. The performance of the ticket classification system is evaluated using a variety of metrics, and the results are compared with alternative methods to assess effectiveness. of the Random Forest algorithm. This evaluation demonstrates the system's ability to correctly classify and prioritize incoming tickets. The successful implementation of this project at Company XYZ is a model for other organizations looking to optimize their IT support through AI-driven approaches. By providing simplified ticket classification and admission ticket reclassification based on AI algorithms, this research helps leverage AI technologies to improve IT support processes. Ultimately, the proposed solution benefits both support providers and users by improving efficiency, response times, and overall customer satisfaction.
... During the training of a decision tree model, the predictor variables or features are used to recursively partition the data into slighter and lesser subsets until a final decision is made at a leaf node. At the root node, the model splitting the information into two subdivisions based on the sender of the email, categorizing emails from known spammers in one subset and emails from non-spammers in the other subset [35]. At each subsequent node, the model further splits the data based on the value of the subject line of the email, identifying subsets of emails with suspicious or benign subject lines. ...
Article
Full-text available
Spam emails pose a substantial cybersecurity danger, necessitating accurate classification to reduce unwanted messages and mitigate risks. This study focuses on enhancing spam email classification accuracy using stacking ensemble machine learning techniques. We trained and tested five classifiers: logistic regression, decision tree, K-nearest neighbors (KNN), Gaussian naive Bayes and AdaBoost. To address overfitting, two distinct datasets of spam emails were aggregated and balanced. Evaluating individual classifiers based on recall, precision and F1 score metrics revealed AdaBoost as the top performer. Considering evolving spam technology and new message types challenging traditional approaches, we propose a stacking method. By combining predictions from multiple base models, the stacking method aims to improve classification accuracy. The results demonstrate superior performance of the stacking method with the highest accuracy (98.8%), recall (98.8%) and F1 score (98.9%) among tested methods. Additional experiments validated our approach by varying dataset sizes and testing different classifier combinations. Our study presents an innovative combination of classifiers that significantly improves accuracy, contributing to the growing body of research on stacking techniques. Moreover, we compare classifier performances using a unique combination of two datasets, highlighting the potential of ensemble techniques, specifically stacking, in enhancing spam email classification accuracy. The implications extend beyond spam classification systems, offering insights applicable to other classification tasks. Continued research on emerging spam techniques is vital to ensure long-term effectiveness.
... Herrera et al. [19] published a related work that explores the impact of maximum tree depth with a classifier in common with one that we use. This is another study where Random Forest is employed to classify the so-called Big Data. ...
Article
Full-text available
We present findings from experiments in Medicare fraud detection, that are the result of research on two new, publicly available datasets. In this research, we employ popular, open-source Machine Learning algorithms to identify fraudulent healthcare providers in Medicare insurance claims data. As far as we know, we are the first to publish a study that includes datasets compiled from the latest Medicare Part B and Medicare Part D data. The datasets became available in 2021, and are the largest such datasets that we know of. We report details on two important findings. The first finding is that increased maximum tree depth is associated with the best performance in terms of area under the receiver-operating characteristic curve (AUC) for both datasets. The second finding, which is an important counterbalance to the first finding, is that one may utilize random undersampling (RUS) to reduce the size of the training data and simultaneously achieve similar or better AUC scores.To the best of our knowledge, our study is novel in reporting the importance of maximum tree depth for classifying imbalanced Big Data. Moreover, this work is unique in demonstrating that one may employ RUS to mitigate the increased resource consumption of higher maximum tree depth.
... BUDIHARTO W has been ranked seventh his top work includes Prediction and analysis of Indonesia Presidential election from Twitter using sentiment analysis [49], Data science approach to stock prices forecasting in Indonesia during Covid-19 using Long Short-Term Memory (LSTM) [50], GNSS-based navigation systems of autonomous drone for delivering items [51]. DOUZI K has been ranked eighth with 6 articles his top work includes GNSS-based navigation systems of autonomous drone for delivering items [52], An LSTM and GRU based trading strategy adapted to the Moroccan market [53] IDS-attention: an efficient algorithm for intrusion detection systems using attention mechanism [54], FURHT B has been ranked ninth also with 6m articles in total , his top work includes Deep Learning applications for COVID-19 [55], Text Data Augmentation for Deep Learning [56], Random forest implementation and optimization for Big Data analytics on LexisNexis's high performance computing cluster platform [57]', and VILLANUSTRE F has been ranked tenth with 6 articles which includes Deep learning applications and challenges in big data analytics [58], Random forest implementation and optimization for Big Data analytics on LexisNexis's high performance computing cluster [59], platform Large-scale distributed L-BFGS [60]. ...
Preprint
Full-text available
The Journal of Big Data has been a leading international journal in Decision Sciences and Computer Sciences. The goal of this research is to look into the Journal of Big Data's bibliometric attributes through scientific activity based on journal article citation links. Using a bibliometric approach, this study examines all of the journal's publications since its inception. The goal is to provide a comprehensive overview of the major factors influencing the journal. This analysis covers key issues such as the journal's publication and citation structure, the most cited articles, and the journal's leading authors, institutions, and countries. The network analysis was performed to look into author keywords through Journal of Big Data publications. The software that was utilized to create this analysis of present patterns and potential future developments was Scopus, Rstudio, and Vosviewer. A bibliometric study was utilized to examine articles and review papers, and only works published in the Scopus database between 2014 and 2021 were considered. The information on each manuscript's bibliometric was acquired after the findings were clearly explained. The initial findings included 632 articles.
... From Table 2, the result of each machine learning performance was evaluated to determine the standard Performance before hyperparameter tuning and without using the feature selection technique. According to [11], the random forest algorithm had the best advantage since the dataset size is large. The Performance of the random forest was the highest among the others because of its outstanding features like variable importance measure, OOB error detection, proximity among the feature and handling of imbalanced datasets. ...
Chapter
Full-text available
Malware is one of the major issues in cybersecurity and among computer users. It has caused severe loss to businesses, organizations, and people. Therefore, this research aims to detect malware using the convolutional neural network (CNN) algorithm. Chi-Square has been used and has selected twenty important features as input for the machine learning models. In addition, random and grid searches have been used to optimize the performance of the CNN and determine the best hyperparameter on the algorithm. The experiments show that CNN has outperformed Neural Network’s performance in terms of accuracy and precision. Meanwhile, the CNN has random search accuracy and precision; thus, we conclude the randomized search algorithm produced a good prediction result with CNN for large datasets.KeywordsMalware detectionMachine learningHyperparameter tuningConvolutional neural network
... The random forest is a supervised learning algorithm often considered one of the best off-the-shelf machine learning algorithms for classification and regression. It is robust to outliers and known to be free from overfitting problems (Herrera et al., 2019). ...
... Five MLAs were used in the current study including: random forest (RF), multivariate discriminant analysis (MDA), generalized linear model Random Forest (RF) is an ensemble classification technique (Breiman, 2001). It involves several steps (training datasets, bootstrapping, ensemble of trees, and aggregation (classification phase)) (Herrera et al., 2019;Sarker et al., 2019). The RF algorithm tends to outperform most other classification approaches in terms of accuracy, with no problems of overfitting (Pedregosa et al., 2011). ...
Article
The current study aimed at producing a multi-hazard susceptibility map for the Hasher-Fayfa Basin. The basin is part of the Jazan region in southwestern Saudi Arabia and is characterized by mountainous terrain. Recently, this area has experienced many extreme natural processes, which become natural hazard events when they intersect with human activities (urban areas and infrastructures). In this work, the probabilities of the three main hazards; landslides, floods, and gully erosion are mapped using machine learning algorithms such as boosted regression tree (BRT), generalized linear model (GLM), Flexible discriminant analysis (FDA), random forest (RF), and multivariate discriminant analysis (MDA). Several factors from various sources, including, topographic, geologic, meteorological, hydrologic, and human activities, were incorporated to produce the final multi-hazard susceptibility model. The area under the curve (AUC) was used to determine the best predictive model for each type of natural hazard. AUC values between 80 and 90% indicated that the model was very good, and values above 90% indicated that the model had excellent predictive capability. Based on the accuracy evaluation, the FDA model was found to be the most accurate landslides prediction with an AUC value of (92.7%, excellent performance). The RF model was found to be the most accurate in predicting floods and erosion with AUC values of (97.2% - excellent and 83.3% - very good performance, respectively). Finally, a map of multi-hazard susceptibility was created by coupling of the mentioned three hazards. The results showed that 33.5% of the total area is safe (no-hazard), while 66.5% is characterized by at least a single hazard and a combination of two or three hazards. Machine learning approaches are useful tools as a basis for management and mitigation processes, based on multi-hazard modeling.
... According to Kumanov and others, the HPC has various functions, such as Job management where the HPC works to submit, terminate, resume and modify the jobs depending on the current condition; Data management, where HPC uploads and downloads files, copies, moves, renames files, makes archives, previews files, monitors and manages files; Multitenancy, where the HPC aids to isolate, aggregate and manage customers and accounts [21]. On the other hand, Herrera and co-workers identified that HPC is a multi-scheduler, multicluster, multi-directory service and multi-tenancy [22]. Therefore, the HPC helps to schedule the jobs and batches; operate different HPC clusters using the portal; can identify the users who provide directory services and many more. ...
Article
The recent technologies are focusing on High- performance computing to fasten the complex calculations and market procedures in industries. Therefore, analysis of the big- data and cloud-computing is necessary to implement the “High- performance computing” (HPC) in practice. Based on the advantages, the research aimed to identify the advantages and disadvantages of cloud computing as well as HPCs. The research used SEM approach to justify the procedure for implementing HPC effectively. In this research, SPSS software was used to conduct multiple regression analysis. A significance value of p < 0.05 is considered ‘significant’ which describes a strong relationship between the dependent and independent variable. The independent variables selected are HPC in industry and year; and the dependent variable is ‘Profit Gain’. Findings showed an insignificant relationship between the dependent and independent variable. According to the result of the SPSS data, the selected variables are not significant and the regression data is not fitting as per the requirements as well.
... including more observations from other satellite sensors). A well-organized implementation in a High Performance Computing environment can resolve the problem (Herrera et al., 2019). ...
Article
Full-text available
Preprocessing of Landsat images is a double-edged sword, transforming the raw data into a useful format but potentially introducing unwanted values with unnecessary steps. Through recovering missing data of satellite images in time series analysis, gap-filling is an important, highly developed, preprocessing procedure, but its necessity and effects in numerous Landsat applications, such as tree canopy cover (TCC) modelling, are rarely examined. We address this barrier by providing a quantitative comparison of TCC modelling using predictor variables derived from Landsat time series that included gap-filling versus those that did not include gap-filling and evaluating the effects that gap-filling has on modelling TCC. With 1-year Landsat time series from a tropical region located in Taita Hills, Kenya, and a reference TCC map in 0–100 scales derived from airborne laser scanning data, we designed comparable random forest modelling experiments to address the following questions: 1) Does gap-filling improve TCC modelling based on time series predictor variables including the seasonal composites (SC), spectral-temporal metrics (STMs), and harmonic regression (HR) coefficients? 2) What is the difference in TCC modelling between using gap-filled pixels and using valid (actual or cloud-free) pixels? Two gap-filling methods, one temporal-based method (Steffen spline interpolation) and one hybrid method (MOPSTM) have been examined. We show that gap-filled predictors derived from the Landsat time series delivered better performance on average than non-gap-filled predictors with the average of median RMSE values for Steffen-filled and MOPSTM-filled SC’s being 17.09 and 16.57 respectively, while for non-gap-filled predictors, it was 17.21. MOPSTM-filled SC is 3.7% better than non-gap-filled SC on RMSE, and Steffen-filled SC is 0.7% better than non-gap-filled SC on RMSE. The positive effects of gap-filling may be reduced when there are sufficient high-quality valid observations to generate a seasonal composite. The single-date experiment suggests that gap-filled data (e.g. RMSE of 16.99, 17.71, 16.24, and 17.85 with 100% gap-filled pixels as training and test datasets for four seasons) may deliver no worse performance than valid data (e.g. RMSE of 15.46, 17.07, 16.31, and 18.14 with 100% valid pixels as training and test datasets for four seasons). Thus, we conclude that gap-filling has a positive effect on the accuracy of TCC modelling, which justifies its inclusion in image preprocessing workflows.