ArticlePDF Available

Assessing the effectiveness of data mining tools in classifying and predicting road traffic congestion

May 2024
Indonesian Journal of Electrical Engineering and Computer Science 34(2):1295-1303

May 2024
34(2):1295-1303

DOI:10.11591/ijeecs.v34.i2.pp1295-1303

License
CC BY-NC-SA 4.0

Authors:

Areen Mohammad Arabiat

Al-Ahliyya Amman University

Muneera Altayeb

Al-Ahliyya Amman University

Traffic congestion is a significant issue in cities, impacting the environment, commuters, and the economy. Predicting congestion is crucial for efficient network operation, but high-quality data and computational techniques are challenging for scientists and engineers. The revolution of data mining and machine learning has enabled the development of effective prediction methods. Machine learning (ML) approaches have shown potential in predicting traffic congestion, with classification being a key area of study. Open-source software tools WEKA and Orange are used to predict and classify traffic congestion. However, there is no single best strategy for every situation. This study compared the effectiveness of both data mining tools for predicting congestion in one of the areas of the capital of the Hashemite Kingdom of Jordan, Amman, by testing several classifiers including support vector machine (SVM), K-nearest neighbors (KNN), logistic regression (LR), and random forest (RF) classifications. The results showed that the Orange mining tool was superior in predicting traffic congestion, with a prediction accuracy of 100% for Random forest, logistic regression, and 99.8% for KNN. On the other hand, results were better in WEKA for the SVM classifier with an accuracy of 99.7%.

Content uploaded by Muneera Altayeb

Content may be subject to copyright.

Indonesian Journal of Electrical Engineering and Computer Science

Vol. 34, No. 2, May 2024, pp. 1295~1303

ISSN: 2502-4752, DOI: 10.11591/ijeecs.v34.i2.pp1295-1303  1295

Journal homepage: http://ijeecs.iaescore.com

Assessing the effectiveness of data mining tools in classifying

and predicting road traffic congestion

Areen Arabiat, Muneera Altayeb

Department of Communications and Computer Engineering, Faculty of Engineering, Al-Ahliyya Amman University,

Amman, Jordan

Article Info

ABSTRACT

Article history:

Received Jan 29, 2024

Revised Feb 9, 2024

Accepted Feb 16, 2024

Traffic congestion is a significant issue in cities, impacting the environment,

commuters, and the economy. Predicting congestion is crucial for efficient

network operation, but high-quality data and computational techniques are

challenging for scientists and engineers. The revolution of data mining and

machine learning has enabled the development of effective prediction

methods. Machine learning (ML) approaches have shown potential in

predicting traffic congestion, with classification being a key area of study.

Open-source software tools WEKA and Orange are used to predict and

classify traffic congestion. However, there is no single best strategy for

every situation. This study compared the effectiveness of both data mining

tools for predicting congestion in one of the areas of the capital of the

Hashemite Kingdom of Jordan, Amman, by testing several classifiers

including support vector machine (SVM), K-nearest neighbors (KNN),

logistic regression (LR), and random forest (RF) classifications. The results

showed that the Orange mining tool was superior in predicting traffic

congestion, with a prediction accuracy of 100% for Random forest, logistic

regression, and 99.8% for KNN. On the other hand, results were better in

WEKA for the SVM classifier with an accuracy of 99.7%.

Keywords:

K-nearest neighbor support

Logistic regression

Machine learning

Orange data mining tool

Random forest

Vector machine

WEKA data mining tool

This is an open access article under the CC BY-SA license.

Corresponding Author:

Areen Arabiat

Department of Communications and Computer Engineering, Faculty of Engineering

Al-Ahliyya Amman University

Al-Saro, Al-Salt, Amman, Jordan

Email: a.arabiat@ammanu.edu.jo

1. INTRODUCTION

Traffic congestion is one of the most important problems that residents of capital cities around the

world suffer from. It can lead to increased stress, delayed delivery, fuel waste, and financial losses. From this

standpoint, studies that contribute to reducing this traffic phenomenon are extremely important [1]. Most

modern studies of congestion forecasting are based on analyzing peak traffic periods, where forecasting is

classified into three types according to traffic flow: short-term, medium-term, and long-term forecasting.

Short-term forecasts which last between five and fifteen minutes on average have a lot of random volatility,

high complexity, and poor data stability. Given the complexity of the traffic condition, it is imperative to

provide reliable short-term forecasting for real-time information determination, On the other hand, medium-

and long-term forecast units often extend to days, weeks, months, and years, and because of the huge time

lag, the stability of the data is very high; Therefore, this type of forecasting is often used to estimate long-

term traffic flow with high accuracy through time series that rely on past data and expected future data [2].

Recent advances in traffic congestion prediction have given rise to an important topic of study, particularly in

AI and ML. The vast availability of data aided by navigation systems and fixed sensors has contributed to a

 ISSN: 2502-4752

Indonesian J Elec Eng & Comp Sci, Vol. 34, No. 2, May 2024: 1295-1303

1296

significant expansion of this subject of study over the past several decades as traffic data can be analyzed,

pattern recognized, and traffic flow insights using ML techniques [3], [4]. Accurate forecasting contributes to

traffic flow volume control, traffic management, and optimization, as predicting traffic congestion using ML

is considered more accurate than traditional methods, which contributes significantly to improving traffic

flow, especially at peak times. However, to fully utilize the potential of machine learning in traffic

management, issues such as data reliability and model interpretability must be resolved [5], [6].

The researchers in [7], used monitoring-based data from IoT sensors embedded in smart cities to

develop traffic control systems that operate autonomously and reduce traffic jams, to obtain great accuracy

and low error rates, a neuro-fuzzy algorithm was used. In validation testing, the model outperformed previous

techniques with an accuracy rate of 98%. To create a traffic prediction model, this study uses radio

frequencies. The RF algorithm features great accessibility, high stability, and outstanding reliability.

Liu and Wu in [8] presented an automated system to predict expected traffic using an RF classifier. The RF

algorithm features great accessibility, high stability, and outstanding reliability. The traffic forecast proposed

model is created using input data including weather, time of day, season, unusual road conditions, traffic

conditions, and holidays. The results showed that the traffic prediction model created using the RF

classification approach can be predicted effectively, has a modest generalization error, and has an accuracy

rate of 87.5%. Li et al. [9] presented a model using an RF classification technique to predict traffic

congestion state in another study. The random forest method has a reputation for being practical, flexible, and

highly effective. Weather, time of day, unique road conditions, road quality, and holidays were used as model

input factors to build road traffic forecasting models. As a result, the results show that the traffic prediction

model developed using the random forest classification method can be predicted successfully with an

accuracy of 87.5% with low generalization error. It is also more adept at predicting crowded scenarios due to

its fast calculation speed.

On the other hand, researchers in [10] developed a long short-term memory (LSTM) network as a

means of predicting congestion propagation across road networks, where the model predicts congestion

propagation over 5 minutes in Buxton, UK, which is a congested city. The study used both univariate and

multivariate LSTM models, with the former relying entirely on the speed recorded over the past five minutes

while the latter takes traffic flow rate and vehicle progress into account. The accuracy of the models ranged

between 84-95% depending on the route configuration and these results revealed that both models may

produce adequate prediction of congestion propagation over short periods, with accuracy mostly determined

by the topology of the local road network. The researchers in [11] developed a proposed model for accurate

prediction of short-term traffic conditions in smart transportation. Intelligent transportation systems (ITS)

systems using machine learning classifiers including LD-SVM, decision forests, MLP, and CN2 rule

induction. According to the results, decision forests outperformed other methods with an average

improvement of 0.982 and 0.975, respectively. This method solves the problem of overfitting in existing

modelling methodologies. Ratra and Gulia [12], presented a comparison of different techniques using

empirical and parameter analysis to evaluate two open data mining tools, WEKA and Orange. The results

reveal that WEKA outperforms Orange in terms of the qualities required for a fully functional and easy-to-

use rating platform. WEKA is suitable for data mining classification challenges. Additionally, the study

analysed proactivity and recall across datasets, and found that Orange had 82.4% greater proactivity and

80.6% greater recall. WEKA has greater precision (83.7%) and recall (83.7%). This comparison includes

Naïve Bayes, Random Forest, and nearest neighbor classifiers. The precision value of the k-nearest algorithm

is larger, with WEKA having a precision of 75.3% and a recall of 75.2%. This work presents a unique

comparison between two distinct data mining tools, where a large data set containing approximately 8,671

records was tested and the accuracy of the evaluations was approximately 100%. This unique experiment for

this study aims to determine the volume of traffic flow through one of the most traffic-congested districts in

the Jordanian capital, Amman.

2. METHOD

By testing the efficiency of several classifiers, the model proposed in this work provides

a comparison between the WEKA and Orange data mining tools, to measure the effectiveness of both tools

for predicting traffic congestion in the Jordanian capital, Amman, as the Greater Amman Municipality

provided the traffic data that was used in the classification process. A system architecture for predicting

traffic Congestion is shown in Figure 1, where the basic steps are stated. The first and most important stage

in developing a machine learning classifier is pre-processing to improve the quality of the dataset, making it

ready to feed a machine learning model. In order to predict traffic congestion in this work, the classifiers RF,

SVM, KNN, and LR were employed to find the optimum data mining technique and classifier for traffic

Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 

Assessing the effectiveness of data mining tools in classifying and predicting … (Areen Arabiat)

1297

congestion prediction. The results will be evaluated using confusion matrices such as accuracy, sensitivity,

precision, and F-measure.

Figure 1. System architecture for predicting traffic congestion

2.1. Dataset

The Greater Amman Municipality provided the data set for King Abdullah Street in Amman, in

2018. This data included time, date, traffic flow, capacity, number of lanes, road width, and traffic volume.

This data was collected with high accuracy using detectors and sensors that can count the number of passing

vehicles on a lane and calculate the traffic volume for each lane approach every hour of every day, every

month, for the whole year [13].

2.2. Data preprocessing

Data preprocessing is the process of removing or correcting erroneous, incomplete, or incorrect data

from a dataset. Use excel's remove duplicates tool to get rid of unnecessary data, then use conditional

formatting to fix any structural issues. To prepare the traffic dataset for use in building an ML classifier using

WEKA and Orange data mining tools, it is first saved as a CSV file.

2.3. Classification

Data classification is done in two steps: i) training, sometimes called learning and ii) testing, or

evaluation, when an instance's predicted class is compared to its actual class. If the hit rate is considered

acceptable by the analyst, then the classifier is considered capable of classifying future occurrences of

unknown classes. Normal and congested are the two categories into which the data in this investigation

should be divided. The model will be validated using 10-fold cross-validation with 30% of the data used for

testing and 70% of the data used for training through RF, SVM, KNN, and LR classifiers.

2.3.1. Random forest

High-dimensional datasets cannot be effectively used with the RF-supervised learning algorithm. An

infinite forest is created by combining M numbers of different decision trees [14]. Random forests use

decision trees arranged randomly on information units, creating forecasts and determining the best solution

 ISSN: 2502-4752

Indonesian J Elec Eng & Comp Sci, Vol. 34, No. 2, May 2024: 1295-1303

1298

through voting. This method provides a useful insight into trait significance [15]. A composite classifier

generates multiple decision trees and integrates them for efficient outcomes. The random forest model uses

decision trees trained on random characteristics but typically ignores the diverse contributions of trees in

different test instances. The aggregate that each tree provides individually is averaged to make predictions.

Also, incredibly adept at adjusting to sacrifice is random forest [16], [17]. In classification issues, the towing

rule, deviation, and Gini index are the primary rules used to binary divide data of these guidelines, the most

often applied is the Gini index in (1), which quantifies the node impurity:

  

󰇛  

󰇜



 (1)

The target class is A, and the sample fraction of class an is 

. A node with a modest value of μ is

considered to be pure, meaning that it has good class separation and mostly comprises observations from a

single class [18]. Figure 2 depicts the RF structure, Where X = X1, X2, X3,..., XN, and n is the number of data

dimensions or predictive variables, which is an example of an input data set, While T expresses trees T1(X),

T2(X), T3(X), Tn(X) that form the RF model [18].

Figure 2. Random forest (RF) structure [19]

2.3.2. Logistic regression

Gaussian-form numerical input variables are used in the binary classification process when an

outcome in regression modeling is binary or dichotomous (yes/no). This is a specific case that is known as

logistic regression, where each input value has a coefficient that is then transformed via a logistic function.

This fast method works well for a variety of classification problems [20]. Assuming a straight-line

relationship between independent variables, linear regression is a frequently used kind of regression analysis

for continuous outcomes. It is useful for determining how one independent variable affects a continuous

result. Nevertheless, it is preferable to use multivariate linear regression to find distinct contributions while

concurrently assessing the effects of several components [21]. The logistic regression model has a particular

form that is described in (2):

󰇛󰇜

󰇛󰇜 

  ( β0+ β1 X1+……+ βk Xki) (2)

where (1-Pi) is the chance that Y takes a value of 0, Pi is the probability that Y takes a value of 1, and e is the

exponential constant [22].

2.3.3. Support vector machine

The machine learning technique known as SVM is grounded in statistical learning theory, as its

algorithm can determine the best classification hyperplane by maximizing the interval, as described in Figure 3

where a dataset with two features (x1 and x2) and two classes (0 and 1) [23], [24]. With the use of support

vector machine technology, data points may be classified by finding a hyperplane in an N-dimensional space.

There are several different hyperplanes for the separation of any two classes of data points. Our goal is to

find a plane that has the most margin. Future data points can be classified more easily by maximizing the

margin distance, which offers some reinforcement. The main flaw with support vector machines is that they

are limited to binary problem classification [25].

Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 

Assessing the effectiveness of data mining tools in classifying and predicting … (Areen Arabiat)

1299

Figure 3. Description of SVM [26]

2.3.4. K-nearest neighbor

KNN classification is a simple data mining technique that forecasts a set's state based on its K

closest neighbors in the training set. It does not require any training but it faced challenges in selecting K

values and performing neighbor searches and distance calculations. KNN is a supervised machine learning

method based on neighbor similarity, classified based on the distance between data points. The primary

obstacles facing KNN are as follows: i) selecting K values and ii) neighbor search and neighbor selection,

encompassing neighbor search and distance calculation [27], [28].

2.4. Data mining tools

Data mining software tools are necessary for both the development and implementation of data

mining techniques. The process of selecting the best tool gets easier as there are more and more options

accessible [29]. The technique of finding important information in large amounts of data is known as data

mining or knowledge mining. It entails several techniques to guarantee that a huge amount of data is

converted into meaningful information, including data translation, cleansing, integration, pattern analysis,

and display [30].

2.4.1. WEKA tool

WEKA, is referred to Waikato Environment for Knowledge Analysis (WEKA), is a machine

learning program developed by Waikato University in New Zealand. It is a Java-based tool that provides

visualization tools and algorithms for predictive modeling and data analysis. It operates on all computing

platforms and includes tasks like data mining, clustering, classification, association, visualization, and feature

selection. The program's user-friendly interface and straightforward settings make it accessible to

inexperienced users [31], [32]. Precision, recall, accuracy, F-measure, MCC, confusion matrix, and other data

may be derived using the WEKA machine learning model to evaluate the result [33]. Figure 4 depicts the

WEKA data mining tool's model.

Figure 4. WEKA data mining tool's model

2.4.2. Orange tool

Orange is a set of machine learning, data mining, and Python scripting tools developed for

interactive data analysis and component-based construction of data mining methods [34]. The bioinformatics

laboratory at the University of Ljubljana has developed the visual data mining program Orange, available for

free and non-commercial download, although primarily designed for instructional purposes, Orange can be

beneficial for data processing and experimental data analysis, offering a platform for experiment selection

[12], [35]. Figure 5 depicts the orange data mining tool's model.

 ISSN: 2502-4752

Indonesian J Elec Eng & Comp Sci, Vol. 34, No. 2, May 2024: 1295-1303

1300

Figure 5. Orange data mining tool's model

3. PERFORMANCE EVALUATION AND CLASSIFICATION

For machine learning tasks like regression and classification, evaluation metrics are essential and

helpful for a variety of tasks. While assuring accurate assessment, proper model evaluation using various

measures can increase predictive parentage and power in addition to avoiding bad predictions when applied

to unknown data. The objective of this procedure is to do a comparative analysis of each classifier and

choose the most accurate one according to the obtained results [36], [37].

3.1. Cross-validation

Since it's easy to use and has a broad range of applications, cross-validation is a common model and

tuning parameter selection technique in statistics and machine learning. Fitting and assessing any potential

model on different data sets is necessary for ensuring accurate assessment. Models are often overfitted by

using conventional techniques like V-fold and leave-one-out. According to preliminary theoretical research,

cross-validation requires a training-testing split ratio of zero to reliably choose the right model under low-

dimensional linear models. Most statistical software programs employ standard split ratios, such as four-to-

one or nine-to-one, because smaller split ratios require larger training samples and lead to less precise model

fitting [38]–[40]. In this research, 10-fold cross-validation is used.

3.2. Confusion matrix

The confusion matrix for the binary classification is written as a 22 matrix. Four measurements

have been published for a confusion matrix: "true positive" (TP), "true negative" (TN), "false positive" (FP),

and "false negative" (FN). The confusion matrix is used to assess classifier performance on datasets in the

multiclass problem. Matrixes to differentiate between the actual and expected values of the model's

constituent parts in Java applications were classified into faulty and non-faulty classes using four confusion

matrix measures: TP, FP, TN, and FN. The figure shows the confusion matrix of binary classification [41]–[44].

Figure 6 shows the confusion matrix of binary classification [45].

Figure 6. Confusion matrix of binary classification [45]

Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 

Assessing the effectiveness of data mining tools in classifying and predicting … (Areen Arabiat)

1301

3.3. Classifiers performance

Datasets and classification algorithms are used to compare the WEKA and Orange data mining tools

in this study. Evaluation criteria include accuracy, sensitivity, precision, and F-measure. By dividing the total

number of properly categorized instances by the total value of instances, the accuracy measure is used to

evaluate performance. Results are assessed utilizing datasets, tools, methods, separation, algorithms, and an

overall total. All tests yielded an 100% categorization accuracy for the research [46]. Table 1 depicts the

classifier's performance.

Table 1. Classifier’s performance [47]

Performance matrices

Equation

Accuracy

TP + TN/TP + FP + TN + FN

Sensitivity

TP/TP + FN

Precision

TP/TP + FP

F-measure

2(Sen  Pre)/ (Sen + Pre)

4. RESULTS AND DISCUSSION

The dataset was classified using a variety of techniques, including SVM, KNN, LR, and RF. Based

on this analysis, the orange tool provided superior results for accuracy (100%) for LR and RF; for KNN and

SVM, the tool achieved CA with values of 99.8% and 99.1%, respectively. On the other hand, the results

using the WEKA tool were also satisfactory. On the other hand, the results using the WEKA tool were also

satisfactory, as SVM obtained a classification accuracy of 99.7%, while KNN, LR, and RF obtained (CA) of

98.7%, 97.6%, and 96.2%, respectively. According to these results, we can notice that the orange data mining

tool is the most effective method for this data, according to the predictions made about traffic congestion.

Table 2 depicts the classifier's performance using the WEKA vs. orange tool, while Figure 7 shows a

comparative analysis of different classifiers' performance. On the other hand, it can also be said that the

Orange3 model proposed in this work for predicting traffic congestion has outperformed previous studies

reported in the literature, such as the study presented by Liu and Wu [8], where the accuracy reached 87.5%

and it was 87.5% in Li et al. [9]. While the accuracy was in the work presented by Majumdar et al. [10], 84–95%.

Table 2. Classifier's performance using WEKA vs. orange tool

SVM

KNN

WEKA

Orange

WEKA

Orange

WEKA

Orange

WEKA

Orange

Accuracy

0.962

1.000

0.976

1.000

0.997

0.991

0.987

0.998

Sensitivity

0.954

1.000

0.975

1.000

0.995

0.991

0.985

0.998

Precision

0.970

1.000

0.977

1.000

0.998

0.991

0.990

0.998

F-measure

0.962

1.000

0.976

1.000

0.997

0.991

0.988

0.998

Figure 7. Comparative analysis of different classifiers' performances

5. CONCLUSION

The purpose of this study was to determine which data mining tool could provide the best prediction

of the accuracy of traffic data. Comparative studies of the tools were conducted to see how successful

different data mining techniques are and how different features affect traffic congestion prediction. The data

was obtained from the Greater Amman Municipality, which mainly contained the traffic volume for the study

area in the capital of the Kingdom of Jordan, Amman, for the year 2018. Various classification methods were

used on the dataset, including RF, LR, KNN, and SVM. Cross-validation is used by 10-fold to improve the

 ISSN: 2502-4752

Indonesian J Elec Eng & Comp Sci, Vol. 34, No. 2, May 2024: 1295-1303

1302

performance of the algorithms. Using the orange tool, RF, and LR have a higher grade of 100% based on this

investigation, while KNN and SVM have scores of 99.8% and 99.1%, respectively. In contrast, using the

WEKA Tool, SVM had a higher grade of 99.7% based on this investigation, while KNN, LR, and RF had

scores of 98.7%, 97.6%, and 96.2, respectively. In the end, we can point out that the results of this

comparison that we presented in our research paper indicate that the model built using the Orange3 data

mining tool outperformed the model built using WEKA, as the accuracy in the first reached 100%.

REFERENCES

[1] T. S. Tamir et al., “Traffic Congestion Prediction using Decision Tree, Logistic Regression and Neural Networks,” IFAC-

PapersOnLine, vol. 53, no. 5, pp. 512–517, 2020, doi: 10.1016/j.ifacol.2021.04.138.

[2] W. Zhuang and Y. Cao, “Short-Term Traffic Flow Prediction Based on a K-Nearest Neighbor and Bidirectional Long Short-Term

Memory Model,” Applied Sciences (Switzerland), vol. 13, no. 4, 2023, doi: 10.3390/app13042681.

[3] Y. Xing, X. Ban, X. Liu, and Q. Shen, “Large-Scale Traffic Congestion Prediction Based on the Symmetric Extreme Learning

Machine Cluster Fast Learning Method,” Symmetry, vol. 11, no. 6, p. 730, May 2019, doi: 10.3390/sym11060730.

[4] T. Adetiloye and A. Awasthi, “Multimodal Big Data Fusion for Traffic Congestion Prediction,” in Multimodal Analytics for Next-

Generation Big Data Technologies and Applications, Cham: Springer International Publishing, 2019, pp. 319–335.

[5] M. W. Ei Leen, N. H. A. Jafry, N. M. Salleh, H. J. Hwang, and N. A. Jalil, “Mitigating Traffic Congestion in Smart and

Sustainable Cities Using Machine Learning: A Review,” Lecture Notes in Computer Science (including subseries Lecture Notes

in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13957 LNCS, pp. 321–331, 2023, doi: 10.1007/978-3-031-

36808-0_21.

[6] T. Saranya, S. Sridevi, C. Deisy, T. D. Chung, and M. K. A. A. Khan, “Performance Analysis of Machine Learning Algorithms in

Intrusion Detection System: A Review,” Procedia Computer Science, vol. 171, pp. 1251–1260, 2020, doi:

10.1016/j.procs.2020.04.133.

[7] S. M. Abdullah et al., " Optimizing Traffic Flow in Smart Cities: Soft GRU-Based Recurrent Neural Networks for Enhanced

Congestion Prediction Using Deep Learning," Sustainability, vol. 15, no. 7, p. 5949, doi: 10.3390/su15075949

[8] Y. Liu and H. Wu, “Prediction of road traffic congestion based on random forest,” in Proceedings - 2017 10th International

Symposium on Computational Intelligence and Design, ISCID 2017, 2018, vol. 2, pp. 361–364, doi: 10.1109/ISCID.2017.216.

[9] Z. Li, P. Liu, C. Xu, H. Duan, and W. Wang, “Reinforcement Learning-Based Variable Speed Limit Control Strategy to Reduce

Traffic Congestion at Freeway Recurrent Bottlenecks,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 11,

pp. 3204–3217, 2017, doi: 10.1109/TITS.2017.2687620.

[10] S. Majumdar, M. M. Subhani, B. Roullier, A. Anjum, and R. Zhu, “Congestion prediction for smart sustainable cities using IoT

and machine learning approaches,” Sustainable Cities and Society, vol. 64, 2021, doi: 10.1016/j.scs.2020.102500.

[11] M. Zahid, Y. Chen, A. Jamal, and M. Q. Memon, “Short term traffic state prediction via hyperparameter optimization based

classifiers,” Sensors (Switzerland), vol. 20, no. 3, 2020, doi: 10.3390/s20030685.

[12] R. Ratra and P. Gulia, “Experimental Evaluation of Open Source Data Mining Tools (WEKA and Orange),” International Journal

of Engineering Trends and Technology, vol. 68, no. 8, pp. 30–35, Aug. 2020, doi: 10.14445/22315381/IJETT-V68I8P206S.

[13] “Greater Amman Municipality.” Jordan, [Online]. Available: http://www.ammancity.gov.jo/en/gam/index.asp (Accessed Sep. 6,

2022).

[14] E. Scornet, G. Biau, and J. P. Vert, “Consistency of random forests,” Annals of Statistics, vol. 43, no. 4, pp. 1716–1741, 2015,

doi: 10.1214/15-AOS1321.

[15] N. Absar et al., “The Efficacy of Machine-Learning-Supported Smart System for Heart Disease Prediction,” Healthcare

(Switzerland), vol. 10, no. 6, 2022, doi: 10.3390/healthcare10061137.

[16] M. Z. Islam, J. Liu, J. Li, L. Liu, and W. Kang, “A semantics aware random forest for text classification,” in International

Conference on Information and Knowledge Management, Proceedings, 2019, pp. 1061–1070, doi: 10.1145/3357384.3357891.

[17] A. Chahal, P. Gulia, N. S. Gill, and J. M. Chatterjee, “Performance Analysis of an Optimized ANN Model to Predict the Stability

of Smart Grid,” Complexity, vol. 2022, 2022, doi: 10.1155/2022/7319010.

[18] F. B. de Santana, W. Borges Neto, and R. J. Poppi, “Random forest as one-class classifier and infrared spectroscopy for food

adulteration detection,” Food Chemistry, vol. 293, pp. 323–332, 2019, doi: 10.1016/j.foodchem.2019.04.073.

[19] H. B. Ly, T. A. Nguyen, and B. T. Pham, “Estimation of Soil Cohesion Using Machine Learning Method: A Random Forest

Approach,” Advances in Civil Engineering, vol. 2021, 2021, doi: 10.1155/2021/8873993.

[20] A. Das, “Logistic Regression,” in Encyclopedia of Quality of Life and Well-Being Research, Cham: Springer International

Publishing, 2021, pp. 1–2.

[21] J. C. Stoltzfus, “Logistic regression: A brief primer,” Academic Emergency Medicine, vol. 18, no. 10, pp. 1099–1104, 2011, doi:

10.1111/j.1553-2712.2011.01185.x.

[22] E. Y. Boateng and D. A. Abaye, “A Review of the Logistic Regression Model with Emphasis on Medical Research,” Journal of

Data Analysis and Information Processing, vol. 07, no. 04, pp. 190–207, 2019, doi: 10.4236/jdaip.2019.74012.

[23] X. Zhang, C. Li, X. Wang, and H. Wu, “A novel fault diagnosis procedure based on improved symplectic geometry mode

decomposition and optimized SVM,” Measurement: Journal of the International Measurement Confederation, vol. 173, 2021,

doi: 10.1016/j.measurement.2020.108644.

[24] B. M. Asl, S. K. Setarehdan, and M. Mohebbi, “Support vector machine-based arrhythmia classification using reduced features of

heart rate variability signal,” Artificial Intelligence in Medicine, vol. 44, no. 1, pp. 51–64, 2008, doi:

10.1016/j.artmed.2008.04.007.

[25] J. Zhou, M. Xiao, Y. Niu, and G. Ji, “Rolling Bearing Fault Diagnosis Based on WGWOA-VMD-SVM,” Sensors, vol. 22, no. 16,

2022, doi: 10.3390/s22166281.

[26] A. Rani, N. Kumar, J. Kumar, and N. K. Sinha, “Machine learning for soil moisture assessment,” Deep Learning for Sustainable

Agriculture, pp. 143–168, 2022, doi: 10.1016/B978-0-323-85214-2.00001-X.

[27] S. Zhang and J. Li, “KNN Classification With One-Step Computation,” IEEE Transactions on Knowledge and Data Engineering,

vol. 35, no. 3, pp. 2711–2723, 2023, doi: 10.1109/TKDE.2021.3119140.

[28] J. Li, J. Zhang, J. Zhang, and S. Zhang, “Quantum KNN Classification With K Value Selection and Neighbor Selection,” IEEE

Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, doi: 10.1109/TCAD.2023.3345251.

Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 

Assessing the effectiveness of data mining tools in classifying and predicting … (Areen Arabiat)

1303

[29] R. Mikut and M. Reischl, “Data mining tools,” WIREs Data Mining and Knowledge Discovery, vol. 1, no. 5, pp. 431–443, Sep.

2011, doi: 10.1002/widm.24.

[30] S. Verma and P. Rattan, “Introduction To Data Mining Tools and Techniques & Applications: a Review,” in Business, no. July,

2021, [Online]. Available: https://www.researchgate.net/profile/Rashmi-Gujrati-2/publication/355170587_Role_of_Technology

_in_New_Decades/links/6163de531eb5da761e794894/Role-of-Technology-in-New-Decades.pdf#page=57.

[31] Sunita B Aher and LOBO L.M.R.J., “Data Mining in Educational System using WEKA,” 2011, [Online]. Available:

http://www.ijcaonline.org/icett/number3/icett021.pdf.

[32] E. Kulkarni G. and R. Kulkarni B., “WEKA Powerful Tool in Data Mining,” International Journal of Computer Applications

National Seminar on Recent Trends in Data Mining, vol. 5, no. Rtdm, pp. 975–8887, 2016, [Online]. Available:

http://research.ijcaonline.org/rtdm2016/number2/rtdm2575.pdf.

[33] P. Debnath et al., “Analysis of Earthquake Forecasting in India Using Supervised Machine Learning Classifiers,” Sustainability,

vol. 13, no. 2, p. 971, Jan. 2021, doi: 10.3390/su13020971.

[34] J. Demšar et al., “Orange: Data mining toolbox in python,” Journal of Machine Learning Research, vol. 14, pp. 2349–2353, 2013.

[35] Z. Dobesova, “Experiment in Finding Look-Alike European Cities Using Urban Atlas Data,” ISPRS International Journal of Geo-

Information, vol. 9, no. 6, p. 406, Jun. 2020, doi: 10.3390/ijgi9060406.

[36] Ž. Vujović, “Classification Model Evaluation Metrics,” International Journal of Advanced Computer Science and Applications,

vol. 12, no. 6, pp. 599–606, 2021, doi: 10.14569/IJACSA.2021.0120670.

[37] M. R. Mahmood, M. B. Abdulrazzaq, S. R. M. Zeebaree, A. K. Ibrahim, R. R. Zebari, and H. I. Dino, “Classification techniques’

performance evaluation for facial expression recognition,” Indonesian Journal of Electrical Engineering and Computer Science,

vol. 21, no. 2, pp. 1176–1184, 2020, doi: 10.11591/ijeecs.v21.i2.pp1176-1184.

[38] P. Zhang, “Model Selection Via Multifold Cross Validation,” The Annals of Statistics, vol. 21, no. 1, 2007, doi:

10.1214/aos/1176349027.

[39] J. Lei, “Cross-Validation With Confidence,” Journal of the American Statistical Association, vol. 115, no. 532, pp. 1978–1997,

2020, doi: 10.1080/01621459.2019.1672556.

[40] J. Shao, “Linear Model Selection by Cross-validation,” Journal of the American Statistical Association, vol. 88, no. 422, pp. 486–

494, Jun. 1993, doi: 10.1080/01621459.1993.10476299.

[41] R. Rajalakshmi and C. Aravindan, “A Naive Bayes approach for URL classification with supervised feature selection and

rejection framework,” Computational Intelligence, vol. 34, no. 1, pp. 363–396, 2018, doi: 10.1111/coin.12158.

[42] J. Font, L. Arcega, Ø. Haugen, and C. Cetina, “Achieving feature location in families of models through the use of search-based

software engineering,” IEEE Transactions on Evolutionary Computation, vol. 22, no. 3, pp. 363–377, 2018, doi:

10.1109/TEVC.2017.2751100.

[43] Y. S. Taspinar, M. Koklu, and M. Altin, “Classification of flame extinction based on acoustic oscillations using artificial

intelligence methods,” Case Studies in Thermal Engineering, vol. 28, 2021, doi: 10.1016/j.csite.2021.101561.

[44] A. Feyzioglu and Y. S. Taspinar, “Beef Quality Classification with Reduced E-Nose Data Features According to Beef Cut

Types,” Sensors, vol. 23, no. 4, 2023, doi: 10.3390/s23042222.

[45] I. Markoulidakis, G. Kopsiaftis, I. Rallis, and I. Georgoulas, “Multi-Class Confusion Matrix Reduction method and its application

on Net Promoter Score classification problem,” ACM International Conference Proceeding Series, pp. 412–419, 2021, doi:

10.1145/3453892.3461323.

[46] R. Panigrahi et al., “Performance assessment of supervised classifiers for designing intrusion detection systems: A comprehensive

review and recommendations for future research,” Mathematics, vol. 9, no. 6, 2021, doi: 10.3390/math9060690.

[47] F. Sajid et al., “Secure and Efficient Data Storage Operations by Using Intelligent Classification Technique and RSA Algorithm

in IoT-Based Cloud Computing,” Scientific Programming, vol. 2022, 2022, doi: 10.1155/2022/2195646.

BIOGRAPHIES OF AUTHORS

Areen Arabiat obtained B.Sc in computer engineering in 2005 from Al Balqa

Applied University (BAU), and her MSc in Intelligent Transportation Systems (ITS) from Al

Ahliyya Amman University (AAU) in 2022. She is currently a computer lab supervisor at the

Faculty of Engineering/Al-Ahliyya Amman University (AAU) since 2013. Her research

interests are focused on the following areas: machine learning, data mining, artificial

intelligence, and image processing. She can be contacted email: a.arabiat@ammanu.edu.jo.

Muneera Altayeb obtained a bachelor’s degree in computer engineering in 2007,

and a master’s degree in communications engineering from the University of Jordan in 2010.

She has been working as a lecturer in the Department of Communications and Computer

Engineering at Al-Ahliyya Amman University since 2015, in addition to her administrative

experience as assistant dean of the Faculty of Engineering during the period (2020-2023).

Her research interests focus on the following areas: digital signals and image processing,

machine learning, robotics, and artificial intelligence. She can be contacted at email:

m.altayeb@ammanu.edu.jo.

ResearchGate has not been able to resolve any citations for this publication.

Optimizing Traffic Flow in Smart Cities: Soft GRU-Based Recurrent Neural Networks for Enhanced Congestion Prediction Using Deep Learning

Article

Full-text available

Mar 2023

Recently, different techniques have been applied to detect, predict, and reduce traffic congestion to improve the quality of transportation system services. Deep learning (DL) is becoming increasingly valuable for solving critiques. DL applications in transportation have been collected in several recently published surveys over the last few years. The existing research has discussed the cloud environment, which does not provide timely traffic forecasts, which is the cause of frequent traffic accidents. Thus, a solid understanding of the difficulties in predicting congestion is required because the transportation system varies widely between non-congested and congested states. This research develops a bi-directional recurrent neural network (BRNN) using Gated Recurrent Units (GRUs) to extract and classify traffic into congested and non-congested. This research uses a bidirectional recurrent neural network to simulate and forecast traffic congestion in smart cities (BRNN). Urban regions worldwide struggle with traffic congestion, and conventional traffic control techniques have failed miserably. This research suggests a data-driven approach employing BRNN for traffic management in smart cities, which uses real-time data from sensors and linked devices to control traffic more efficiently. The primary measures include predicting traffic metrics such as speed, weather, current, and accident probability. Congestion prediction performance has also been improved by extracting more features such as traffic, road, and weather conditions. The proposed model achieved better measures than the existing state-of-the-art methods. This research also explores an overview and analysis of several early initiatives that have shown promising results; moreover, it explores two potential future research approaches to increase the accuracy and efficiency of large-scale motion prediction.

Short-Term Traffic Flow Prediction Based on a K-Nearest Neighbor and Bidirectional Long Short-Term Memory Model

Article

Full-text available

Feb 2023

In the previous research on traffic flow prediction models, most of the models mainly studied the time series of traffic flow, and the spatial correlation of traffic flow was not fully considered. To solve this problem, this paper proposes a method to predict the spatio-temporal characteristics of short-term traffic flow by combining the k-nearest neighbor algorithm and bidirectional long short-term memory network model. By selecting the real-time traffic flow data observed on high-speed roads in the United Kingdom, the K-nearest neighbor algorithm is used to spatially screen the station data to determine the points with high correlation and then input the BILSTM model for prediction. The experimental results show that compared with SVR, LSTM, GRU, KNN-LSTM, and CNN-LSTM models, the model proposed in this paper has better prediction accuracy, and its performance has been improved by 77%, 19%, 18%, 22%, and 13%, respectively. The proposed K-nearest neighbor-bidirectional long short-time memory model shows better prediction performance.

Beef Quality Classification with Reduced E-Nose Data Features According to Beef Cut Types

Article

Full-text available

Feb 2023
SENSORS-BASEL

Ensuring safe food supplies has recently become a serious problem all over the world. Controlling the quality, spoilage, and standing time for products with a short shelf life is a quite difficult problem. However, electronic noses can make all these controls possible. In this study, which aims to develop a different approach to the solution of this problem, electronic nose data obtained from 12 different beef cuts were classified. In the dataset, there are four classes (1: excellent, 2: good, 3: acceptable, and 4: spoiled) indicating beef quality. The classifications were performed separately for each cut and all cut shapes. The ANOVA method was used to determine the active features in the dataset with data for 12 features. The same classification processes were carried out by using the three active features selected by the ANOVA method. Three different machine learning methods, Artificial Neural Network, K Nearest Neighbor, and Logistic Regression, which are frequently used in the literature, were used in classifications. In the experimental studies, a classification accuracy of 100% was obtained as a result of the classification performed with ANN using the data obtained by combining all the tables in the dataset.

Rolling Bearing Fault Diagnosis Based on WGWOA-VMD-SVM

Article

Full-text available

Aug 2022
SENSORS-BASEL

A rolling bearing fault diagnosis method based on whale gray wolf optimization algorithm-variational mode decomposition-support vector machine (WGWOA-VMD-SVM) was proposed to solve the unclear fault characterization of rolling bearing vibration signal due to its nonlinear and nonstationary characteristics. A whale gray wolf optimization algorithm (WGWOA) was proposed by combining whale optimization algorithm (WOA) and gray wolf optimization (GWO), and the rolling bearing signal was decomposed by using variational mode decomposition (VMD). Each eigenvalue was extracted as eigenvector after VMD, and the training and test sets of the fault diagnosis model were divided accordingly. The support vector machine (SVM) was used as the fault diagnosis model and optimized by using WGWOA. The validity of this method was verified by two cases of Case Western Reserve University bearing data set and laboratory test. The test results show that in the bearing data set of Case Western Reserve University, compared with the existing VMD-SVM method, the fault diagnosis accuracy rate of the WGWOA-VMD-SVM method in five repeated tests reaches 100.00%, which preliminarily verifies the feasibility of this algorithm. In the laboratory test case, the diagnostic effect of the proposed fault diagnosis method is compared with backpropagation neural network, SVM, VMD-SVM, WOA-VMD-SVM, GWO-VMD-SVM, and WGWOA-VMD-SVM. Test results show that the accuracy rate of WGWOA-VMD-SVM fault diagnosis is the highest, the accuracy rate of a single test reaches 100.00%, and the accuracy rate of five repeated tests reaches 99.75%, which is the highest compared with the above six methods. WGWOA plays a good optimization role in optimizing VMD and SVM. The signal decomposed by VMD is optimized by using the WGWOA algorithm without mode overlap. WGWOA has the better convergence performance than WOA and GWO, which further verifies its superiority among the compared methods. The research results can provide an effective improvement method for the existing rolling bearing fault diagnosis technology.

Performance Analysis of an Optimized ANN Model to Predict the Stability of Smart Grid

Article

Full-text available

Aug 2022
COMPLEXITY

The stability of the power grid is concernment due to the high demand and supply to smart cities, homes, factories, and so on. Different machine learning (ML) and deep learning (DL) models can be used to tackle the problem of stability prediction for the energy grid. This study elaborates on the necessity of IoT technology to make energy grid networks smart. Different prediction models, namely, logistic regression, naïve Bayes, decision tree, support vector machine, random forest, XGBoost, k-nearest neighbor, and optimized artificial neural network (ANN), have been applied on openly available smart energy grid datasets to predict their stability. The present article uses metrics such as accuracy, precision, recall, f1-score, and ROC curve to compare different predictive models. Data augmentation and feature scaling have been applied to the dataset to get better results. The augmented dataset provides better results as compared with the normal dataset. This study concludes that the deep learning predictive model ANN optimized with Adam optimizer provides better results than other predictive models. The ANN model provides 97.27% accuracy, 96.79% precision, 95.67% recall, and 96.22% F1 score.

The Efficacy of Machine-Learning-Supported Smart System for Heart Disease Prediction

Article

Full-text available

Jun 2022

The disease may be an explicit status that negatively affects human health. Cardiopathy is one of the common deadly diseases that is attributed to unhealthy human habits compared to alternative diseases. With the help of machine learning (ML) algorithms, heart disease can be noticed in a short time as well as at a low cost. This study adopted four machine learning models, such as random forest (RF), decision tree (DT), AdaBoost (AB), and K-nearest neighbor (KNN), to detect heart disease. A generalized algorithm was constructed to analyze the strength of the relevant factors that contribute to heart disease prediction. The models were evaluated using the datasets Cleveland, Hungary, Switzerland, and Long Beach (CHSLB), and all were collected from Kaggle. Based on the CHSLB dataset, RF, DT, AB, and KNN models predicted an accuracy of 99.03%, 96.10%, 100%, and 100%, respectively. In the case of a single (Cleveland) dataset, only two models, namely RF and KNN, show good accuracy of 93.437% and 97.83%, respectively. Finally, the study used Streamlit, an internet-based cloud hosting platform, to develop a computer-aided smart system for disease prediction. It is expected that the proposed tool together with the ML algorithm will play a key role in diagnosing heart diseases in a very convenient manner. Above all, the study has made a substantial contribution to the computation of strength scores with significant predictors in the prognosis of heart disease.

Secure and Efficient Data Storage Operations by Using Intelligent Classification Technique and RSA Algorithm in IoT-Based Cloud Computing

Article

Full-text available

Apr 2022

In mobile cloud services, smartphones may depend on IoT-based cloud infrastructure and information storage tools to conduct technical errands, such as quest, information processing, and combined networks. In addition to traditional finding institutions, the smart IoT-cloud often upgrades the normal impromptu structure by treating mobile devices as corporate hubs, e.g., by identifying institutions. This has many benefits from the start, with several significant problems to be overcome in order to enhance the unwavering consistency of the cloud environment while Internet of things connects and improves decision support system of the entire network. In fact, similar issues apply to monitor loading, resistance, and other security risks in the cloud state. Right now, we are looking at changed arrangement procedures in MATLAB utilizing cardiovascular failure information and afterward protecting that information with the assistance of RSA calculation in mobile cloud. The calculations tried are SVM, RF, DT, NB, and KNN. In the outcome, the order strategies that have the best exactness result to test respiratory failure information will be recommended for use for enormous scope information. Instead, the collected data will be transferred to the mobile cloud for preservation using the RSA encryption algorithm.

Machine learning for soil moisture assessment

Chapter

Full-text available

Jan 2022

Soil moisture plays a key role in the Earth’s hydrological cycle and meteorological and climatic processes. The information on soil moisture content is required for irrigation scheduling, crop yield prediction, studies on weather and climate change, monitoring and forecasting extreme weather events like floods and drought, and estimation of runoff and soil erosion. The accurate and timely estimation and forecasting of soil moisture are necessary for these applications. Machine learning (ML) algorithms, like artificial neural networks, support vector machines, decision trees, random forest, and so on, are widely used for soil moisture assessment due to their ability to model nonlinear and complex relationships between variables. These algorithms are used to develop pedotransfer functions that can predict soil hydraulic properties, like available water capacity, hydraulic conductivity, soil water retention curve, and more. These algorithms are also used for the retrieval of soil moisture through remote sensing. By providing meteorological, vegetation, topographic, and historical input data about soil moisture variation, these ML algorithms can accurately forecast soil moisture after a few days. This information can be used for scheduling irrigation in the automated smart irrigation system. These algorithms are also extensively used for downscaling coarse resolution satellite-derived soil moisture products to finer spatial resolutions so that these products can be applied at the regional or watershed level. ML algorithms are contributing significantly to the progress of soil moisture research. In this chapter, an overview of the applicability of ML algorithms for soil moisture assessment in the various domains of soil moisture research is presented.

Introduction to Data Mining Tools and Techniques & Applications: A Review

Conference Paper

Full-text available

Jul 2021

Data mining refers to extract and identify useful information from large sets of data. This term is really a misnomer. Thus, data mining should be named as knowledge mining which rely stress on mining from vast sets of data . An enormous quantity of data is present in the information industry. This data is meaningless until it is converted into useful form of information or help the industries in their business. It is essential to analyze this plenty of data and extract the valuable information from it. In data mining, extraction of information is not only the process to be performed it also involves various other process such as cleaning, integration, data transformation, data mining, pattern evaluation and presentation. When all these processes are completed one will be able to use this valuable information in many applications such as Fraud Detection, Market Analysis, Production Control, Science Exploration, etc.(Dhaka et al.,2018) .This paper introduces the significance use of data mining techniques such as clustering, association-rules, sequential pattern, statistics analysis, characteristics rules and so on can be used to find out the useful knowledge. Finally, various tools, then various applications of Data Mining in various fields like Banking, Healthcare, Education and Stock Prediction has been explained in this paper.

Mitigating Traffic Congestion in Smart and Sustainable Cities Using Machine Learning: A Review

Chapter

Jun 2023

Machine Learning (ML) algorithms can analyze large amounts of traffic data, learn from patterns and past behaviors, and provide insights into the current and future traffic flow. ML can also optimize traffic management, including traffic signal control, route optimization, and demand forecasting. Traffic prediction is a key application of ML in traffic management, with studies showing that ML outperforms traditional methods in predicting traffic congestion. ML is an effective tool for managing traffic, particularly for projecting traffic demand, predicting traffic congestion, and optimising routes. Studies have revealed that ML is more efficient than conventional techniques in these areas, leading to decreased journey times, improved traffic flow, and better traffic management in general. As the demand for efficient and sustainable transportation systems rises, ML integration in traffic management is expected to be vital in addressing these requirements. Nevertheless, there are obstacles and restrictions that must be overcome, such as shortcomings in the reliability of data and model interpretability. Despite these challenges, ML has the potential to mitigate traffic congestion and enhance urban mobility in smart and sustainable cities. Further research is needed to address these challenges and fully realize the potential of ML in traffic management.KeywordsTraffic congestionMachine learningSmart citiesSustainable citiesTraffic management

Assessing the effectiveness of data mining tools in classifying and predicting road traffic congestion

Abstract

Recommended publications

An automated system for classifying types of cerebral hemorrhage based on image processing technique...

A Hybrid Deep Convolutional Neural Network Approach for Predicting the Traffic Congestion Index

Case Study of Model Selection on Customer Information Task Based on Machine Learning Algorithms

Crack detection based on mel-frequency cepstral coefficients features using multiple classifiers