ArticlePDF Available

Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria

Authors:

Abstract

The proliferation of maliciously coded documents as file transfers increase has led to a rise in sophisticated attacks. Portable Document Format (PDF) files have emerged as a major attack vector for malware due to their adaptability and wide usage. Detecting malware in PDF files is challenging due to its ability to include various harmful elements such as embedded scripts, exploits, and malicious URLs. This paper presents a comparative analysis of machine learning (ML) techniques, including Naive Bayes (NB), K-Nearest Neighbor (KNN), Average One Dependency Estimator (A1DE), Random Forest (RF), and Support Vector Machine (SVM) for PDF malware detection. The study utilizes a dataset obtained from the Canadian Institute for Cyber-security and employs different testing criteria, namely percentage splitting and 10-fold cross-validation. The performance of the techniques is evaluated using F1-score, precision, recall, and accuracy measures. The results indicate that KNN outperforms other models, achieving an accuracy of 99.8599% using 10-fold cross-validation. The findings highlight the effectiveness of ML models in accurately detecting PDF malware and provide insights for developing robust systems to protect against malicious activities.
This work is licensed under a Creative Commons Attribution 4.0 International License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the
original work is properly cited.
ech
T
PressScience
DOI: 10.32604/jcs.2023.042501
ARTICLE
Comparative Analysis of Machine Learning Models for PDF Malware
Detection: Evaluating Dierent Training and Testing Criteria
Bilal Khan1, Muhammad Arshad2and Sarwar Shah Khan3,4,*
1Department of Computer Science, City University of Science and Information Technology, Peshawar, Pakistan
2Department of Computer Soware Engineering, University of Engineering and Technology, Mardan, Pakistan
3Department of Computer and Soware Technology, University of Swat, Swat, Pakistan
4Department of Computer Science, IQRA National University, Swat, Pakistan
*Corresponding Author: Sarwar Shah Khan. Email: sskhan0092@gmail.com
Received: 01 June 2023 Accepted: 03 August 2023 Published: 21 August 2023
ABSTRACT
The proliferation of maliciously coded documents as le transfers increase has led to a rise in sophisticated attacks.
Portable Document Format (PDF) les have emerged as a major attack vector for malware due to their adaptability
and wide usage. Detecting malware in PDF les is challenging due to its ability to include various harmful elements
such as embedded scripts, exploits, and malicious URLs. This paper presents a comparative analysis of machine
learning (ML) techniques, including Naive Bayes (NB), K-Nearest Neighbor (KNN), Average One Dependency
Estimator (A1DE), Random Forest (RF), and Support Vector Machine (SVM) for PDF malware detection. The study
utilizes a dataset obtained from the Canadian Institute for Cyber-security and employs dierent testing criteria,
namely percentage splitting and 10-fold cross-validation. The performance of the techniques is evaluated using F1-
score, precision, recall, and accuracy measures. The results indicate that KNN outperforms other models, achieving
an accuracy of 99.8599% using 10-fold cross-validation. The ndings highlight the eectiveness of ML models in
accurately detecting PDF malware and provide insights for developing robust systems to protect against malicious
activities.
KEYWORDS
Cyber-security; PDF malware; model training; testing
1Introduction
Recent years have seen a sharp rise in sophisticated assaults using maliciously coded documents
as file transfers increase. Executable files that are attached to emails or webpages can be dangerous,
as most Internet users are aware. Nevertheless, the papers are a useful tool for distributing malware
because people are ignorant of them. The major attack vector for malware that has been detected
is the PDF, which is much adaptable than other document formats. Malicious PDF files frequently
contain JavaScript or binary scripts that take advantage of security weaknesses to do damaging
actions [1]. There are uncountable PDF files online. Some are not as innocuous as one may think.
In reality, PDF files may contain a wide range of items, such as JavaScript or binary code. These
2JCS, 2023, vol.5
things might occasionally be dangerous. Since Portable Document Format files can include a variety
of harmful material, including embedded scripts, exploits, and malicious URLs, it can be difficult
to detect malware in them. A reading flaw might be used by malware software to try to infect a
machine [2]. Adobe Acrobat Reader discovered a huge number of vulnerabilities in 2017. Every reader
has particular flaws, and a malicious PDF file could be able to exploit them [3]. Offices frequently
use the PDF file format due to its great efficiency, reliability, and interaction. The emergence of
more advanced, non-executable file-based attack technologies and techniques has made PDF security
more challenging because spiteful PDF files are the commonly explore infection vectors in hostile
circumstances [4,5]. PDF malware detection is very important due to several reasons including:
Protection against Malicious Content: PDF files are often utilized for document sharing and can
include a variety of embedded content types, including JavaScript, links, and multimedia components.
These characteristics can be used by malicious actors to embed malware into PDF files, potentially
making them a vehicle for virus delivery. Finding PDF malware helps users avoid unintentionally
accessing or running dangerous files [6].
Preventing Exploits: Vulnerabilities in PDF reader software and other applications that work with
PDF files can be exploited using PDF files. Malicious PDFs may include exploits that use security
flaws to access systems without authorization or run malicious malware. For computer systems and
networks to remain secure, these attacks must be found and stopped [7].
Protecting Sensitive Information: PDF files are frequently used to store and distribute sensitive
information, such as financial information, intellectual property, or personal particulars. This sensitive
information may be stolen or leaked by malware that is included in PDF files, which might result in
monetary loss, data breaches, or identity theft. Protecting the security and integrity of sensitive data
is made easier by finding and eliminating malware from PDF files [8].
Attacks Using Social Engineering: To deceive users into opening infected PDF files, malicious
actors frequently utilize social engineering tactics. These files could have alluring subject lines or mes-
sages, or they might be presented as actual papers. Finding PDF malware shields consumers from these
social engineering scams and guards against the potential loss of money, reputation, or operational
efficiency [9].
System Security Overall: Malware attacks can have serious effects on the safety and functionality
of computer systems. System crashes, data damage, unauthorized access, and the installation of new
malware are all possible consequences of malware. Maintaining the overall security and stability of
computer systems and networks involves finding and eliminating PDF malware [10].
The motivation for this research stems from the need to develop effective methods for protecting
against sophisticated attacks using PDF files. The authors highlight the importance of PDF malware
detection for several reasons. Firstly, detecting malware in PDF files helps protect users from unin-
tentionally accessing or running dangerous files, safeguarding them against potential harm. Secondly,
vulnerabilities in PDF reader software and other applications can be exploited through malicious PDF
files, making it crucial to identify and prevent such attacks. Thirdly, PDF files often contain sensitive
information that can be stolen or leaked by malware, leading to financial loss, data breaches, or identity
theft. Detecting and eliminating malware from PDF files helps protect the security and integrity of
sensitive data. Lastly, malicious actors often use social engineering tactics to trick users into opening
infected PDF files, and detecting PDF malware can mitigate the risks associated with such attacks.
Keeping all these important in mind, researchers have proposed a variety of models to distinguish
numerous attacks connected to PDF files as a result of the growth of ML technology in recent years
JCS, 2023, vol.5 3
[11,12]. However, this study presents the analysis of various ML models which are Average One
Dependency Estimator (A1DE), K-Nearest Neighbor (KNN), Support Vector Machine (SVM) [13],
Naive Bayes (NB), and Random Forest (RF)” [14]. Based on F1-Score, precision, recall and accuracy,
these models are contrasted. The primary objective of this study is to develop a malware detection
model capable of safeguarding systems against harmful actions caused by PDF viruses.
The remaining sections of this study are organized as follows: The literature study is summarized
in Section 2, the technique is covered in Sections 3,and4the inquiry is concluded in Section 5.
2Literature Review
Using countless ML and DL models, several varieties of research have been managing on the
identification of PDF malware. Kang et al. described the use of the PDF in 2019 [15]. They gave
a thorough analysis of the JavaScript structure and content in the PDF with embedded XML.
They then build a variety of features, such as configuration encoding methods for material and
variables like file size, keywords, versions, and JavaScript readable strings. Information about file
size, category, and content properties, additionally item names, keywords, and JavaScript readable
strings. The approaches to training resilient PDF malware classifiers utilizing observable robustness
features were described by Chen et al. in 2019 [16]. For instance, with no substance on how countless
pages of innocuous forms are included in the document, the classifier must identify PDF malware as
harmful. They demonstrate how to accurately evaluate the worst-case behavior of a malware classifier
concerning particular robustness properties.
In several studies, ML approaches have been utilized to develop classifiers for PDF malware. Two
prior initiatives that focused on the hazardous JavaScript that was presented in Portable Document
Format malware were Wepawet [17]andLaskovetal.[18].
Based on the lexical features of JavaScript scripts as well as functions, constants, objects,
techniques, and keywords, Khitan et al. [19] proposedattributes. Zhang et al. [20] merged the JavaScript
object count, page count, and stream filtering data with the PDF structure, entity characteristics, meta-
data information, and content statistics. Following the revelation that malicious JavaScript functions
differently from legitimate JavaScript code. Liu et al. [21] suggested a context-aware approach. This
approach involves utilizing the original JavaScript code as input to the “eval” function to open the
PDF file, while closely monitoring for any unusual behavior based on the given instructions.
According to Herrera-Silva et al. [22], Cyberattacks using ransomware have increased over the
past ten years, causing great concern among organizations. It’s critical to develop novel and enhanced
techniques for detecting this type of virus. This work employs machine learning and dynamic analysis
to identify the ransomware signatures that are always evolving using a few dynamic variables. This
study can be utilized to identify current and even novel versions of the threat because the majority of
the characteristics are shared by a variety of ransom ware-affected samples.
Dhalaria et al. present a hybrid method for detecting and classifying Android malware [23]. The
proposed method combines static and dynamic analysis techniques to effectively identify malicious
applications and classify them into different malware families. The authors train machine learning
models for malware detection and family classification using features taken from both the static and
dynamic behaviours of Android apps. Experimental results demonstrate the effectiveness of the hybrid
approach in accurately detecting and classifying Android malware, thereby contributing to the field
of mobile security and aiding in the prevention of malicious activities on Android devices. However,
Deore et al. presented a novel approach for detecting malware using a Faster Region Proposals
4JCS, 2023, vol.5
Convolutional Neural Network (FRCNN) [24]. The proposed MDFRCNN model aims to increase
the accuracy and efficiency of malware detection by effectively identifying and classifying malicious
regions within digital content. The authors conduct experiments and evaluate the performance of
their model using various datasets, demonstrating its effectiveness in detecting malware in real-world
scenarios.
3Methodology
This study focuses on the comparison of various ML models and model training criteria to find a
better solution for PDF malware detection. ML models include A1DE, NB, KNN, RF, and SVM while
training criteria include the percentage splitting with 70% and 30% for training and testing respectively,
and 10-fold cross-validation. The overall methodology is presented in Fig. 1.
Figure 1: Methodology flow chart
3.1 Dataset Explanation and Preprocessing
We have collected the PDF Malware detection dataset from Canadian Institute for Cyber-security:
https://www.unb.ca/cic/datasets/pdfmal-2022.html. The dataset has 33 characteristics, 32 of which are
independent, and 1 of which is dependent. The first 11 characteristics were eliminated since they had no
effect during the analysis stage. These characteristics are collectively referred to as general features, and
they comprise the following information: “Encryption, metadata size, page number, header, picture
number, text, object number, font objects, number of embedded files, and average size of all embedded
media are all factors to consider”. The data is cleaned and no need for further preprocessing.
For further analysis, there is a need to select some features that are best suited for the analysis.
To this end, we select some features from Structural features which define the PDF file in terms of
its structure, which necessitates more thorough processing and reveals information about the PDF’s
general framework.
We have employed Classifier Attribute Evaluator techniques employing the ZeroR classifier and
the Ranker searching method for retrieving such functions. For accuracy estimation, the number of
folds used is 5. The selected features are ranked as:
JCS, 2023, vol.5 5
Selected attributes: “21,7,8,10,6,5,4,3,2,9,11,20,18,19,12,17,16,15,14,13,1: 21”.
The following attributes are present, in that order: Colours, encrypt, JS, XFA startxref, trailer,
xref, endstream, stream, endobj, launch, OpenAction, AA, EmbeddedFile, JBIG2Decode, Acroform,
pageno, ObjStm, Javascript, RichMedia, and obj. Table 1 represents the selected features with their
descriptions.
Table 1: Selected features and description
S. No. Feature Description
1 Xref The stream’s size, as harmful code may be concealed within streams.
2 Trailer Number of trailers inside the PDF.
3 Pageno Malicious PDF files often include fewer pages—often just one blank
page—because they don’t care how their material is presented.
4 Stream This displays how many binary data sequences there are in the PDF.
5 Encrypt This function indicates whether or not the PDF file is
password-protected.
6 Objstm Streams that contain additional objects.
7 Endstream Keywords that signify the streams’ termination.
8 JS Several objects encompassing Javascript code.
9 Obj This might be a sign of an attempt to obfuscate.
10 Javascript This shows how many things include Javascript code, the most common
type of characteristic.
11 AA Specifies a specific action upon an incident.
12 OpenAction When a PDF file is opened, this property specifies what action should be
done. The majority of commonly encountered malicious PDF files
employ this feature in combination with Javascript.
13 endobj PDFs enable a wide range of obfuscation methods, including string
obfuscations in hex, octal, etc., that are frequently employed in evasion
strategies.
14 Acroform Form fields in PDF files created with Acrobat contain scripting that
hackers might use against you.
15 Startxref Numerous keywords that include “startxref” indicate the location of the
Xref table’s start.
16 JBIG2Decode JBig2Decode is a well-liked filter for encoding hazardous data. What
items have the most number of nested filters? Nested filters can impede
decoding and suggest evasion.
17 Richmeddia The number of rich media keywords shows the amount of flash and
embedded media.
18 Launch The phrase “launch” refers to the act of executing a command or
programme.
19 EmbeddedFile It is possible for PDF files to attach or embed other things, such as word
documents, photos, and more, which can be used maliciously.
20 XFA XFAs, an XML form architecture that permits scripting technologies
that attackers may exploit, are found in some PDF files.
(Continued)
6JCS, 2023, vol.5
Table 1 (continued)
S. No. Feature Description
21 Color Many color schemes are used in the PDF.
22 Class Categorize as benign or malicious.
Due to its mobility, PDF has become the most often used document format throughout time.
Unfortunately, the ubiquity of PDFs and their sophisticated capabilities have made it possible for
attackers to use them in a variety of ways. An attacker can take advantage of several crucial PDF
properties to spread a malicious payload. This dataset, which comprises 10,019 records in which 5551
malicious and 4468 are benign which incline to evade the mutual significant features discovered in
every class, collects these malicious data and information.
3.2 Model Training and Performance Evaluation
This study focuses on two different types of testing analysis; one is based on percentage splitting
of the dataset where we have used 70% for training and the rest 30% for testing, and the second
method is K-fold cross-validation in which we have selected the value of K as 10. This study presents
a comparison of these testing methods. The ML techniques are compared using standard evaluation
metrics such as F1-score, precision, recall, and accuracy. These measures can be calculated as:
Precision =TP
TP +FP (1)
Recall =TP
TP +FN (2)
F1Score =2Precision Recall
Precision +Recall (3)
Accuracy =TP +TN
TP +TN +FP +FN (4)
Here, the true positive values are presented with TP, false positive values are presented with FP,
while TN and FN present the values of true negative and false negative calculations.
4Results Analysis and Discussion
This section presents the outcomes achieved via the aforementioned ML models including A1De,
NB, SVM, KNN, and RF. These models are evaluated using f1-score, precision, recall, and accuracy.
This study also focuses on different types of testing criteria which are percentage splitting and K-fold
cross-validation. For percentage splitting, we have used 70% percent for training and the remaining
30% for testing while in the K-fold, we have selected the value of K as 10. Fig. 2 presents the precision,
recall, and f1-score of each employed technique using the first testing criterion which is 70% and
30% for training and testing respectively; however, Fig. 3 presents the same using the second testing
criterion which is 10-fold cross-validation. Considering the testing criteria, this analysis illustrates that
10-fold cross-validation is better to utilize for testing instead of 70% and 30% for training and testing.
JCS, 2023, vol.5 7
Moreover, it also shows that KNN outperforms other employed models on 10-fold cross-validation.
Using the percentage splitting criterion, KNN and RF both show the same performance.
Precision, Recall and F1 Score
Figure 2: Precision, recall, and accuracy analysis using percentage splitting criterion
Precision, Recall and F1 Scre
Figure 3: Precision, recall, and accuracy analysis using 10-fold cross-validation
Figs. 4 and 5separately present the accuracy analysis of each employed model using both testing
criteria which are percentage splitting and K-fold cross-validation. In both cases, KNN outperforms
another employed model with better accuracy is 99.8499% using percentage splitting testing criteria
and 99.8599% on K-fold cross-validation criteria. This analysis shows that in the current scenario,
K-fold cross-validation is the better training and testing criteria to train the model. Based on a variety
of input values, ML models predict output values. One of the simplest ML algorithms is KNN, which
is typically employed for classification. It categorizes the data point based on how its neighbors are
categorized. Lazy Learner (Instance-based learning) is another name for KNN. It learns nothing
throughout the training period. No discriminative function is generated using the training data. In
other words, no training is required. It only draws learning from the stored training dataset for
making real-time predictions. The KNN technique is much quicker than other utilized technique since
other models require training while the KNN technique does not require training before producing
predictions, and fresh data may be incorporated effortlessly without affecting the techniques accuracy.
8JCS, 2023, vol.5
Figure 4: Accuracy analysis using percentage splitting testing criteria
Figure 5: 10 accuracy analysis using 10-fold cross-validation testing criteria
Overall, the findings reveal that the ML models consistently outperform both assessment methods,
demonstrating their efficiency in correctly predicting the target variable. Because the data partitioning
and model training differ across the two techniques, the accuracy values produced by percentage
splitting and 10-fold cross-validation are only slightly different. However, the models’ excellent
accuracy ratings demonstrate their potential for accurate predictions in the present situation.
The paper contributes to the field of cyber-security by providing insights into the effectiveness of
ML models for detecting malware in PDF files. The comparative analysis and performance evaluation
contribute to the development of robust systems for protecting against malicious activities associated
with PDF malware. The study demonstrates the effectiveness of ML models in accurately detecting
PDF malware and provides insights for developing robust systems to protect against malicious
activities. The findings suggest that KNN is a promising model for PDF malware detection, but further
research and experimentation may be required to validate and improve the results.
5Conclusion and Future Work
In this research paper, we conducted a comparative analysis of machine learning (ML) models
for PDF malware detection, focusing on the A1DE, NB, SVM, KNN, and RF models. We utilized a
dataset obtained from the Canadian Institute for Cyber-security and employed two testing criteria:
percentage splitting and 10-fold cross-validation. Our evaluation was based on precision, recall,
JCS, 2023, vol.5 9
F1-score, and accuracy metrics. The results showed that KNN outperformed the other models, achiev-
ing an accuracy of 99.8599% using 10-fold cross-validation. These findings highlight the effectiveness
of ML models in accurately detecting PDF malware and provide insights for developing robust
systems to protect against malicious activities in PDF files. The research contributes to enhancing
cyber-security measures by providing a reliable model for PDF malware detection, which can help in
preventing the proliferation of sophisticated attacks through maliciously coded documents.
Moreover, the paper has some limitations such as a limited dataset and the lack of comparison
with other approaches; its methodology, performance evaluation, and comparative analysis contribute
to its validity. However, further research and validation using diverse datasets and comparison with
alternative methods are necessary to strengthen the findings.
Moving forward, further research in the field of PDF malware detection should focus on several
key areas. First, investigating deep learning methods like convolutional neural networks and recurrent
neural networks (RNNs) may improve the performance and accuracy of the models. Additionally,
incorporating natural language processing (NLP) techniques to analyze the textual content within
PDF files could provide valuable insights for malware detection. Moreover, the development of real-
time detection systems that can analyze PDF files on the fly and detect emerging threats in a timely
manner would be highly beneficial. Lastly, collaboration between researchers, industry professionals,
and cyber-security organizations is crucial to gather large-scale, diverse datasets for training and
testing purposes, ensuring the models are robust and effective against a wide range of PDF malware
variants.
Declarations: We, hereby declare that the research paper titled “Comparative Analysis of Machine
Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria,”
submitted for publication, is my original work and has not been submitted elsewhere for any academic
or non-academic purpose. We confirm that all the sources used in this research paper have been
appropriately cited and acknowledged. Any references, data, or ideas obtained from other authors
are duly attributed and documented in the references section.
Acknowledgement: We would like to express our sincere gratitude to all those who contributed to the
successful completion of this research paper, “Comparative Analysis of Machine Learning Models
for PDF Malware Detection: Evaluating Different Training and Testing Criteria”. We are thankful
to the research team members, Bilal Khan (BK), Muhammad Arshad (MA), and Sarwar Shah Khan
(SSK), for their collaboration and valuable insights throughout the research process. We extend our
appreciation to the institutions that supported this research work. We are deeply grateful to the
Canadian Institute for Cyber-security for providing the dataset that formed the foundation of our
analysis. Their contributions have been instrumental in enabling us to conduct this study on PDF
malware detection and assess the performance of various ML models. Our heartfelt appreciation goes
to all the individuals and institutions that reviewed and provided constructive feedback on this research
paper. Your valuable input helped improve the quality and rigor of our work. Without the collective
efforts and support of all these individuals and organizations, this research paper would not have been
possible. Thank you all for your invaluable contributions.
Funding Statement: No specific grant from a funding organization supported this research.
Author Contributions: The authors confirm contribution to the paper as follows: study conception and
design: BK; data collection: BK and MA; analysis and interpretation of results: SSK and BK; draft
10 JCS, 2023, vol.5
manuscript preparation: MA and SSK. All authors reviewed the results and approved the final version
of the manuscript.
Availability of Data and Materials: The data used in this research paper, “Comparative Analysis of
Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing
Criteria,” is obtained from the Canadian Institute for Cyber-security and is publicly available for
research purposes. The dataset can be accessed at the following URL: https://www.unb.ca/cic/datasets/
pdfmal-2022.html.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the
present study.
References
[1] Y. S. Jeong, J. Woo and A. R. Kang, “Malware detection on byte streams of pdf files using convolutional
neural networks,” Security and Communication Networks, vol. 2019, pp. 144–152, 2019.
[2] B. Cuan, A. Damien, C. Delaplace and M. Valois, “Malware detection in PDF files using machine learning,”
in ICETE 2018-Proc. of the 15th Int. Joint Conf. on e-Business and Telecommunications, Porto, Portugal,
vol. 2, pp. 412–419, 2018.
[3] A. Falah, S. R. Pokhrel, L. Pan and A. de Souza-Daw, “Towards enhanced PDF maldocs detection with
feature engineering: Design challenges,” Multimedia Tools and Applications, vol. 81, no. 28, pp. 41103–
41130, 2022.
[4] W. Xu, Y. Qi and D. Evans, “Automatically evading classifiers,” in Proc. of the 23rd Annual Network and
Distributed System Security Symp., San Diego, California, vol. 2016, no. February, pp. 21–24, 2016.
[5] S. Sibi Chakkaravarthy, D. Sangeetha and V. Vaidehi, “A survey on malware analysis and mitigation
techniques,” Computer Science Review, vol. 32, pp. 1–23, 2019.
[6] F. J. Abdullayeva and S. S. Ojagverdiyeva, “Multicriteria decision making using analytic hierarchy process
for child protection from malicious content on the internet,”International Journal of Computer Network &
Information Security, vol. 13, no. 3, pp. 52–61, 2021.
[7] B. Wickman, H. Hu, I. Yun, D. Jang, J. W. Lim et al., “Preventing Use-After-Free attacks with fast forward
allocation,” in 30th USENIX Security Symp. (USENIX Security 21), Vancouver, Canada, pp. 2453–2470,
2021.
[8] M. Templ and M. Sariyar, “A systematic overview on methods to protect sensitive data provided for various
analyses,” International Journal of Information Security, vol. 21, no. 6, pp. 1233–1246, 2022.
[9] W. Syafitri, Z. Shukur, U. Asma’Mokhtar, R. Sulaiman and M. A. Ibrahim, “Social engineering attacks
prevention: A systematic literature review,” IEEE Access, vol. 10, pp. 39325–39343, 2022.
[10] H. Ahmad, I. Dharmadasa, F. Ullah and M. A. Babar, “A review on C3I systems’ security: Vulnerabilities,
attacks, and countermeasures,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–38, 2023.
[11] W. Li, W. Meng, Z. Tan and Y. Xiang, “Design of multi-view based email classification for IoT systems via
semi-supervised learning,” Journal of Network and Computer Applications, vol. 128, pp. 56–63, 2019.
[12] Y.Li,X.Wang,Z.Shi,R.Zhang,J.Xueet al., “Boosting training for PDF malware classifier via active
learning,” International Journal of Intelligent Systems, vol. 37, no. 4, pp. 2803–2821, 2022.
[13] S. S. Khan, M. Khan, S. Technology and R. Naseem, “Challenges in opinion mining, comprehensive,” A
Science and Technology Journal, vol. 33, no. 11, pp. 123–135, 2018.
[14] T. Tsafrir, A. Cohen, E. Nir and N. Nissim, “Efficient feature extraction methodologies for unknown MP4-
malware detection using machine learning algorithms,” Expert Systems with Applications, vol. 219, pp.
119615, 2023.
[15] A. R. Kang, Y. S. Jeong, S. L. Kim and J. Woo, “Malicious PDF detection model against adversarial attack
built from benign PDF containing javascript,” Applied Sciences, vol. 9, no. 22, pp. 4764, 2019.
JCS, 2023, vol.5 11
[16] Y. Chen, S. Wang, D. She and S. Jana, “On training robust {PDF}malware classifiers,” in 29th USENIX
Security Symp. (USENIX Security 20), Boston, USA, pp. 2343–2360, 2020.
[17] M. Cova, C. Kruegel and G. Vigna, “Detection and analysis of drive-by-download attacks and malicious
JavaScript code,” in Proc. of the 19th Int. Conf. on World Wide Web, North Carolina, USA, pp. 281–290,
2010.
[18] P. Laskov and N. ˇ
Srndi´
c, “Static detection of malicious JavaScript-bearing PDF documents,” in Proc. of
the 27th Annual Computer Security Applications Conf., Orlando, Florida, USA, pp. 373–382, 2011.
[19] S. J. Khitan, A. Hadi and J. Atoum, “PDF forensic analysis system using YARA,” International Journal of
Computer Science and Network Security, vol. 17, no. 5, pp. 77–85, 2017.
[20] J. Zhang, “MLPdf: An effective machine learning based approach for PDF malware detection,” pp. 1–6,
2018. [Online]. Available: https://arxiv.org/pdf/1808.06991.pdf
[21] D. Liu, H. Wang and A. Stavrou, “Detecting malicious javascript in pdf through document instrumenta-
tion,” in 2014 44th Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks, Atlanta, USA, pp.
100–111, 2014.
[22] J. A. Herrera-Silva and M. Hernández-Álvarez, “Dynamic feature dataset for ransomware detection using
machine learning algorithms,” Sensors, vol. 23, no. 3, pp. 1053, 2023.
[23] M. Dhalaria and E. Gandotra, “A hybrid approach for android malware detection and family classifica-
tion,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 6, pp. 174–188,
2020.
[24] M. Deore and U. Kulkarni, “Mdfrcnn: Malware detection using faster region proposals convolution neural
network,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 7, no. 4, pp. 146–
162, 2022.
... However, advances in adversarial strategies to get around hostile document classifiers have been made. In particular, adversarial examples based on precision manipulation are carefully designed to cause misclassifications, which might pose a danger to many machine learning-based detectors [6]. Although several analysis and detection techniques have been put out to thwart these assaults, the threat posed by adversarial attacks still needs to be properly addressed. ...
... A comparative study of machine learning methods for identifying malware in PDF files is carried out by B. Khan and colleagues [6]. K-Nearest Neighbor (KNN) beats other models with an amazing 99.8599% accuracy in 10-fold cross-validation, according to the study, which uses a dataset from the Canadian Institute for Cybersecurity. ...
... To this end, this study uses the 10-fold cross-validation to split the data for training and testing purposes. By testing the model on several data subsets, it reduces the possibility of overfitting [6]. ...
Article
Full-text available
Malware is an ever-present and dynamic threat to networks and computer systems in cybersecurity, and because of its complexity and evasiveness, it is challenging to identify using traditional signature-based detection approaches. The study article discusses the growing danger to cybersecurity that malware hidden in PDF files poses, highlighting the shortcomings of conventional detection techniques and the difficulties presented by adversarial methodologies. The article presents a new method that improves PDF virus detection by using document analysis and a Logistic Model Tree. Using a dataset from the Canadian Institute for Cybersecurity, a comparative analysis is carried out with well-known machine learning models, such as Credal Decision Tree, Naïve Bayes, Average One Dependency Estimator, Locally Weighted Learning, and Stochastic Gradient Descent. Beyond traditional structural and JavaScript-centric PDF analysis, the research makes a substantial contribution to the area by boosting precision and resilience in malware detection. The use of Logistic Model Tree, a thorough feature selection approach, and increased focus on PDF file attributes all contribute to the efficiency of PDF virus detection. The paper emphasizes Logistic Model Tree’s critical role in tackling increasing cybersecurity threats and proposes a viable answer to practical issues in the sector. The results reveal that the Logistic Model Tree is superior, with improved accuracy of 97.46% when compared to benchmark models, demonstrating its usefulness in addressing the ever-changing threat landscape.
... Model training and testing are the core phases of any ML-based analysis. To this end, the study focuses on 10-fold cross-validation [19,23], which is a process for assessment that splits the complete data into ten subgroups of equal sizes; one subgroup is used for testing, whereas the rest of the subgroups are used for training, continuing until each subgroup has been used for testing [24][25][26][27][28][29]. The performance of the proposed and other ML models is evaluated using standard assessment measures, including accuracy, precision, recall, F-measure, and MCC (Matthews Correlation Coefficient) [20][21][22]. ...
Article
Full-text available
Author Profiling (AP) is a subsection of digital forensics that focuses on the detection of the author’s personal information, such as age, gender, occupation, and education, based on various linguistic features, e.g., stylistic, semantic, and syntactic. The importance of AP lies in various fields, including forensics, security, medicine, and marketing. In previous studies, many works have been done using different languages, e.g., English, Arabic, French, etc. However, the research on Roman Urdu is not up to the mark. Hence, this study focuses on detecting the author’s age and gender based on Roman Urdu text messages. The dataset used in this study is Fire’18-MaponSMS. This study proposed an ensemble model based on AdaBoostM1 and Random Forest (AMBRF) for AP using multiple linguistic features that are stylistic, character-based, word-based, and sentence-based. The proposed model is contrasted with several of the well-known models from the literature, including J48-Decision Tree (J48), Naïve Bays (NB), K Nearest Neighbor (KNN), and Composite Hypercube on Random Projection (CHIRP), NB-Updatable, RF, and AdaboostM1. The overall outcome shows the better performance of the proposed AdaboostM1 with Random Forest (ABMRF) with an accuracy of 54.2857% for age prediction and 71.1429% for gender prediction calculated on stylistic features. Regarding word-based features, age and gender were considered in 50.5714% and 60%, respectively. On the other hand, KNN and CHIRP show the weakest performance using all the linguistic features for age and gender prediction.
... The accuracy can be determined by dividing the total number of predicted reviews by the overall number of movie reviews (Khan and Muhammad 2023). ...
Article
Full-text available
Movies have been important in our lives for many years. Movies provide entertainment, inspire, educate, and offer an escape from reality. Movie reviews help us choose better movies, but reading them all can be time-consuming and overwhelming. To make it easier, sentiment analysis can classify movie reviews into positive and negative categories. Opinion mining (OP), called sentiment analysis (SA), uses natural language processing to identify and extract opinions expressed through text. Naive Bayes, a supervised learning algorithm, offers simplicity, efficiency, and strong performance in classification tasks due to its feature independence assumption. This study evaluates the performance of four Naïve Bayes variations using two vectorization techniques, Count Vectorizer and Term Frequency–Inverse Document Frequency (TF–IDF), on two movie review datasets: IMDb Movie Reviews Dataset and Rotten Tomatoes Movie Reviews. Bernoulli Naive Bayes achieved the highest accuracy using Count Vectorizer on the IMDB and Rotten Tomatoes datasets. Multinomial Naive Bayes, on the other hand, achieved better accuracy on the IMDB dataset with TF–IDF. During preprocessing, we implemented different techniques to enhance the quality of our datasets. These included data cleaning, spelling correction, fixing chat words, lemmatization, and removing stop words. Additionally, we fine-tuned our models through hyperparameter tuning to achieve optimal results. Using TF–IDF, we observed a slight performance improvement compared to using the count vectorizer. The experiment highlights the significant role of sentiment analysis in understanding the attitudes and emotions expressed in movie reviews. By predicting the sentiments of each review and calculating the average sentiment of all reviews, it becomes possible to make an accurate prediction about a movie’s overall performance.
... The accuracy of the predictions is determined by dividing the number of predicted reviews by the total number of reviews [31]. ...
Article
Full-text available
Movies reviews provide valuable insights that can help people decide which movies are worth watching and avoid wasting their time on movies they will not enjoy. Movie reviews may contain spoilers or reveal significant plot details, which can reduce the enjoyment of the movie for those who have not watched it yet. Additionally, the abundance of reviews may make it difficult for people to read them all at once, classifying all of the movie reviews will help in making this decision without wasting time reading them all. Opinion mining, also called sentiment analysis, is the process of identifying and extracting subjective information from textual data. This study introduces a sentiment analysis approach using advanced deep learning models: Extra-Long Neural Network (XLNet), Long Short-Term Memory (LSTM), and Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM). XLNet understands the context of a word from both sides, which is helpful for capturing complex language patterns. LSTM performs better in modeling long-term dependencies, while CNN-LSTM combines local and global context for robust feature extraction. Deep learning models take advantage of their ability to extract complex linguistic patterns and contextual information from raw text data. We carefully cleaned the IMDb movie reviews dataset with the goal of optimizing the results of models used in the experiment. This involves eliminating unnecessary punctuation, links, hashtags, stop words, and duplicate reviews. Lemmatization is also used for keeping consistent word forms. This cleaned IMDb dataset is evaluated on the proposed model for sentiment analysis in which XLNet performs well achieving an impressive 93.74% accuracy on the IMDb Dataset. The findings highlight the effectiveness of deep learning models in improving sentiment analysis, showing its potential for wider applications in natural language processing.
... The accuracy is calculated by dividing the predicted reviews by the total number of reviews [29]. ...
Article
Full-text available
Movie reviews are a valuable source of information for potential viewers. However, reading all of the reviews can be time-consuming and overwhelming. Summarizing all of the reviews will help you make the correct choice without wasting time reading all of the reviews. Sentiment analysis, or opinion mining, can extract subjective information from movie reviews, such as the reviewer’s overall opinion of the movie, its strengths and weaknesses, and the reviewer’s recommendations. This information can help potential viewers make informed decisions about whether or not to watch a movie. XLNet and Bidirectional Encoder Representations from Transformers (BERT) are pre-trained advanced language models that learn bidirectional relationships between words, improving performance on many natural language processing tasks. BERT uses a masked language modeling objective, while XLNet uses a permutation language modeling objective. This experiment is based on the proposed method for XLNet and BERT, two advanced techniques and popular baseline techniques using the Internet Movie Database (IMDB) Dataset of 50K reviews and the Rotten Tomatoes dataset. We pre-processed both datasets using data cleaning, the removal of duplicate reviews, lemmatization, and handling of chat words to improve baseline technique results. The results indicate that XLNet achieved the highest accuracy on both datasets. As a result of the research experiment, sentiment analysis provides insights into how emotions and attitudes are expressed in movie reviews that can be used to predict a movie’s performance based on their overall sentiment.
... The accuracy is calculated by dividing the predicted reviews by the total number of reviews [29]. ...
Article
Movie reviews are a valuable source of information for potential viewers. However, reading all of the reviews can be time-consuming and overwhelming. Summarizing all of the reviews will help you make the correct choice without wasting time reading all of the reviews. Sentiment analysis, or opinion mining, can extract subjective information from movie reviews, such as the reviewer’s overall opinion of the movie, its strengths and weaknesses, and the reviewer’s recommendations. This information can help potential viewers make informed decisions about whether or not to watch a movie. XLNet and Bidirectional Encoder Representations from Transformers (BERT) are pre-trained advanced language models that learn bidirectional relationships between words, improving performance on many natural language processing tasks. BERT uses a masked language modeling objective, while XLNet uses a permutation language modeling objective. This experiment is based on the proposed method for XLNet and BERT, two advanced techniques and popular baseline techniques using the Internet Movie Database (IMDB) Dataset of 50K reviews and the Rotten Tomatoes dataset. We pre-processed both datasets using data cleaning, the removal of duplicate reviews, lemmatization, and handling of chat words to improve baseline technique results. The results indicate that XLNet achieved the highest accuracy on both datasets. As a result of the research experiment, sentiment analysis provides insights into how emotions and attitudes are expressed in movie reviews that can be used to predict a movie’s performance based on their overall sentiment.
... The model used in this study was evaluated using training time, precision, recall, F1-score and accuracy [33]. Following are the parameters: TN: True Negative means the number of negative movie reviews that were correctly classified as negative. ...
Article
Full-text available
Movies are the better source of entertainment. Every year, a great percentage of movies are released. People comment on movies in the form of reviews after watching them. Since it is difficult to read all of the reviews for a movie, summarizing all of the reviews will help make this decision without wasting time in reading all of the reviews. Opinion mining also known as sentiment analysis is the process of extracting subjective information from textual data. Opinion mining involves identifying and extracting the opinions of individuals, which can be positive, neutral, or negative. The task of opinion mining also called sentiment analysis is performed to understand people's emotions and attitudes in movie reviews. Movie reviews are an important source of opinion data because they provide insight into the general public's opinions about a particular movie. The summary of all reviews can give a general idea about the movie. This study compares baseline techniques, Logistic Regression, Random Forest Classifier, Decision Tree, K-Nearest Neighbor, Gradient Boosting Classifier, and Passive Aggressive Classifier with Linear Support Vector Machines and Multinomial Naïve Bayes on the IMDB Dataset of 50K reviews and Sentiment Polarity Dataset Version 2.0. Before applying these classifiers, in pre-processing both datasets are cleaned, duplicate data is dropped and chat words are treated for better results. On the IMDB Dataset of 50K reviews, Linear Support Vector Machines achieve the highest accuracy of 89.48%, and after hyperparameter tuning, the Passive Aggressive Classifier achieves the highest accuracy of 90.27%, while Multinomial Nave Bayes achieves the highest accuracy of 70.69% and 71.04% after hyperparameter tuning on the Sentiment Polarity Dataset Version 2.0. This study highlights the importance of sentiment analysis as a tool for understanding the emotions and attitudes in movie reviews and predicts the performance of a movie based on the average sentiment of all the reviews.
Article
Full-text available
Ransomware-related cyber-attacks have been on the rise over the last decade, disturbing organizations considerably. Developing new and better ways to detect this type of malware is necessary. This research applies dynamic analysis and machine learning to identify the ever-evolving ransomware signatures using selected dynamic features. Since most of the attributes are shared by diverse ransomware-affected samples, our study can be used for detecting current and even new variants of the threat. This research has the following objectives: (1) Execute experiments with encryptor and locker ransomware combined with goodware to generate JSON files with dynamic parameters using a sandbox. (2) Analyze and select the most relevant and non-redundant dynamic features for identifying encryptor and locker ransomware from goodware. (3) Generate and make public a dynamic features dataset that includes these selected parameters for samples of different artifacts. (4) Apply the dynamic feature dataset to obtain models with machine learning algorithms. Five platforms, 20 ransomware, and 20 goodware artifacts were evaluated. The final feature dataset is composed of 2000 registers of 50 characteristics each. This dataset allows for a machine learning detection with a 10-fold cross-evaluation with an average accuracy superior to 0.99 for gradient boosted regression trees, random forest, and neural networks.
Article
Full-text available
In view of the various methodological developments regarding the protection of sensitive data, especially with respect to privacy-preserving computation and federated learning, a conceptual categorization and comparison between various methods stemming from different fields is often desired. More concretely, it is important to provide guidance for the practice, which lacks an overview over suitable approaches for certain scenarios, whether it is differential privacy for interactive queries, k -anonymity methods and synthetic data generation for data publishing, or secure federated analysis for multiparty computation without sharing the data itself. Here, we provide an overview based on central criteria describing a context for privacy-preserving data handling, which allows informed decisions in view of the many alternatives. Besides guiding the practice, this categorization of concepts and methods is destined as a step towards a comprehensive ontology for anonymization. We emphasize throughout the paper that there is no panacea and that context matters.
Article
Full-text available
Command, Control, Communication, and Intelligence (C3I) systems are increasingly used in critical civil and military domains for achieving information superiority, operational efficacy, and greater situational awareness. The critical civil and military domains include, but are not limited to battlefield, healthcare, transportation, and rescue missions. Given the sensitive nature and modernization of tactical domains, the security of C3I systems has recently become a critical concern. This is because cyber-attacks on C3I systems have catastrophic consequences including loss of human lives. Despite the increasing number of cyber-attacks on C3I systems and growing concerns about C3I systems’ security, there is a paucity of a comprehensive review to systematize the body of knowledge on the security of C3I systems. Therefore, in this paper, we have gathered, analyzed, and synthesized the body of knowledge on the security of C3I systems. We have identified and reported security vulnerabilities, attack vectors, and countermeasures/defenses for C3I systems. In particular, this paper has enabled us to (i) propose a taxonomy for security vulnerabilities, attack vectors, and countermeasures; (ii) interrelate attack vectors with security vulnerabilities and countermeasures; and (iii) propose future research directions for advancing the state-of-the-art on the security of C3I systems. We believe that our findings will serve as a guideline for practitioners and researchers to advance the state-of-the-practice and state-of-the-art on the security of C3I systems.
Article
Full-text available
In this paper, we perform an in-depth analysis of a large corpus of PDF maldocs to identify the key set of significantly important features and help in maldoc detection. Existing industry-based tools for the detection are inefficient and cannot prevent PDF maldocs because they are generic and depend primarily on a signature-based approach. Besides, several other methods developed by academics suffer heavily from reduced effectiveness. The feature-set using machine learning classifiers is prone to various known attacks, such as mimicry and parser confusion. Also, we discover that increasingly more malicious files i) contain evasive and obfuscated JavaScript code, ii) include hidden contents (mostly outside the objects), iii) have a corrupted document structure, and iv) usually contain short JavaScript code blocks . We utilise maldoc attacks’ evolution over a decade to highlight the essential features (e.g., concept drifts) that impact detectors and classifiers.
Article
Full-text available
Social engineering is an attack on information security for accessing systems or networks. Social engineering attacks occur when victims do not recognize methods, models, and frameworks to prevent them. The current research explains user studies, constructs, evaluation, concepts, frameworks, models, and methods to prevent social engineering attacks. Unfortunately, there is no specific previous research on preventing social engineering attacks that effectively and systematically analyze it. Current prevention methods, models, and frameworks of social engineering attacks include health campaigns, human as security sensor frameworks, user-centric frameworks, and user vulnerability models. The human as a security sensor framework needs guidance that will explore cybersecurity as super-recognizers, likely policing act for a secure system. This paper intends to critically and rigorously review prior literature on the prevention methods, models, and frameworks of social engineering attacks. We conducted a systematic literature review based on Bryman & Bell’s literature review method. We found a new approach in addition to methods, frameworks, models and evaluations to prevent social engineering attacks based on our review, which is using a protocol. We found the protocol to effectively prevent social engineering attacks, such as health campaigns, the vulnerability of social engineering victims, and co-utile protocol, which can manage information sharing on a social network. We present this systematic literature review to recommend ways to prevent social engineering attacks.
Article
Full-text available
Technological advancement of smart devices has opened up a new trend: Internet of Everything (IoE), where all devices are connected to the web. Large scale networking benefits the community by increasing connectivity and giving control of physical devices. On the other hand, there exists an increased ‘Threat’ of an ‘Attack’. Attackers are targeting these devices, as it may provide an easier ‘backdoor entry to the users’ network’.MALicious softWARE (MalWare) is a major threat to user security. Fast and accurate detection of malware attacks are the sine qua non of IoE, where large scale networking is involved. The paper proposes use of a visualization technique where the disassembled malware code is converted into gray images, as well as use of Image Similarity based Statistical Parameters (ISSP) such as Normalized Cross correlation (NCC), Average difference (AD), Maximum difference (MaxD), Singular Structural Similarity Index Module (SSIM), Laplacian Mean Square Error (LMSE), MSE and PSNR. A vector consisting of gray image with statistical parameters is trained using a Faster Region proposals Convolution Neural Network (F-RCNN) classifier. The experiment results are promising as the proposed method includes ISSP with F-RCNN training. Overall training time of learning the semantics of higher-level malicious behaviors is less. Identification of malware (testing phase) is also performed in less time. The fusion of image and statistical parameter enhances system performance with greater accuracy. The benchmark database from Microsoft Malware Classification challenge has been used to analyze system performance, which is available on the Kaggle website. An overall average classification accuracy of 98.12% is achieved by the proposed method.
Article
Full-text available
Modern children are active Internet users. However, in the context of information abundance, they have little knowledge of which information is useful and which is harmful. To make the Internet a safe place for children, various methods are used at the international and national levels, as well as by experts, and the ways to protect children from harmful information are sought. The article proposes an approach using a multi-criteria decision-making process to prevent children from encountering harmful content on the Internet and to make the Internet more secure environment for children. The article highlights the age characteristics of children as criteria. Harmless information, Training information, Entertainment information, News, and Harmful information are considered as alternatives. Here, a decision is made by comparing the alternatives according to the given criteria. According to the trials, harmful information is rated in the last position. There is no child protection issue on the Internet using the AHP method. This research is important to protect children from harmful information in the virtual space. In the protection of minors Internet users is a reliable approach for educational institutions, parents and other subjects related to child safety.
Article
Full-text available
With the increase in the popularity of mobile devices, malicious applications targeting Android platform have greatly increased. Malware is coded so prudently that it has become very complicated to identify. The increase in the large amount of malware every day has made the manual approaches inadequate for detecting the malware. Nowadays, a new malware is characterized by sophisticated and complex obfuscation techniques. Thus, the static malware analysis alone is not enough for detecting it. However, dynamic malware analysis is appropriate to tackle evasion techniques but incapable to investigate all the execution paths and also it is very time consuming. So, for better detection and classification of Android malware, we propose a hybrid approach which integrates the features obtained after performing static and dynamic malware analysis. This approach tackles the problem of analyzing, detecting and classifying the Android malware in a more efficient manner. In this paper, we have used a robust set of features from static and dynamic malware analysis for creating two datasets i.e. binary and multiclass (family) classification datasets. These are made publically available on GitHub and Kaggle with the aim to help researchers and anti-malware tool creators for enhancing or developing new techniques and tools for detecting and classifying Android malware. Various machine learning algorithms are employed to detect and classify malware using the features extracted after performing static and dynamic malware analysis. The experimental outcomes indicate that hybrid approach enhances the accuracy of detection and classification of Android malware as compared to the case when static and dynamic features are considered alone.
Article
Machine learning algorithms are widely used for cybersecurity applications, include spam, malware detection. In these applications, the machine learning model has to face attack by adversarial samples. Therefore, how to train a robust machine learning model with small samples is a very hot research problem. portable document format (PDF) is a widely used file format, and often utilized as a vehicle for malicious behavior. There have been various PDF malware detectors based on machine learning. However, the labeling of large‐scale data samples is time‐consuming and laborious. This paper aims to reduce the size of training set while maintain the performance of detection. We propose a novel PDF malware detection method, using active learning to boost training. Particularly, we first make clear the meaning of uncertain samples in this paper, and theoretically explain the effectiveness of these uncertain samples for malware detection. Second, we present an active‐learning based malware detection model, using mutual agreement analysis to choose the uncertain sample as the data augmentation. The detector is retrained according to the ground truth of the uncertain samples rather than the whole test samples in the previous epoch, which can not only improve the detection performance, but also reduce the training time consumption of the detector. We conduct 10 epochs of retraining experiments for comparison, using the uncertain samples and the whole test samples from the previous epoch respectively as training set augmentation. The experimental results show that our active‐learning based model can achieve the same performance as the traditional model in the tenth epoch of retraining, while the former only needs to use one thirtieth of the latter's training samples.