ArticlePDF Available

Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria

August 2023
Journal of Cyber Security 5:1-11

August 2023
5:1-11

DOI:10.32604/jcs.2023.042501

Authors:

University of Swat

The proliferation of maliciously coded documents as file transfers increase has led to a rise in sophisticated attacks. Portable Document Format (PDF) files have emerged as a major attack vector for malware due to their adaptability and wide usage. Detecting malware in PDF files is challenging due to its ability to include various harmful elements such as embedded scripts, exploits, and malicious URLs. This paper presents a comparative analysis of machine learning (ML) techniques, including Naive Bayes (NB), K-Nearest Neighbor (KNN), Average One Dependency Estimator (A1DE), Random Forest (RF), and Support Vector Machine (SVM) for PDF malware detection. The study utilizes a dataset obtained from the Canadian Institute for Cyber-security and employs different testing criteria, namely percentage splitting and 10-fold cross-validation. The performance of the techniques is evaluated using F1-score, precision, recall, and accuracy measures. The results indicate that KNN outperforms other models, achieving an accuracy of 99.8599% using 10-fold cross-validation. The findings highlight the effectiveness of ML models in accurately detecting PDF malware and provide insights for developing robust systems to protect against malicious activities.

Content uploaded by Sarwar Shah Khan

Content may be subject to copyright.

This work is licensed under a Creative Commons Attribution 4.0 International License,

which permits unrestricted use, distribution, and reproduction in any medium, provided the

original work is properly cited.

ech

PressScience

DOI: 10.32604/jcs.2023.042501

ARTICLE

Comparative Analysis of Machine Learning Models for PDF Malware

Detection: Evaluating Dierent Training and Testing Criteria

Bilal Khan1, Muhammad Arshad2and Sarwar Shah Khan3,4,*

1Department of Computer Science, City University of Science and Information Technology, Peshawar, Pakistan

2Department of Computer Soware Engineering, University of Engineering and Technology, Mardan, Pakistan

3Department of Computer and Soware Technology, University of Swat, Swat, Pakistan

4Department of Computer Science, IQRA National University, Swat, Pakistan

*Corresponding Author: Sarwar Shah Khan. Email: sskhan0092@gmail.com

Received: 01 June 2023 Accepted: 03 August 2023 Published: 21 August 2023

ABSTRACT

The proliferation of maliciously coded documents as le transfers increase has led to a rise in sophisticated attacks.

Portable Document Format (PDF) les have emerged as a major attack vector for malware due to their adaptability

and wide usage. Detecting malware in PDF les is challenging due to its ability to include various harmful elements

such as embedded scripts, exploits, and malicious URLs. This paper presents a comparative analysis of machine

learning (ML) techniques, including Naive Bayes (NB), K-Nearest Neighbor (KNN), Average One Dependency

Estimator (A1DE), Random Forest (RF), and Support Vector Machine (SVM) for PDF malware detection. The study

utilizes a dataset obtained from the Canadian Institute for Cyber-security and employs dierent testing criteria,

namely percentage splitting and 10-fold cross-validation. The performance of the techniques is evaluated using F1-

score, precision, recall, and accuracy measures. The results indicate that KNN outperforms other models, achieving

an accuracy of 99.8599% using 10-fold cross-validation. The ndings highlight the eectiveness of ML models in

accurately detecting PDF malware and provide insights for developing robust systems to protect against malicious

activities.

KEYWORDS

Cyber-security; PDF malware; model training; testing

1Introduction

Recent years have seen a sharp rise in sophisticated assaults using maliciously coded documents

as file transfers increase. Executable files that are attached to emails or webpages can be dangerous,

as most Internet users are aware. Nevertheless, the papers are a useful tool for distributing malware

because people are ignorant of them. The major attack vector for malware that has been detected

is the PDF, which is much adaptable than other document formats. Malicious PDF files frequently

contain JavaScript or binary scripts that take advantage of security weaknesses to do damaging

actions [1]. There are uncountable PDF files online. Some are not as innocuous as one may think.

In reality, PDF files may contain a wide range of items, such as JavaScript or binary code. These

2JCS, 2023, vol.5

things might occasionally be dangerous. Since Portable Document Format files can include a variety

of harmful material, including embedded scripts, exploits, and malicious URLs, it can be difficult

to detect malware in them. A reading flaw might be used by malware software to try to infect a

machine [2]. Adobe Acrobat Reader discovered a huge number of vulnerabilities in 2017. Every reader

has particular flaws, and a malicious PDF file could be able to exploit them [3]. Offices frequently

use the PDF file format due to its great efficiency, reliability, and interaction. The emergence of

more advanced, non-executable file-based attack technologies and techniques has made PDF security

more challenging because spiteful PDF files are the commonly explore infection vectors in hostile

circumstances [4,5]. PDF malware detection is very important due to several reasons including:

Protection against Malicious Content: PDF files are often utilized for document sharing and can

include a variety of embedded content types, including JavaScript, links, and multimedia components.

These characteristics can be used by malicious actors to embed malware into PDF files, potentially

making them a vehicle for virus delivery. Finding PDF malware helps users avoid unintentionally

accessing or running dangerous files [6].

Preventing Exploits: Vulnerabilities in PDF reader software and other applications that work with

PDF files can be exploited using PDF files. Malicious PDFs may include exploits that use security

flaws to access systems without authorization or run malicious malware. For computer systems and

networks to remain secure, these attacks must be found and stopped [7].

Protecting Sensitive Information: PDF files are frequently used to store and distribute sensitive

information, such as financial information, intellectual property, or personal particulars. This sensitive

information may be stolen or leaked by malware that is included in PDF files, which might result in

monetary loss, data breaches, or identity theft. Protecting the security and integrity of sensitive data

is made easier by finding and eliminating malware from PDF files [8].

Attacks Using Social Engineering: To deceive users into opening infected PDF files, malicious

actors frequently utilize social engineering tactics. These files could have alluring subject lines or mes-

sages, or they might be presented as actual papers. Finding PDF malware shields consumers from these

social engineering scams and guards against the potential loss of money, reputation, or operational

efficiency [9].

System Security Overall: Malware attacks can have serious effects on the safety and functionality

of computer systems. System crashes, data damage, unauthorized access, and the installation of new

malware are all possible consequences of malware. Maintaining the overall security and stability of

computer systems and networks involves finding and eliminating PDF malware [10].

The motivation for this research stems from the need to develop effective methods for protecting

against sophisticated attacks using PDF files. The authors highlight the importance of PDF malware

detection for several reasons. Firstly, detecting malware in PDF files helps protect users from unin-

tentionally accessing or running dangerous files, safeguarding them against potential harm. Secondly,

vulnerabilities in PDF reader software and other applications can be exploited through malicious PDF

files, making it crucial to identify and prevent such attacks. Thirdly, PDF files often contain sensitive

information that can be stolen or leaked by malware, leading to financial loss, data breaches, or identity

theft. Detecting and eliminating malware from PDF files helps protect the security and integrity of

sensitive data. Lastly, malicious actors often use social engineering tactics to trick users into opening

infected PDF files, and detecting PDF malware can mitigate the risks associated with such attacks.

Keeping all these important in mind, researchers have proposed a variety of models to distinguish

numerous attacks connected to PDF files as a result of the growth of ML technology in recent years

JCS, 2023, vol.5 3

[11,12]. However, this study presents the analysis of various ML models which are “Average One

Dependency Estimator (A1DE), K-Nearest Neighbor (KNN), Support Vector Machine (SVM) [13],

Naive Bayes (NB), and Random Forest (RF)” [14]. Based on F1-Score, precision, recall and accuracy,

these models are contrasted. The primary objective of this study is to develop a malware detection

model capable of safeguarding systems against harmful actions caused by PDF viruses.

The remaining sections of this study are organized as follows: The literature study is summarized

in Section 2, the technique is covered in Sections 3,and4the inquiry is concluded in Section 5.

2Literature Review

Using countless ML and DL models, several varieties of research have been managing on the

identification of PDF malware. Kang et al. described the use of the PDF in 2019 [15]. They gave

a thorough analysis of the JavaScript structure and content in the PDF with embedded XML.

They then build a variety of features, such as configuration encoding methods for material and

variables like file size, keywords, versions, and JavaScript readable strings. Information about file

size, category, and content properties, additionally item names, keywords, and JavaScript readable

strings. The approaches to training resilient PDF malware classifiers utilizing observable robustness

features were described by Chen et al. in 2019 [16]. For instance, with no substance on how countless

pages of innocuous forms are included in the document, the classifier must identify PDF malware as

harmful. They demonstrate how to accurately evaluate the worst-case behavior of a malware classifier

concerning particular robustness properties.

In several studies, ML approaches have been utilized to develop classifiers for PDF malware. Two

prior initiatives that focused on the hazardous JavaScript that was presented in Portable Document

Format malware were Wepawet [17]andLaskovetal.[18].

Based on the lexical features of JavaScript scripts as well as functions, constants, objects,

techniques, and keywords, Khitan et al. [19] proposedattributes. Zhang et al. [20] merged the JavaScript

object count, page count, and stream filtering data with the PDF structure, entity characteristics, meta-

data information, and content statistics. Following the revelation that malicious JavaScript functions

differently from legitimate JavaScript code. Liu et al. [21] suggested a context-aware approach. This

approach involves utilizing the original JavaScript code as input to the “eval” function to open the

PDF file, while closely monitoring for any unusual behavior based on the given instructions.

According to Herrera-Silva et al. [22], Cyberattacks using ransomware have increased over the

past ten years, causing great concern among organizations. It’s critical to develop novel and enhanced

techniques for detecting this type of virus. This work employs machine learning and dynamic analysis

to identify the ransomware signatures that are always evolving using a few dynamic variables. This

study can be utilized to identify current and even novel versions of the threat because the majority of

the characteristics are shared by a variety of ransom ware-affected samples.

Dhalaria et al. present a hybrid method for detecting and classifying Android malware [23]. The

proposed method combines static and dynamic analysis techniques to effectively identify malicious

applications and classify them into different malware families. The authors train machine learning

models for malware detection and family classification using features taken from both the static and

dynamic behaviours of Android apps. Experimental results demonstrate the effectiveness of the hybrid

approach in accurately detecting and classifying Android malware, thereby contributing to the field

of mobile security and aiding in the prevention of malicious activities on Android devices. However,

Deore et al. presented a novel approach for detecting malware using a Faster Region Proposals

4JCS, 2023, vol.5

Convolutional Neural Network (FRCNN) [24]. The proposed MDFRCNN model aims to increase

the accuracy and efficiency of malware detection by effectively identifying and classifying malicious

regions within digital content. The authors conduct experiments and evaluate the performance of

their model using various datasets, demonstrating its effectiveness in detecting malware in real-world

scenarios.

3Methodology

This study focuses on the comparison of various ML models and model training criteria to find a

better solution for PDF malware detection. ML models include A1DE, NB, KNN, RF, and SVM while

training criteria include the percentage splitting with 70% and 30% for training and testing respectively,

and 10-fold cross-validation. The overall methodology is presented in Fig. 1.

Figure 1: Methodology flow chart

3.1 Dataset Explanation and Preprocessing

We have collected the PDF Malware detection dataset from Canadian Institute for Cyber-security:

https://www.unb.ca/cic/datasets/pdfmal-2022.html. The dataset has 33 characteristics, 32 of which are

independent, and 1 of which is dependent. The first 11 characteristics were eliminated since they had no

effect during the analysis stage. These characteristics are collectively referred to as general features, and

they comprise the following information: “Encryption, metadata size, page number, header, picture

number, text, object number, font objects, number of embedded files, and average size of all embedded

media are all factors to consider”. The data is cleaned and no need for further preprocessing.

For further analysis, there is a need to select some features that are best suited for the analysis.

To this end, we select some features from Structural features which define the PDF file in terms of

its structure, which necessitates more thorough processing and reveals information about the PDF’s

general framework.

We have employed Classifier Attribute Evaluator techniques employing the ZeroR classifier and

the Ranker searching method for retrieving such functions. For accuracy estimation, the number of

folds used is 5. The selected features are ranked as:

JCS, 2023, vol.5 5

Selected attributes: “21,7,8,10,6,5,4,3,2,9,11,20,18,19,12,17,16,15,14,13,1: 21”.

The following attributes are present, in that order: Colours, encrypt, JS, XFA startxref, trailer,

xref, endstream, stream, endobj, launch, OpenAction, AA, EmbeddedFile, JBIG2Decode, Acroform,

pageno, ObjStm, Javascript, RichMedia, and obj. Table 1 represents the selected features with their

descriptions.

Table 1: Selected features and description

S. No. Feature Description

1 Xref The stream’s size, as harmful code may be concealed within streams.

2 Trailer Number of trailers inside the PDF.

3 Pageno Malicious PDF files often include fewer pages—often just one blank

page—because they don’t care how their material is presented.

4 Stream This displays how many binary data sequences there are in the PDF.

5 Encrypt This function indicates whether or not the PDF file is

password-protected.

6 Objstm Streams that contain additional objects.

7 Endstream Keywords that signify the streams’ termination.

8 JS Several objects encompassing Javascript code.

9 Obj This might be a sign of an attempt to obfuscate.

10 Javascript This shows how many things include Javascript code, the most common

type of characteristic.

11 AA Specifies a specific action upon an incident.

12 OpenAction When a PDF file is opened, this property specifies what action should be

done. The majority of commonly encountered malicious PDF files

employ this feature in combination with Javascript.

13 endobj PDFs enable a wide range of obfuscation methods, including string

obfuscations in hex, octal, etc., that are frequently employed in evasion

strategies.

14 Acroform Form fields in PDF files created with Acrobat contain scripting that

hackers might use against you.

15 Startxref Numerous keywords that include “startxref” indicate the location of the

Xref table’s start.

16 JBIG2Decode JBig2Decode is a well-liked filter for encoding hazardous data. What

items have the most number of nested filters? Nested filters can impede

decoding and suggest evasion.

17 Richmeddia The number of rich media keywords shows the amount of flash and

embedded media.

18 Launch The phrase “launch” refers to the act of executing a command or

programme.

19 EmbeddedFile It is possible for PDF files to attach or embed other things, such as word

documents, photos, and more, which can be used maliciously.

20 XFA XFAs, an XML form architecture that permits scripting technologies

that attackers may exploit, are found in some PDF files.

(Continued)

6JCS, 2023, vol.5

Table 1 (continued)

S. No. Feature Description

21 Color Many color schemes are used in the PDF.

22 Class Categorize as benign or malicious.

Due to its mobility, PDF has become the most often used document format throughout time.

Unfortunately, the ubiquity of PDFs and their sophisticated capabilities have made it possible for

attackers to use them in a variety of ways. An attacker can take advantage of several crucial PDF

properties to spread a malicious payload. This dataset, which comprises 10,019 records in which 5551

malicious and 4468 are benign which incline to evade the mutual significant features discovered in

every class, collects these malicious data and information.

3.2 Model Training and Performance Evaluation

This study focuses on two different types of testing analysis; one is based on percentage splitting

of the dataset where we have used 70% for training and the rest 30% for testing, and the second

method is K-fold cross-validation in which we have selected the value of K as 10. This study presents

a comparison of these testing methods. The ML techniques are compared using standard evaluation

metrics such as F1-score, precision, recall, and accuracy. These measures can be calculated as:

Precision =TP

TP +FP (1)

Recall =TP

TP +FN (2)

F1−Score =2∗Precision ∗Recall

Precision +Recall (3)

Accuracy =TP +TN

TP +TN +FP +FN (4)

Here, the true positive values are presented with TP, false positive values are presented with FP,

while TN and FN present the values of true negative and false negative calculations.

4Results Analysis and Discussion

This section presents the outcomes achieved via the aforementioned ML models including A1De,

NB, SVM, KNN, and RF. These models are evaluated using f1-score, precision, recall, and accuracy.

This study also focuses on different types of testing criteria which are percentage splitting and K-fold

cross-validation. For percentage splitting, we have used 70% percent for training and the remaining

30% for testing while in the K-fold, we have selected the value of K as 10. Fig. 2 presents the precision,

recall, and f1-score of each employed technique using the first testing criterion which is 70% and

30% for training and testing respectively; however, Fig. 3 presents the same using the second testing

criterion which is 10-fold cross-validation. Considering the testing criteria, this analysis illustrates that

10-fold cross-validation is better to utilize for testing instead of 70% and 30% for training and testing.

JCS, 2023, vol.5 7

Moreover, it also shows that KNN outperforms other employed models on 10-fold cross-validation.

Using the percentage splitting criterion, KNN and RF both show the same performance.

Precision, Recall and F1 Score

Figure 2: Precision, recall, and accuracy analysis using percentage splitting criterion

Precision, Recall and F1 Scre

Figure 3: Precision, recall, and accuracy analysis using 10-fold cross-validation

Figs. 4 and 5separately present the accuracy analysis of each employed model using both testing

criteria which are percentage splitting and K-fold cross-validation. In both cases, KNN outperforms

another employed model with better accuracy is 99.8499% using percentage splitting testing criteria

and 99.8599% on K-fold cross-validation criteria. This analysis shows that in the current scenario,

K-fold cross-validation is the better training and testing criteria to train the model. Based on a variety

of input values, ML models predict output values. One of the simplest ML algorithms is KNN, which

is typically employed for classification. It categorizes the data point based on how its neighbors are

categorized. Lazy Learner (Instance-based learning) is another name for KNN. It learns nothing

throughout the training period. No discriminative function is generated using the training data. In

other words, no training is required. It only draws learning from the stored training dataset for

making real-time predictions. The KNN technique is much quicker than other utilized technique since

other models require training while the KNN technique does not require training before producing

predictions, and fresh data may be incorporated effortlessly without affecting the techniques accuracy.

8JCS, 2023, vol.5

Figure 4: Accuracy analysis using percentage splitting testing criteria

Figure 5: 10 accuracy analysis using 10-fold cross-validation testing criteria

Overall, the findings reveal that the ML models consistently outperform both assessment methods,

demonstrating their efficiency in correctly predicting the target variable. Because the data partitioning

and model training differ across the two techniques, the accuracy values produced by percentage

splitting and 10-fold cross-validation are only slightly different. However, the models’ excellent

accuracy ratings demonstrate their potential for accurate predictions in the present situation.

The paper contributes to the field of cyber-security by providing insights into the effectiveness of

ML models for detecting malware in PDF files. The comparative analysis and performance evaluation

contribute to the development of robust systems for protecting against malicious activities associated

with PDF malware. The study demonstrates the effectiveness of ML models in accurately detecting

PDF malware and provides insights for developing robust systems to protect against malicious

activities. The findings suggest that KNN is a promising model for PDF malware detection, but further

research and experimentation may be required to validate and improve the results.

5Conclusion and Future Work

In this research paper, we conducted a comparative analysis of machine learning (ML) models

for PDF malware detection, focusing on the A1DE, NB, SVM, KNN, and RF models. We utilized a

dataset obtained from the Canadian Institute for Cyber-security and employed two testing criteria:

percentage splitting and 10-fold cross-validation. Our evaluation was based on precision, recall,

JCS, 2023, vol.5 9

F1-score, and accuracy metrics. The results showed that KNN outperformed the other models, achiev-

ing an accuracy of 99.8599% using 10-fold cross-validation. These findings highlight the effectiveness

of ML models in accurately detecting PDF malware and provide insights for developing robust

systems to protect against malicious activities in PDF files. The research contributes to enhancing

cyber-security measures by providing a reliable model for PDF malware detection, which can help in

preventing the proliferation of sophisticated attacks through maliciously coded documents.

Moreover, the paper has some limitations such as a limited dataset and the lack of comparison

with other approaches; its methodology, performance evaluation, and comparative analysis contribute

to its validity. However, further research and validation using diverse datasets and comparison with

alternative methods are necessary to strengthen the findings.

Moving forward, further research in the field of PDF malware detection should focus on several

key areas. First, investigating deep learning methods like convolutional neural networks and recurrent

neural networks (RNNs) may improve the performance and accuracy of the models. Additionally,

incorporating natural language processing (NLP) techniques to analyze the textual content within

PDF files could provide valuable insights for malware detection. Moreover, the development of real-

time detection systems that can analyze PDF files on the fly and detect emerging threats in a timely

manner would be highly beneficial. Lastly, collaboration between researchers, industry professionals,

and cyber-security organizations is crucial to gather large-scale, diverse datasets for training and

testing purposes, ensuring the models are robust and effective against a wide range of PDF malware

variants.

Declarations: We, hereby declare that the research paper titled “Comparative Analysis of Machine

Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria,”

submitted for publication, is my original work and has not been submitted elsewhere for any academic

or non-academic purpose. We confirm that all the sources used in this research paper have been

appropriately cited and acknowledged. Any references, data, or ideas obtained from other authors

are duly attributed and documented in the references section.

Acknowledgement: We would like to express our sincere gratitude to all those who contributed to the

successful completion of this research paper, “Comparative Analysis of Machine Learning Models

for PDF Malware Detection: Evaluating Different Training and Testing Criteria”. We are thankful

to the research team members, Bilal Khan (BK), Muhammad Arshad (MA), and Sarwar Shah Khan

(SSK), for their collaboration and valuable insights throughout the research process. We extend our

appreciation to the institutions that supported this research work. We are deeply grateful to the

Canadian Institute for Cyber-security for providing the dataset that formed the foundation of our

analysis. Their contributions have been instrumental in enabling us to conduct this study on PDF

malware detection and assess the performance of various ML models. Our heartfelt appreciation goes

to all the individuals and institutions that reviewed and provided constructive feedback on this research

paper. Your valuable input helped improve the quality and rigor of our work. Without the collective

efforts and support of all these individuals and organizations, this research paper would not have been

possible. Thank you all for your invaluable contributions.

Funding Statement: No specific grant from a funding organization supported this research.

Author Contributions: The authors confirm contribution to the paper as follows: study conception and

design: BK; data collection: BK and MA; analysis and interpretation of results: SSK and BK; draft

10 JCS, 2023, vol.5

manuscript preparation: MA and SSK. All authors reviewed the results and approved the final version

of the manuscript.

Availability of Data and Materials: The data used in this research paper, “Comparative Analysis of

Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing

Criteria,” is obtained from the Canadian Institute for Cyber-security and is publicly available for

research purposes. The dataset can be accessed at the following URL: https://www.unb.ca/cic/datasets/

pdfmal-2022.html.

Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the

present study.

References

[1] Y. S. Jeong, J. Woo and A. R. Kang, “Malware detection on byte streams of pdf files using convolutional

neural networks,” Security and Communication Networks, vol. 2019, pp. 144–152, 2019.

[2] B. Cuan, A. Damien, C. Delaplace and M. Valois, “Malware detection in PDF files using machine learning,”

in ICETE 2018-Proc. of the 15th Int. Joint Conf. on e-Business and Telecommunications, Porto, Portugal,

vol. 2, pp. 412–419, 2018.

[3] A. Falah, S. R. Pokhrel, L. Pan and A. de Souza-Daw, “Towards enhanced PDF maldocs detection with

feature engineering: Design challenges,” Multimedia Tools and Applications, vol. 81, no. 28, pp. 41103–

41130, 2022.

[4] W. Xu, Y. Qi and D. Evans, “Automatically evading classifiers,” in Proc. of the 23rd Annual Network and

Distributed System Security Symp., San Diego, California, vol. 2016, no. February, pp. 21–24, 2016.

[5] S. Sibi Chakkaravarthy, D. Sangeetha and V. Vaidehi, “A survey on malware analysis and mitigation

techniques,” Computer Science Review, vol. 32, pp. 1–23, 2019.

[6] F. J. Abdullayeva and S. S. Ojagverdiyeva, “Multicriteria decision making using analytic hierarchy process

for child protection from malicious content on the internet,”International Journal of Computer Network &

Information Security, vol. 13, no. 3, pp. 52–61, 2021.

[7] B. Wickman, H. Hu, I. Yun, D. Jang, J. W. Lim et al., “Preventing Use-After-Free attacks with fast forward

allocation,” in 30th USENIX Security Symp. (USENIX Security 21), Vancouver, Canada, pp. 2453–2470,

2021.

[8] M. Templ and M. Sariyar, “A systematic overview on methods to protect sensitive data provided for various

analyses,” International Journal of Information Security, vol. 21, no. 6, pp. 1233–1246, 2022.

[9] W. Syafitri, Z. Shukur, U. Asma’Mokhtar, R. Sulaiman and M. A. Ibrahim, “Social engineering attacks

prevention: A systematic literature review,” IEEE Access, vol. 10, pp. 39325–39343, 2022.

[10] H. Ahmad, I. Dharmadasa, F. Ullah and M. A. Babar, “A review on C3I systems’ security: Vulnerabilities,

attacks, and countermeasures,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–38, 2023.

[11] W. Li, W. Meng, Z. Tan and Y. Xiang, “Design of multi-view based email classification for IoT systems via

semi-supervised learning,” Journal of Network and Computer Applications, vol. 128, pp. 56–63, 2019.

[12] Y.Li,X.Wang,Z.Shi,R.Zhang,J.Xueet al., “Boosting training for PDF malware classifier via active

learning,” International Journal of Intelligent Systems, vol. 37, no. 4, pp. 2803–2821, 2022.

[13] S. S. Khan, M. Khan, S. Technology and R. Naseem, “Challenges in opinion mining, comprehensive,” A

Science and Technology Journal, vol. 33, no. 11, pp. 123–135, 2018.

[14] T. Tsafrir, A. Cohen, E. Nir and N. Nissim, “Efficient feature extraction methodologies for unknown MP4-

malware detection using machine learning algorithms,” Expert Systems with Applications, vol. 219, pp.

119615, 2023.

[15] A. R. Kang, Y. S. Jeong, S. L. Kim and J. Woo, “Malicious PDF detection model against adversarial attack

built from benign PDF containing javascript,” Applied Sciences, vol. 9, no. 22, pp. 4764, 2019.

JCS, 2023, vol.5 11

[16] Y. Chen, S. Wang, D. She and S. Jana, “On training robust {PDF}malware classifiers,” in 29th USENIX

Security Symp. (USENIX Security 20), Boston, USA, pp. 2343–2360, 2020.

[17] M. Cova, C. Kruegel and G. Vigna, “Detection and analysis of drive-by-download attacks and malicious

JavaScript code,” in Proc. of the 19th Int. Conf. on World Wide Web, North Carolina, USA, pp. 281–290,

2010.

[18] P. Laskov and N. ˇ

Srndi´

c, “Static detection of malicious JavaScript-bearing PDF documents,” in Proc. of

the 27th Annual Computer Security Applications Conf., Orlando, Florida, USA, pp. 373–382, 2011.

[19] S. J. Khitan, A. Hadi and J. Atoum, “PDF forensic analysis system using YARA,” International Journal of

Computer Science and Network Security, vol. 17, no. 5, pp. 77–85, 2017.

[20] J. Zhang, “MLPdf: An effective machine learning based approach for PDF malware detection,” pp. 1–6,

2018. [Online]. Available: https://arxiv.org/pdf/1808.06991.pdf

[21] D. Liu, H. Wang and A. Stavrou, “Detecting malicious javascript in pdf through document instrumenta-

tion,” in 2014 44th Annual IEEE/IFIP Int. Conf. on Dependable Systems and Networks, Atlanta, USA, pp.

100–111, 2014.

[22] J. A. Herrera-Silva and M. Hernández-Álvarez, “Dynamic feature dataset for ransomware detection using

machine learning algorithms,” Sensors, vol. 23, no. 3, pp. 1053, 2023.

[23] M. Dhalaria and E. Gandotra, “A hybrid approach for android malware detection and family classifica-

tion,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 6, pp. 174–188,

2020.

[24] M. Deore and U. Kulkarni, “Mdfrcnn: Malware detection using faster region proposals convolution neural

network,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 7, no. 4, pp. 146–

162, 2022.

Enhancing PDF Malware Detection through Logistic Model Trees

Article

Full-text available

Mar 2024
CMC-COMPUT MATER CON

Muhammad Binsawad

Malware is an ever-present and dynamic threat to networks and computer systems in cybersecurity, and because of its complexity and evasiveness, it is challenging to identify using traditional signature-based detection approaches. The study article discusses the growing danger to cybersecurity that malware hidden in PDF files poses, highlighting the shortcomings of conventional detection techniques and the difficulties presented by adversarial methodologies. The article presents a new method that improves PDF virus detection by using document analysis and a Logistic Model Tree. Using a dataset from the Canadian Institute for Cybersecurity, a comparative analysis is carried out with well-known machine learning models, such as Credal Decision Tree, Naïve Bayes, Average One Dependency Estimator, Locally Weighted Learning, and Stochastic Gradient Descent. Beyond traditional structural and JavaScript-centric PDF analysis, the research makes a substantial contribution to the area by boosting precision and resilience in malware detection. The use of Logistic Model Tree, a thorough feature selection approach, and increased focus on PDF file attributes all contribute to the efficiency of PDF virus detection. The paper emphasizes Logistic Model Tree’s critical role in tackling increasing cybersecurity threats and proposes a viable answer to practical issues in the sector. The results reveal that the Logistic Model Tree is superior, with improved accuracy of 97.46% when compared to benchmark models, demonstrating its usefulness in addressing the ever-changing threat landscape.

Predicting Age and Gender in Author Profiling: A Multi-Feature Exploration

Article

Full-text available

May 2024
CMC-COMPUT MATER CON

Author Profiling (AP) is a subsection of digital forensics that focuses on the detection of the author’s personal information, such as age, gender, occupation, and education, based on various linguistic features, e.g., stylistic, semantic, and syntactic. The importance of AP lies in various fields, including forensics, security, medicine, and marketing. In previous studies, many works have been done using different languages, e.g., English, Arabic, French, etc. However, the research on Roman Urdu is not up to the mark. Hence, this study focuses on detecting the author’s age and gender based on Roman Urdu text messages. The dataset used in this study is Fire’18-MaponSMS. This study proposed an ensemble model based on AdaBoostM1 and Random Forest (AMBRF) for AP using multiple linguistic features that are stylistic, character-based, word-based, and sentence-based. The proposed model is contrasted with several of the well-known models from the literature, including J48-Decision Tree (J48), Naïve Bays (NB), K Nearest Neighbor (KNN), and Composite Hypercube on Random Projection (CHIRP), NB-Updatable, RF, and AdaboostM1. The overall outcome shows the better performance of the proposed AdaboostM1 with Random Forest (ABMRF) with an accuracy of 54.2857% for age prediction and 71.1429% for gender prediction calculated on stylistic features. Regarding word-based features, age and gender were considered in 50.5714% and 60%, respectively. On the other hand, KNN and CHIRP show the weakest performance using all the linguistic features for age and gender prediction.

Sentiment analysis of movie reviews based on NB approaches using TF–IDF and count vectorizer

Article

Full-text available

Apr 2024

Movies have been important in our lives for many years. Movies provide entertainment, inspire, educate, and offer an escape from reality. Movie reviews help us choose better movies, but reading them all can be time-consuming and overwhelming. To make it easier, sentiment analysis can classify movie reviews into positive and negative categories. Opinion mining (OP), called sentiment analysis (SA), uses natural language processing to identify and extract opinions expressed through text. Naive Bayes, a supervised learning algorithm, offers simplicity, efficiency, and strong performance in classification tasks due to its feature independence assumption. This study evaluates the performance of four Naïve Bayes variations using two vectorization techniques, Count Vectorizer and Term Frequency–Inverse Document Frequency (TF–IDF), on two movie review datasets: IMDb Movie Reviews Dataset and Rotten Tomatoes Movie Reviews. Bernoulli Naive Bayes achieved the highest accuracy using Count Vectorizer on the IMDB and Rotten Tomatoes datasets. Multinomial Naive Bayes, on the other hand, achieved better accuracy on the IMDB dataset with TF–IDF. During preprocessing, we implemented different techniques to enhance the quality of our datasets. These included data cleaning, spelling correction, fixing chat words, lemmatization, and removing stop words. Additionally, we fine-tuned our models through hyperparameter tuning to achieve optimal results. Using TF–IDF, we observed a slight performance improvement compared to using the count vectorizer. The experiment highlights the significant role of sentiment analysis in understanding the attitudes and emotions expressed in movie reviews. By predicting the sentiments of each review and calculating the average sentiment of all reviews, it becomes possible to make an accurate prediction about a movie’s overall performance.

Opinion Mining on Movie Reviews Based on Deep Learning Models

Article

Full-text available

Jan 2024

Movies reviews provide valuable insights that can help people decide which movies are worth watching and avoid wasting their time on movies they will not enjoy. Movie reviews may contain spoilers or reveal significant plot details, which can reduce the enjoyment of the movie for those who have not watched it yet. Additionally, the abundance of reviews may make it difficult for people to read them all at once, classifying all of the movie reviews will help in making this decision without wasting time reading them all. Opinion mining, also called sentiment analysis, is the process of identifying and extracting subjective information from textual data. This study introduces a sentiment analysis approach using advanced deep learning models: Extra-Long Neural Network (XLNet), Long Short-Term Memory (LSTM), and Convolutional Neural Network-Long Short-Term Memory (CNN-LSTM). XLNet understands the context of a word from both sides, which is helpful for capturing complex language patterns. LSTM performs better in modeling long-term dependencies, while CNN-LSTM combines local and global context for robust feature extraction. Deep learning models take advantage of their ability to extract complex linguistic patterns and contextual information from raw text data. We carefully cleaned the IMDb movie reviews dataset with the goal of optimizing the results of models used in the experiment. This involves eliminating unnecessary punctuation, links, hashtags, stop words, and duplicate reviews. Lemmatization is also used for keeping consistent word forms. This cleaned IMDb dataset is evaluated on the proposed model for sentiment analysis in which XLNet performs well achieving an impressive 93.74% accuracy on the IMDb Dataset. The findings highlight the effectiveness of deep learning models in improving sentiment analysis, showing its potential for wider applications in natural language processing.

Proposing sentiment analysis model based on BERT and XLNet for movie reviews

Article

Full-text available

Jan 2024
MULTIMED TOOLS APPL

Movie reviews are a valuable source of information for potential viewers. However, reading all of the reviews can be time-consuming and overwhelming. Summarizing all of the reviews will help you make the correct choice without wasting time reading all of the reviews. Sentiment analysis, or opinion mining, can extract subjective information from movie reviews, such as the reviewer’s overall opinion of the movie, its strengths and weaknesses, and the reviewer’s recommendations. This information can help potential viewers make informed decisions about whether or not to watch a movie. XLNet and Bidirectional Encoder Representations from Transformers (BERT) are pre-trained advanced language models that learn bidirectional relationships between words, improving performance on many natural language processing tasks. BERT uses a masked language modeling objective, while XLNet uses a permutation language modeling objective. This experiment is based on the proposed method for XLNet and BERT, two advanced techniques and popular baseline techniques using the Internet Movie Database (IMDB) Dataset of 50K reviews and the Rotten Tomatoes dataset. We pre-processed both datasets using data cleaning, the removal of duplicate reviews, lemmatization, and handling of chat words to improve baseline technique results. The results indicate that XLNet achieved the highest accuracy on both datasets. As a result of the research experiment, sentiment analysis provides insights into how emotions and attitudes are expressed in movie reviews that can be used to predict a movie’s performance based on their overall sentiment.

Proposing sentiment analysis model based on BERT and XLNet for movie reviews

Article

Jan 2024
MULTIMED TOOLS APPL

Sentiment Analysis Based on Performance of Linear Support Vector Machine and Multinomial Naïve Bayes Using Movie Reviews with Baseline Techniques

Article

Full-text available

Sep 2023

Movies are the better source of entertainment. Every year, a great percentage of movies are released. People comment on movies in the form of reviews after watching them. Since it is difficult to read all of the reviews for a movie, summarizing all of the reviews will help make this decision without wasting time in reading all of the reviews. Opinion mining also known as sentiment analysis is the process of extracting subjective information from textual data. Opinion mining involves identifying and extracting the opinions of individuals, which can be positive, neutral, or negative. The task of opinion mining also called sentiment analysis is performed to understand people's emotions and attitudes in movie reviews. Movie reviews are an important source of opinion data because they provide insight into the general public's opinions about a particular movie. The summary of all reviews can give a general idea about the movie. This study compares baseline techniques, Logistic Regression, Random Forest Classifier, Decision Tree, K-Nearest Neighbor, Gradient Boosting Classifier, and Passive Aggressive Classifier with Linear Support Vector Machines and Multinomial Naïve Bayes on the IMDB Dataset of 50K reviews and Sentiment Polarity Dataset Version 2.0. Before applying these classifiers, in pre-processing both datasets are cleaned, duplicate data is dropped and chat words are treated for better results. On the IMDB Dataset of 50K reviews, Linear Support Vector Machines achieve the highest accuracy of 89.48%, and after hyperparameter tuning, the Passive Aggressive Classifier achieves the highest accuracy of 90.27%, while Multinomial Nave Bayes achieves the highest accuracy of 70.69% and 71.04% after hyperparameter tuning on the Sentiment Polarity Dataset Version 2.0. This study highlights the importance of sentiment analysis as a tool for understanding the emotions and attitudes in movie reviews and predicts the performance of a movie based on the average sentiment of all the reviews.

Dynamic Feature Dataset for Ransomware Detection Using Machine Learning Algorithms

Article

Full-text available

Jan 2023
SENSORS-BASEL

Ransomware-related cyber-attacks have been on the rise over the last decade, disturbing organizations considerably. Developing new and better ways to detect this type of malware is necessary. This research applies dynamic analysis and machine learning to identify the ever-evolving ransomware signatures using selected dynamic features. Since most of the attributes are shared by diverse ransomware-affected samples, our study can be used for detecting current and even new variants of the threat. This research has the following objectives: (1) Execute experiments with encryptor and locker ransomware combined with goodware to generate JSON files with dynamic parameters using a sandbox. (2) Analyze and select the most relevant and non-redundant dynamic features for identifying encryptor and locker ransomware from goodware. (3) Generate and make public a dynamic features dataset that includes these selected parameters for samples of different artifacts. (4) Apply the dynamic feature dataset to obtain models with machine learning algorithms. Five platforms, 20 ransomware, and 20 goodware artifacts were evaluated. The final feature dataset is composed of 2000 registers of 50 characteristics each. This dataset allows for a machine learning detection with a 10-fold cross-evaluation with an average accuracy superior to 0.99 for gradient boosted regression trees, random forest, and neural networks.

A systematic overview on methods to protect sensitive data provided for various analyses

Article

Full-text available

Aug 2022
INT J INF SECUR

In view of the various methodological developments regarding the protection of sensitive data, especially with respect to privacy-preserving computation and federated learning, a conceptual categorization and comparison between various methods stemming from different fields is often desired. More concretely, it is important to provide guidance for the practice, which lacks an overview over suitable approaches for certain scenarios, whether it is differential privacy for interactive queries, k -anonymity methods and synthetic data generation for data publishing, or secure federated analysis for multiparty computation without sharing the data itself. Here, we provide an overview based on central criteria describing a context for privacy-preserving data handling, which allows informed decisions in view of the many alternatives. Besides guiding the practice, this categorization of concepts and methods is destined as a step towards a comprehensive ontology for anonymization. We emphasize throughout the paper that there is no panacea and that context matters.

A Review on C3I Systems’ Security: Vulnerabilities, Attacks, and Countermeasures

Article

Full-text available

Aug 2022

Command, Control, Communication, and Intelligence (C3I) systems are increasingly used in critical civil and military domains for achieving information superiority, operational efficacy, and greater situational awareness. The critical civil and military domains include, but are not limited to battlefield, healthcare, transportation, and rescue missions. Given the sensitive nature and modernization of tactical domains, the security of C3I systems has recently become a critical concern. This is because cyber-attacks on C3I systems have catastrophic consequences including loss of human lives. Despite the increasing number of cyber-attacks on C3I systems and growing concerns about C3I systems’ security, there is a paucity of a comprehensive review to systematize the body of knowledge on the security of C3I systems. Therefore, in this paper, we have gathered, analyzed, and synthesized the body of knowledge on the security of C3I systems. We have identified and reported security vulnerabilities, attack vectors, and countermeasures/defenses for C3I systems. In particular, this paper has enabled us to (i) propose a taxonomy for security vulnerabilities, attack vectors, and countermeasures; (ii) interrelate attack vectors with security vulnerabilities and countermeasures; and (iii) propose future research directions for advancing the state-of-the-art on the security of C3I systems. We believe that our findings will serve as a guideline for practitioners and researchers to advance the state-of-the-practice and state-of-the-art on the security of C3I systems.

Towards enhanced PDF maldocs detection with feature engineering: design challenges

Article

Full-text available

May 2022
MULTIMED TOOLS APPL

In this paper, we perform an in-depth analysis of a large corpus of PDF maldocs to identify the key set of significantly important features and help in maldoc detection. Existing industry-based tools for the detection are inefficient and cannot prevent PDF maldocs because they are generic and depend primarily on a signature-based approach. Besides, several other methods developed by academics suffer heavily from reduced effectiveness. The feature-set using machine learning classifiers is prone to various known attacks, such as mimicry and parser confusion. Also, we discover that increasingly more malicious files i) contain evasive and obfuscated JavaScript code, ii) include hidden contents (mostly outside the objects), iii) have a corrupted document structure, and iv) usually contain short JavaScript code blocks . We utilise maldoc attacks’ evolution over a decade to highlight the essential features (e.g., concept drifts) that impact detectors and classifiers.

Social Engineering Attacks Prevention: A Systematic Literature Review

Article

Full-text available

Jan 2022

Social engineering is an attack on information security for accessing systems or networks. Social engineering attacks occur when victims do not recognize methods, models, and frameworks to prevent them. The current research explains user studies, constructs, evaluation, concepts, frameworks, models, and methods to prevent social engineering attacks. Unfortunately, there is no specific previous research on preventing social engineering attacks that effectively and systematically analyze it. Current prevention methods, models, and frameworks of social engineering attacks include health campaigns, human as security sensor frameworks, user-centric frameworks, and user vulnerability models. The human as a security sensor framework needs guidance that will explore cybersecurity as super-recognizers, likely policing act for a secure system. This paper intends to critically and rigorously review prior literature on the prevention methods, models, and frameworks of social engineering attacks. We conducted a systematic literature review based on Bryman & Bell’s literature review method. We found a new approach in addition to methods, frameworks, models and evaluations to prevent social engineering attacks based on our review, which is using a protocol. We found the protocol to effectively prevent social engineering attacks, such as health campaigns, the vulnerability of social engineering victims, and co-utile protocol, which can manage information sharing on a social network. We present this systematic literature review to recommend ways to prevent social engineering attacks.

MDFRCNN: Malware Detection using Faster Region Proposals Convolution Neural Network

Article

Full-text available

Jan 2021

Technological advancement of smart devices has opened up a new trend: Internet of Everything (IoE), where all devices are connected to the web. Large scale networking benefits the community by increasing connectivity and giving control of physical devices. On the other hand, there exists an increased ‘Threat’ of an ‘Attack’. Attackers are targeting these devices, as it may provide an easier ‘backdoor entry to the users’ network’.MALicious softWARE (MalWare) is a major threat to user security. Fast and accurate detection of malware attacks are the sine qua non of IoE, where large scale networking is involved. The paper proposes use of a visualization technique where the disassembled malware code is converted into gray images, as well as use of Image Similarity based Statistical Parameters (ISSP) such as Normalized Cross correlation (NCC), Average difference (AD), Maximum difference (MaxD), Singular Structural Similarity Index Module (SSIM), Laplacian Mean Square Error (LMSE), MSE and PSNR. A vector consisting of gray image with statistical parameters is trained using a Faster Region proposals Convolution Neural Network (F-RCNN) classifier. The experiment results are promising as the proposed method includes ISSP with F-RCNN training. Overall training time of learning the semantics of higher-level malicious behaviors is less. Identification of malware (testing phase) is also performed in less time. The fusion of image and statistical parameter enhances system performance with greater accuracy. The benchmark database from Microsoft Malware Classification challenge has been used to analyze system performance, which is available on the Kaggle website. An overall average classification accuracy of 98.12% is achieved by the proposed method.

Multicriteria Decision Making using Analytic Hierarchy Process for Child Protection from Malicious Content on the Internet

Article

Full-text available

May 2021

Modern children are active Internet users. However, in the context of information abundance, they have little knowledge of which information is useful and which is harmful. To make the Internet a safe place for children, various methods are used at the international and national levels, as well as by experts, and the ways to protect children from harmful information are sought. The article proposes an approach using a multi-criteria decision-making process to prevent children from encountering harmful content on the Internet and to make the Internet more secure environment for children. The article highlights the age characteristics of children as criteria. Harmless information, Training information, Entertainment information, News, and Harmful information are considered as alternatives. Here, a decision is made by comparing the alternatives according to the given criteria. According to the trials, harmful information is rated in the last position. There is no child protection issue on the Internet using the AHP method. This research is important to protect children from harmful information in the virtual space. In the protection of minors Internet users is a reliable approach for educational institutions, parents and other subjects related to child safety.

A Hybrid Approach for Android Malware Detection and Family Classification

Article

Full-text available

Jan 2020

With the increase in the popularity of mobile devices, malicious applications targeting Android platform have greatly increased. Malware is coded so prudently that it has become very complicated to identify. The increase in the large amount of malware every day has made the manual approaches inadequate for detecting the malware. Nowadays, a new malware is characterized by sophisticated and complex obfuscation techniques. Thus, the static malware analysis alone is not enough for detecting it. However, dynamic malware analysis is appropriate to tackle evasion techniques but incapable to investigate all the execution paths and also it is very time consuming. So, for better detection and classification of Android malware, we propose a hybrid approach which integrates the features obtained after performing static and dynamic malware analysis. This approach tackles the problem of analyzing, detecting and classifying the Android malware in a more efficient manner. In this paper, we have used a robust set of features from static and dynamic malware analysis for creating two datasets i.e. binary and multiclass (family) classification datasets. These are made publically available on GitHub and Kaggle with the aim to help researchers and anti-malware tool creators for enhancing or developing new techniques and tools for detecting and classifying Android malware. Various machine learning algorithms are employed to detect and classify malware using the features extracted after performing static and dynamic malware analysis. The experimental outcomes indicate that hybrid approach enhances the accuracy of detection and classification of Android malware as compared to the case when static and dynamic features are considered alone.

Efficient Feature Extraction Methodologies for Unknown MP4-Malware Detection using Machine Learning Algorithms

Article

Feb 2023
EXPERT SYST APPL

Boosting training for PDF malware classifier via active learning

Article

May 2021

Machine learning algorithms are widely used for cybersecurity applications, include spam, malware detection. In these applications, the machine learning model has to face attack by adversarial samples. Therefore, how to train a robust machine learning model with small samples is a very hot research problem. portable document format (PDF) is a widely used file format, and often utilized as a vehicle for malicious behavior. There have been various PDF malware detectors based on machine learning. However, the labeling of large‐scale data samples is time‐consuming and laborious. This paper aims to reduce the size of training set while maintain the performance of detection. We propose a novel PDF malware detection method, using active learning to boost training. Particularly, we first make clear the meaning of uncertain samples in this paper, and theoretically explain the effectiveness of these uncertain samples for malware detection. Second, we present an active‐learning based malware detection model, using mutual agreement analysis to choose the uncertain sample as the data augmentation. The detector is retrained according to the ground truth of the uncertain samples rather than the whole test samples in the previous epoch, which can not only improve the detection performance, but also reduce the training time consumption of the detector. We conduct 10 epochs of retraining experiments for comparison, using the uncertain samples and the whole test samples from the previous epoch respectively as training set augmentation. The experimental results show that our active‐learning based model can achieve the same performance as the traditional model in the tenth epoch of retraining, while the former only needs to use one thirtieth of the latter's training samples.

Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria

Abstract

Recommended publications

An Efficient Boosting-Based Windows Malware Family Classification System Using Multi-Features Fusion

Comparative Analysis of Supervised Machine and Deep Learning Algorithms for Kyphosis Disease Detecti...

PDF Malware Detection Based on Fuzzy Unordered Rule Induction Algorithm (FURIA)

Intelligent Optimal Gated Recurrent Unit based Malicious PDF Detection and Classification Model

Intelligent Optimal Gated Recurrent Unit based Malicious PDF Detection and Classification Model

Design and Analysis of Machine Learning Based Technique for Malware Identification and Classificatio...