ArticlePDF Available

Invasive weed optimization with stacked long short term memory for PDF malware detection and classification

Authors:

Abstract and Figures

Due to high versatility and widespread adoption, PDF documents are widely exploited for launching attacks by cyber criminals. PDFs have been conventionally utilized as an effective method for spreading malware. Automated detection and classification of PDF malware are essential to accomplish security. Latest developments of artificial intelligence (AI) and deep learning (DL) models pave a way for automated detection of PDF malware. In this view, this article develops an Invasive Weed Optimization with Stacked Long Short Term Memory (IWO-S-LSTM) technique for PDF malware detection and classification. The presented IWO-S-LSTM model focuses on the recognition and classification of different kinds of malware that exist in PDF documents. The proposed IWO-S-LSTM model initially undergoes pre-processing in two stages namely categorical encoding and null value removal. Besides, autoencoder (AE) based outlier detection approach is presented to remove the existence of outliers. In addition, S-LSTM model is utilized to detect and classify PDF malware. Finally, IWO algorithm is applied to fine tune the hyperparameters involved in the S-LSTM model. To determine the enhanced outcomes of the IWO-S-LSTM model, a series of simulations were executed on two benchmark datasets. The experimental outcomes outperformed the promising performance of the IWO-S-LSTM technique on the other approaches.
Content may be subject to copyright.
How to Cite:
Chandran, P. P., Hema, R. N., & Jeyakarthic, M. (2022). Invasive weed optimization with stacked long
short term memory for PDF malware detection and classification. International Journal of Health
Sciences, 6(S5), 41874204. https://doi.org/10.53730/ijhs.v6nS5.9540
International Journal of Health Sciences ISSN 2550-6978 E-ISSN 2550-696X © 2022.
Manuscript submitted: 27 Feb 2022, Manuscript revised: 9 April 2022, Accepted for publication: 18 June 2022
4187
Invasive weed optimization with stacked long
short term memory for PDF malware detection
and classification
P. Pandi Chandran
Department of Computer and Information Science, Faculty of Science, Annamalai
University, Chidambaram, 608002, India
Corresponding author email: pandi.chandran@gmail.com
Hema Rajini. N
Department of CSE, Alagappa Chettiar Government College of Engineering and
Technology, Karaikudi, 630003, India
M. Jeyakarthic
Department of Computer and Information Science, Annamalai University,
Chidambaram, 608002, India
Abstract---Due to high versatility and widespread adoption, PDF
documents are widely exploited for launching attacks by cyber
criminals. PDFs have been conventionally utilized as an effective
method for spreading malware. Automated detection and classification
of PDF malware are essential to accomplish security. Latest
developments of artificial intelligence (AI) and deep learning (DL)
models pave a way for automated detection of PDF malware. In this
view, this article develops an Invasive Weed Optimization with Stacked
Long Short Term Memory (IWO-S-LSTM) technique for PDF malware
detection and classification. The presented IWO-S-LSTM model
focuses on the recognition and classification of different kinds of
malware that exist in PDF documents. The proposed IWO-S-LSTM
model initially undergoes pre-processing in two stages namely
categorical encoding and null value removal. Besides, autoencoder
(AE) based outlier detection approach is presented to remove the
existence of outliers. In addition, S-LSTM model is utilized to detect
and classify PDF malware. Finally, IWO algorithm is applied to fine
tune the hyperparameters involved in the S-LSTM model. To
determine the enhanced outcomes of the IWO-S-LSTM model, a series
of simulations were executed on two benchmark datasets. The
experimental outcomes outperformed the promising performance of
the IWO-S-LSTM technique on the other approaches.
4188
Keywords---PDFs, Malware detection, Outlier removal, Machine
learning, Deep learning, Invasive Weed optimization.
1. Introduction
PDF is one of the popular and trusted extensions, whereas Adobe Reader is the
more commonly used program to open these kinds of files [1]. The factor
encourages attackers to research and search for vulnerabilities and new ways of
making exploits that will execute arbitrary code while opened with this specific
software. Attackers frequently embed malware in digital documents [2], including
Microsoft Word and Adobe PDF. PDF document becomes a more commonly
known file format exploits in targeted attacks and new vulnerability continues to
be exploited by targeted attackers. Because system compromise is undesirable [3,
4], tools and methodologies to determine either malicious or benign, the
disposition of a document, are required. The number of researches has proved
that malicious document is widely employed in socially engineered phishing
attack executed by group of targeted, sophisticated, and persistent attackers
whose aim is espionage [5].
The usage of trojan PDF in targeted attacks includes those executed by a group of
attackers named the Advanced Persistent Threat (APT), adding urgency to counter
these kinds of malware delivery [6]. A proper method of delivery and exploitation
used differs significantly: in certain cases, the document is merely utilized for
exploiting vulnerabilities in the reader applications which provide small values
interms of social engineering. For example, certain classes of web-based attacks,
namely those leverage cross-site scripting, proceed accordingly [7]. There are
several techniques for malicious document detection. Signature matching is
extensively used and is efficient for identifying formerly recognized malware on a
largescale. Signature matching is a dynamic analysis of document wherein the
performance of whole system or the set of programmes is observed when the
document is opened. In fact, matching system includes commodity antivirus
scanner [8] provides specific functionality to PDF documents such that they could
be decrypted to exploit document specific vulnerabilities and reveal malicious
content.
Various technique for categorizing document has been pursued. One method is to
check for anomalies in static features extracting from the document [9]. The
abovementioned methods for detecting malevolent documents need that training
or seeding with characterization of formerly encountered benign or malicious
documents. But still, alternate method work by monitoring the runtime
performance of a document observer for unpredicted action as it renders the
document [10]. For example, popular antivirus system relies on curated database
of byte-signature for identifying malicious documents and ML approach relies on
model that is trained by the features (dynamic execution artifact, weighted byte n-
grams, and so on.) extracted from a corpus that contains benign or malicious
document.
This article develops an Invasive Weed Optimization with Stacked Long Short
Term Memory (IWO-S-LSTM) model for PDF malware detection and classification.
4189
The proposed IWO-S-LSTM model initially undergoes pre-processing in two stages
namely categorical encoding and null value removal. Moreover, autoencoder (AE)
based outlier detection approach is presented to remove the existence of outliers.
Furthermore, IWO with S-LSTM model is utilized to detect and classify PDF
malwares. In order to demonstrate the enhanced outcomes of the IWO-S-LSTM
model, a series of simulations were performed using two benchmark datasets and
distinct measures.
2. Related works
Yoo et al. [11] presented an ML based hybrid decision method which attains a
maximum detection rate with minimum false positive rate. This hybrid method
integrates an arbitrary forest and DL technique utilizing 12 hidden layers for
determining malware and benign files correspondingly. Li et al. [12] projected an
active-learning based malware detection method, utilizing mutual agreement
analysis for choosing the uncertain instance as data augmentation. The detector
was retrain based on the ground truth of uncertain instances before the entire
test instances from the preceding epoch that is not only enhancing the detection
performance, then also decreasing the trained time utilization of detectors. Jeong
et al. [13] aimed to develop a CNN for tackling malware detection on PDF files. It
intensively inspects the infrastructure of input data and showcases that it is
proposal the presented network depends upon the features of data.
Zhang [14] examined a new method dependent upon multilayer perceptron (MLP)
NN method, named MLPdf, for the recognition of PDF based malware. More
particularly, the MLPdf technique utilizes a BP technique with stochastic gradient
descent (SGD) search for technique updating. Ye et al. [15] presented a
heterogeneous DL infrastructure collected of Auto-Encoder (AE) stacked up with
multi-layer RBM and layer of associative memory for detecting recently unknown
malware. The presented DL technique executes as greedy layer-wise trained
function to unsupervised feature learning, then supervised parameter fine-tuned.
Liu et al. [16] presented a new approach integrating Malware Visualization and
Image Classification for detecting PDF files and recognizing that ones can be
malicious. Moreover, dependent upon the signature of objects from the files, it is
allocated various colors attained in SimHash for generating RGB images.
3. The Proposed Model
In this article, a new IWO-S-LSTM model has been developed for the recognition
and classification of different kinds of malware that exist in the PDF documents.
The proposed IWO-S-LSTM model is pre-processed in two stages namely
categorical encoding and null value removal. Followed by, AE based outlier
detection with S-LSTM based classification model is developed. At last, the IWO
algorithm is used for fine tuning the hyperparameters involved in the S-LSTM
model. Fig. 1 illustrates the overall process of IWO-S-LSTM technique.
4190
Fig. 1. Overall process of IWO-S-LSTM technique
3.1. Pre-processing
In a primary phase, the input data has been pre-processed in 2 phases of
operations namely categorical encoded and null value elimination. Initially, the
categorical values are encoded as numerical values. Secondly, the null value
which exists from the dataset is removed.
3.2. AE based Outlier Detection Process
For the removal of outliers, the AE model is used in this study. An AE is an
artificial neural network that is trained to learn an appropriate representation
(coding) of the input information to assure desired property [17]. It enables us to
compare input and output, says and , and computes a loss function according
to the distance, for training the net to regenerate the input vector. Fig. 2
showcase the structure of AE. Accurately, the encoding maps the input vector,
to hidden representation, , as follows
󰇛 󰇜󰇛󰇜
whereas 󰇛󰇜 represent the activation function, denotes a weighted
matrix and indicates a bias vector. The decoder, sequentially, map the hidden
representation to the output reconstruction as follows
󰇛󰇜󰇛󰇜
In which 󰇛󰇜 and have apparent meaning.
4191
Fig. 2. Structure of AutoEncoder
Hence, All the training input 󰇛󰇜 is mapped to last representation 󰇛󰇜, and
regenerated as 󰇛󰇜. The parameter of the AE 󰇝󰇞 are learned by
minimalizing the average recreation error amongst input and output typically
evaluated by the average Euclidean distance
󰇛󰇜
 󰇛󰇜 󰇛󰇜󰇛󰇜
3.3. S-LSTM based Classification
Once the outliers are removed, the next stage is to classify the PDF documents
using S-LSTM model. The major concept behindhand LSTM lies in a gate that
controls the data flow alongside time axis could capture longterm dependency at
every time step. Especially, at every time step , hidden layer is upgraded by
data fusion at the input gate , similar step , output gate , forget gate ,
hidden layer at final time step  and memory cell :
󰇛  󰇜 ,
󰇛  󰇜 , 󰇛  󰇜󰇛󰇜
  󰇛 󰇜 ,
󰇛󰇜 ,
In which, model parameter includes   and are learned at
the time of training and share at every time step, denotes sigmoid activation,
indicates elementwise product, and represent a hyperparameter that symbolizes
the dimension of hidden states [18]. At first, LSTM is employed to handle
timeseries data. Also, the last output is at final time step, is employed for
predicting the output through a linear regression layer, as follows:
4192
󰇛󰇜
Whereas  and denotes the dimensionality of output. Here, the
crossentropy is utilized as loss function amongst the target label distribution 󰇛󰇜
and the forecasted label distribution 󰇛󰇜. Hence it is expressed as follows
 󰇛󰇜
󰇛󰇜󰇛󰇜󰇛󰇜
The activation function allows the network to obtain a non-linear depiction of the
input signal. The ReLU is expressed as follows:
󰇛󰇜 󰇡󰇛󰇜󰇢 󰇛󰇜󰇛󰇜
Here, 󰇛󰇜 and
󰇛󰇜 denotes the output of LSTM and activation value of
󰇛󰇜, correspondingly. With the tremendous growth of computer hardware and
a sequence of DL approaches that have been proposed, deep architecture has
shown the effective ability to feature selflearning. Thus, stacking LSTM layer for a
deep LSTMbased NN is useful. The major concept of DNN is that non-linear
mapping layers among outputs and inputs are employed for learning features.
The output of hidden state is propagated forward by time and utilized as input of
following LSTM hidden state. This can be formulated by the following equation
󰇛

󰇜 ,
󰇛
 󰇜 ,
󰇛

󰇜󰇛󰇜



󰇛

󰇜 ,
󰇛
󰇜.
The input of initial layer is raw temporal signal that is
, whereas the output
of inital layer is an abstraction of raw signal, i.e., considered as a hierarchical
feature. The advantage of stacked LSTM is apparent: (1) Model parameter is
distributed without improving storage capacity that allows to quicken
convergence and refine non-linear operation of raw information. (2) stacking
LSTM layer allows learning features of raw temporal signal from distinct aspects
at every time step. Fig. 3 depicts the framework of Stacked LSTM Model.
4193
Fig. 3. Framework of Stacked LSTM Model
3.4. IWO based Hyperparameter Optimization
In order to optimally tuning the hyperparameters involved in the S-LSTM
technique, the IWO algorithm is used. IWO is an arithmetical stochastic search
technique that stimulates natural performance of weed colonizing to optimize the
function. But it is proved that efficient in converging to optimum solution by
using fundamental propertiesfor example, growth, seeding, and competitionin
a weed colony [19]. The basic properties of the algorithm are given below
1. A fixed count of seeds are spread out through the searching region.
2. Each seed grows to a flowering plant and generates seed depending on its
fitness.
3. The generated seed is dispersed randomly through the search region and
grows to new plant.
4. This procedure repeats until the maximal amount of plants is obtained.
Next, the plant with higher fitness could survive and generate seeds, and
others are removed. The procedure repeats until the maximal amount of
iterations is obtained and, the plant with the optimal fitness is close to the
optimum solution. It can be discussed detailed in the following.
4194
Population Initialization: A population initialization of solution is spread out
through the dimension problem space with arbitrary position. Reproduction: A
specific population of plants is permitted to generate seeds based on its own and
the colony's highest and lowest fitness: The plant generates linearly increase from
the minimal seed production level to its maximal level. Spatial Dispersal:
Adaptation and Randomness is shown in the following. The produced seeds are
distributed arbitrarily through the dimension search region by arbitrary number
with a mean value equivalent to zero but differ with variance. This guarantees
that the seeds are accumulated near the parent plant and distributed arbitrarily.
But standard deviation (SD) of the random function reduce from a formerly
determined initial value, , to the last value, , in all the steps. In
simulation, non-linear variations show an outstanding performance, as follows
 󰇛󰇜

󰇛 󰇜󰇛󰇜
Whereas  represents the maximal amount of iterations,  indicates the
SD at the existing step, and signifies the non-linear modulation index. The
experiment study suggests that the outcomes from IWO are superior to the
outcomes from other models. The IWO performance is compared to other
evolutionary mechanisms, and its outcomes are reasonable for each test function
[19]. The IWO method develops a FF for gaining increased PDF malware
classification performance of the S-LSTM technique. It resolves a positive integer
to represent the optimal performance of the candidate solutions. Under this
analysis, the minimization of classifier error rate has assumed as FF is provided
in Eq. (10). The best result is a lesser error rate and least solution reaches a
maximal error rate.
󰇛󰇜 󰇛󰇜

 󰇛󰇜
4. Experimental Validation
In this section, the performance validation of the IWO-S-LSTM model is carried
out using two datasets namely CIC Evasive-PDFMal2022 [20] and Contagio [21]
dataset. The first dataset holds 10023 instances with 31 attributes and two
classes. After outlier removal process, the number of instances becomes 5007.
The second Contagio dataset includes 9999 instances with 136 attributes and two
classes. Once the outliers are removed, the number of instances become 4382
4195
Fig. 4. CIC Evasive-PDFMal2022 Dataset a) CM of S-LSTM b) CM of IWO-S-LSTM
c) PRC of S-LSTM d) PRC IWO-S-LSTM e) ROC of S-LSTM f) ROC of IWO-S-LSTM
4196
Fig. 4 illustrates a set of results obtained by the IWO-S-LSTM model on the
classification of PDF malware in CIC Evasive-PDFMal2022 dataset. Fig. 4a
indicates that the S-LSTM model has identified 4232 instances into benign class
and 628 instances into malignant. Next, Fig. 4b represents that the IWO-S-LSTM
model has categorized 4231 instances into benign class and 686 instances into
malignant class. Then, Figs. 4c-4d shows the precision-recall curves of the S-
LSTM and IWO-S-LSTM models. The figures reported that the IWO-S-LSTM model
has obtained enhanced performance over the LSTM model. Similarly, Figs. 4e-4f
represents the ROC examination of the S-LSTM and IWO-S-LSTM models. The
results indicated that the IWO-S-LSTM model has attained maximum ROC
values.
Fig. 5. CIC Evasive-PDFMal2022 Dataset a) Accuracy of S-LSTM b) Loss of S-
LSTM c) Accuracy of IWO-S-LSTM d) Loss of IWO-S-LSTM
Fig. 5 illustrates a set of accuracy and loss analyses of the IWO-S-LSTM model on
the classification of PDF malware in CIC Evasive-PDFMal2022 dataset. Figs. 5a-
5b shows the accuracy and loss analysis of the S-LSTM models. The results show
that the accuracy value tends to increase and loss value tends to decrease with
an increase in epoch count. Similarly, Figs. 5c-5d shows the accuracy and loss
analysis of the IWO-S-LSTM models. The figures show that the accuracy value
tends to increase and loss value tends to decrease with an increase in epoch
count.
4197
Fig. 6. Contagio Dataset a) CM of S-LSTM b) CM of IWO-S-LSTM c) PRC of S-
LSTM d) PRC IWO-S-LSTM e) ROC of S-LSTM f) ROC of IWO-S-LSTM
4198
Fig. 6 demonstrates a set of results attained by the IWO-S-LSTM technique on the
classification of PDF malware in Contagio dataset. Fig. 6a shows that the S-LSTM
approach has recognized 4191 instances into benign class and 115 samples into
malignant. Then, Fig. 6b characterizes that the IWO-S-LSTM approach has
categorized 4191 instances into benign class and 124 instances into malignant
class. Next, Figs. 6c-6d displays the precision-recall curves of the S-LSTM and
IWO-S-LSTM techniques. The figure reports that the IWO-S-LSTM technique has
attained improved performance over the LSTM. Similarly, Figs. 6e-6f signifies the
ROC examination of the S-LSTM and IWO-S-LSTM approaches. The result
indicates that the IWO-S-LSTM technique has achieved maximal ROC values.
Fig. 7. Contagio Dataset a) Accuracy of S-LSTM b) Loss of S-LSTM c) Accuracy of
IWO-S-LSTM d) Loss of IWO-S-LSTM
Fig. 7 shows a set of accuracy and loss analyses of the IWO-S-LSTM approach on
the classification of PDF malwares in Contagio dataset. Figs. 7a-7b illustrates the
accuracy and loss analysis of the S-LSTM models. The result shows that the
accuracy value tends to increase and loss value tends to decrease with an
increase in epoch count. Likewise, Figs. 7c-7d displays the accuracy and loss
analysis of the IWO-S-LSTM models. The figure shows that the accuracy value
tends to increase and loss value tends to decrease with an increase in epoch
count.
4199
Table 1 provides detailed PDF malware classification results of the S-LSTM and
IWO-S-LSTM model on CIC Evasive-PDFMal2022 and Contagio dataset. The
experimental values indicated that the IWO-S-LSTM model has obtained improved
performance over the S-LSTM model.
Table 1 Result analysis of S-LSTM and IWO-S-LSTM technique on two datasets
Measures
CIC Evasive-PDFMal2022
Contagio Dataset
S-LSTM
IWO-S-LSTM
S-LSTM
Accuracy
97.06
98.20
98.27
Precision
98.05
98.65
99.11
Recall
90.68
94.43
80.10
F1-Score
93.91
96.40
87.13
AUC-Score
98.83
98.77
99.56
Fig. 8. Result analysis of S-LSTM and IWO-S-LSTM approach on CIC Evasive-
PDFMal2022 dataset
Fig. 8 investigates a brief classification outcomes of the S-LSTM and IWO-S-LSTM
model on the CIC Evasive-PDFMal2022 dataset. The results indicated that the S-
LSTM model has obtained , , , and  of 97.06%,
98.05%, 90.68%, 93.91%, and 98.83% respectively. However, the IWO-S-LSTM
model has attained even improved , , , and  of
98.20%, 98.65%, 94.43%, 96.40%, and 98.77% respectively.
4200
Fig. 9. Result analysis of S-LSTM and IWO-S-LSTM approach on Contagio dataset
Fig. 9 examines a classification outcomes of the S- LSTM and IWO-S-LSTM model
on the Contagio dataset. The result indicates that the S-LSTM model has gained
, , , and  of 99.11%, 80.10%, 87.13%, 99.56%, and
98.83% correspondingly. But, the IWO-S-LSTM technique has accomplished even
better , , , and  of 98.20%, 99.21%, 82.46%,
88.97%, and 99.46% correspondingly.
In order to report the enhanced performance of the IWO-S-LSTM model, a brief
comparison study with recent methods is made in Table 2 [22]. Fig. 10
investigates the  analysis of the IWO-S-LSTM model with recent models. The
figure indicated that the ridge regression, DT model, and RF model have resulted
to lower performance with  of 93.50%, 93.47%, and 93.19% respectively.
Followed by, the AdaBoost and SGDC models have provided slightly enhanced
outcomes with  of 95.82% and 95.68% respectively. Though the LR model
has accomplished reasonably  of 96.33%, the IWO-S-LSTM model has
outperformed the compared ones with higher  of 98.47%.
4201
Table 2 Comparative analysis of IWO-S-LSTM technique with existing algorithms
Methods
Accuracy
Precision
AUC
DT Model
93.47
92.92
93.46
RF Model
93.19
93.54
92.82
AdaBoost
95.82
96.65
95.71
Logistic Regression
96.33
95.73
96.20
Ridge Regression
93.50
93.20
93.71
SGDC Model
95.68
95.57
95.76
IWO-S-LSTM
98.47
99.21
99.46
Fig. 10.  analysis of IWO-S-LSTM technique with existing algorithms
Fig. 11 examines the  and AUC analysis of the IWO-S-LSTM model with
current models. The figure indicates that the ridge regression, DT model, and RF
model have resulted in low performance with minimum values of  and AUC.
Next, the AdaBoost and SGDC models have provided slightly improved values of
 and AUC. Although the LR model has attained reasonable  and AUC of
95.73% and 96.20%, the IWO-S-LSTM model has outperformed the compared
ones with high  of 99.21% and 99.46% correspondingly.
4202
Fig. 11. Comparative analysis of IWO-S-LSTM technique with existing algorithms
The above mentioned results and discussion portrayed that the IWO-S-LSTM
model has accomplished superior PDF malware detection and classification
process.
5. Conclusion
In this article, a new IWO-S-LSTM model has been developed for the recognition
and classification of different kinds of malware that exist in PDF documents. The
proposed IWO-S-LSTM model is pre-processed in two stages namely categorical
encoding and null value removal. Followed by, AE based outlier detection with S-
LSTM based classification model is developed. At last, the IWO algorithm is used
for fine tuning the hyperparameters involved in the S-LSTM model. To determine
the enhanced outcomes of the IWO-S-LSTM technique, a series of simulations
were executed on two benchmark datasets. The experimental outcomes
highlighted the promising performance of the IWO-S-LSTM model over the other
approaches. In future, hybrid metaheuristic algorithms are developed for
improving the overall classification efficiency of the IWO-S-LSTM model.
References
[1] Rudra, B., 2021. Study of a hybrid approach towards malware detection in
executable files. SN Computer Science, 2(4), pp.1-7.
4203
[2] Rathore, H., Agarwal, S., Sahay, S.K. and Sewak, M., 2018, December.
Malware detection using machine learning and deep learning.
In International Conference on Big Data Analytics (pp. 402-411). Springer,
Cham.
[3] Mercaldo, F. and Santone, A., 2020. Deep learning for image-based mobile
malware detection. Journal of Computer Virology and Hacking
Techniques, 16(2), pp.157-171.
[4] Yuxin, D. and Siyi, Z., 2019. Malware detection based on deep learning
algorithm. Neural Computing and Applications, 31(2), pp.461-472.
[5] Yen, Y.S. and Sun, H.M., 2019. An Android mutation malware detection
based on deep learning using visualization of importance from
codes. Microelectronics Reliability, 93, pp.109-114.
[6] Singh, P., Tapaswi, S. and Gupta, S., 2020. Malware detection in pdf and
office documents: A survey. Information Security Journal: A Global
Perspective, 29(3), pp.134-153.
[7] Cuan, B., Damien, A., Delaplace, C. and Valois, M., 2018, July. Malware
detection in pdf files using machine learning. In SECRYPT 2018-15th
International Conference on Security and Cryptography (p. 8p).
[8] Tian, D., Ying, Q., Jia, X., Ma, R., Hu, C. and Liu, W., 2021. MDCHD: A
novel malware detection method in cloud using hardware trace and deep
learning. Computer Networks, 198, p.108394.
[9] Iadarola, G., Martinelli, F., Mercaldo, F. and Santone, A., 2021. Towards an
interpretable deep learning model for mobile malware detection and family
identification. Computers & Security, 105, p.102198.
[10] Mohammed, T.M., Nataraj, L., Chikkagoudar, S., Chandrasekaran, S. and
Manjunath, B.S., 2021, November. HAPSSA: Holistic Approach to PDF
malware detection using Signal and Statistical Analysis. In MILCOM 2021-
2021 IEEE Military Communications Conference (MILCOM) (pp. 709-714).
IEEE.
[11] Yoo, S., Kim, S., Kim, S. and Kang, B.B., 2021. AI-HydRa: Advanced hybrid
approach using random forest and deep learning for malware
classification. Information Sciences, 546, pp.420-435.
[12] Li, Y., Wang, X., Shi, Z., Zhang, R., Xue, J. and Wang, Z., 2021. Boosting
training for PDF malware classifier via active learning. International Journal
of Intelligent Systems.
[13] Gandamayu, I. B. M., Antari, N. W. S., & Strisanti, I. A. S. (2022). The level
of community compliance in implementing health protocols to prevent the
spread of COVID-19. International Journal of Health & Medical Sciences,
5(2), 177-182. https://doi.org/10.21744/ijhms.v5n2.1897
[14] Jeong, Y.S., Woo, J. and Kang, A.R., 2019. Malware detection on byte
streams of pdf files using convolutional neural networks. Security and
Communication Networks, 2019.
[15] Rinartha, K., & Suryasa, W. (2017). Comparative study for better result on
query suggestion of article searching with MySQL pattern matching and
Jaccard similarity. In 2017 5th International Conference on Cyber and IT
Service Management (CITSM) (pp. 1-4). IEEE.
[16] Zhang, J., 2018. MLPdf: an effective machine learning based approach for
PDF malware detection. arXiv preprint arXiv:1808.06991.
4204
[17] Ye, Y., Chen, L., Hou, S., Hardy, W. and Li, X., 2018. DeepAM: a
heterogeneous deep learning framework for intelligent malware
detection. Knowledge and Information Systems, 54(2), pp.265-285.
[18] Liu, C.Y., Chiu, M.Y., Huang, Q.X. and Sun, H.M., 2021, July. PDF Malware
Detection Using Visualization and Machine Learning. In IFIP Annual
Conference on Data and Applications Security and Privacy (pp. 209-220).
Springer, Cham.
[19] Cozzolino, D. and Verdoliva, L., 2016, December. Single-image splicing
localization through autoencoder-based anomaly detection. In 2016 IEEE
International workshop on information forensics and security (WIFS) (pp. 1-6).
IEEE.
[20] Yu, L., Qu, J., Gao, F. and Tian, Y., 2019. A novel hierarchical algorithm for
bearing fault diagnosis based on stacked LSTM. Shock and Vibration, 2019.
[21] Sedighy, S.H., Mallahzadeh, A.R., Soleimani, M. and Rashed-Mohassel, J.,
2010. Optimization of printed Yagi antenna using invasive weed
optimization (IWO). IEEE Antennas and Wireless Propagation Letters, 9,
pp.1275-1278.
[22] https://www.unb.ca/cic/datasets/pdfmal-2022.html
[23] https://github.com/srndic/mimicus/tree/master/data
[24] Damaševiˇcius, R.; Venˇckauskas, A.; Toldinas, J.; Grigaliunas, Š.
Ensemble-Based ¯ Classification Using Neural Networks and Machine
Learning Models for Windows PE Malware Detection. Electronics 2021, 10,
485. https:// doi.org/10.3390/electronics10040485.
... Machine learning and Deep Learning, which is based on data sets and a test-train split, is a flexible and robust method that can detect malware the system has never been encountered before.Neural Networks have had great results in models with automated parameters being set due to forward and backward propagation.Older research had suggested usage of machine learning models such as SVM ,RF,LMT, Naive Bayes,Bayes Net, and J4, [5], [3]. in recent times, there were positive results with models made with CNNs , ANNS and KNNs and RNNs such as LSTM and GRU in many different combinations for an efficient robust detection. [7], [9], [12], [14]. These gave light how machine learning and neural networks were efficient in detection rather than static or dynamic analysis. ...
Conference Paper
Full-text available
As the world progresses towards a digital era, the transfer of data in Portable Document Format (PDF) has become ubiquitous. Regrettably, this format is susceptible to malware attacks and the conventional anti-malware and anti-virus software may not be able to detect PDF malware effectively. In response to this problem, the implementation of machine learning algorithms and neural networks has been proposed in the past. However, the lack of transparency in these models raises concerns regarding their ethical and responsible decision-making. To address this concern, the utilization of Explainable AI (XAI) with the SHAP framework is proposed to classify PDF files as either malicious or clean, providing both a global and local understanding of the models’ decisions. The algorithms employed in this endeavor include Stochastic Gradient Descent (SGD), XGBoost Classifier, Single Layer Perceptron, and Artificial Neural Network (ANN).
Article
Full-text available
World health problems caused by corona virus or pandemic needs to get special attention from health practitioner, scientist and community. Some health protocols initiated by the Ministry of Health are by wearing masks, maintaining distance and washing hands, avoiding crowds and limiting mobility outside home. The purpose of this study was to determine the level of community compliance at Banjar Buluh and Banjar Sakih, Guwang village, Sukawati Gianyar in implementing health protocols in their daily routines. This study employed descriptive design with cross sectional approach. The setting of this reasearch was in the area of Vanjar Buluh and Banjar Sakih, Guwang village, Sukawati – Gianyar. There were 101 respondents recruited as sample of the study through accidental sampling. The data were collected by using questionnaire. The data were analized by using Statistical Package for the Social Science (SPSS for Windows, Release 23.0). The data analyzed included descriptive analysis (mean, standard deviation, frequency, percentage and range). The results showed that the majority of the respondents in Banjar Buluh and Banjar Sakih, Guwang village, Sukawati-Gianyar were classified as non-compliance in applying health protocols.
Conference Paper
Full-text available
Malicious PDF documents present a serious threat to various security organizations that require modern threat intelligence platforms to effectively analyze and characterize the identity and behavior of PDF malware. State-of-the-art approaches use machine learning (ML) to learn features that characterize PDF malware. However, ML models are often susceptible to evasion attacks, in which an adversary obfuscates the malware code to avoid being detected by an Antivirus. In this paper, we derive a simple yet effective holistic approach to PDF malware detection that leverages signal and statistical analysis of malware binaries. This includes combining orthogonal feature space models from various static and dynamic malware detection methods to enable generalized robustness when faced with code obfuscations. Using a dataset of nearly 30,000 PDF files containing both malware and benign samples, we show that our holistic approach maintains a high detection rate (99.92%) of PDF malware and even detects new malicious files created by simple methods that remove the obfuscation conducted by malware authors to hide their malware, which are undetected by most antiviruses.
Article
Full-text available
With the ever-increasing number of Internet users in this digital age, exposure to malicious attacks is increasing. Every day, large volumes of malicious content are generated to exploit 0-day vulnerabilities. There is every possibility of downloading malicious files unintentionally, which could corrupt the system and user data. With the advancements in technology and growing dependence on digital data, malicious software detection has become a crucial task. The existing approaches need modifications to support and detect the latest attacks. Recently, artificial intelligence-based malicious file detection methods have been proposed. In the past, most of the works analyzed the executable file features and visual features from their corresponding images independently. Additionally, image-based analysis has been exploited for categorical classification, i.e., finding the family once it is known to be malware. We propose a CNN-based model that extracts visual features from malware images, which outperforms existing approaches on a benchmark dataset like MalImg. We study the effect of using a hybrid feature set containing these visual features integrated with statically obtained opcode frequencies for the detection of malware. Our experiments on standard datasets demonstrate that there is no significant performance improvement using this hybrid approach.
Article
Full-text available
The security of information is among the greatest challenges facing organizations and institutions. Cybercrime has risen in frequency and magnitude in recent years, with new ways to steal, change and destroy information or disable information systems appearing every day. Among the types of penetration into the information systems where confidential information is processed is malware. An attacker injects malware into a computer system, after which he has full or partial access to critical information in the information system. This paper proposes an ensemble classification based methodology for malware detection. The first-stage classification is performed by a stacked ensemble of dense (fully connected) and convolutional neural networks (CNN), while the final stage classification is performed by a meta-learner. For a meta-learner, we explore and compare 14 classifiers. For a baseline comparison, 13 machine learning methods are used: K-Nearest Neighbors , Linear Support Vector Machine (SVM), Radial basis function (RBF) SVM, Random Forest, Ada-Boost, Decision Tree, ExtraTrees, Linear Discriminant Analysis, Logistic, Neural Net, Passive Clas-sifier, Ridge Classifier and Stochastic Gradient Descent classifier. We present the results of experiments performed on the Classification of Malware with PE headers (ClaMP) dataset. The best performance is achieved by an ensemble of five dense and CNN neural networks, and the ExtraTrees classifier as a meta-learner.
Article
Full-text available
Current anti-malware technologies in last years demonstrated their evident weaknesses due to the signature-based approach adoption. Many alternative solutions were provided by the current state of art literature, but in general they suffer of a high false positive ratio and are usually ineffective when obfuscation techniques are applied. In this paper we propose a method aimed to discriminate between malicious and legitimate samples in mobile environment and to identify the belonging malware family and the variant inside the family. We obtain gray-scale images directly from executable samples and we gather a set of features from each image to build several classifiers. We experiment the proposed solution on a data-set of 50,000 Android (24,553 malicious among 71 families and 25,447 legitimate) and 230 Apple (115 samples belonging to 10 families) real-world samples, obtaining promising results.
Article
With the development of cloud computing, more and more enterprises and institutes have deployed important computing tasks and data into virtualization environments. Virtualization security has become very important for cloud computing. When an attacker controls a victim’s virtual machine, he (or she) may launch malware for malicious purpose in that virtual machine. To defend against malware attacks in the cloud, many virtualization-based approaches are proposed. However, the existing methods suffer from limitations in terms of transparency and performance cost. To address these issues, we propose MDCHD, a novel malware detection solution for virtualization environments. This method first utilizes the Intel Processor Trace (IPT) mechanism to collect the run-time control flow information of the target program. Then, it converts the control flow information into color images. By doing so, we can utilize a CNN-based deep learning method to identify malware from the images. To improve the performance of our detection mechanism, we leverage Lamport’s ring buffer algorithm. In this way, the control flow information collector and security checker can work concurrently. The evaluation shows that our approach can achieve acceptable detection accuracy with a minimal performance cost.
Article
Machine learning algorithms are widely used for cybersecurity applications, include spam, malware detection. In these applications, the machine learning model has to face attack by adversarial samples. Therefore, how to train a robust machine learning model with small samples is a very hot research problem. portable document format (PDF) is a widely used file format, and often utilized as a vehicle for malicious behavior. There have been various PDF malware detectors based on machine learning. However, the labeling of large‐scale data samples is time‐consuming and laborious. This paper aims to reduce the size of training set while maintain the performance of detection. We propose a novel PDF malware detection method, using active learning to boost training. Particularly, we first make clear the meaning of uncertain samples in this paper, and theoretically explain the effectiveness of these uncertain samples for malware detection. Second, we present an active‐learning based malware detection model, using mutual agreement analysis to choose the uncertain sample as the data augmentation. The detector is retrained according to the ground truth of the uncertain samples rather than the whole test samples in the previous epoch, which can not only improve the detection performance, but also reduce the training time consumption of the detector. We conduct 10 epochs of retraining experiments for comparison, using the uncertain samples and the whole test samples from the previous epoch respectively as training set augmentation. The experimental results show that our active‐learning based model can achieve the same performance as the traditional model in the tenth epoch of retraining, while the former only needs to use one thirtieth of the latter's training samples.
Article
Mobile devices are pervading everyday activities of our life. Each day we store a plethora of sensitive and private information in smart devices such as smartphones or tablets, which are typically equipped with an always-on internet connection. These information are of interest for malicious writers that are developing more and more aggressive harmful code for stealing sensitive and private information from mobile devices. Considering the weaknesses exhibited from current antimalware signature-based detection, in this paper we propose a method relying on application representation in terms on images used to input an explainable deep learning model designed by authors for Android malware detection and family identification. Moreover, we show how the explainability can be considered from the analyst to assess different models. Experimental results demonstrated the effectiveness of the proposed method, obtaining an average accuracy ranging from 0.96 to 0.97; we evaluated 8446 Android samples belonging to six different malware families and one more family for trusted samples, by providing also interpretability about the predictions performed by the model.
Article
In 2018, with the internet being treated as a utility on equal grounds as clean water or air, the underground malicious software economy is flourishing with an influx of growth and sophistication in the attacks. The use of malicious documents has increased rapidly in the last five years along with a spectrum of attacks. They offer flexibility in document structure with numerous features for attackers to exploit. Despite efforts from industry and research communities, this remains a viable security threat. In this paper, a broad classification of malicious documents based attacks is provided along with a detailed description of the attack opportunities available using Portable Document Format (PDF) and Office documents. Detailed structures of both file formats, state of the art tools as well as the current research in automatic detection methods have been discussed.