ArticlePDF Available

A Supervised Machine Learning Algorithm for Detecting Malware

Authors:
A Supervised Machine Learning Algorithm for Detecting Malware
Olaniyi Abiodun Ayeni
Department of Cyber Security, School of Computing
Federal University of Technology, Nigeria
Abstract
The proliferation of malware is a threat to our
computing system and its security. That is why the
need for malware detection using machine learning
arises. This work was motivated by the limitation of
[1], [2] in ‘Malware Detection Module using
Machine Learning Algorithms. The objective of this
research is to develop a security system for the
detection of malware using supervised machine
learning algorithms and also to carried out
performance evaluation. Feature selection (Filter
method) was used to reduce 100,000 columns and 35
rows of features to 20 features, then three classifier
algorithms were employed which are K-Nearest
Neighbor, Decision Tree and Random Forest. The
classifiers are trained and tested using the
dataset(malware.csv) gotten from Malware
Detection Kaggle. The results of the algorithms (K-
Nearest Neighbor, Decision Tree and Random
Forest) are respectively 96.53%,97.79% and
99.90%. The results were also compared with other
researchers[3] work that used the same three
classifiers, the results of Maqsood 2020 for Random
Forest, Decision tree and K nearest neighbor are
respectively 96.39%, 100%(overfit) and 99.4%,
while the results of Sarang et al 2013 for Random
Forest, Decision tree and K nearest neighbor are
respectively 99.57%, 99.23%, and 99.06%. It
indicates that Random Forest is most effective out of
the three classifiers algorithm for malware detection
using machine learning, moreover, the study
performed can be useful as a base for further
research in the field of malware analysis with
machine learning methods.
Keywords: Malware, Supervised Learning, Decision
Tree, K-Nearest Neighbour, Random Forest, Feature
Method, Computer Security
1. Introduction
The malware was first created in 1949 by John
von Neumann. Ever since then, more were created.
Antivirus company is continually searching for the
most effective ways in identifying malware and one
of the most famous methods used is the signature-
based detection. Furthermore, the skill level that is
required for malware development is on the decrease
because of the high numbers of attacking tools on the
internet nowadays. High availability of anti-detection
techniques and the ability to buy malware in the
black-market results in an opportunity to be an
attacker for anyone, not depending on the skill level.
Current studies show that more attacks are being
issued by script-kiddies or automated [4]. Therefore,
malware protection of computer is an essential
cybersecurity tasks for single users and businesses,
since an attack can lead to compromised data and
sufficient losses. Also, massive losses and frequent
attacks influences the need for accurate and timely
detection methods. However, current static and
dynamic methods do not provide efficient detection,
especially when dealing with zero-day attacks.
Hence, machine learning-based techniques can be
used. Therefore, this paper discusses the main points
and concerns of machine learning-based malware
detection, as well as looks for the best feature
representation and classification methods.
2. Research Motivation
The existing literature on the topic of malware
detection convinces that there is a need for efficient
malware detection system, especially since the use of
internet are becoming increasingly important
nowadays.
The existing frameworks Malware Detection
Module using Machine Learning Algorithms to
Assist in Centralized Security in Enterprise
Networks’ [2] focuses on just the detection and
classification neglecting home users because it’s
processor heavy, also in Detection of malicious code
by applying machine learning classifiers on static
features Shabtai et al. [5] models were not trained
properly which resulted to running inefficient
algorithms. This provided the motivation to create a
malware detection system using machine learning
that is well trained and has a high accuracy and low
positive rate using machine learning that can protect
one’s system by flagging incoming malicious files
and preventing them from affecting one’s computer.
2.1. Problem Statement
With the development of technology, the number
of malwares is increasing daily. Malware is now
designed with mutation characteristic, which causes
an enormous growth in the number of variations.
Also, with the help of automated malware generated
Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022
Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094
764
tools, novice malware authors can now quickly
generate a new variation. With these growths,
traditional signature-based malware detection is
proven to be ineffective against the vast variation of
malware. However, machine learning methods for
malware detection has proven effective against new
malware.
Objective: The objectives of this research work are
to:
Design a security framework for malware
detection using Supervised Machine learning
algorithm.
Implement the design in (a).
Evaluate the performance of the system.
2.2. Malware Types
i. Backdoor: It is a malware type that negates
standard authentication procedures to access a
system. As a result, remote access is granted to
resources within an application, such as databases
and file servers, giving perpetrators the ability to
issue system commands remotely and update
malware.
ii. Rootkit: Its functionality enables the attacker to
access the data with higher permissions than is
allowed. For example, it can be used to give
unauthorized user administrative access. Rootkits
always hide their existence and quite often are
unnoticeable on the system, making the detection
and, therefore, removal incredibly hard. [6].
iii. Keylogger: The idea behind this malware class is
to log all the keys pressed by the user and store all
data, including passwords, bank card numbers, and
other sensitive information [7].
iv. Ransomware: This type of malware aims to
encrypt all the data on the machine and ask a victim
to transfer some money to get the decryption key.
Usually, a machine infected by ransomware is
“frozen” as the user cannot open any file, and the
desktop picture is used to provide information on the
attacker’s demands [8].
3. Related Works
In state-of-the-art survey of malware detection
approaches using data mining techniques [1] present
a survey of malware detection approaches divided
into two categories:
Signature-based methods
Behavior-based methods.
However, the survey does not provide either a review
of the most recent deep learning approaches or a
taxonomy of the types of features used in data
mining techniques for malware detection and
classification.
The research is motivated by a serious threat
today called malicious executables. It is designed to
damage computer system and some of them spread
over network without the knowledge of the owner
using the system.
Objective(s): To present a survey of malware
detection approaches.
3.1. Methodology
Provision of summary of the current challenges
related to malware detection approaches in data
mining, presenting a systematic and categorized
overview of the current approaches to machine
learning mechanisms in the data mining topics,
exploring a structure of the important methods that
are significant in malware detection approach,
discussing the important factors of classification
malware approaches in the data mining to improve
their problems in the futures.
3.2. Contribution to Knowledge
It enlightens more on how to approach malware
detection using machine learning. In his paper
“Malware detection using statistical analysis of byte-
level file content” [9] used several machine learning
techniques to detect malware files. The authors
claimed that their techniques can properly classify
any malware regardless of its obfuscation using
multi-class classification technique to detect seven
classes including benign. The novelty of the authors’
approach is in the ability to detect obfuscated and
packed malware. The difficulty in detecting
obfuscated malware lies in the obscureness of the
structure of the malware file itself. The writer of the
malware, intentionally, re-write the code of the file
that makes it difficult to be caught by antimalware
software. The total size of the content set collected
for the experiment is 12,111 files (1,800 benign files
and 10,311 malicious). However, only 50 files per
class were used as training set. The features for the
experiment generated are statistical-based features
derived from byte sequence n-grams of the
executables.
A static malware detection system using data
mining methods proposed extraction method based
on PE headers, DLLs, and API functions and
methods based on Naive Bayes, J48 Decision Trees,
and Support Vector Machines. The highest overall
accuracy was achieved with the J48 algorithm (99%
with PE header feature type and hybrid PE header
and API function feature type, 99.1% with API
Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022
Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094
765
function feature type). [10].
Motivation of the research: The research is
motivated by a serious threat called malicious
executables. It is designed to damage computer
system and some of them spread over network
without the knowledge of the owner using the
system.
Objective(s): The objective of the research is to
create a static malware detection system with high
accuracy rate.
Methodology: A static malware detection system
using data mining techniques such as information
gain, principal component analysis, and three
classifiers: SVM, J48, and naïve bayes. For
overcoming the lack of usual anti-virus products,
methods of static analysis to extract valuable features
of windows PE file was used.
Contribution to knowledge: A static malware
detection system which has a detection rate of 99.6%
was created.
Limitations: It is not suitable for home users because
it is processor heavy
In Zero-day malware detection based on
supervised learning algorithms of API call
Signatures, the API functions were used for feature
representation again. The best result was achieved
with the Support Vector Machines algorithm with
normalized poly kernel. A precision of 97.6% was
achieved, with a false-positive rate of 0.025. [11].
Motivation of the research: The research is
motivated by antivirus detectors being unable to
detect new malwares.
Objective(s): To develop a machine learning
framework using eight different classifiers to detect
unknown malware and to achieve high accuracy rate
Methodology: Large data sets were used to train
classifiers, and analyses the performance results of
the various data mining algorithms adopted for the
study using a fully automated tool developed in the
research to conduct the various experimental
investigations and evaluation
Contribution to knowledge: The machine learning
framework developed achieved a promising result of
98.5% accuracy rate.
Limitations: API call sequence can be extracted
from most device, not all which makes the algorithm
limited to some devices.
The paper Survey on the usage of machine
learning techniques for malware analysis [12]
identify three main methods for detecting malicious
software:
Signature-based Methods.
Heuristic-based Methods and behavior-based
methods.
In addition, they [12] investigate some features
for malware detection and discuss concealment
techniques used by malware to evade detection.
Nonetheless, the aforementioned research does not
consider either dynamic or hybrid approaches.
Motivation of the research: The research was
motivated by malware getting more and more
challenging, given their relentless growth in
complexity and volume.
Objective(s): It aims at providing an overview on the
way machine learning has been used so far in the
context of malware analysis in windows
environments.
Methodology: For the analysis of portable
executables, surveyed papers were systematized
according to their objectives, what information about
malware they specifically use, and what machine
learning techniques they employ to process the input
and produce the output.
Contribution to knowledge: It provided an overview
on how machine learning algorithms can be
employed in malware analysis.
Limitations: It highlighted that if the models were
not properly trained it will result in running
inefficient algorithms and making limited
predictions.
4. Machine Learning Method
The machine learning method process consists of
the following 5 stages:
i. Data intake. At first, the dataset is loaded from the
file and is saved in memory.
ii. Data transformation. At this point, the data that
was loaded at step 1 is transformed, cleared, and
normalized to be suitable for the algorithm. Data is
converted so that it lies in the same range, has the
same format. At this point, feature extraction and
selection, which are discussed further, are performed
as well. In addition to that, the data is separated into
sets 'training set' and 'test set.' Data from the
training set is used to build the model, which is later
evaluated using the test set.
iii. Model Training. At this stage, a model is built
using the selected algorithm.
Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022
Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094
766
iv. Model Testing. The model that was built or
trained during step 3 is tested using the test data set,
and the produced result is used for building a new
model that would consider previous models, i.e.,
"learn" from them.
v. Model Deployment. At this stage, the best model
is selected (either after the defined number of
iteration or as soon as the needed result is achieved.
5. Application of Machine Learning
Methods
The purpose of this section is to analyse the data
and using it to train the prediction model. Figure 1
shows the sample of selected features after the
feature extraction method have been applied.
Figure 1. sample of the selected features
After the features were extracted and selected, the
algorithms were applied to the data obtained. The
machine learning methods applied, are K-Nearest
Neighbors, Decision tree, and Random Forest. The
results of the model used are shown below.
The accuracy for the test and the training is
shown in Figure 2, there is accuracy of 0.9667 on
the training data and 0.9653 on the testing data.
Figure 2. Implementation of K-Nearest Neighbor
Algorithm
Figure 3 shows the accuracy of the test and
training, there is accuracy of 0.9793 on the
training data and 0.9779 on the testing data.
The accuracy of the test and the training is shown
in Figure 4, there is accuracy of 0.9999 on the
training data and 0.9999 on the testing data.
Figure 3. Implementation of Decision Tree
Algorithm
Figure 4. Implementation of Random
Forest Algorithm
Figure 5. the confusion matrix for K-nearest
Neighbor
Figure 6. The confusion matrix for
Decision Tree
Confusion Matrix: The confusion Matrix for K-
Nearest Neighbor, Decision Tree and Random Forest
Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022
Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094
767
is shown in Figure 5, Figure 6, Figure 7 respectively.
Figure 7. The confusion matrix for
Random Forest
7. Result of the Dataset Analysis
The overall accuracy of the algorithm is
calculated below:
Table 1. Performance Evaluation Compared with
other Researchers Results
Figure 8. Algorithms performance analysis and
Graphical representation of the nslkdd.csv results
Table 1 shows the performance of our research
compared with other researchers which used the
same machine learning algorithms (RF, DT and
KNN). This work performed better than the other
two [13], [14].
The Figure 8 shows that Random Forest
performed better than the two other machine learning
algorithms.
8. Discussion
The Figure 1 shows the original features in the
dataset, but after feature selection method was
applied, the features were pruned to 20. Feature
selection was used in removing redundant and
irrelevant features to improve the accuracy of the
prediction. The Python programming language was
used for performing the feature selection and
applying the machine learning methods. The process
was used to reduce the number of features to 20 from
a total of 35 features. The method used in this
research is the Filter method.
Data preprocessing technique (Data encoding and
checking for missing data) was used in preparing
(cleaning and organizing) the raw data to make it
suitable for building and training the Machine
Learning models. The reason for data preprocessing
is because, it is the first step marking the initiation of
the process. Typically, real-world data is incomplete,
inconsistent, inaccurate (contains errors or outliers),
and often lacks specific attribute values/trends. It
helps to clean, format, and organize the raw data,
thereby making it ready-to-go for Machine Learning
models.
Figure 2 is about the result of KNN algorithm
implementation which shows that the accuracy of
training and testing data. KNN is a non-parametric
algorithm, meaning that it does not make any
assumptions about the data structure. In real-world
problems, data rarely obeys the general theoretical
assumptions, making non-parametric algorithms a
good solution for such problems. KNN model
representation is as simple as the dataset there is no
learning required, the entire training set is stored.
Figures 3 and 4 are the results of implementation
of Decision Tree and Random Forest algorithm on
the dataset. decision trees are data structures that
have a structure of the tree. The training dataset is
used for the creation of the tree, which is
subsequently used for making predictions on the test
data. In this algorithm, the goal is to achieve the
most accurate result with the least number of
decisions that must be made. The training accuracy
of DT is 0.979 while the testing accuracy was 0.977.
Random Forests are the collections of decision trees,
producing a better prediction accuracy. That is why it
is called a ’forest’ it is basically a set of decision
trees. The basic idea is to grow multiple decision
trees based on the independent subsets of the dataset.
At each node, n variables out of the feature set are
selected randomly, and the best split on these
variables is found. The training accuracy for RF was
0.99 and the testing accuracy was also 0.997.
Figures 5, 6 and 7 shows the confusion matrix of the
three-machine learning algorithm deployed (KNN,
DT, RF). The Figure 8 is the graphical representation
of the comparison performance of the three
Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022
Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094
768
algorithm which shows that Random Forest is the
best out of the three.
9. Conclusion
Malware detection is a major issue if our
computing system and its infrastructure must be kept
secure in this modern age. Dataset from Kaggle.com
was used for this research, the dataset which
comprises of 100,000 rows and 35 columns of
features was reduced to 20 by feature selection
method to improve the accuracy of the algorithms.
The results showed that Random Forest machine
learning techniques are the best classifier to classify
our data with 99.9% of accuracy.
10. References
[1] Souri, A., and Hosseini, R. (2018). A stateoftheart
survey of malware detection approaches using data mining
techniques.
[2] Singhal, P., and Nataasha, R. (2015). Malware
detection module using machine learning algorithms to
assist in centralized security in enterprise networks.
International Journal of Network Security and it’s
Applications, 4(1).61-67. https://arxiv.org/abs/1205.30 62
(Access Date: 23 December 2021).
[3] Jehad Ali, Rehanullah Khan, Nasir Ahmad, and Imran
Maqsood. (2012). Random forests and decision trees.
International Journal of Computer Science Issues (IJCSI)
9, 5.
[4] Aliyev, V. (2010). Using honeypots to study skill level
of attackers based on the exploited vulnerabilities in the
network(Corpus ID 107947677) [Master’s thesis,
Chalmers University of Technology]. https://www.seman
ticscholar.org/paper/usinghoneypots-to-study-skill-level-of
-attackersAliyev/62cea777b89e3cc069744f5201a46d64bca
fbe0. (Access Date: 21 December 2021).
[5] Shabtai, A., Kanonov, U., Elovici, Y., Glezer, C. and
Weiss, Y. (2009). Andromaly: a behavioral malware
detection framework for android devices.
[6] Chuvakin, A. (2003). An overview of unix rootkits.
iDEFENCE inc.
[7] Lopez, W., Humberto, G., Enio, P., Erick, B., and Juan,
S. (2013). Keyloggers. Florida International University.
[8] Savage, K., Peter, C., and Hon, L. (2015). The
evolution of ransomware. Symantec corporation.
http://www.symantec.com/content/en/us/enterprise/media/s
ecurity_response/whitepapers/the-evolution-ofransomware
.pdf (Access Date: 21 December 2021).
[9] Tabish, M., Zubair Shafiq, M., and Farooq. M. (2009).
Malware detection using statistical analysis of byte-level
file content. In Proceedings of the ACM SIGKDD
Workshop on Cyber Security and Intelligence Informatics,
pages 2331. ACM.
[10] Baldangombo, U., Nyamjav J., and Shi-Jinn, H.
(2013). A static malware detection system using data
mining methods. International Journal of Artificial
Intelligence and Application, 4(4), 113-126. DOI:
10.5121/IJAIA.2013.4411.
[11] Alazab, M., Sitalakshmi, V., Paul, W., and Moutaz, A.
(2011). Zero-day malware detection based on supervised
learning algorithms of API call signatures. In proceedings
of the ninth Australasian data mining conference, 121,
171-182. Australian Computer Society. DOI: 10.5555/248
3628.2483648.
[12] Bazrafshan, Z., Hashemi, H., Fard, S.M.H. and
Hamze, A. (2013).A survey on heuristic malware detection
techniques Conference: Information and Knowledge
Technology (IKT).
[13] Jehad Ali, Rehanullah Khan, Nasir Ahmad, and Imran
Maqsood. (2012). Random forests and decision trees.
International Journal of Computer Science Issues (IJCSI)
9, 5(2012), 272.
[14] Sarang Na, S. and Kwon, T. (2013). A Rolling
Image based Virtual Keyboard Resilient to Spyware
on Smartphones. Journal of the Korea Institute of
Information Security and Cryptology. 23(6):1219-1223.
Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022
Copyright © 2022, Infonomics Society | DOI: 10.20533/jitst.2046.3723.2022.0094
769
... Supervised Learning, particularly effective in malware detection for IoMT devices, relies on models trained with labeled data, distinguishing between malicious and benign samples. This approach is noted for its high accuracy and recall in identifying malware, with Ayeni [4] reporting a 99.1% accuracy. It's beneficial for recognizing new malware by comparing it to known types. ...
Article
Full-text available
Advancements in information and communications technology (ICT) have fundamentally transformed computing, notably through the internet of things (IoT) and its healthcare-focused branch, the internet of medical things (IoMT). These technologies, while enhancing daily life, face significant security risks, including ransomware. To counter this, the authors present a scalable, hybrid machine learning framework that effectively identifies IoMT ransomware attacks, conserving the limited resources of IoMT devices. To assess the effectiveness of their proposed solution, the authors undertook an experiment using a state-of-the-art dataset. Their framework demonstrated superiority over conventional detection methods, achieving an impressive 87% accuracy rate. Building on this foundation, the framework integrates a multi-faceted feature extraction process that discerns between benign and malign actions, with a subsequent in-depth analysis via a neural network. This advanced analysis is pivotal in precisely detecting and terminating ransomware threats, offering a robust solution to secure the IoMT ecosystem.
... Researcher [56] noted that KNN is a non-parametric algorithm, meaning that it does not make any assumptions about the data structure. In real-world problems, data rarely obeys the general theoretical assumptions, making non-parametric algorithms a good solution for such problems. ...
Thesis
Malware detection is a very crucial component of Computer Security with the current malware prevalence on the internet. With disastrous malware infiltration, early detection functions can help prevent malicious software from compromising Information and Communication Technology Systems. Different machine learning algorithms have been used to detect malicious activities on the web. The intent of using machine learning algorithm is to detect future high-risk malicious activities before they can turn into security breaches. Precisely, the majority of this research's efforts was placed on crawling 1873 top Alexa advertisement URLs for static malware analysis; Sandbox environment for analysing the malicious websites dynamically; extract dynamic and static analysis features to train K-nearest neighbors (KNN) classifier for malware detection; and applied K-Folds cross-validator to evaluate the approach. The model designed in this approach attained 99.2% accuracy and precision. The result as well showed a very low false positive and false negative rates. Thus, the designed malware detection approach can be used to support malware detection in web advertisements. Keywords: Web Advertisement, Malware Analysis, Dynamic Malware Analysis, Static Malware Analysis, Sandbox; Machine Learning; K-nearest neighbors; K-Folds Cross-validator, Classification
... Olaniyi Abiodun Ayeni [4] found an Android phone issue affecting users' data and security. Using machine learning techniques, malware can be detected, and, to some extent, the data can be protected. ...
Article
Malware analysis is used to identify suspicious file transferring in the network. It can be identified efficiently by using the reverse engineering hybrid approach. Implementing a hybrid approach depends on the feature selection because the dataset contains static and dynamic parameters. The given dataset contains 85 attributes with 10 different class labels. Since it has high dimensional and multi-classification data, existing approaches of ML could be more efficient in reducing the features. The model combines the enhanced JAYA genetic algorithm with a gradient boosting technique to identify the efficiency and a smaller number of features. Many existing approaches for feature selection either implement correlation analysis or wrapper techniques. The major disadvantages of these issues are that they are facing fitting problems with a very small number of features. With the Usage of the genetic approach, this paper has achieved 95% accuracy with 12 features, approximately 7% greater than ML approaches.
Article
Artificial intelligence and machine learning have become crucial tools in the fight against cyber attacks. With the constant evolution of technology, traditional methods of protecting networks are no longer enough. This is where AI and machine learning come into play, by analyzing vast amounts of data and detecting patterns or anomalies that might indicate a potential threat. This paper aims at understanding and analyzing the implementation of Artificial Intelligence (AI) and Machine Learning (ML) systems in enhancing cyber security. By detecting patterns and anomalies in network traffic, AI algorithms can quickly identify potential threats and reduce response time, far surpassing human capabilities. This not only saves valuable time and resources for organizations but also improves overall protection against cyber-attacks. As technology continues to advance, it is crucial that we leverage AI for cybersecurity to stay ahead in the fight against malicious actors. With proper utilization of AI and ML technologies, we can ensure a safer digital future for all users..
Article
Full-text available
In this paper, we have compared the classification results of two models i.e. Random Forest and the J48 for classifying twenty versatile datasets. We took 20 data sets available from UCI repository [1] containing instances varying from 148 to 20000. We compared the classification results obtained from methods i.e. Random Forest and Decision Tree (J48). The classification parameters consist of correctly classified instances, incorrectly classified instances, F-Measure, Precision, Accuracy and Recall. We discussed the pros and cons of using these models for large and small data sets. The classification results show that Random Forest gives better results for the same number of attributes and large data sets i.e. with greater number of instances, while J48 is handy with small data sets (less number of instances). The results from breast cancer data set depicts that when the number of instances increased from 286 to 699, the percentage of correctly classified instances increased from 69.23% to 96.13% for Random Forest i.e. for dataset with same number of attributes but having more instances, the Random Forest accuracy increased.
Article
Full-text available
A serious threat today is malicious executables. It is designed to damage computer system and some of them spread over network without the knowledge of the owner using the system. Two approaches have been derived for it i.e. Signature Based Detection and Heuristic Based Detection. These approaches performed well against known malicious programs but cannot catch the new malicious programs. Different researchers have proposed methods using data mining and machine learning for detecting new malicious programs. The method based on data mining and machine learning has shown good results compared to other approaches. This work presents a static malware detection system using data mining techniques such as Information Gain, Principal component analysis, and three classifiers: SVM, J48, and Na\"ive Bayes. For overcoming the lack of usual anti-virus products, we use methods of static analysis to extract valuable features of Windows PE file. We extract raw features of Windows executables which are PE header information, DLLs, and API functions inside each DLL of Windows PE file. Thereafter, Information Gain, calling frequencies of the raw features are calculated to select valuable subset features, and then Principal Component Analysis is used for dimensionality reduction of the selected features. By adopting the concepts of machine learning and data-mining, we construct a static malware detection system which has a detection rate of 99.6%.
Conference Paper
Full-text available
Zero-day or unknown malware are created using code obfuscation techniques that can modify the parent code to produce offspring copies which have the same functionality but with different signatures. Current techniques reported in literature lack the capability of detecting zero-day malware with the required accuracy and efficiency. In this paper, we have proposed and evaluated a novel method of employing several data mining techniques to detect and classify zero-day malware with high levels of accuracy and efficiency based on the frequency of Windows API calls. This paper describes the methodology employed for the collection of large data sets to train the classifiers, and analyses the performance results of the various data mining algorithms adopted for the study using a fully automated tool developed in this research to conduct the various experimental investigations and evaluation. Through the performance results of these algorithms from our experimental analysis, we are able to evaluate and discuss the advantages of one data mining algorithm over the other for accurately detecting zero-day malware successfully. The data mining framework employed in this research learns through analysing the behavior of existing malicious and benign codes in large datasets. We have employed robust classifiers, namely Naïve Bayes (NB) Algorithm, k−Nearest Neighbor (kNN) Algorithm, Sequential Minimal Optimization (SMO) Algorithm with 4 differents kernels (SMO -Normalized PolyKernel, SMO – PolyKernel, SMO – Puk, and SMO-Radial Basis Function (RBF)), Backpropagation Neural Networks Algorithm, and J48 decision tree and have evaluated their performance., Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included Overall, the automated data mining system implemented for this study has achieved high true positive (TP) rate of more than 98.5%, and low false positive (FP) rate of less than 0.025, which has not been achieved in literature so far. This is much higher than the required commercial acceptance level indicating that our novel technique is a major leap forward in detecting zero-day malware. This paper also offers future directions for researchers in exploring different aspects of obfuscations that are affecting the IT world today.
Article
Full-text available
Malicious software is abundant in a world of innumerable computer users, who are constantly faced with these threats from various sources like the internet, local networks and portable drives. Malware is potentially low to high risk and can cause systems to function incorrectly, steal data and even crash. Malware may be executable or system library files in the form of viruses, worms, Trojans, all aimed at breaching the security of the system and compromising user privacy. Typically, anti-virus software is based on a signature definition system which keeps updating from the internet and thus keeping track of known viruses. While this may be sufficient for home-users, a security risk from a new virus could threaten an entire enterprise network. This paper proposes a new and more sophisticated antivirus engine that can not only scan files, but also build knowledge and detect files as potential viruses. This is done by extracting system API calls made by various normal and harmful executable, and using machine learning algorithms to classify and hence, rank files on a scale of security risk. While such a system is processor heavy, it is very effective when used centrally to protect an enterprise network which maybe more prone to such threats.
Article
Full-text available
This article presents Andromaly—a framework for detecting malware on Android mobile devices. The proposed framework realizes a Host-based Malware Detection System that continuously monitors various features and events obtained from the mobile device and then applies Machine Learning anomaly detectors to classify the collected data as normal (benign) or abnormal (malicious). Since no malicious applications are yet available for Android, we developed four malicious applications, and evaluated Andromaly’s ability to detect new malware based on samples of known malware. We evaluated several combinations of anomaly detection algorithms, feature selection method and the number of top features in order to find the combination that yields the best performance in detecting new malware on Android. Empirical results suggest that the proposed framework is effective in detecting malware on mobile devices in general and on Android in particular.
Article
Data mining techniques have been concentrated for malware detection in the recent decade. The battle between security analyzers and malware scholars is everlasting as innovation grows. The proposed methodologies are not adequate while evolutionary and complex nature of malware is changing quickly and therefore turn out to be harder to recognize. This paper presents a systematic and detailed survey of the malware detection mechanisms using data mining techniques. In addition, it classifies the malware detection approaches in two main categories including signature-based methods and behavior-based detection. The main contributions of this paper are: (1) providing a summary of the current challenges related to the malware detection approaches in data mining, (2) presenting a systematic and categorized overview of the current approaches to machine learning mechanisms, (3) exploring the structure of the significant methods in the malware detection approach and (4) discussing the important factors of classification malware approaches in the data mining. The detection approaches have been compared with each other according to their importance factors. The advantages and disadvantages of them were discussed in terms of data mining models, their evaluation method and their proficiency. This survey helps researchers to have a general comprehension of the malware detection field and for specialists to do consequent examinations.
Article
Due to the fundamental features of smartphones, such as openness and mobility, a great deal of malicious software including spyware can be installed more easily. Since spyware can steal user`s sensitive information and invade privacy, it is necessary to provide proper security mechanisms like secure virtual keyboards. In this paper, we propose a novel password input system to resist spyware and show how effectively it can reduce the threats.
An overview of unix rootkits
  • A Chuvakin
Chuvakin, A. (2003). An overview of unix rootkits. iDEFENCE inc.
The evolution of ransomware
  • K Savage
  • C Peter
  • L Hon
Savage, K., Peter, C., and Hon, L. (2015). The evolution of ransomware. Symantec corporation. http://www.symantec.com/content/en/us/enterprise/media/s ecurity_response/whitepapers/the-evolution-ofransomware .pdf (Access Date: 21 December 2021).