ArticlePDF Available

A Supervised Machine Learning Algorithm for Detecting Malware

June 2022

June 2022
10(1):764-769

DOI:10.20533/jitst.2046.3723.2022.0094

Authors:

Olaniyi Ayeni

Federal University of Technology, Akure

Content uploaded by Olaniyi Ayeni

Content may be subject to copyright.

A Supervised Machine Learning Algorithm for Detecting Malware

Olaniyi Abiodun Ayeni

Department of Cyber Security, School of Computing

Federal University of Technology, Nigeria

Abstract

The proliferation of malware is a threat to our

computing system and its security. That is why the

need for malware detection using machine learning

arises. This work was motivated by the limitation of

[1], [2] in ‘Malware Detection Module using

Machine Learning Algorithms. The objective of this

research is to develop a security system for the

detection of malware using supervised machine

learning algorithms and also to carried out

performance evaluation. Feature selection (Filter

method) was used to reduce 100,000 columns and 35

rows of features to 20 features, then three classifier

algorithms were employed which are K-Nearest

Neighbor, Decision Tree and Random Forest. The

classifiers are trained and tested using the

dataset(malware.csv) gotten from Malware

Detection Kaggle. The results of the algorithms (K-

Nearest Neighbor, Decision Tree and Random

Forest) are respectively 96.53%,97.79% and

99.90%. The results were also compared with other

researchers[3] work that used the same three

classifiers, the results of Maqsood 2020 for Random

Forest, Decision tree and K nearest neighbor are

respectively 96.39%, 100%(overfit) and 99.4%,

while the results of Sarang et al 2013 for Random

Forest, Decision tree and K nearest neighbor are

respectively 99.57%, 99.23%, and 99.06%. It

indicates that Random Forest is most effective out of

the three classifiers algorithm for malware detection

using machine learning, moreover, the study

performed can be useful as a base for further

research in the field of malware analysis with

machine learning methods.

Keywords: Malware, Supervised Learning, Decision

Tree, K-Nearest Neighbour, Random Forest, Feature

Method, Computer Security

1. Introduction

The malware was first created in 1949 by John

von Neumann. Ever since then, more were created.

Antivirus company is continually searching for the

most effective ways in identifying malware and one

of the most famous methods used is the signature-

based detection. Furthermore, the skill level that is

required for malware development is on the decrease

because of the high numbers of attacking tools on the

internet nowadays. High availability of anti-detection

techniques and the ability to buy malware in the

black-market results in an opportunity to be an

attacker for anyone, not depending on the skill level.

Current studies show that more attacks are being

issued by script-kiddies or automated [4]. Therefore,

malware protection of computer is an essential

cybersecurity tasks for single users and businesses,

since an attack can lead to compromised data and

sufficient losses. Also, massive losses and frequent

attacks influences the need for accurate and timely

detection methods. However, current static and

dynamic methods do not provide efficient detection,

especially when dealing with zero-day attacks.

Hence, machine learning-based techniques can be

used. Therefore, this paper discusses the main points

and concerns of machine learning-based malware

detection, as well as looks for the best feature

representation and classification methods.

2. Research Motivation

The existing literature on the topic of malware

detection convinces that there is a need for efficient

malware detection system, especially since the use of

internet are becoming increasingly important

nowadays.

The existing frameworks Malware Detection

Module using Machine Learning Algorithms to

Assist in Centralized Security in Enterprise

Networks’ [2] focuses on just the detection and

classification neglecting home users because it’s

processor heavy, also in Detection of malicious code

by applying machine learning classifiers on static

features Shabtai et al. [5] models were not trained

properly which resulted to running inefficient

algorithms. This provided the motivation to create a

malware detection system using machine learning

that is well trained and has a high accuracy and low

positive rate using machine learning that can protect

one’s system by flagging incoming malicious files

and preventing them from affecting one’s computer.

2.1. Problem Statement

With the development of technology, the number

of malwares is increasing daily. Malware is now

designed with mutation characteristic, which causes

an enormous growth in the number of variations.

Also, with the help of automated malware generated

Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

764

tools, novice malware authors can now quickly

generate a new variation. With these growths,

traditional signature-based malware detection is

proven to be ineffective against the vast variation of

malware. However, machine learning methods for

malware detection has proven effective against new

malware.

Objective: The objectives of this research work are

to:

• Design a security framework for malware

detection using Supervised Machine learning

algorithm.

• Implement the design in (a).

• Evaluate the performance of the system.

2.2. Malware Types

i. Backdoor: It is a malware type that negates

standard authentication procedures to access a

system. As a result, remote access is granted to

resources within an application, such as databases

and file servers, giving perpetrators the ability to

issue system commands remotely and update

malware.

ii. Rootkit: Its functionality enables the attacker to

access the data with higher permissions than is

allowed. For example, it can be used to give

unauthorized user administrative access. Rootkits

always hide their existence and quite often are

unnoticeable on the system, making the detection

and, therefore, removal incredibly hard. [6].

iii. Keylogger: The idea behind this malware class is

to log all the keys pressed by the user and store all

data, including passwords, bank card numbers, and

other sensitive information [7].

iv. Ransomware: This type of malware aims to

encrypt all the data on the machine and ask a victim

to transfer some money to get the decryption key.

Usually, a machine infected by ransomware is

“frozen” as the user cannot open any file, and the

desktop picture is used to provide information on the

attacker’s demands [8].

3. Related Works

In state-of-the-art survey of malware detection

approaches using data mining techniques [1] present

a survey of malware detection approaches divided

into two categories:

• Signature-based methods

• Behavior-based methods.

However, the survey does not provide either a review

of the most recent deep learning approaches or a

taxonomy of the types of features used in data

mining techniques for malware detection and

classification.

The research is motivated by a serious threat

today called malicious executables. It is designed to

damage computer system and some of them spread

over network without the knowledge of the owner

using the system.

Objective(s): To present a survey of malware

detection approaches.

3.1. Methodology

Provision of summary of the current challenges

related to malware detection approaches in data

mining, presenting a systematic and categorized

overview of the current approaches to machine

learning mechanisms in the data mining topics,

exploring a structure of the important methods that

are significant in malware detection approach,

discussing the important factors of classification

malware approaches in the data mining to improve

their problems in the futures.

3.2. Contribution to Knowledge

It enlightens more on how to approach malware

detection using machine learning. In his paper

“Malware detection using statistical analysis of byte-

level file content” [9] used several machine learning

techniques to detect malware files. The authors

claimed that their techniques can properly classify

any malware regardless of its obfuscation using

multi-class classification technique to detect seven

classes including benign. The novelty of the authors’

approach is in the ability to detect obfuscated and

packed malware. The difficulty in detecting

obfuscated malware lies in the obscureness of the

structure of the malware file itself. The writer of the

malware, intentionally, re-write the code of the file

that makes it difficult to be caught by antimalware

software. The total size of the content set collected

for the experiment is 12,111 files (1,800 benign files

and 10,311 malicious). However, only 50 files per

class were used as training set. The features for the

experiment generated are statistical-based features

derived from byte sequence n-grams of the

executables.

A static malware detection system using data

mining methods proposed extraction method based

on PE headers, DLLs, and API functions and

methods based on Naive Bayes, J48 Decision Trees,

and Support Vector Machines. The highest overall

accuracy was achieved with the J48 algorithm (99%

with PE header feature type and hybrid PE header

and API function feature type, 99.1% with API

Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

765

function feature type). [10].

Motivation of the research: The research is

motivated by a serious threat called malicious

executables. It is designed to damage computer

system and some of them spread over network

without the knowledge of the owner using the

system.

Objective(s): The objective of the research is to

create a static malware detection system with high

accuracy rate.

Methodology: A static malware detection system

using data mining techniques such as information

gain, principal component analysis, and three

classifiers: SVM, J48, and naïve bayes. For

overcoming the lack of usual anti-virus products,

methods of static analysis to extract valuable features

of windows PE file was used.

Contribution to knowledge: A static malware

detection system which has a detection rate of 99.6%

was created.

Limitations: It is not suitable for home users because

it is processor heavy

In Zero-day malware detection based on

supervised learning algorithms of API call

Signatures, the API functions were used for feature

representation again. The best result was achieved

with the Support Vector Machines algorithm with

normalized poly kernel. A precision of 97.6% was

achieved, with a false-positive rate of 0.025. [11].

Motivation of the research: The research is

motivated by antivirus detectors being unable to

detect new malwares.

Objective(s): To develop a machine learning

framework using eight different classifiers to detect

unknown malware and to achieve high accuracy rate

Methodology: Large data sets were used to train

classifiers, and analyses the performance results of

the various data mining algorithms adopted for the

study using a fully automated tool developed in the

research to conduct the various experimental

investigations and evaluation

Contribution to knowledge: The machine learning

framework developed achieved a promising result of

98.5% accuracy rate.

Limitations: API call sequence can be extracted

from most device, not all which makes the algorithm

limited to some devices.

The paper Survey on the usage of machine

learning techniques for malware analysis [12]

identify three main methods for detecting malicious

software:

• Signature-based Methods.

• Heuristic-based Methods and behavior-based

methods.

In addition, they [12] investigate some features

for malware detection and discuss concealment

techniques used by malware to evade detection.

Nonetheless, the aforementioned research does not

consider either dynamic or hybrid approaches.

Motivation of the research: The research was

motivated by malware getting more and more

challenging, given their relentless growth in

complexity and volume.

Objective(s): It aims at providing an overview on the

way machine learning has been used so far in the

context of malware analysis in windows

environments.

Methodology: For the analysis of portable

executables, surveyed papers were systematized

according to their objectives, what information about

malware they specifically use, and what machine

learning techniques they employ to process the input

and produce the output.

Contribution to knowledge: It provided an overview

on how machine learning algorithms can be

employed in malware analysis.

Limitations: It highlighted that if the models were

not properly trained it will result in running

inefficient algorithms and making limited

predictions.

4. Machine Learning Method

The machine learning method process consists of

the following 5 stages:

i. Data intake. At first, the dataset is loaded from the

file and is saved in memory.

ii. Data transformation. At this point, the data that

was loaded at step 1 is transformed, cleared, and

normalized to be suitable for the algorithm. Data is

converted so that it lies in the same range, has the

same format. At this point, feature extraction and

selection, which are discussed further, are performed

as well. In addition to that, the data is separated into

sets – 'training set' and 'test set.' Data from the

training set is used to build the model, which is later

evaluated using the test set.

iii. Model Training. At this stage, a model is built

using the selected algorithm.

Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

766

iv. Model Testing. The model that was built or

trained during step 3 is tested using the test data set,

and the produced result is used for building a new

model that would consider previous models, i.e.,

"learn" from them.

v. Model Deployment. At this stage, the best model

is selected (either after the defined number of

iteration or as soon as the needed result is achieved.

5. Application of Machine Learning

Methods

The purpose of this section is to analyse the data

and using it to train the prediction model. Figure 1

shows the sample of selected features after the

feature extraction method have been applied.

Figure 1. sample of the selected features

After the features were extracted and selected, the

algorithms were applied to the data obtained. The

machine learning methods applied, are K-Nearest

Neighbors, Decision tree, and Random Forest. The

results of the model used are shown below.

• The accuracy for the test and the training is

shown in Figure 2, there is accuracy of 0.9667 on

the training data and 0.9653 on the testing data.

Figure 2. Implementation of K-Nearest Neighbor

Algorithm

• Figure 3 shows the accuracy of the test and

training, there is accuracy of 0.9793 on the

training data and 0.9779 on the testing data.

• The accuracy of the test and the training is shown

in Figure 4, there is accuracy of 0.9999 on the

training data and 0.9999 on the testing data.

Figure 3. Implementation of Decision Tree

Algorithm

Figure 4. Implementation of Random

Forest Algorithm

Figure 5. the confusion matrix for K-nearest

Neighbor

Figure 6. The confusion matrix for

Decision Tree

Confusion Matrix: The confusion Matrix for K-

Nearest Neighbor, Decision Tree and Random Forest

Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

767

is shown in Figure 5, Figure 6, Figure 7 respectively.

Figure 7. The confusion matrix for

Random Forest

7. Result of the Dataset Analysis

The overall accuracy of the algorithm is

calculated below:

Table 1. Performance Evaluation Compared with

other Researchers Results

Figure 8. Algorithms performance analysis and

Graphical representation of the nslkdd.csv results

Table 1 shows the performance of our research

compared with other researchers which used the

same machine learning algorithms (RF, DT and

KNN). This work performed better than the other

two [13], [14].

The Figure 8 shows that Random Forest

performed better than the two other machine learning

algorithms.

8. Discussion

The Figure 1 shows the original features in the

dataset, but after feature selection method was

applied, the features were pruned to 20. Feature

selection was used in removing redundant and

irrelevant features to improve the accuracy of the

prediction. The Python programming language was

used for performing the feature selection and

applying the machine learning methods. The process

was used to reduce the number of features to 20 from

a total of 35 features. The method used in this

research is the Filter method.

Data preprocessing technique (Data encoding and

checking for missing data) was used in preparing

(cleaning and organizing) the raw data to make it

suitable for building and training the Machine

Learning models. The reason for data preprocessing

is because, it is the first step marking the initiation of

the process. Typically, real-world data is incomplete,

inconsistent, inaccurate (contains errors or outliers),

and often lacks specific attribute values/trends. It

helps to clean, format, and organize the raw data,

thereby making it ready-to-go for Machine Learning

models.

Figure 2 is about the result of KNN algorithm

implementation which shows that the accuracy of

training and testing data. KNN is a non-parametric

algorithm, meaning that it does not make any

assumptions about the data structure. In real-world

problems, data rarely obeys the general theoretical

assumptions, making non-parametric algorithms a

good solution for such problems. KNN model

representation is as simple as the dataset – there is no

learning required, the entire training set is stored.

Figures 3 and 4 are the results of implementation

of Decision Tree and Random Forest algorithm on

the dataset. decision trees are data structures that

have a structure of the tree. The training dataset is

used for the creation of the tree, which is

subsequently used for making predictions on the test

data. In this algorithm, the goal is to achieve the

most accurate result with the least number of

decisions that must be made. The training accuracy

of DT is 0.979 while the testing accuracy was 0.977.

Random Forests are the collections of decision trees,

producing a better prediction accuracy. That is why it

is called a ’forest’ – it is basically a set of decision

trees. The basic idea is to grow multiple decision

trees based on the independent subsets of the dataset.

At each node, n variables out of the feature set are

selected randomly, and the best split on these

variables is found. The training accuracy for RF was

0.99 and the testing accuracy was also 0.997.

Figures 5, 6 and 7 shows the confusion matrix of the

three-machine learning algorithm deployed (KNN,

DT, RF). The Figure 8 is the graphical representation

of the comparison performance of the three

Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

768

algorithm which shows that Random Forest is the

best out of the three.

9. Conclusion

Malware detection is a major issue if our

computing system and its infrastructure must be kept

secure in this modern age. Dataset from Kaggle.com

was used for this research, the dataset which

comprises of 100,000 rows and 35 columns of

features was reduced to 20 by feature selection

method to improve the accuracy of the algorithms.

The results showed that Random Forest machine

learning techniques are the best classifier to classify

our data with 99.9% of accuracy.

10. References

[1] Souri, A., and Hosseini, R. (2018). A state‑of‑the‑art

survey of malware detection approaches using data mining

techniques.

[2] Singhal, P., and Nataasha, R. (2015). Malware

detection module using machine learning algorithms to

assist in centralized security in enterprise networks.

International Journal of Network Security and it’s

Applications, 4(1).61-67. https://arxiv.org/abs/1205.30 62

(Access Date: 23 December 2021).

[3] Jehad Ali, Rehanullah Khan, Nasir Ahmad, and Imran

Maqsood. (2012). Random forests and decision trees.

International Journal of Computer Science Issues (IJCSI)

9, 5.

[4] Aliyev, V. (2010). Using honeypots to study skill level

of attackers based on the exploited vulnerabilities in the

network(Corpus ID 107947677) [Master’s thesis,

Chalmers University of Technology]. https://www.seman

ticscholar.org/paper/usinghoneypots-to-study-skill-level-of

-attackersAliyev/62cea777b89e3cc069744f5201a46d64bca

fbe0. (Access Date: 21 December 2021).

[5] Shabtai, A., Kanonov, U., Elovici, Y., Glezer, C. and

Weiss, Y. (2009). Andromaly: a behavioral malware

detection framework for android devices.

[6] Chuvakin, A. (2003). An overview of unix rootkits.

iDEFENCE inc.

[7] Lopez, W., Humberto, G., Enio, P., Erick, B., and Juan,

S. (2013). Keyloggers. Florida International University.

[8] Savage, K., Peter, C., and Hon, L. (2015). The

evolution of ransomware. Symantec corporation.

http://www.symantec.com/content/en/us/enterprise/media/s

ecurity_response/whitepapers/the-evolution-ofransomware

.pdf (Access Date: 21 December 2021).

[9] Tabish, M., Zubair Shafiq, M., and Farooq. M. (2009).

Malware detection using statistical analysis of byte-level

file content. In Proceedings of the ACM SIGKDD

Workshop on Cyber Security and Intelligence Informatics,

pages 23–31. ACM.

[10] Baldangombo, U., Nyamjav J., and Shi-Jinn, H.

(2013). A static malware detection system using data

mining methods. International Journal of Artificial

Intelligence and Application, 4(4), 113-126. DOI:

10.5121/IJAIA.2013.4411.

[11] Alazab, M., Sitalakshmi, V., Paul, W., and Moutaz, A.

(2011). Zero-day malware detection based on supervised

learning algorithms of API call signatures. In proceedings

of the ninth Australasian data mining conference, 121,

171-182. Australian Computer Society. DOI: 10.5555/248

3628.2483648.

[12] Bazrafshan, Z., Hashemi, H., Fard, S.M.H. and

Hamze, A. (2013).A survey on heuristic malware detection

techniques Conference: Information and Knowledge

Technology (IKT).

[13] Jehad Ali, Rehanullah Khan, Nasir Ahmad, and Imran

Maqsood. (2012). Random forests and decision trees.

International Journal of Computer Science Issues (IJCSI)

9, 5(2012), 272.

[14] Sarang Na, S. and Kwon, T. (2013). A Rolling

Image based Virtual Keyboard Resilient to Spyware

on Smartphones. Journal of the Korea Institute of

Information Security and Cryptology. 23(6):1219-1223.

Journal of Internet Technology and Secured Transactions (JITST), Volume 10, Issue 1, 2022

769

Proactive ransomware prevention in pervasive IoMT via hybrid machine learning

Article

Full-text available

May 2024

Advancements in information and communications technology (ICT) have fundamentally transformed computing, notably through the internet of things (IoT) and its healthcare-focused branch, the internet of medical things (IoMT). These technologies, while enhancing daily life, face significant security risks, including ransomware. To counter this, the authors present a scalable, hybrid machine learning framework that effectively identifies IoMT ransomware attacks, conserving the limited resources of IoMT devices. To assess the effectiveness of their proposed solution, the authors undertook an experiment using a state-of-the-art dataset. Their framework demonstrated superiority over conventional detection methods, achieving an impressive 87% accuracy rate. Building on this foundation, the framework integrates a multi-faceted feature extraction process that discerns between benign and malign actions, with a subsequent in-depth analysis via a neural network. This advanced analysis is pivotal in precisely detecting and terminating ransomware threats, offering a robust solution to secure the IoMT ecosystem.

An Approach for Detecting Malware in Web Advertisements

Thesis

Oct 2022

Emmanuel Akoch

Malware detection is a very crucial component of Computer Security with the current malware prevalence on the internet. With disastrous malware infiltration, early detection functions can help prevent malicious software from compromising Information and Communication Technology Systems. Different machine learning algorithms have been used to detect malicious activities on the web. The intent of using machine learning algorithm is to detect future high-risk malicious activities before they can turn into security breaches. Precisely, the majority of this research's efforts was placed on crawling 1873 top Alexa advertisement URLs for static malware analysis; Sandbox environment for analysing the malicious websites dynamically; extract dynamic and static analysis features to train K-nearest neighbors (KNN) classifier for malware detection; and applied K-Folds cross-validator to evaluate the approach. The model designed in this approach attained 99.2% accuracy and precision. The result as well showed a very low false positive and false negative rates. Thus, the designed malware detection approach can be used to support malware detection in web advertisements. Keywords: Web Advertisement, Malware Analysis, Dynamic Malware Analysis, Static Malware Analysis, Sandbox; Machine Learning; K-nearest neighbors; K-Folds Cross-validator, Classification

GBJOF: Gradient Boosting Integrated with Jaya Algorithm to Optimize the Features in Malware Analysis

Article

Aug 2023

Malware analysis is used to identify suspicious file transferring in the network. It can be identified efficiently by using the reverse engineering hybrid approach. Implementing a hybrid approach depends on the feature selection because the dataset contains static and dynamic parameters. The given dataset contains 85 attributes with 10 different class labels. Since it has high dimensional and multi-classification data, existing approaches of ML could be more efficient in reducing the features. The model combines the enhanced JAYA genetic algorithm with a gradient boosting technique to identify the efficiency and a smaller number of features. Many existing approaches for feature selection either implement correlation analysis or wrapper techniques. The major disadvantages of these issues are that they are facing fitting problems with a very small number of features. With the Usage of the genetic approach, this paper has achieved 95% accuracy with 12 features, approximately 7% greater than ML approaches.

Malware Detection Techniques based on Machine Learning

Article

Nov 2023

Dhruv Singh Rajput

Artificial intelligence and machine learning have become crucial tools in the fight against cyber attacks. With the constant evolution of technology, traditional methods of protecting networks are no longer enough. This is where AI and machine learning come into play, by analyzing vast amounts of data and detecting patterns or anomalies that might indicate a potential threat. This paper aims at understanding and analyzing the implementation of Artificial Intelligence (AI) and Machine Learning (ML) systems in enhancing cyber security. By detecting patterns and anomalies in network traffic, AI algorithms can quickly identify potential threats and reduce response time, far surpassing human capabilities. This not only saves valuable time and resources for organizations but also improves overall protection against cyber-attacks. As technology continues to advance, it is crucial that we leverage AI for cybersecurity to stay ahead in the fight against malicious actors. With proper utilization of AI and ML technologies, we can ensure a safer digital future for all users..

Machine Learning-Based Model for Intrusion Detection System

Article

Dec 2023

Random Forests and Decision Trees

Article

Full-text available

Sep 2012

In this paper, we have compared the classification results of two models i.e. Random Forest and the J48 for classifying twenty versatile datasets. We took 20 data sets available from UCI repository [1] containing instances varying from 148 to 20000. We compared the classification results obtained from methods i.e. Random Forest and Decision Tree (J48). The classification parameters consist of correctly classified instances, incorrectly classified instances, F-Measure, Precision, Accuracy and Recall. We discussed the pros and cons of using these models for large and small data sets. The classification results show that Random Forest gives better results for the same number of attributes and large data sets i.e. with greater number of instances, while J48 is handy with small data sets (less number of instances). The results from breast cancer data set depicts that when the number of instances increased from 286 to 699, the percentage of correctly classified instances increased from 69.23% to 96.13% for Random Forest i.e. for dataset with same number of attributes but having more instances, the Random Forest accuracy increased.

A Static Malware Detection System Using Data Mining Methods

Article

Full-text available

Aug 2013

A serious threat today is malicious executables. It is designed to damage computer system and some of them spread over network without the knowledge of the owner using the system. Two approaches have been derived for it i.e. Signature Based Detection and Heuristic Based Detection. These approaches performed well against known malicious programs but cannot catch the new malicious programs. Different researchers have proposed methods using data mining and machine learning for detecting new malicious programs. The method based on data mining and machine learning has shown good results compared to other approaches. This work presents a static malware detection system using data mining techniques such as Information Gain, Principal component analysis, and three classifiers: SVM, J48, and Na\"ive Bayes. For overcoming the lack of usual anti-virus products, we use methods of static analysis to extract valuable features of Windows PE file. We extract raw features of Windows executables which are PE header information, DLLs, and API functions inside each DLL of Windows PE file. Thereafter, Information Gain, calling frequencies of the raw features are calculated to select valuable subset features, and then Principal Component Analysis is used for dimensionality reduction of the selected features. By adopting the concepts of machine learning and data-mining, we construct a static malware detection system which has a detection rate of 99.6%.

Zero-day Malware Detection based on Supervised Learning Algorithms of API call Signatures

Conference Paper

Full-text available

Dec 2011

Zero-day or unknown malware are created using code obfuscation techniques that can modify the parent code to produce offspring copies which have the same functionality but with different signatures. Current techniques reported in literature lack the capability of detecting zero-day malware with the required accuracy and efficiency. In this paper, we have proposed and evaluated a novel method of employing several data mining techniques to detect and classify zero-day malware with high levels of accuracy and efficiency based on the frequency of Windows API calls. This paper describes the methodology employed for the collection of large data sets to train the classifiers, and analyses the performance results of the various data mining algorithms adopted for the study using a fully automated tool developed in this research to conduct the various experimental investigations and evaluation. Through the performance results of these algorithms from our experimental analysis, we are able to evaluate and discuss the advantages of one data mining algorithm over the other for accurately detecting zero-day malware successfully. The data mining framework employed in this research learns through analysing the behavior of existing malicious and benign codes in large datasets. We have employed robust classifiers, namely Naïve Bayes (NB) Algorithm, k−Nearest Neighbor (kNN) Algorithm, Sequential Minimal Optimization (SMO) Algorithm with 4 differents kernels (SMO -Normalized PolyKernel, SMO – PolyKernel, SMO – Puk, and SMO-Radial Basis Function (RBF)), Backpropagation Neural Networks Algorithm, and J48 decision tree and have evaluated their performance., Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included Overall, the automated data mining system implemented for this study has achieved high true positive (TP) rate of more than 98.5%, and low false positive (FP) rate of less than 0.025, which has not been achieved in literature so far. This is much higher than the required commercial acceptance level indicating that our novel technique is a major leap forward in detecting zero-day malware. This paper also offers future directions for researchers in exploring different aspects of obfuscations that are affecting the IT world today.

Malware Detection Module using Machine Learning Algorithms to Assist in Centralized Security in Enterprise Networks

Article

Full-text available

Feb 2012

Malicious software is abundant in a world of innumerable computer users, who are constantly faced with these threats from various sources like the internet, local networks and portable drives. Malware is potentially low to high risk and can cause systems to function incorrectly, steal data and even crash. Malware may be executable or system library files in the form of viruses, worms, Trojans, all aimed at breaching the security of the system and compromising user privacy. Typically, anti-virus software is based on a signature definition system which keeps updating from the internet and thus keeping track of known viruses. While this may be sufficient for home-users, a security risk from a new virus could threaten an entire enterprise network. This paper proposes a new and more sophisticated antivirus engine that can not only scan files, but also build knowledge and detect files as potential viruses. This is done by extracting system API calls made by various normal and harmful executable, and using machine learning algorithms to classify and hence, rank files on a scale of security risk. While such a system is processor heavy, it is very effective when used centrally to protect an enterprise network which maybe more prone to such threats.

"Andromaly": A behavioral malware detection framework for android devices

Article

Full-text available

Feb 2012

This article presents Andromaly—a framework for detecting malware on Android mobile devices. The proposed framework realizes a Host-based Malware Detection System that continuously monitors various features and events obtained from the mobile device and then applies Machine Learning anomaly detectors to classify the collected data as normal (benign) or abnormal (malicious). Since no malicious applications are yet available for Android, we developed four malicious applications, and evaluated Andromaly’s ability to detect new malware based on samples of known malware. We evaluated several combinations of anomaly detection algorithms, feature selection method and the number of top features in order to find the combination that yields the best performance in detecting new malware on Android. Empirical results suggest that the proposed framework is effective in detecting malware on mobile devices in general and on Android in particular.

A state‑of‑the‑art survey of malware detection approaches using data mining techniques

Article

Jan 2018

Data mining techniques have been concentrated for malware detection in the recent decade. The battle between security analyzers and malware scholars is everlasting as innovation grows. The proposed methodologies are not adequate while evolutionary and complex nature of malware is changing quickly and therefore turn out to be harder to recognize. This paper presents a systematic and detailed survey of the malware detection mechanisms using data mining techniques. In addition, it classifies the malware detection approaches in two main categories including signature-based methods and behavior-based detection. The main contributions of this paper are: (1) providing a summary of the current challenges related to the malware detection approaches in data mining, (2) presenting a systematic and categorized overview of the current approaches to machine learning mechanisms, (3) exploring the structure of the significant methods in the malware detection approach and (4) discussing the important factors of classification malware approaches in the data mining. The detection approaches have been compared with each other according to their importance factors. The advantages and disadvantages of them were discussed in terms of data mining models, their evaluation method and their proficiency. This survey helps researchers to have a general comprehension of the malware detection field and for specialists to do consequent examinations.

A Rolling Image based Virtual Keyboard Resilient to Spyware on Smartphones

Article

Dec 2013

Due to the fundamental features of smartphones, such as openness and mobility, a great deal of malicious software including spyware can be installed more easily. Since spyware can steal user`s sensitive information and invade privacy, it is necessary to provide proper security mechanisms like secure virtual keyboards. In this paper, we propose a novel password input system to resist spyware and show how effectively it can reduce the threats.

Using honeypots to study skill level of attackers based on the exploited vulnerabilities in the network

Article

Vusal Aliyev

An overview of unix rootkits

Jan 2003

A Chuvakin

Chuvakin, A. (2003). An overview of unix rootkits. iDEFENCE inc.

The evolution of ransomware

Dec 2015

K Savage
C Peter
L Hon

Savage, K., Peter, C., and Hon, L. (2015). The evolution of ransomware. Symantec corporation. http://www.symantec.com/content/en/us/enterprise/media/s ecurity_response/whitepapers/the-evolution-ofransomware .pdf (Access Date: 21 December 2021).

A Supervised Machine Learning Algorithm for Detecting Malware

Recommended publications

Integrated Effecient Approach to Botnet Detection using Supervised Machine Learning

A Novel Machine Learning Framework for Analyzing Performance of Different Prediction Models by Using...

A hybrid machine learning approach for analysis of stegomalware

Evaluation of Supervised Machine Learning Classifiers to Detect Mobile Malware