Content uploaded by s. Amudha
Author content
All content in this area was uploaded by s. Amudha on Jun 18, 2020
Content may be subject to copyright.
International Journal of Advanced Science and Technology
Vol. 29, No. 4s, (2020), pp. 1947-1954
ISSN: 2005-4238 IJAST
Copyright ⓒ 2020 SERSC 1947
Detect and Classify Zero Day Malware Efficiently In Big Data Platform
V. R. Niveditha1, T. V. Ananthan 2, S.Amudha3, Dahlia Sam4 and S.Srinidhi5
1Research Scholar Department of Computer Science and Engineering,
2Professor Department of Computer Science and Engineering,
3Assistant Professor Department of Computer Science and Engineering,
4Assistant Professor Department of Information Technology,
5B.Tech Department of Computer Science and Engineering,
1, 2,3,4,5 Dr. M.G.R. Educational and Research Institute,
Chennai 600095, Tamil Nadu, India
Abstract
Malware has long been familiar on the Internet nowadays as one of the most prominent cyber threats. It
expands rapidly in volume, velocity and variety, which overcoming the conventional methods used to
identify and recognize malware. In order to suit the size and difficulty of such a data-accelerated
environment, successful analytics methods are required. Nowadays sense of Big Data platform, the
specific methods will help malware researchers successfuldone the time-consuming process of
systematically investigating malicious events. Security researchers want to create a use of Machine
Learning (ML) algorithms with big data techniques to evaluate and track indefinite malware in a large
scale. These techniques consists of dynamic and wide flux of malicious binaries which aid them to solve
the emerging threat environment. This paper suggests the framework of big data whereby techniques of
static and dynamic malware detection are efficiently merged in order to accurately classify and identify
zero-day malware. The framework being introduced the tested and estimated on a sample files for 0.1
million involving the clean files for 0.03 million and containing a wide variety of malware families in 0.13
million malicious binaries. The results show that SVM attained the best accuracy of 93.03% for detecting
malware and benign types using 10-fold cross validation.
Keywords: malware, big data, zero day malware, malicious binaries
1. Introduction
Malware software is designed by computer program which are some security and more sensitive code or
data without the permission of the user to damage the operating system kernel [1, 2]. Malware
containsworms, computer bugs, theoretically inappropriate plans, and other programs which may also
damage a machine.In worldwide, the use of such viruses on the internet is impacting numerouscompanies
and people. There are several malevolentevents on the network with original occurrences triggered by
indefinite versions of current malware that fail to detect their behavior [3].This malware was referred to
as zero-day or novel malware, because there may be zero-days between the main intrusion of the
unknown malware and the moment it is identified. Similarly, these threats are called zero-day threats
(attack). The widely used approaches to malware identification fall below two primary methods namely,
analysis of static and dynamic [4–7].Malware detection conducts malicious software partition to clarify
the functionalities, strengths and motivations. The previous tests the malevolent binary code without
getting it executed.The other hand, tracks the malicious program actions when operating in the simulated
environment [8].While the ML field has been built to identify unknown malware in a timely manner, it
faces difficulties due to the evolution in malware data with a huge number in the samples of malware as
the attackers continuously come up along with novel strategies to fool the detectors.Malware
identification has develop a major big data issue in the threat environment. Big data analytics has gained
significant consideration from the technology analysts and practitioners in recent times.The main aim is to
reduce reaction time and improve performance using artificial learning, data analytics, big data and
decision-making strategies with increasing human interface in detecting zero-day threats to malware. It
International Journal of Advanced Science and Technology
Vol. 29, No. 4s, (2020), pp. 1947-1954
ISSN: 2005-4238 IJAST
Copyright ⓒ 2020 SERSC 1948
will assist in the near real-time upgrading of anti-malware applications to deal with emerging threats to
malware. The past data has to deal potential attacks with its particularimplication and will deliver cyber
information. Currently, because of ease of use Apache Spark has improved presentation than Apache
Hadoop [9] whereas Apache Spark is one of Apache's greatest successful Big Data ventures.In this
research work, a big data system is proposed to be established on best of Apache Spark for broad-scale
malware recognition using the Machine Learning Library (MLlib) to identify malware, it is checked and
analyzed on a broad dataset and the new outcome examination is performed.
The organizations of this paper is as follows. In section 2 describes the associated survey regarding
technique based methodological contributions from existing work, section 3 describes the proposed
methodology based on Zero-Day Malware Discovery, section 4, discusses predicting malware and benign
using classification algorithm, Section 5 concludes the evaluation work.
2. Literature Review
Kouser (2018) developed and applied a system utilizing Apache Spark and Hadoop Distributed File
System (HDFS). The suggested system is tested on a dataset that contains samples greater than 1 million.
The research work has shown that the identification result can be enhanced by enabling human analysis of
malware samples [10].Shalini (2018)provides the role of a malware analyst is extremely labor intensive
and dynamic as current automatic methods with Big Data and the Internet of Things (IoT), these
techniques are effective for detecting and finding only recognized malware due to unknown unidentified
malware by an ever-increasing amount of attacks Jayasuruthi et al,(2018)[11,12].Given the creation of
automatic data analysis tools that replicate this mechanism as much as possible in which they still need
intermediate findings to be checked and transcribe by the domain experts. Bou-Harb et al,(2014)describes
the various latest experiments have examined simulation strategies to dramatically speed up the phase of
malware identification in Cao et al,(2015)[13,14].Deepak Gupta &RinkleRani(2018) proposes to identify
zero-day malware in which scalable framework developed on top of the Apache Spark that uses its
accessible MLlib [15]. The suggestion by Venkatraman and Alazab(2018) is to recognize indistinct
malware by visualizing technique which are challenging to detect with the current methods.Overall, the
high classification accuracy can be seen visually by obtaining with our proposed approach because of
various malware families display substantially dissimilar patterns of behavior [16].TaeGuen et al,(2019)
described the malware identification techniques in android have been utilizing a multimodal deep NN to
match the numerous structures accompanied by specific assets. It uses several static characteristics to
represent the belongings of several parts of an application.It is also mentioned this complex technique
function of current research may be introduced in future [17]. Patidar et al,(2017)described the method of
behavior may determine what information is being used or required, and which information or resources.
For instance, in web browser will not know whether the application or the necessary authorization is
actually wanted otherwise the attacker will have plan to execute it attacks the computer and provides
unknown surroundings using network technology or extracting personal details without the awareness of
the attacker [18][19].Patidar and Khandelwal(2019) suggest a strategy focused on zero-day malware
identification by the use of ML[20].
From the literature review, it is observed that most of the current approaches proposed to identify
malware activities are not flexible and thus cannot accommodate the increasing amount and complexity of
malware families.
3. Proposed Methodology
3.1 Data Preparation
The suggested malware identification architecture is tested on a dataset of 0.1 million files comprising
0.13 million samples and 0.03 million clean files of malware that directing Windows OS. Samples of
malware used in our sample are gathered by various causes such as No think (no think), VX Heaven
International Journal of Advanced Science and Technology
Vol. 29, No. 4s, (2020), pp. 1947-1954
ISSN: 2005-4238 IJAST
Copyright ⓒ 2020 SERSC 1949
(vxheaven), Virus Share (virus share), etc.Figure 1 indicates the amount of malware samples sent for
review to Virustotal the years 2013–2018. Virustotal offers 57 antivirus software with the scanning result.
The malware samples can summaries with the antivirus which identifies these as harmful binaries.Our list
includes over 3000 malware families according to testing result from a free antivirus, Average
(www.avg.com). The topmost 15 malware families are in our dataset together with their count. The clean
files included in our sample are files retrieved from successive versions of the Windows OS from
System32 folders.
Figure 1. Malware dataset samples between 2013 and 2018.
3.2 Malware Feature Extraction
The detection accurateness of malware methods is focused on how well it can isolate and compare the
behavior trends shown by malicious code. In general, the malware-intrusion approaches and methods of
attack may be narrowly defined as static, fluid, and hybrid.While approach of static use manipulations of
code syntax, process modifications are used by dynamic approaches. In certain instances, both computer
modification and procedure modifications are mixed by hybrid approaches. Malware code writers follow
unique of the key types of spontaneous generation of novel malware which leads to zero-day occurrences
for simple and fast deployment as described below:
Install or bundling of applications (static): Malicious code is inserted into host apps or loaded into
external components through using an update bug. Each period the software / module is used, the
malicious code runs and it becomes loaded into a device and affects the program.
Static: Malware reaches new targets aligned with a current target.
Dynamic: Malware may operate from a remote location, seeking novel targets for its attack.
System or data manipulation: Malicious is inserted into additional OS or data in order to obtain further
rights.
Disguise: This technique is used to mask the identity of other applications, data or device resources or to
avoid devices, applications or protection settings from being disabled.
Payload: This approach is used to transfer or transmit information or to third parties.
0
5000
10000
15000
20000
25000
30000
35000
2013 2014 2015 2016 2017 2018
Number of Malware samples
years
International Journal of Advanced Science and Technology
Vol. 29, No. 4s, (2020), pp. 1947-1954
ISSN: 2005-4238 IJAST
Copyright ⓒ 2020 SERSC 1950
Our suggested approach is focused on the assumption that visualization should be used to help both a
malware sample of human behavioral analysis and zero day malware based on accurate classification of
malware. The malware classification uses malware sample similarity to classify specific actions which
has been exhibited by the families of well-known malware.
3.3 Impact Analysis of Malware Feature
In this section, segment looks at the usefulness of the technologies used to identify and track malware.
We construct a model of classification using logistic regression based on Apache Spark's accessible
MLlib to explain the dataset and evaluate the functions used for malware classification.This offers a list
of characteristics of malware along with their loads. Relying on the weight factor alone has no enough to
explain the significance of features for classification purposes as a function might have gained a higher
value, but may be a continuous in the dataset samples. These functionality cannot differentiate between
the malware and the clean data.Furthermore, we found the involvement of low-level measurements of a
function to research the significance of a function for that the system used a ranking technique shown in
equation.1 to measure the value of a set of characteristics, where n signifies the total number of features.
Variation = √𝑣𝑎𝑟 ∑𝑥𝑛𝑊
𝑛𝑛∈𝑙 ……………..(1)
𝑤ℎ𝑒𝑟𝑒 𝑥𝑛denotes number of instances
𝑊
𝑛denotes weight of nthlength of a feature
4. Zero-Day Malware Detection
Data mining and machine learning are the latest technique existence used for detection and classification
of malware. ML algorithms may characterize a file's actions as either harmful or benevolent based on
information gathered from the file utilizing static or dynamic analysis.Through implementing there are
various ML algorithms, the classification model developed up through training with labeled data set
which have easily identify new data. Therefore a malware detection based on the attributes which has the
potential to identify new malware obtained after conducting static and dynamic malware analysis has
been developed.Based on the experiments there are three ML algorithm, namely, Apache Spark's based
on versatile MLlib algorithms, Naïve Bayes and SVM. The ML methods are widely used in the review to
identify and recognize zero-day malware. The following supervised ML algorithm are described below.
Naive Bayes (NB) is a classifier can determine the probability of a sample datasetgoes to a particular
class which it is based on Baye's theorem. This functions under the premise that all the features distinctly
lead to the estimation of data grouping likelihood (Meng et al. 2016), i.e. the occurrence of one function
in a class which has not linked to the existence of alternative.
SVM is a classifier that plots each data element in feature space of n-dimensional, where the position of
all function serves as a organize position. Then, an ideal hyperplane of linear has examined and divides
files from one class.This hyperplane is predictable using training tuples and the margins they describe.
4.1 Evaluation Parameter
The most essential aspect of ML technique has evaluating the performance. The results based on iterative
process of learning in the refining of the parameter which helps to provide a deeper interpretation of the
technique. Using different output metrics the ML algorithms are tested. The experimental findings used
10-fold cross validation for detecting malware and benign types are illustrated in table.1.
International Journal of Advanced Science and Technology
Vol. 29, No. 4s, (2020), pp. 1947-1954
ISSN: 2005-4238 IJAST
Copyright ⓒ 2020 SERSC 1951
0
0.2
0.4
0.6
0.8
1
1.2
TPR FPR TNR FNR
Prediction of Benign
Classification algorithm
NB SVM
Table 1 Predicting malware and benign using classification algorithm
Classification
algorithm
Class Name
TPR
FPR
TNR
FNR
Accuracy
(%)
NB
Malware
0.835
0
1
0.165
87.13%
SVM
Malware
0.917
0.006
0.986
0.072
93.0.%
NB
Benign
1
0.156
0.736
0
87.13%
SVM
Benign
0.975
0.081
0.048
0.006
93.03%
Figures.2 and 3 represent FPR / FNR and accuracy of malware detection precision across corresponding
classifiers. Among the two classifiers, the findings reveal that SVM classifier is the better fit to our
malware classification dataset led by NB, respectively.
Figure.2 classification algorithm using confusion matrix parameter for malware prediction
Figure.3 classification algorithm using confusion matrix parameter for Benign prediction
0
0.2
0.4
0.6
0.8
1
1.2
TPR FPR TNR FNR
Prediction of malware
Classification algorithm
NB SVM
International Journal of Advanced Science and Technology
Vol. 29, No. 4s, (2020), pp. 1947-1954
ISSN: 2005-4238 IJAST
Copyright ⓒ 2020 SERSC 1952
Figure 4. Comparison of various classifiers based on Accuracy
Results and Discussion
The proposed framework would be validated and analyzed to identify zero-day malware utilizing a
sample dataset that will include a wide quantity of malware families submitted to VirusTotal over a 7-
year span 2013 to 2018. The experimental findings indicate that SVM gives the highest 93.03% accuracy
with the lowest FPR/FNR accompanied by NB which has been provided an 87.13% accuracy
respectively.
4. Conclusion
This paper, Malware samples are increasingly rising at a remarkable pace, and identification has
previously been detected as a big data difficult. It is important to note that the capacity to gather original
data is not so acute. The analysis that goes data into information and therefore gives security analysts
more importance.While the suggested framework will resolve the problems and concerns relevant to zero-
day malware identification early. This has the potential to development the data in real-time to identify
malware of zero-day attack and offer the stakeholders with prompt corrective measures. The effects of
these two classifiers are contrasted, and it is noticed that SVM has the better performance to identify
malware. The outcomes show that SVM attained the best accuracy of 93.03% for detecting malware and
benign types using 10-fold cross validation. The suggested architecture may be expanded to the cloud
infrastructure to do research. A hybrid solution may be called to enable both local cluster and cloud data
processing to further improve the efficiency of the analyzes.
Reference
[1] J. Aycock, “Computer Viruses and Malware,” in Advances in Information Security, Springer-Verlag,
New York, NY, USA, 1st edition, 2006.
0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
100.00
NB SVM
Accuracy in %
Classifier Model
Accuracy
International Journal of Advanced Science and Technology
Vol. 29, No. 4s, (2020), pp. 1947-1954
ISSN: 2005-4238 IJAST
Copyright ⓒ 2020 SERSC 1953
[2] G. Mohamed and N. B. Ithnin, “Survey on Representation Techniques for Malware Detection,”
System American Journal of Applied Sciences, 2017.
[3] Praveen Sundar, P.V., Ranjith, D., Vinoth Kumar, V. et al. Low power area efficient adaptive FIR
filter for hearing aids using distributed arithmetic architecture. Int J Speech Technol (2020).
https://doi.org/10.1007/s10772-020-09686-y.
[4] Umamaheswaran, S., Lakshmanan, R., Vinothkumar, V. et al. New and robust composite micro
structure descriptor (CMSD) for CBIR. International Journal of Speech Technology (2019),
doi:10.1007/s10772-019-09663-0.
[5] Karthikeyan, T., Sekaran, K., Ranjith, D., Vinoth kumar, V., Balajee, J.M. (2019) “Personalized
Content Extraction and Text Classification Using Effective Web Scraping Techniques”, International
Journal of Web Portals (IJWP), 11(2), pp.41-52
[6] Vinoth Kumar, V., Arvind, K.S., Umamaheswaran, S., Suganya, K.S (2019), “Hierarchal Trust
Certificate Distribution using Distributed CA in MANET”, International Journal of Innovative
Technology and Exploring Engineering, 8(10), pp. 2521-2524.
[7] Maithili, K , Vinothkumar, V, Latha, P (2018). “Analyzing the security mechanisms to prevent
unauthorized access in cloud and network security” Journal of Computational and Theoretical
Nanoscience, Vol.15, pp.2059-2063.
[8] V.Vinoth Kumar, Ramamoorthy S (2017), “A Novel method of gateway selection to improve
throughput performance in MANET”, Journal of Advanced Research in Dynamical and Control
Systems,9(Special Issue 16), pp. 420-432
[9] Dhilip Kumar V, Vinoth Kumar V, Kandar D (2018), “Data Transmission Between Dedicated Short-
Range Communication and WiMAX for Efficient Vehicular Communication” Journal of
Computational and Theoretical Nanoscience, Vol.15, No.8, pp.2649-2654.
[10] Kouser, R.R., Manikandan, T., Kumar, V.V (2018), “Heart disease prediction system using artificial
neural network, radial basis function and case based reasoning” Journal of Computational and
Theoretical Nanoscience, 15, pp. 2810-2817.
[11] Shalini A, Jayasuruthi L, Vinoth Kumar V, “Voice Recognition Robot Control using Android
Device” Journal of Computational and Theoretical Nanoscience, 15(6-7), pp. 2197-2201
[12] Jayasuruthi L,Shalini A,Vinoth Kumar V.,(2018) ” Application of rough set theory in data mining
market analysis using rough sets data explorer” Journal of Computational and Theoretical
Nanoscience, 15(6-7), pp. 2126-213
[13] E. Bou-Harb, M. Debbabi, and C. Assi, “Cyber scanning: A comprehensive survey,” IEEE
Communications Surveys & Tutorials, vol. 16, no. 3, pp. 1496–1519, 2014.
[14] N. Cao, L. Lu, Y.-R. Lin, F. Wang, and Z. Wen, “SocialHelix: visual analysis of sentiment
divergence in social media,” Journal of Visualization, vol. 18, no. 2, pp. 221–235, 2015.
[15] Deepak Gupta and Rinkle Rani, “Big Data Framework for Zero-Day Malware Detection”,
Cybernetics and Systems, DOI: 10.1080/01969722.2018.1429835,2018.
[16] Sitalakshmi Venkatraman andMamounAlazab, “Use of Data Visualisation for Zero-Day Malware
Detection”, Security and Communication Networks, Article ID 1728303, 13 pages, 2018.
International Journal of Advanced Science and Technology
Vol. 29, No. 4s, (2020), pp. 1947-1954
ISSN: 2005-4238 IJAST
Copyright ⓒ 2020 SERSC 1954
[17] TaeGuen Kim, BooJoong Kang, Mina Rho, SakirSezer, EulGyuIm’ “A Multimodal Deep Learning
Method for Android Malware Detection Using Various Features”, IEEE Transactions on Information
Forensics and Security, Vol. 14 No. 3, March 2019.
[18] C.P. Patidar, NehaVerma, “Comparison of Visual Content for Different Browsers," International
Journal of Computer Science and Engineering, vol. 6, no. 4, pp177, April. 2018. Accessed on:
October. 9, 2018.
[19] C.P. Patidar, Meena Sharma, VarshaSharda,” Detection of Cross Browser Inconsistency by
Comparing Extracted Attributes,” International Journal of Scientific Research and Engineering, vol.
5, no. 1, pp 2-3, Feb 2017.
[20] C.P. Patidar and HarshitaKhandelwal, “ZERO DAY ATTACK DETECTION USING MACHINE
LEARNING TECHNIQUES”, IJRAR, Volume 6, Issue 1, January 2019.