ChapterPDF Available

Machine Learning Classifiers for Android Malware Detection

Authors:
  • GLS Institute of Computer Technology

Figures

Content may be subject to copyright.
Machine Learning Classifiers
for Android Malware Detection
Prerna Agrawal and Bhushan Trivedi
Abstract With the growing popularity of Android devices, it is also more prone
to malware attacks. There are many malware scanning tools available for scanning
the Android Malware but most of them perform static analysis and also require a
lot of resources and manual overhead. By using Machine Learning Classifiers, this
study aims to improve detecting Android Malware. In this paper, analysis is done on
different Android Malware Detection Techniques with different Machine Learning
Classifiers. This paper also discusses its strengths and weaknesses with their future
scope. The conclusion of the paper also states that one of the Machine Learning
Classifier known as Random Forest has the greatest accuracy compared to SVM and
Naive Bayes. Also, Random Forest, SVM, Naive Bayes classifiers are highly used
for performance evaluation.
Keywords Machine learning ·Android malware ·Static analysis ·Malware
detection ·Android mobile security ·Dynamic analysis
1 Introduction
The usage of smartphones has become extensive now these days. With the ease of
new technologies, smartphones are becoming the basic need of the end-user [1]. In
2016, Google’s Android Market is leading by 82% [1] and selling of smartphones to
end-users is around 1.5 billion units. As Android system is much popular, it is more
vulnerable to malware attacks. Avast reported an increase in 40% of cyber-attacks
in Android since 2016 [1]. A total of 316 weaknesses were found in the Android OS
in 2017 which is more than compared to any operating system [2].
P. Agrawa l ( B
)·B. Trivedi
Faculty of Computer Technology (MCA), GLS University, Ellisbridge, Ahmedabad, Gujarat, India
e-mail: prerna.agrawal@glsuniversity.ac.in
B. Trivedi
e-mail: bhushan.trivedi@glsuniversity.ac.in
© Springer Nature Singapore Pte Ltd. 2021
N. Sharma et al. (eds.), Data Management, Analytics and Innovation,
Advances in Intelligent Systems and Computing 1174,
https://doi.org/10.1007/978-981-15-5616- 6_22
311
312 P. Agrawal and B. Trivedi
In paper [3], various Online Android Malware Scanning Tools are studied and
a brief comparison is also shown. Based on the comparison it can be concluded
that most of the existing Android Malware Scanning tools perform static analysis
and take a longer time to scan a single file [3]. Also, these tools require manual
overhead and heavy resources for performing the scanning [3]. So in this situation
Machine Learning is the proper solution for detecting the malware. With the usage
of different Machine Learning Classifiers automation in malware detection system is
possible which will improve the precision of the finding and also reduce time, usage
of heavy resources, and manual overhead [1]. So the study and detailed comparison of
detecting Android Malware using different Machine Learning Classifiers are needed.
The paper is distributed into the following segments: Sect. 2defines the associ-
ated work done for detecting Android Malware using Machine Learning Classifiers.
Section 3defines different Machine Learning Classifiers used. Section 4provides
a comparative study for detecting Android Malware using Machine Learning
Classifiers. Section 5delivers conclusion of the paper.
2 Related Work
There are many existing approaches which are proposed by researchers for detecting
Android Malware by using different Machine Learning Classifiers. Different Android
Malware detection techniques are Static analysis, Dynamic analysis, and Hybrid
analysis [2].
The static analysis focuses on the Android Manifest file to reverse engineer the
APK file to detect the malware [2]. Some approaches like Monica [4] uses static
analysis that applies different Machine Learning Classifiers on features and improves
static malware detection. Koli [1] uses static analysis that applies different Machine
Learning Classifiers on features and proposes a system named RanDroid. Mathew
[5] uses static analysis that applies different Machine Learning Classifiers on features
and proposes a system based on examining permission. Justin [6] uses static analysis
that applies different Machine Learning Classifiers and proposes an original machine
learning-based Malware detection system. Zarni [7] uses static analysis that applies
different Machine Learning Classifiers on the features and proposes a framework for
classifying Android Applications.
The dynamic analysis mainly focuses on the runtime behavioral analysis of an
application [2]. Some approaches like Ham [8] uses dynamic analysis that applies
different Machine Learning Classifiers on different runtime features and recommends
a method of selecting the feature and reducing Malware False Detection rate. Chang
[9] uses dynamic analysis that applies different Machine Learning Classifiers on
different runtime features and proposes a Robotium Program. Chieh [9] uses dynamic
analysis that applies different Machine Learning Classifiers on different runtime
features and proposes a framework named as DroidDolphin. Yu [10] uses dynamic
analysis that applies different Machine Learning Classifiers on different runtime
features and proposes a Malware detection system.
Machine Learning Classifiers for Android Malware Detection 313
3 Machine Learning Classifiers
Machine Learning Classifiers are mainly divided into two categories: supervised
learning and unsupervised learning [1,4,5,712]. Supervised learning is also known
as predictive learning that predicts the class of unknown objects based on prior class-
related information of similar objects [6]. Unsupervised learning is also known as
descriptive learning and finds patterns in unknown objects by grouping other similar
objects together [6].
According to the study [1,4,5,712], the Machine Learning Classifiers mainly
used are as follows.
3.1 Naive Bayesian
Naive Bayesian is used for a classification task that assigns class labels to problem
instances [12,13]. It requires less amount of training information or data to classify
the parameters. Naive Bayesian classifiers are direct linear classifiers and are known
for their straight forward and accurate result [6]. The strengths of this classifier are that
it is simple and fast in calculation, in situations where it is noisy and missing data it
performs well, works well with small and large amount of data is present for training,
easy and straightforward for obtaining accurate results [6]. The weaknesses of this
classifier are that the assumption for equal importance and independence does not
hold true if the dataset contains large number of numeric features than the accuracy
and reliability of output becomes limited [6]. Text classification, Spam filtering,
Online Sentiment Analysis are certain applications of Naive Bayesian Classifier [6].
3.2 Support Vector Machine
Support Vector Machine (SVM) is a classification model recommended for linear
classification and regression that is grounded in the conception of surfaces called
hyperplane. It draws boundary between data instances plotted in multidimensional
feature space [6]. It is used to differentiate the data instances belonging to different
classes. The strengths of SVM are that it can be used in both regression and classifi-
cation, it is robust, and the prediction results are very accurate [6]. The weaknesses
of SVM are that is applicable only for binary classification, it is very complex, it is
very slow with large dataset, it is memory-intensive [6]. Cancer detection, detecting
the image of a face is certain applications of SVM classifier [6].
314 P. Agrawal and B. Trivedi
3.3 Random Forest
Random Forest is a collective classifier that syndicates and uses many decision tree
classifiers [6]. A set of decision trees are created from random selection of a subset
within a dataset [14]. When the random forest is generated with combination of
decision trees, majority vote is applied to combine the output of the different trees
[6,14]. The strengths of Random Forest are that it works well on large and expansive
data sets, it has robust method for estimating missing data and maintains precision
in absence of large proportion of data, it has techniques for balancing errors in an
unbalanced dataset for class population, it provides estimation for which features
are most important ones in overall classification, generated forests can be saved for
future use on other data, it can be used for both classification and regression [6]. The
weaknesses of Random Forest are that it is very difficult to understand as it combines
multiple decision trees, it is much more expensive than a simple model like decision
tree [6].
3.4 Logistic Regression
Logistic Regression is used both in classification and regression [6]. It is also known
as a kind of regression study that is used to predict the result of categorized dependent
variable. It is used for binary classification [15]. The strengths of Logistic Regression
are that it is very effective, does not need high computational resources, no need to
scale the input features, gives accurate predictions of results, it is simple, and easy to
implement [15]. The weaknesses of Logistic Regression are that non-linear problems
are not solved, it does not work well if all the independent variables are not identified
clearly [15].
3.5 K-Means Clustering
It is a clustering technique which uses partitioning-based clustering in machine
learning [6]. It is known as a centroid-based technique. In K-means classifier n data
points are assigned to one of the K clusters. Here K will be a user-defined parameter
with a number of clusters desired [6]. The strength of K-means clustering classifier
is that it is very flexible and fits in most scenarios and complexities, the performance
and the efficiency are very high [6]. The weaknesses of K-means clustering are that
it involves a random chance and may not be an optimal set of a cluster in some cases,
some experience is required to the user for guessing the starting number of natural
clusters for efficient outcome [6].
Machine Learning Classifiers for Android Malware Detection 315
4 Comparative Study of Detecting Android Malware Using
Machine Learning Classifiers
In this section, a detailed comparison between detecting Android Malware using
Machine Learning techniques are shown [1,4,5,712]. The following parameters
are Paper, Analysis Type, Input, Dataset Type, Final Dataset, Machine Learning Type,
Machine Learning Classifiers, Detection Rate, Performance Evaluation Criteria,
Comparison with other Machine Learning Classifiers, Proposed Approach. Table 1
shows details comparison for detecting Android Malware using Machine Learning
Classifiers.
4.1 Analysis Type
This parameter defines the type of analysis performed by the system. It can be static,
dynamic, or hybrid Analysis. Monica [4] performs static analysis. Ham [8] performs
a dynamic analysis. Chang [9] performs a dynamic analysis. Koli [1] performs static
analysis. Mathew [5] performs static analysis. Justin [12] performs static analysis.
Chieh [11] performs a dynamic analysis. Zarni [7] performs static analysis. Yu [10]
performs a dynamic analysis.
4.2 Input
This parameter defines the input type taken by every system. Monica [4] takes
Permissions, Intents as an input. Ham [8] takes Native Size, other_shared, VMPeak,
VMData, VMLib, Dalvik_Rss, cpu_usage, RxBytes, Send_sms as an input. Chang
[9] takes Permissions, Intent Receivers, Network Activities, and File read/write
permissions as an input. Koli [1] takes Requested Permissions, Vulnerable API Calls,
Dynamic Code, Reflection Code, Cryptographic Code, Database, and Native Code
as an input. Mathew [5] takes Permissions as an input. Justin [12] takes Permissions
as an input. Chieh [11] takes Run time logs of Applications as an input. Zarni [7]
takes Permissions as input. Yu [10] takes System calls as an input.
4.3 Dataset Type
This parameter defines whether the data taken for performing experiments in the
system is training or real dataset. Monica [4] uses training dataset for performing
experiments in the system. Koli [1] uses training dataset for performing experiments
in the system. Mathew [5] uses training dataset for performing experiments in the
316 P. Agrawal and B. Trivedi
Tabl e 1 . Comparison of Detecting Android Malware Using Machine Learning Classifiers
Paper Analysis
type
Input Dataset
type
Final dataset ML type ML classifiers Detection
rate
Performance
evaluation
criteria
Comparison
with other ML
classifiers
Proposed
approach
Monica
[4]
Static Permissions,
intents
Training 500 Benign
Applications
and 500
Malicious
Applications
Supervised
learning
Cubic SVM 91.7% Not
mentioned
Linear
discriminant
SVM, weighted
KMN, complex
tree, linear
SVM, course
KNN
Improves static
malware
detection
Ham
[8]
Dynamic Native size,
other_shared,
VMPeak,
VMLib,
Dalvik_Rss,
RxBytes,
VMData,
send_sms,
cpu_usage
Not
specified
11,268
benign
applications
and 3526
malicious
applications
Supervised
learning
Naïve
Bayesian,
random forest,
Logistic
Regression,
SVM
99% with
random
forest
FPR, TPR 10-fold
cross-validation
Feature
selection
method and
reduction of
false detection
of malware
Ling [9]Dynamic Permissions,
intent
receivers,
network
activities, file
read/write
permissions
Not
specified
Not
specified
Supervised
learning
K-fold
cross-validation
97% FPR, TPR,
accuracy
Random forest,
J48, LMT,
logitboost,
bagging, KNN,
Ksatr, PART,
BayesNet
A robotium
program
(continued)
Machine Learning Classifiers for Android Malware Detection 317
Tabl e 1 . (continued)
Paper Analysis
type
Input Dataset
type
Final dataset ML type ML classifiers Detection
rate
Performance
evaluation
criteria
Comparison
with other ML
classifiers
Proposed
approach
Koli [1]Static Requested
permissions,
vulnerable
API calls,
dynamic
code,
reflection
code, native
code,
cryptographic
code,
database
Training 120 Benign
applications
and 175
malicious
applications
Supervised
learning
SVM 97.7% FPR,
accuracy,
Recall Rate,
Precision,
F-measure
Decision tree,
Naïve Bayes,
random forest
Asystem
named
randroid
Mathew
[5]
Static Permissions Training 2444 benign
applications
and 870
malicious
applications
Supervised
learning
SVM 80% Not
specified
Neural
networks,
classification
trees, fuzzy
clustering,
random forest
of decision
trees
Detection of
android
malware
technique built
on examining
permission
Justin
[12]
Static Permissions Training 2081 benign
applications
and 91
malicious
applications
Supervised
learning
One-class SVM Not
specified
Not
specified
Not specified Amalware
detection
system based
on machine
learning
(continued)
318 P. Agrawal and B. Trivedi
Tabl e 1 . (continued)
Paper Analysis
type
Input Dataset
type
Final dataset ML type ML classifiers Detection
rate
Performance
evaluation
criteria
Comparison
with other ML
classifiers
Proposed
approach
Chieh
[11]
Dynamic Run time logs
of
applications
Training 32000
benign
applications
and 32000
malicious
applications
Supervised
learning
SVM 86.1% Recall rate,
FPR,
precision
rate,
accuracy,
F-Score
BayesNet,
Naïve Bayes,
J48, random
forest,
multilayer
perception,
logistic
A dynamic
malware
analysis
framework
named as
droiddolphin
Zarni
[7]
Static Permissions Not
mentioned
700
applications
Unsupervised
learning
K-Means
clustering
91.75%
with
random
forest
FPR, TPR,
TP, FP, FN,
TN, overall
accuracy
Random forest,
J48, CART
A framework
for classifying
android
applications
Wei Yu
[10]
Dynamic System calls Training 96 benign
applications
and 92
malware
applications
Supervised
learning
SVM, Naïve
Bayes
78% Detection
rate, error
rate, training
time,
detection
time
Not specified Amalware
detection
system uses
behavior-based
detection
Machine Learning Classifiers for Android Malware Detection 319
system. Justin [12] uses training dataset for performing experiments in the system.
Chieh [11] uses training dataset for performing experiments in the system. Yu [10]
uses training dataset for performing experiments in the system.
4.4 Final Dataset
This parameter defines the criteria for the selection of the final dataset. Monica [4]
uses 500 Benign Applications and 500 Malicious Applications. Ham [8] uses 11,268
Benign Applications and 3526 Malicious Applications. Koli [1] uses 120 Benign
Applications and 175 Malicious Applications. Mathew [5] uses 2444 Benign Appli-
cations and 870 Malicious Applications. Justin [12] uses 2081 Benign Applications
and 91 Malicious Applications. Chieh [11] uses 32,000 Benign Applications and
32,000 Malicious Applications. Zarni [7] uses 700 Applications. Yu [10]uses96
Benign Applications and 92 Malware Applications.
4.5 Machine Learning Type
This parameter defines the different types of machine learning. It can be super-
vised learning, unsupervised learning, or reinforcement learning [6]. Monica [4]uses
supervised learning. Ham [8] uses supervised learning. Chang [9] uses supervised
learning. Koli [1] uses supervised learning. Mathew [5] uses supervised learning.
Justin [12] uses supervised learning. Chieh [11] uses supervised learning. Zarni [7]
uses unsupervised learning. Wei Yu [10] uses supervised learning.
4.6 Machine Learning Classifiers
This parameter defines different Machine Learning Classifiers or algorithms used in
the system. Monica [4] uses Cubic Support Vector Machine (SVM). Ham [8]uses
Naive Bayes, Random Forest, Logistic Regression, and Support Vector Machine
(SVM). Chang [9] uses a K-fold Cross-Validation. Koli [1] usages a Support Vector
Machine (SVM). Mathew [5] usages a Support Vector Machine (SVM). Justin [12]
uses a one-class Support Vector Machine (SVM). Chieh [11] uses a Support Vector
Machine (SVM). Zarni [7] uses a K-Means Clustering. Yu [10]usestheNaïve
Bayesian and Support Vector Machine (SVM).
320 P. Agrawal and B. Trivedi
4.7 Detection Rate
This parameter shows the detection rate for detecting malware accurately. In Monica
[4], the detection rate is 91.7%. In Ham [8], the detection rate is 99% with Random
Forest classifier. In Chang [9], the detection rate is 97%. In Koli [1], the detection rate
is 97.7%. In Mathew [5], the detection rate is 80%. In Chieh [11], the detection rate
is 86.1%. In Zarni [7], the detection rate is 91.75% with Random Forest classifier.
In Yu [10], the detection rate is 78%.
4.8 Performance Evaluation Criteria
This parameter defines different values taken for the Performance Evaluation Criteria
using Machine Learning Classifiers. Ham [8] uses FPR and TPR. Chang [9]usesFPR,
TPR, and Accuracy. Koli [1] uses a False Positive Rate (FPR), Accuracy, Recall rate,
Precision, F-measure. Chieh [11] uses Recall rate, FPR, Precision rate, Accuracy,
F-Score. Zarni [7] uses TP, FP, TN, FN, TPR, FPR, and Overall Accuracy. Yu [10]
uses Detection Rate, Error Rate, Training Time, and Detection Time.
4.9 Comparison with Other Machine Learning Classifiers
This parameter defines other Machine Learning Classifiers compared with each
other using performance evaluation criteria. Monica [4] uses Course KNN, Weighted
KMN, Complex tree, Linear SVM, Linear Discriminant SVM. Ham [8]usesa10-fold
Cross-Validation. Chang [9] uses Random Forest, J48, LMT, LogitBoost, Bagging,
KNN, Ksatr, PART, BayesNet. Koli [1] uses a Decision Tree, Naïve Bayes, and
Random Forest. Mathew [5] uses Neural Networks, Classification trees, Fuzzy Clus-
tering, Random Forest of decision trees. Chieh [11] uses BayesNet, Naïve Bayes,
J48, Random Forest, Multilayer Perception, and Logistic. Zarni [7] uses Random
Forest, J48, and CART.
4.10 Proposed Approach
This parameter defines the different approaches proposed by different researchers. In
Monica [4], the static malware detection is improved by comparing different Machine
Learning Classifiers on Manifest file dataset. In Ham [8], a feature selection method
is proposed and experimentation is done for reducing false detection rate of malware.
In Chang [9], a Robotium program in Android sandbox is proposed which triggers
the Android Application automatically and monitor behavior. Koli [1] proposed a
Machine Learning Classifiers for Android Malware Detection 321
system named RanDroid which detects malicious applications in the Android system
by using machine learning techniques. In Mathews [5] by examining permissions an
Android Malware detection technique is developed. Justin [12] proposed an original
machine learning-based malware detection system for the Android OS. Chieh [11]
proposed a dynamic malware analysis framework named DroidDolphin which uses
the technologies of Big Data Analysis, GUI-based testing, and machine learning
to detect malicious Android applications. Zarni [7] proposed a framework using
machine learning techniques for classifying Android applications for malware detec-
tion. In Yu [10], a malware detection system is proposed that uses behavior-based
detection approach for malware detection.
Based on the comparative study of Detecting Android Malware using Machine
Learning Classifiers, it can be concluded that every approach has some limitations.
In Monica [4], the dataset taken is very small. Also, the Detection rate is also not
high. The classifiers only depend on Manifest file, and it only uses static analysis
and lacks dynamic analysis. In Ham [8], there is a lot of variation in the accuracy
of Detection rate using different Machine Learning Classifiers. In Chang [9], there
are very fewer features selected for analysis. In Koli [1], the dataset taken is small
with fewer features. In the system, the Quality of detection model critically depends
on the accessibility of malicious and benign applications. It is good only for a small
and random set of application datasets. It only uses static analysis and lacks dynamic
analysis. In Mathew [5], the dataset taken is very small with fewer features. Detection
rate is also not high. It only uses static analysis and lacks dynamic analysis. In Justin
[12], dataset taken is very small, and it only uses static analysis and lacks dynamic
analysis. In Chieh [11], the Detection rate is not high. It takes up to 5 min to run
the apk files and do the analysis. So it is time-consuming and less efficient. Also, it
cannot detect malware with anti-emulation techniques. In Zarni [7], the Detection
rate is not high and the dataset taken is very small with fewer features. It only uses
static analysis and lacks dynamic analysis. In Yu [10], the Detection rate is not high
and the dataset taken is very small.
5 Conclusion
Based on the above study, it can be concluded that the accuracy rate of Malware Detec-
tion is higher using the Random Forest Classifier as compared to SVM and Naive
Bayesian Classifiers. The Random Forest, SVM, Naive Bayesian are highly used
Machine Learning Classifiers for Performance Evaluation. A Generalized Malware
Detection model using Machine Learning Classifiers is still lacking for proper
Malware Detection. So a Generalized Malware Detection model using a combina-
tion of supervised and unsupervised Machine Learning Classifiers must be proposed
to increase the efficiency and accuracy in detection rate with a large dataset and
more features. Also, Random Forest, SVM, Naive Bayes classifiers must be used for
performance evaluation of the model.
322 P. Agrawal and B. Trivedi
References
1. Koli, J. D. (2018). RanDroid: Android malware detection using random machine learning
classifiers. In: International Conference on Technologies for Smart City Energy Security and
Power (ICSESP) IEEE, Mar 2018.
2. Agrawal, P., & Trivedi, B. (2019). A survey on android malware and their detection techniques.
In: Third International Conference on Electrical, Computer and Communication Technologies
(ICECCT) IEEE, Feb 2019.
3. Agrawal, Prerna, & Trivedi, Bhushan. (2019). Analysis of android malware scanning tools.
International Journal of Computer Sciences and Engineering, 7(3), 807–810.
4. Kumaran, M., & Li, W. (2016). Lightweight malware detection based on machine learning
algorithms and the android manifest file. In: MIT Undergraduate Research Technology
Conference(URTC) IEEE, Nov 2016.
5. Leeds, M., & Atkison, T. (2016). Preliminary results of applying machine learning algorithms
to android malware detection. In: International Conference on Computational Intelligence
(ICCI) IEEE, Dec 2016.
6. Dutt, S., Chanframouli, S., & Das, A. K. (2019). Machine Learning 1st (Ed.), India: Pearson.
7. Aung, Z., & Zaw, W. (2013). Permission-based android malware detection. International
Journal of Scientific and Technology Research,2(3).
8. Ham, H. S., & Choi, M. J. (2013). Analysis of android malware detection performance using
machine learning classifiers. In: International Conference on ICT Convergence (ICTC) IEEE,
Oct 2013.
9. Chang, W. L., & Wu, W. (2016). An android behaviour-based malware detection method using
machine learning. In: International Conference on Signal Processing, Communications, and
Computing (ICSPCC) IEEE, Aug 2016.
10. Yu, W., & Zhang, H. (2013). On behaviour-based detection of malware on android platform.
In: Communication and Information System Security Symposium (Globecom) IEEE, Dec 2013.
11. Wu, W. C., & Hung, S. H. (2014). DroidDolphin: A dynamic android malware detection using
big data and machine learning. In: Research in Adaptive and Convergent Systems (RACS).
ACM, Oct 2014.
12. Sahs, J., & Khan, L. (2012). A machine learning approach to android malware detection. In:
European Intelligence and Security Informatics Conference (EISIC) IEEE, Aug 2012.
13. Naïve Bayesian Classifier. https://towardsdatascience.com/naive-bayes-classifier-81d512
f50a7c.
14. Random Forest Classifier. https://medium.com/machine-learning-101/chapter-5-random-for
est-classifier-56dc7425c3e1.
15. Logistic Regression Classifier. https://machinelearning-blog.com/2018/04/23/logistic-regres
sion-101/.
... Emphasis is placed on feature extraction from PE files for enhanced detection [15]. The rapid expansion of Android devices also instigated research into malware detection on this platform using both machine and deep learning, which yields high accuracy rates [5,16,17]. The integration of static and dynamic analysis methods, like OPEM, offers increased detection capabilities [18]. ...
Article
Full-text available
Recent advancements in cybersecurity threats and malware have brought into question the safety of modern software and computer systems. As a direct result of this, artificial intelligence-based solutions have been on the rise. The goal of this paper is to demonstrate the efficacy of memory-optimized machine learning solutions for the task of static analysis of software metadata. The study comprises an evaluation and comparison of the performance metrics of three popular machine learning solutions: artificial neural networks (ANN), support vector machines (SVMs), and gradient boosting machines (GBMs). The study provides insights into the effectiveness of memory-optimized machine learning solutions when detecting previously unseen malware. We found that ANNs shows the best performance with 93.44% accuracy classifying programs as either malware or legitimate even with extreme memory constraints.
... The method for detecting the malware should be capable enough to detect the malware correctly or with high accuracy, and secondly, to protect the mobile device or user from the unseen attack by the intruder [14]. The Androidbased malware detection techniques can be divided into two parts, i.e., dynamic detection approaches and static malware detection approaches [15,16]. The static malware detection approaches make use of static or historical data, and the model is built on the basis of existing data [17]. ...
Article
Full-text available
Mobile connectivity and smart devices are spreading worldwide. As a result, the use of mobile devices and applications is rising exponentially. Therefore, nowadays hackers target such smart devices to steal information and misuse it for malicious purposes. It becomes absolutely essential to protect sensitive information such as app. permissions, login credentials, browse history, media contents etc. from intruders. Security can be breached easily if smart techniques are not devised to safeguard mobile data. In this article, an attempt is made to classify the different types of malware and to protect the sensitive information on Android devices that significantly reduce network congestion and improve network throughput by increasing data transmission. The proposed hybrid approach consists of AdaBoost, random forest and deep learning methods jointly classify the sophisticated malware. The empirical results indicate that this achieves better classification and detection accuracy and is capable of identifying the potential threat more efficiently.
... In another work, Agrawal et al. [6] addressed the challenge of detecting new and generic malware using conventional methods, which proved to be inadequate. To overcome the lack of readily available machine learning datasets for malware detection, they generated their own dataset by collecting a variety of malware files from renowned malware projects. ...
Chapter
Privacy is a myth, a statement persistently encountered when talking about the world of Internet. Malwares are a constant, ominous threat to data which cripples the cyberspace today. The myth of digital privacy began with the conception and subsequent proliferation of malwares. Any device connected through the Internet is a potential target and runs the risk of its security being breached and information being compromised. In this paper, a benchmarked dataset Big 2015 is used for the malware classification experiment. Seven different machine learning models namely Random Forest, Support Vector Machines, Logistic Regression, Naïve Bayes, AdaBoost, Gradient Boost and Bagging, are used to train and test the dataset and to establish the one that performs the best. The performance metrics put in place are Accuracy, Precision, Recall and F1-score. It is seen that ensemble machine learning approach, namely Random Forest, Bagging and Gradient Boost performed better in accordance to the performance parameters considered.
Chapter
Malware, a term derived from malicious software, includes any specially designed software that provides unauthorized access to computer systems and networks to disrupt devices. It assumes a critical role in emphasizing the significance of security within Android operating systems. As our world increasingly depends on smartphones for diverse activities, including communication, banking, and accessing sensitive information, the potential risks posed by malware grow more pronounced. Android devices can fall victim to the infiltration of malicious software, resulting in compromised user privacy, personal data theft, and financial harm. The prevalence of malware serves as a powerful reminder that robust security measures are indispensable for Android systems. It compels users and developers to remain vigilant, continuously update their devices, and employ effective antivirus and anti-malware solutions. By comprehending the potential dangers associated with malware, users can adopt safe browsing practices, steer clear of suspicious downloads, and safeguard their devices, ensuring a secure and dependable Android experience. Machine learning (ML) assumes a pivotal role in the realm of malware detection, delivering significant benefits and advancements in cybersecurity. In this study, we have developed a machine learning–based malware detection system that exhibits enhanced detection accuracy, adaptive and dynamic protection mechanisms, and improved zero-day threat detection. According to the experimental results of the research conducted, it shows the efficiency of the proposed models.
Article
Full-text available
Malicious malware targeting Android systems has alarmingly increased due to the quick spread of Android devices. For these devices to be secure and to protect the private data of users, Android virus detection is essential. The selection of features, model performance, and efficiency are issues with existing Android malware detection techniques. To overcome these drawbacks, we suggest a unique method for identifying malicious Android data that combines Tree Seed Optimization with Support Vector Machines (TSO-SVM).TSO is a nature-inspired optimization technique that looks for the best feature subsets by simulating the tree's seed dispersal process. The efficiency and effectiveness of SVM-based classification are increased by our method's use of TSO to choose the most instructive features from the Android malware dataset. To normalize the features of the Android application dataset before training, we use a data-cleaning method known as Z-Score normalization. Our Android malware detection solution uses Independent Component Analysis (ICA) as a feature reduction method. Our test results show how well the TSO-SVM technique works at detecting Malicious Android data. In terms of accuracy, precision, recall, and F1-Score for malicious detection, the suggested model achieves 97.12%, 96.35%, 97.88%, and 96.84%, respectively. The proposed technique successfully solves the problem of suboptimal classification accuracy in the presence of dynamic and changing malware threats. The results of this work highlight the potential of TSO techniques for enhancing the security of Android-based devices and present a promising direction for further investigation in the area of mobile security.
Conference Paper
In this paper, we propose An Android Behavior-Based Malware Detection Method using Machine Learning. We improve an Android application sandbox, Droidbox, by inserting a view-identification automatic trigger program which can click mobile applications in the meaningful order. Taking advantage of Droidbox result, we collect the behavior such as network activities, file read/write and permission as the feature data and use different machine learning algorithms to classify malware and evaluate the performance. We use a large number of malware and normal application samples to prove that our method has high accuracy.
Conference Paper
Because of exponential growth in smart mobile devices, malware attacks on smart mobile devices have been growing and pose serious threats to mobile device users. To address this issue, we develop a malware detection system, which uses a behavior-based detection approach to deal with the detection of a large number of unknown malware. To accurately detect malware, we examine system calls to capture the runtime behavior of software, which interacts with an operating system and adopt machine learning approaches such as Support Vector Machine (SVM) and Naive Bayes learning schemes to learn the dynamic behavior of software execution. Using real-world malware and benign samples, we conduct experiments on Android devices and evaluate the effectiveness of our developed system in terms of learning algorithms, the size of training set, the length of n-grams, and the overhead in training and detection processes. Our experimental data demonstrates the effectiveness of our proposed detection system to detect malware.
Conference Paper
With the recent emergence of mobile platforms capable of executing increasingly complex software and the rising ubiquity of using mobile platforms in sensitive applications such as banking, there is a rising danger associated with malware targeted at mobile devices. The problem of detecting such malware presents unique challenges due to the limited resources avalible and limited privileges granted to the user, but also presents unique opportunity in the required metadata attached to each application. In this article, we present a machine learning-based system for the detection of malware on Android devices. Our system extracts a number of features and trains a One-Class Support Vector Machine in an offline (off-device) manner, in order to leverage the higher computing power of a server or cluster of servers.
Conference Paper
As mobile devices have supported various services and contents, much personal information such as private SMS messages, bank account information, etc. is scattered in mobile devices. Thus, attackers extend the attack range not only to the existing environment of PC and Internet, but also to the mobile device. Previous studies evaluated the malware detection performance of machine learning classifiers through collecting and analyzing event, system call, and log information generated in Android mobile devices. However, monitoring of unnecessary features without understanding Android architecture and malware characteristics generates resource consumption overhead of Android devices and low ratio of malware detection. In this paper, we propose new feature sets which solve the problem of previous studies in mobile malware detection and analyze the malware detection performance of machine learning classifiers.