Conference PaperPDF Available

A deep learning-enhanced botnet detection system based on Android manifest text mining

Authors:

Abstract and Figures

Android botnets remain a significant threat to mobile and IoT systems and networks as they continue to infect millions of devices worldwide. Therefore, there is a need to develop more effective solutions to tackle their spread. Hence, in this paper we propose a system for detecting Android botnets through automated text mining of the manifest files obtained from apps. The proposed method utilizes NLP techniques to extract features from the manifest files and a deep learning-based classification model is used to detect botnet applications. The classification model is implemented using CNN and a traditional machine learning classi-fier such as SVM, Random Forest or KNN. We performed experiments to evaluate the proposed system with 3858 Android applications consisting of 1929 botnet and 1929 benign samples. The results showed the best overall performance with the CNN-SVM hybrid model which had an average accuracy of 96.9% thus outperforming the singular machine learning classifiers.
Content may be subject to copyright.
The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)
A deep learning-enhanced botnet detection system
based on Android manifest text mining
Suleiman Y. Yerima 1, 2 and YiMin To 2
1 Cyber Technology Institute
School of Computer Science and Informatics
Faculty of Computing, Engineering and Media,
De Montfort University, Leicester, United Kingdom
syerima@dmu.ac.uk
2 School of Computer Science and Informatics
Faculty of Computing, Engineering and Media,
De Montfort University, Leicester, United Kingdom
P2668280@my365.dmu.ac.uk
Abstract Android botnets remain a significant threat to mo-
bile and IoT systems and networks as they continue to infect mil-
lions of devices worldwide. Therefore, there is a need to develop
more effective solutions to tackle their spread. Hence, in this paper
we propose a system for detecting Android botnets through auto-
mated text mining of the manifest files obtained from apps. The
proposed method utilizes NLP techniques to extract features from
the manifest files and a deep learning-based classification model is
used to detect botnet applications. The classification model is im-
plemented using CNN and a traditional machine learning classi-
fier such as SVM, Random Forest or KNN. We performed exper-
iments to evaluate the proposed system with 3858 Android appli-
cations consisting of 1929 botnet and 1929 benign samples. The re-
sults showed the best overall performance with the CNN-SVM hy-
brid model which had an average accuracy of 96.9% thus outper-
forming the singular machine learning classifiers.
KeywordsMobile botnets; text mining; Android malware;
machine learning; Natural Language Processing, deep learning
I. INTRODUCTION
Recent threat reports have shown that mobile malware is on the
rise [1]. Android, being an open source mobile and IoT operat-
ing system, is vulnerable to attacks by botnets. A botnet consists
of Internet-connected devices under the control of botmaster(s)
who can configure the bots’ to perform various attacks such as
distributed denial of service, spam distribution, phishing, etc. In
fact, botnets are regarded as the Swiss army knife of cybercrim-
inals due to their versatility, the variety of attacks they can be
utilized for, and their ability to receive and respond to com-
mands from a command and control (C&C) server.
The rise in the threat of Android botnets, highlighted by the
Chamois botnet attack which infected millions of Android de-
vices worldwide [2], [3], calls for more effective detection so-
lutions. In recent years, static and dynamic analysis-based ap-
proaches have been proposed for detecting malware on the An-
droid platform. Both of these methods have their pros and cons.
Researchers have also extensively leveraged machine learning
(ML) to provide automated detection based on the use of static
and dynamic features extracted from Android applications.
Several researchers have proposed static, dynamic and machine
learning based systems for Android botnet detection. For exam-
ple, [4] proposed a cloud-based Android botnet detection sys-
tem based on dynamic analysis that uses strace, netflow, logcat,
sysdump and tcpdump. In [5], the Android botnet identification
system (ABIS) was presented. It uses static and dynamic fea-
tures to train machine learning classifiers to distinguish be-
tween botnet and clean applications.
Despite good performance exhibited by previously proposed
ML-based botnet detection methods, there is need for improved
effectiveness in feature extraction to provide scalable solutions.
Majority of the previous works adopt a manual feature extrac-
tion approach, which is in turn highly dependent on specialist
domain knowledge. However, natural language processing
(NLP) or text mining techniques can provide researchers with a
less manual feature engineering process that is also non-de-
pendent on domain knowledge. Motivated by this, we propose
a system to detect Android botnet malware through automated
text mining of the manifest files. Our proposed approach uti-
lizes NLP techniques to extract features from the processed
manifest file text and builds an efficient model using a hybrid
of CNN and traditional machine learning classifiers.
We employed 3858 Android applications (1929 botnet and
1929 clean) to evaluate the proposed system. The botnet sam-
ples are from the well-known ISCX botnet dataset available
from the Canadian Institute for Cybersecurity (CIC). The re-
sults of our experiments showed that the best singular classifier
from the popular ML algorithms that were tested reached up to
95.5% accuracy, demonstrating the effectiveness of the NLP
based approach. The hybrid models improved the performance
of the botnet detection system; several hybrid model configura-
tions (where CNN was used with a different ML classifier) ob-
tained greater than 95% overall accuracy.
The rest of the paper is organized as follows: In section II, we
review related work. In section III we discuss our methodology,
while in section IV experimental results are presented and eval-
uated. Finally, we conclude the paper and outline future work
in section V.
The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)
II. RELATED WORK
In this section, we briefly review the existing related Android
botnet detection works. The experiments presented later in this
paper were based on the ISCX botnet samples which were first
made available by the study conducted by Kadir et al. in [6]. In
their paper, an in-depth analysis of the C&C and built-in URLs
of Android botnets was conducted. They shared their 1929 bot-
net samples categorized into 14 families which can be obtained
from [7]. In [8], Karim et al presented a static analysis system
called DeDroid for mobile botnet detection and evaluated the
system using the Drebin malware dataset. This study led to a
conclusion that 90% of the malware samples in the dataset were
botnets. In [9], network features were used with machine learn-
ing to enable detection of mobile botnets. The features used in-
clude TCP/UDP packet size, frame duration, and source/desti-
nation IP address. Five ML classifiers were employed i.e. KNN,
Decision Tree, Naïve Bayes, SVM and Neural Network.
Hijawi et al. [10] presented an ML-based detection framework
for Android botnets using permissions and their protection lev-
els. They performed experiments using 1635 ISCX botnet apps
and 1635 benign apps and evaluated Naïve Bayes, Decision
Trees, MLP and Random Forest classifiers. Random forest had
the highest overall accuracy of 97.3%. In [11], an ML-based
botnet classification system called ABC (Android botnet clas-
sification) which uses requested permissions as features with
Information Gain feature selection was proposed. Naïve Bayes,
Random Forest, and J48 classifiers were used and Random for-
est had the highest botnet TPR of 94.6% and the lowest FPR of
9.9%. The experiments were performed using 1505 ISCX bot-
net samples and 850 benign applications.
Yusof et al. proposed a botnet classification system based on
API calls and permissions and used feature selection to select
16 permissions and 31 API calls to train ML classifiers using
WEKA [22]. In their paper [12], they presented experiments
with SVM, KNN, Naïve Bayes, J48 and Random Forest using
6282 benign and malicious samples. Random Forest obtained
the best results. The work was extended in [13] by including
system calls in the feature set. The best results obtained were
99.4% TPR, 12.5% FPR, and 97.9% overall accuracy.
In [14], a framework for image-based detection of Android bot-
nets called Bot-IMG was proposed. Histogram of Oriented Gra-
dients was used to train ML classifiers to distinguish between
botnets and benign applications. Experiments were performed
on 1929 ISCX botnet apps and 2500 benign apps. The system
achieved 93.1% accuracy using 10-fold cross validation and
95.3% accuracy with 80:20 training-testing split. In [15], per-
missions were used to generate images based on co-occurrence
matrix and the images were used to train a CNN model to clas-
sify apps into benign or botnet. The system achieved 97.2% ac-
curacy on 1800 ISCX botnet applications and 3650 benign ap-
plication. Moodi et al [16], utilized traffic features for the de-
tection of Android botnets based on SVM. They presented an
approach called smart adaptive particle swarm optimization
support vector machine (SAPSO-SVM) based on the top 20
traffic features from the 28-SABD Android botnet dataset. In
[17], 342 static features consisting of permissions, intents, extra
executable files, API calls, and commands were used in con-
junction with deep learning classifiers such as CNN-LSTM,
LSTM, GRU and DNN. This study was also based on the 1929
ISCX botnet samples with additional 4873 clean samples.
Anwar et al. [18], used MD5 hashes, broadcast receivers, per-
missions and background services as features for detecting mo-
bile botnet attacks. Their experiments were conducted on 1400
ISCX botnet samples together with 1400 benign apps, yielding
95.1% classification accuracy. In [19], an image-based ap-
proach to detecting Android botnets was investigated. A mix of
features including Histogram of Oriented Gradients, and Byte
Histograms were extracted from the images and combined with
permissions and these were used to train machine learning clas-
sifiers. The best results observed were 96% accuracy from 10-
fold cross validation, and 97.5% accuracy using 80:20 training-
testing split. A review of the previous works has revealed that
NLP or text mining-based techniques are relatively less popular
in Android botnet or malware detection. A few of the works that
exist in this realm include [20] and [21]. Furthermore, text min-
ing/NLP techniques have been applied to malicious JavaScript
code detection [22], [23], intrusion detection via proxy logs
[24], as well as PE malware detection [27], [28].
In [20] a source code mining approach was proposed for An-
droid botnet detection. Dex2jar was used to decompile the app
executable to Java source code and NLP techniques were ap-
plied to the code. The authors utilized TextToWordVector with
TF-IDF and StringToWordVector with TF-IDF using WEKA
[25]. They evaluated several classifiers including Naive Bayes,
KNN, J48, SVM and Random Forest. However, their work was
based on only 21 Android app samples. In [21], document em-
bedding using Paragraph Vector Distributed Bag of Words
(PV-DBoW) was proposed as features to train ML classifiers to
detect Android malware. The embeddings are created from text
obtained from manifest and dex files within the apps. CNN,
SVM, and LR were trained with the embedding vectors and
evaluated on 2234 Android apps. However, the paper only
evaluated few ML classifiers and the individual performances
of benign and malware classes were not presented.
Unlike the system described in [10], which showed good per-
formance on the ISCX botnet dataset, the text mining-based ap-
proach proposed in this paper does not depend on domain
knowledge (i.e. understanding permission protection levels).
Moreover, our proposed system requires less feature extraction
effort compared to the more manual systems presented in [12],
[13], [17], [18], [19] and [26] for example. In section III, we
describe the proposed system in greater detail.
III. METHODOLOGY
A. Dataset
The Android botnet dataset known as the ISCX botnet dataset
is used in this paper. The dataset consists of 1929 botnet sam-
ples from 14 families. The dataset is originally from [6] and has
The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)
been utilized in several previous works such as [10], [11], [14],
[15], [17], [18] and [19]. In order to build and evaluate the pro-
posed models via supervised learning, we obtained an addi-
tional 1929 benign samples to give a balanced dataset consist-
ing of a total of 3858 Android app samples.
B. Android application manifest text mining
The manifest file in an Android application is a required XML
file containing important metadata about the application. These
includes package name, the names of activities, services, broad-
cast receivers and content providers. Other information in-
cludes Android version support, hardware features support, per-
mission, intents, etc. To enable the text mining of the manifest
file, the Android Asset Packaging Tool (AAPT) was used to
extract the contents of the AndroidManifest.xml file into a .txt
file as illustrated in Figure 1. The command aapt dump xml-
tree’ was used to extract the contents of the manifest file to the
.txt file. A pre-processing script is then run on the .txt file to
remove redundant and unwanted characters in preparation for
further text mining operations.
First, the script filters out all the unnecessary characters and
numbers from the XML tree dump. It removes the redundant
words such as descriptors like ‘uses-permission’, ‘an-
droid:name’ etc. Also, lengthy expressions such as intents and
permissions were truncated by removing the common preced-
ing characters and leaving only the parts that uniquely identify
the intent or permission. For example, android.permis-
sion.RESTART_PACKAGES is shortened to RESTART_
PACKAGES and android.intent.action.SMS_RECEIVED is
reduced to SMS_RECEIVED. After reducing the text in the
manifest file to keep only the potentially informative items, the
remaining text was tokenized i.e. split into individual words and
each token or word is later converted into an element of a vector
that is used to characterize the manifest file.
The tokenized words contained in the files from the training set
are also used to create the ‘vocabulary’ or ‘bag of words’ from
which the training vectors are created. The vocabulary words
were stored in a separate text file. Each application was repre-
sented by a single text document that contained the reduced and
tokenized text, and an array was created from these documents
to hold the contents of each document in a single line (as one
element of the array). Thus, the number of items in the array
was equal to the number of processed documents.
The texts_to_matrix function from the tokenizer class of Keras
text pre-processing Python library was used to process all the
lines in the array to produce three different types of features
from the bag of words. These features were binary, frequency
and tf-idf. The dimensions of the feature vectors were equal to
the length of the vocabulary created from the training set. The
top 25 words in the vocabulary produced from a training set is
shown in Table 1. While creating the vocabulary, only the
words that had more than 5 occurrences in all of the documents
from the training set were kept. Those with less than 5 were
eliminated. This resulted in a total of 958 unique words in the
vocabulary created from this training set.
Fig. 1: Manifest file pre-processing using AAPT to extract con-
tents and scripts to remove redundant and unnecessary text.
Table I: Top 25 words from a training set vocabulary (bag of
words) extracted from the manifests of botnet and benign apps.
Top vocabulary words in the training set
Occurrences
MAIN
3506
LAUNCHER
3295
INTERNET
3293
ACCESS_NETWORK_STATE
2759
READ_PHONE_STATE
2565
WRITE_EXTERNAL_STORAGE
2210
BOOT_COMPLETED
1847
RECEIVE_BOOTCOMPLETED
1751
DEFAULT
1479
WAKELOCK
1401
ACCESS_WIFI_STATE
1322
SEND_SMS
1308
READ_CONTACTS
1282
VIBRATE
1248
READ_SMS
1168
RECEIVE_SMS
1077
CALL_PHONE
886
ACCESS_FINE_LOCATION
874
WRITE_SMS
847
ACCESS_COARSE_LOCATION
823
WRITE_CONTACTS
788
WRITE_SETTINGS
636
PHONE_STATE
560
INSTALL_SHORTCUT
553
USER_PRESENT
541
The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)
Figure 2: Word cloud for benign subset.
Fig. 3: Word cloud for botnet subset.
From Table 1, we can see that the most prevalent word in the
vocabulary set was ‘MAIN’ which was seen 3506 times in all
documents within the training set. Other popular words include:
LAUNCHER (3295), INTERNET (3293),
ACCESS_NETWORK_STATE (2759), READ_PHONE_
STATE (2565), WRITE_EXTERNAL_STORAGE (2210),
SEND_SMS (1308). Notice that most of these words corre-
spond to app permissions. In the vocabulary set, words that rep-
resent intents and other components were also present.
We generated a word cloud from training documents for each
class, and Figures 2 and 3 illustrate the most prevalent words in
the respective classes. The word prevalence is denoted by the
sizes of the words, i.e. the larger the size the more times the
word appears in the (processed) documents. Figure 3 is the
word cloud generated from the documents in the botnet cate-
gory, while Figure 2 was generated from documents in the be-
nign category. We can see that the word ‘SEND’ is prevalent in
the botnet class but it is not prominent in the benign class, as
seen from Figure 2. The word ACCESS_DOWNLOAD_
MANAGER is also prominent within the botnet category (Fig-
ure 3) but not in benign category (Figure 2).
C. Machine learning models implementation
Our proposed approach is based on hybrid learning with deep
and traditional shallow learning algorithms to produce the bot-
net detection model. In this method, the bag of words (BoW)
vectors are not used to directly train the traditional machine
learning classifiers. Instead, we use the BoW vectors to train a
Convolutional Neural Network (CNN), consisting of a number
of layers and then extract a representative vector from the CNN
layers. This process is shown in Figure 4. CNN is used to enable
the recognition of any hidden patterns and relationships that
might exist between the words found in the manifest file which
may not be directly captured by the BoW vectors.
Fig. 4: Proposed hybrid learning model for botnet detection us-
ing BoW vectors extracted from the manifest text documents.
The CNN layers consisted of four ReLU activated convolu-
tional layers with each one followed by a MaxPooling layer.
The input dimension to the first convolutional layer is equal to
the size of the BoW vector as derived from the training set. The
number of filters used in the convolutional layers was 32, while
the size of filters was 3. For the learning, we use Adam opti-
mizer and the CNN model is run for 150 training epochs but
with an early stopping monitor call back which terminated the
training after 20 epochs of non-improvement in the validation
loss. A training-validation split of 90:10 was utilized. After the
training, a flat vector is derived from the output of the final
MaxPooling layer. In a typical CNN model, the final classifica-
tion layer will be a sigmoid activated layer with a single neuron
to enable binary classification. Instead of deriving our final
classification from a sigmoid activated layer, we extract the flat
vectors (of reduced dimensionality) obtained the final Max-
Pooling layer.
As shown in Figure 4, the vector derived from the CNN layers
is concatenated with the original BoW vector and after normal-
ization, feature reduction is applied through the use of a feature
selection algorithm such as Chi Square or Information Gain.
This further reduces the dimensionality while selecting the best
features for classifier training. The final selected features are
used to train a high-performing traditional machine learning
classifier such as SVM, RF, DT or KNN. In our experiments
we used Chi Square to select the 200 top features before apply-
ing them to the final ML classifier.
The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)
IV. EXPERIMENTS AND RESULTS
In this section we present the results of the study undertaken to
evaluate our proposed system. The first set of experiments were
with the BoW vectors and traditional machine learning classifi-
ers. Further experiments were performed with several configu-
rations of the proposed hybrid method and the results are com-
pared with the baseline BoW vectors approach. Note that all of
the results presented were from using 90% of the dataset for
training and 10% for testing. This was done 10 times using 10
different equal sized segments of the dataset for testing and the
average was taken to obtain the results presented in the tables.
A. Results of the ML models with BoW binary vectors
Table II shows results from models trained with binary features.
These are categorical features denoted by a ‘0’ or ‘1’ that cap-
ture the presence or absence of the tokenized words in the vo-
cabulary. The length of the feature vector was equal to the num-
ber of tokens (words) in the bag of words/vocabulary. In the
table, we present the results of Gaussian Naïve Bayes (GNB),
KNN, Decision Tree (DT), SVM and Random Forest (RF).
GNB performed poorly on the binary BoW vectors yielding
only 55.7% accuracy and showing huge disparity in the True
Positive Rates (TPR) with 98.5% for benign and only 12.4% for
botnet. In terms of overall accuracy, KNN (with 94.5%) per-
formed better than SVM (92.7%) and DT (93.2%). However,
the best performance was obtained with RF i.e. 97.3% TPR for
botnet, 93.7% TPR for benign, and overall accuracy of 95.5%.
Table II: Results of singular ML models with binary features.
TPR(C)
Accuracy
F1
GNB
0.985
0.557
0.456
KNN
0.922
0.945
0.945
DT
0.941
0.932
0.932
SVM
0.918
0.927
0.927
RF
0.937
0.955
0.955
B. Results of the ML models with BoW frequency vectors
Table III shows results obtained from models trained with BoW
frequency features. These features capture how often tokenized
words in the vocabulary appear in a message. It can be seen that
for KNN, DT, RF and SVM classifiers, the results were below
those of the BoW binary features shown in Table II. For the RF
classifier, the results are quite close. Again, the RF classifier
gave the best overall accuracy result of 95.4% compared to DT
which was the next best classifier with 92.6% overall accuracy.
Table III: Results of singular models with frequency features.
TPR(C)
Accuracy
F1
GNB
0.984
0.586
0.505
KNN
0.931
0.924
0.925
DT
0.937
0.926
0.926
SVM
0.913
0.922
0.921
RF
0.936
0.954
0.954
C. Results of the ML models with tf-idf BoW vectors
Table IV shows results from models trained with TF-IDF fea-
tures. The TF-IDF features provide a weighting that determines
the importance of a word or token based on a metric derived
from its frequency in a document (representing one application
manifest) and the number of documents (i.e. applications)
where the word or token appears. The TF-IDF weight of a word
w in a document j is given by: Nj(w) * log (DT /D(w)) where
Nj(w) is the number of times the word w appears in document j,
while D(w) is the number of documents containing the word w.
From Table IV we can see that TF-IDF did not improve the
overall accuracies over the binary features for any of the classi-
fiers. The best classifier was RF with 95.5% which is the same
as obtained with the binary features.
Table IV: Results of singular ML models with TF-IDF features.
TPR(C)
Accuracy
F1
GNB
0.985
0.557
0.456
KNN
0.926
0.943
0.943
DT
0.940
0.931
0.931
SVM
0.918
0.927
0.927
RF
0.935
0.955
0.955
D. Enhanced performance using the proposed deep learning
based scheme
In Table V, we present the results of the proposed hybrid
scheme described in section III. Because the BoW binary vec-
tors showed better results compared to frequency and TF-IDF,
we used it as the input to the hybrid model as described in Fig-
ure 4. From the results it is clear that the CNN based feature
extraction scheme improved the results for all the classifiers.
Except for CNN-GNB, the results from all the other four hybrid
classifiers obtained greater than 95% overall accuracy. The best
results were from the CNN-SVM model which had 97.6% TPR
for botnet and 95.9% TPR for benign. The overall accuracies of
CNN-DT (95.9%), CNN-RF (96.8%) and CNN-SVM (96.9%)
were better than the best result from the singular RF classifier
with binary BoW vectors which yielded 95.5% accuracy.
Table V: CNN-ML manifest text mining results.
TPR (B)
TPR(C)
Accuracy
F1
CNN-GNB
0.246
0.993
0.622
0.560
CNN-KNN
0.967
0.937
0.952
0.952
CNN-DT
0.961
0.956
0.959
0.959
CNN-SVM
0.976
0.959
0.969
0.969
CNN-RF
0.976
0.959
0.968
0.968
E. Pre-processing and model training overhead
The pre-processing steps were performed on an Ubuntu Linux
Machine with Intel core-i7 3.3GHz CPU and 8GB RAM. The
The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)
time taken by our scripts to pre-process the manifest file, extract
tokens into .txt documents and to derive the BoW vectors
amounted to about 4 minutes 37 seconds for 3858 applications;
which is an average of 0.072 seconds per application. This in-
dicates that the approach incurs quite a low pre-processing
overhead and is thus feasible to implement in practice. The av-
erage training time for the hybrid models using 3472 instances
(90%) was about 95 seconds (mainly dominated by the CNN
vector extraction part), while the testing time was much less,
and varied depending on the ML classifier. Testing time for
CNN-DT was the lowest with an average of 0.0026 millisec-
onds per instance, while that of CNN-RF was the highest, aver-
aging around 0.0192 milliseconds per instance.
V. CONCLUSIONS AND FUTURE WORK
In this paper we proposed an investigated an Android botnet
detection method based on app manifest text mining and a hy-
brid learning classification model. The classification model uti-
lizes deep learning i.e. CNN to extract additional features that
are subsequently concatenated to the original BoW binary fea-
tures to improve the efficiency of the model. The advantage of
employing text mining is the ability for more effective feature
extraction compared to the manual methods which are more
prevalent in the current literature. The results of our experi-
ments performed on 3858 applications showed that the pro-
posed hybrid model enhances the accuracy obtained with the
best singular classifiers. The CNN-SVM hybrid model yielded
96.9% accuracy with 97.6% botnet detection rate which are
comparable to the state-of-the-art results. Future work will ex-
plore avenues to enhance performance including text mining of
additional components of the Android applications.
REFERENCES
[1] McAfee. McAfee Labs Threat Report 06.21. [online]:
https://www.mcafee.com/enterprise/en-us/assets/reports/rp-threats-jun-
2021.pdf [accessed 10 April 2022].
[2] Chris Brook “Google Eliminates Android Adfraud Botnet Chamois”
Threat Post. March 2017. [online]: https://threatpost.com/google-
eliminates-android-adfraud-botnet-chamois/124311/ [accessed 10 April
2022].
[3] Fahmida, Y. Rashid “Chamois: The Big Botnet You Didn’t Hear about”
April 2019 Decipher, by Duo Security. [Online]:
https://duo.com/decipher/chamois-the-big-botnet-you-didnt-hear-about
[accessed 10 April 2022].
[4] S. Jadhav, S. Dutia, K. Calangutkar, T. Oh, Y. H. Kim, and J. N. Kim,
“Cloud-based android botnet malware detection system,” in 2015 17th
International Conference on Advanced Communication Technology
(ICACT), 2015, pp. 347352.
[5] C. Tansettanakorn, S. Thongprasit, S. Thamkongka, and V. Visoot-
tiviseth, “ABIS: A prototype of Android Botnet Identification System,”
in 2016 Fifth ICT International Student Project Conference (ICT-ISPC),
(ICT-ISPC), Nakhonpathom, Thailand, 2728 May 2016; pp. 15.
[6] Kadir, A.F.A., Stakhanova, N., Ghorbani, A.A., 2015. Android botnets:
What urls are telling us, in: International Conference on Network and
System Security, Springer. pp. 7891.
[7] ISCX Android botnet dataset. [online]: https://www.unb.ca/cic/datasets
/android-botnet.html. [Accessed 03 March 2022]
[8] A. Karim, R. Salleh, and S. A. A. Shah, “Dedroid: A mobile botnet
detection approach based on static analysis,” in 2015 IEEE 15th Intl Conf
on Scalable Computing and Communications and Its Associated
Workshops (UIC-ATC-ScalCom), 2015, pp. 13271332.
[9] X. Meng, and G. Spanoudakis, "MBotCS: A mobile botnet detection
system based on machine learning. Lecture Notes in Computer Science,
9572, 2016, pp. 274-291. doi: 10.1007/978-3-319-31811-0_17
[10] W. Hijawi, J. Alqatawna, and H. Faris, “Toward a detection framework
for android botnet,” in 2017 International Conference on New Trends in
Computing Sciences (ICTCS), 2017, pp. 197202.
[11] Z. Abdullah, M. M. Saudi and N. B. Anuar,”ABC: Android Botnet
Classification Using Feature Selection and Classification Algorithms.
Adv. Sci. Lett. 2017, 23, 47174720.
[12] M. Yusof, M. M. Saudi, and F. Ridzuan, A new mobile botnet
classification based on permission and API calls,” In Proceedings of the
Seventh International Conference on Emerging Security Technologies
(EST), Canterbury, UK, 68 September 2017; pp. 122127.
[13] M. Yusof, M. M. Saudi, and F. Ridzuan, Mobile Botnet Classification
by using Hybrid Analysis,” Int. J. Eng. Technol. 2018, 7, 103108.
[14] S. Y. Yerima, and A. Bashar, Bot-IMG: A framework for image-based
detection of Android botnets using machine learning. In Proc. of the 18th
ACS/IEEE International Conf. on Computer Systems and Applications
(AICCSA 2021), Tangier, Morocco, 30 Nov. to 3 Dec., 2021; pp. 17.
[15] S. Hojjatinia, S. Hamzenejadi, and H. Mohseni, “Android botnet de-
tection using convolutional neural networks,” in 2020 28th Iranian
Conference on Electrical Engineering (ICEE), 2020, pp. 16.
[16] M. Moodi, M. Ghazvini, H. Moodi, and B. Ghawami, A smart adaptive
particle swarm optimizationsupport vector machine: Android botnet
detection application. J. Supercomput. 2020, 76, 98549881.
[17] S. Y. Yerima, M. Alzaylaee, A. Shajan, and P. Vinod, “Deep learning
techniques for Android botnet detection,” Electronics, vol. 10, no. 519,
2021. https://doi.org/10.3390/electronics10040519
[18] S. Anwar, J. M. Zain, Z. Inayat, R. U. Haq, A. Karim, and A. N. Jabir, "A
static approach towards mobile botnet detection," In Proc. 3rd Int.
Conference on Electronic Design (ICED), 2016: IEEE, pp. 563-567.
[19] S. Y. Yerima, and A. Bashar, A novel Android botnet detection system
using image-based and manifest file features,” Electronics, vol. 11, no.
486, 2022. https://doi.org/10.3390/electronics11030486
[20] B. Alothman and P. Rattadilok ‘Android botnet detection: An integrated
source code mining aproach’ 12th International Conference for Internet
Technology and Secured Transactions (ICITST),11-14 Dec.,Cambridge,
UK, 2017, IEEE, pp 111-115
[21] U. Raghav, E. Martinez-Marroquin and W. Ma, "Static analysis for
Android Malware detection with document vectors," 2021 International
Conference on Data Mining Workshops (ICDMW), 2021, pp. 805-812,
doi: 10.1109/ICDMW53433.2021.00104.
[22] M. Mimura, and Y. Suga, Filtering malicious javascript code with
doc2vec on an imbalanced dataset,” In Proc. 2019 14th Asia Joint Confer-
ence on Information Security (AsiaJCIS), pp.2431, 2019.
https://doi.org/10.1109/AsiaJCIS.2019.000-9
[23] S. Ndichu, S. Kim, S. Ozawa, T. Misu, and K. Makishima, A machine
learning approach to detection of javascript-based attacks using ast
features and paragraph vectors, Appl. Soft Comput. 84, 105721 (2019)
[24] M. Mimura, and H. Tanaka A linguistic approach towards intrusion
detection in actual proxy logs,” In Proc. 20th international conference on
information and communications security, ICICS 2018, Lille, France,
Oct. 29-31, 2018, pp. 708718.
[25] M. Hall et al., “The WEKA data mining software: An update, ACM
SIGKDD Explor. Newslett., vol. 11, no. 1, pp. 1018, Jun. 2009
[26] S. Y. Yerima and S. Khan, Longitudinal performance analysis of
machine learning based Android malware detectors. In Proceedings of
the 2019 International Conference on Cyber Security and Protection of
Digital Services (Cyber Security), Oxford, UK, 34 June 2019.
[27] Y. Nagano, and R. Uda, Static analysis with paragraph vector for mal-
ware detection,” In Proc. 11th International Conf. on Ubiquitous
Information Management and Communication, IMCOM 2017, Article
no. 80, pp. 1-7. https://doi.org/10.1145/3022227.3022306
[28] E. Raff, J. Sylvester, and C. Nicholas, Learning the PE header, malware
detection with minimal domain knowledge. In Proc. of the 10th ACM
Workshop on Artificial Intelligence and Security, AISec@CCS 2017,
Dallas, TX, USA, November 3, 2017, pp. 121132.
https://doi.org/10.1145/3128572.3140442
... By employing this technique, it can aid in both anti-crime efforts and swift responses when criminal acts or behaviors are detected. In future endeavors, we aim to establish stronger connections between newly discovered evidence items and existing ones, further strengthening the investigative process [24], [25]. ...
Article
Full-text available
p>Currently, the rapid advancement of computer systems and mobile phones has resulted in their utilization in unlawful acts. Ensuring adequate and effective security measures poses a difficult task due to the intricate nature of these devices, thereby exacerbating the challenges associated with investigating crimes involving them. Digital forensics, which involves investigating cyber crimes, plays a crucial role in this realm. Extensive research has been conducted in this field to aid forensic investigations in addressing contemporary obstacles. This paper aims to explore the progress made in the applications of digital forensics and security, encompassing various aspects, and provide insights into the evolution of digital forensics over the past five years.</p
Chapter
Android devices can now offer a wide range of services. They support a variety of applications, including those for banking, business, health, and entertainment. The popularity and functionality of Android devices, along with the open-source nature of the Android operating system, have made them a prime target for attackers. One of the most dangerous malwares is an Android botnet, which an attacker known as a botmaster can remotely control to launch destructive attacks. This paper investigates Android botnets by using static analysis to extract features from reverse-engineered applications. Furthermore, this article delivers a new dataset of Android apps, including botnet or benign, and an optimized multilayer perceptron neural network (MLP) for detecting botnets infected by malware based on the permissions of the apps. Experimental results show that the proposed methodology is both practical and effective while outperforming other standard classifiers in various evaluation metrics.KeywordsAndroid Malware detectionBotnetsNeural NetworksNew dataset
Conference Paper
Full-text available
In the last few years, control engineers have started to use artificial neural networks (NNs) embedded in advanced feedback control algorithms. Its natural integration into existing controllers, such as programmable logic controllers (PLCs) or close to them, represents a challenge. Besides, the application of these algorithms in critical applications still raises concerns among control engineers due to the lack of safety guarantees. Building trustworthy NNs is still a challenge and their verification is attracting more attention nowadays. This paper discusses the peculiarities of formal verification of NNs controllers running on PLCs. It outlines a set of properties that should be satisfied by a NN that is intended to be deployed in a critical high-availability installation at CERN. It compares different methods to verify this NN and sketches our future research directions to find a safe NN.KeywordsVerification of neural networksPLCsControl system
Conference Paper
Full-text available
The ever-increasing use of mobile phones running the Android OS has created security threats of data breach and botnet-based remote control. To address these challenges, numerous countermeasures have been proposed in the domain of image-based Android Malware Detection (AMD) applying Deep Learning (DL) approaches. This paper proposes, implements and evaluates a solution based on pre-trained CNN models using Transfer Learning feature to identify botnets from the ISCX Android Botnet 2015 dataset. More specifically, we study the performance of 6 prominent pre-trained CNN models namely, MobileNetV2, RestNet101, VGG16, VGG19, InceptionRestNetV2 and DenseNet121, in terms of training accuracies, computation time complexity and testing accuracies. The maximum classification accuracy obtained was 91% for Manifest dataset using the MobileNetV2 model. Also, in terms of computational complexity the MobileNetV2 yielded the lowest training time of 16 ms per sample and testing time of 0.9 ms per sample. In order to improve the testing accuracies we plan to further augment these pre-trained models with larger datasets or fine-tune the model parameters for enhanced performance.
Article
Full-text available
Malicious botnet applications have become a serious threat and are increasingly incorporating sophisticated detection avoidance techniques. Hence, there is a need for more effective mitigation approaches to combat the rise of Android botnets. Although the use of Machine Learning to detect botnets has been a focus of recent research efforts, several challenges remain. To overcome the limitations of using hand-crafted features for Machine-Learning-based detection, in this paper, we propose a novel mobile botnet detection system based on features extracted from images and a manifest file. The scheme employs a Histogram of Oriented Gradients and byte histograms obtained from images representing the app executable and combines these with features derived from the manifest files. Feature selection is then applied to utilize the best features for classification with Machine-Learning algorithms. The proposed system was evaluated using the ISCX botnet dataset, and the experimental results demonstrate its effectiveness with F1 scores ranging from 0.923 to 0.96 using popular Machine-Learning algorithms. Furthermore, with the Extra Trees model, up to 97.5% overall accuracy was obtained using an 80:20 train–test split, and 96% overall accuracy was obtained using 10-fold cross validation.
Conference Paper
Full-text available
To enable more effective mitigation of Android botnets, image-based detection approaches offer great promise. Such image-based or visualization methods provide detection solutions that are less reliant on hand-engineered features which require domain knowledge. In this paper we propose Bot-IMG, a framework for visualization and image-based detection of Android botnets using machine learning. Furthermore, we evaluated the efficacy of Bot-IMG framework using the ISCX botnet dataset. In particular, we implement an image-based detection method using Histogram of Oriented Gradients (HOG) as feature descriptors within the framework, and utilized Autoencoders in conjunction with traditional machine learning classifiers. From the experiments performed, we obtained up to 95.3% classification accuracy using train-test split of 80:20 and 93.1% classification accuracy with 10-fold cross validation.
Article
Full-text available
Android is increasingly being targeted by malware since it has become the most popular mobile operating system worldwide. Evasive malware families, such as Chamois, designed to turn Android devices into bots that form part of a larger botnet are becoming prevalent. This calls for more effective methods for detection of Android botnets. Recently, deep learning has gained attention as a machine learning based approach to enhance Android botnet detection. However, studies that extensively investigate the efficacy of various deep learning models for Android botnet detection are currently lacking. Hence, in this paper we present a comparative study of deep learning techniques for Android botnet detection using 6802 Android applications consisting of 1929 botnet applications from the ISCX botnet dataset. We evaluate the performance of several deep learning techniques including: CNN, DNN, LSTM, GRU, CNN-LSTM, and CNN-GRU models using 342 static features derived from the applications. In our experiments, the deep learning models achieved state-of-the-art results based on the ISCX botnet dataset and also outperformed the classical machine learning classifiers.
Article
Full-text available
Support vector machine (SVM) is a renowned machine learning technique, which has been successfully applied to solve many practical pattern classification problems. One of the difficulties in successful implementation of SVM is its different parameters (i.e., kernel parameter(s), penalty parameter (C) and the features available in the dataset), which should be well adjusted during the training process. In this paper, a new approach called smart adaptive particle swarm optimization–support vector machine (SAPSO–SVM) is developed to adapt the parameters of optimization algorithm (i.e., inertia weight and acceleration coefficients) to the latest changes in the search space, so that each particle explicitly explores the search space based on the latest changes made to Personal best, Global best and other particle locations. In this algorithm, using the changes in Personal best and Global best at each stage of execution, the new evolution factor values are designated and the interference of the intervals of inertia weight is eradicated. Then, the states of each particle (i.e., convergence, exploitation, exploration, jumping-out) at each stage of administration, based on the interval weights, are specified accurately. By fine tuning the parameters of SAPSO, this algorithm can acquire the best optimal responses for SVM parameters. The results obtained from the SAPSO–SVM method demonstrate the superiority of this method in four different measures (i.e., sensitivity, specificity, precision, accuracy) in comparison with the other three similar ones. Finally, the top 20 features of Android botnets are somehow introduced by the proposed approach and three other approaches; firstly, these features are not encrypted by Android botnets, and secondly, are selected based on the best results.
Article
Full-text available
Websites attract millions of visitors due to the convenience of services they offer, which provide for interesting targets for cyber attackers. Most of these websites use JavaScript (JS) to create dynamic content. The exploitation of vulnerabilities in servers, plugins, and other third-party systems enables the insertion of malicious codes into websites. These exploits use methods such as drive-by-downloads, pop up ads, and phishing attacks on news, porn, piracy, torrent or free software websites, among others. Many of the recent cyber-attacks exploit JS vulnerabilities, in some cases employing obfuscation to hide their maliciousness and evade detection. It is, therefore, primal to develop an accurate detection system for malicious JS to protect users from such attacks. This study adopts Abstract Syntax Tree (AST) for code structure representation and a machine learning approach to conduct feature learning called Doc2vec to address this issue. Doc2vec is a neural network model that can learn context information of texts with variable length. This model is a well-suited feature learning method for JS codes, which consist of text content ranging among single line sentences, paragraphs, and full-length documents. Besides, features learned with Doc2Vec are of low dimensions which ensure faster detections. A classifier model judges the maliciousness of a JS code using the learned features. The performance of this approach is evaluated using the D3M dataset (Drive-by-Download Data by Marionette) for malicious JS codes and the JSUNPACK plus Alexa top 100 websites datasets for benign JS codes. We then compare the performance of Doc2Vec on plain JS codes (Plain-JS) and AST form of JS codes (AST-JS) to other feature learning methods. Our experimental results show that the proposed AST features and Doc2Vec for feature learning provide better accuracy and fast classification in malicious JS codes detection compared to conventional approaches and can flag malicious JS codes previously identified as hard-to-detect.
Conference Paper
Full-text available
This paper presents a longitudinal study of the performance of machine learning classifiers for Android malware detection. The study is undertaken using features extracted from Android applications first seen between 2012 and 2016. The aim is to investigate the extent of performance decay over time for various machine learning classifiers trained with static features extracted from date-labelled benign and malware application sets. Using date-labelled apps allows for true mimicking of zero-day testing, thus providing a more realistic view of performance than the conventional methods of evaluation that do not take date of appearance into account. In this study, all the investigated machine learning classifiers showed progressive diminishing performance when tested on sets of samples from a later time period. Overall, it was found that false positive rate (misclassifying benign samples as malicious) increased more substantially compared to the fall in True Positive rate (correct classification of malicious apps) when older models were tested on newer app samples.
Chapter
Modern malware imitates benign http traffic to evade detection. To detect unseen malicious traffic, a linguistic-based detection method for proxy logs has been proposed. This method uses Paragraph Vector to extract features automatically. To generate discriminative feature representation, a balanced corpus is required. In actual proxy logs, benign traffic is dominant, and occupies malicious feature representation. Therefore, the previous method does not perform accuracy in practical environment.