Conference PaperPDF Available

A deep learning-enhanced botnet detection system based on Android manifest text mining

June 2022

June 2022

DOI:10.1109/ISDFS55398.2022.9800817

Conference: The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022

Authors:

Suleiman Y. Yerima

British University in Dubai

Android botnets remain a significant threat to mobile and IoT systems and networks as they continue to infect millions of devices worldwide. Therefore, there is a need to develop more effective solutions to tackle their spread. Hence, in this paper we propose a system for detecting Android botnets through automated text mining of the manifest files obtained from apps. The proposed method utilizes NLP techniques to extract features from the manifest files and a deep learning-based classification model is used to detect botnet applications. The classification model is implemented using CNN and a traditional machine learning classi-fier such as SVM, Random Forest or KNN. We performed experiments to evaluate the proposed system with 3858 Android applications consisting of 1929 botnet and 1929 benign samples. The results showed the best overall performance with the CNN-SVM hybrid model which had an average accuracy of 96.9% thus outperforming the singular machine learning classifiers.

Word cloud for benign subset.

…

Proposed hybrid learning model for botnet detection using BoW vectors extracted from the manifest text documents.

…

Figures - uploaded by Suleiman Y. Yerima

Content may be subject to copyright.

Content uploaded by Suleiman Y. Yerima

Content may be subject to copyright.

The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)

A deep learning-enhanced botnet detection system

based on Android manifest text mining

Suleiman Y. Yerima 1, 2 and YiMin To 2

1 Cyber Technology Institute

School of Computer Science and Informatics

Faculty of Computing, Engineering and Media,

De Montfort University, Leicester, United Kingdom

syerima@dmu.ac.uk

2 School of Computer Science and Informatics

Faculty of Computing, Engineering and Media,

De Montfort University, Leicester, United Kingdom

P2668280@my365.dmu.ac.uk

Abstract— Android botnets remain a significant threat to mo-

bile and IoT systems and networks as they continue to infect mil-

lions of devices worldwide. Therefore, there is a need to develop

more effective solutions to tackle their spread. Hence, in this paper

we propose a system for detecting Android botnets through auto-

mated text mining of the manifest files obtained from apps. The

proposed method utilizes NLP techniques to extract features from

the manifest files and a deep learning-based classification model is

used to detect botnet applications. The classification model is im-

plemented using CNN and a traditional machine learning classi-

fier such as SVM, Random Forest or KNN. We performed exper-

iments to evaluate the proposed system with 3858 Android appli-

cations consisting of 1929 botnet and 1929 benign samples. The re-

sults showed the best overall performance with the CNN-SVM hy-

brid model which had an average accuracy of 96.9% thus outper-

forming the singular machine learning classifiers.

Keywords—Mobile botnets; text mining; Android malware;

machine learning; Natural Language Processing, deep learning

I. INTRODUCTION

Recent threat reports have shown that mobile malware is on the

rise [1]. Android, being an open source mobile and IoT operat-

ing system, is vulnerable to attacks by botnets. A botnet consists

of Internet-connected devices under the control of botmaster(s)

who can configure the ‘bots’ to perform various attacks such as

distributed denial of service, spam distribution, phishing, etc. In

fact, botnets are regarded as the Swiss army knife of cybercrim-

inals due to their versatility, the variety of attacks they can be

utilized for, and their ability to receive and respond to com-

mands from a command and control (C&C) server.

The rise in the threat of Android botnets, highlighted by the

Chamois botnet attack which infected millions of Android de-

vices worldwide [2], [3], calls for more effective detection so-

lutions. In recent years, static and dynamic analysis-based ap-

proaches have been proposed for detecting malware on the An-

droid platform. Both of these methods have their pros and cons.

Researchers have also extensively leveraged machine learning

(ML) to provide automated detection based on the use of static

and dynamic features extracted from Android applications.

Several researchers have proposed static, dynamic and machine

learning based systems for Android botnet detection. For exam-

ple, [4] proposed a cloud-based Android botnet detection sys-

tem based on dynamic analysis that uses strace, netflow, logcat,

sysdump and tcpdump. In [5], the Android botnet identification

system (ABIS) was presented. It uses static and dynamic fea-

tures to train machine learning classifiers to distinguish be-

tween botnet and clean applications.

Despite good performance exhibited by previously proposed

ML-based botnet detection methods, there is need for improved

effectiveness in feature extraction to provide scalable solutions.

Majority of the previous works adopt a manual feature extrac-

tion approach, which is in turn highly dependent on specialist

domain knowledge. However, natural language processing

(NLP) or text mining techniques can provide researchers with a

less manual feature engineering process that is also non-de-

pendent on domain knowledge. Motivated by this, we propose

a system to detect Android botnet malware through automated

text mining of the manifest files. Our proposed approach uti-

lizes NLP techniques to extract features from the processed

manifest file text and builds an efficient model using a hybrid

of CNN and traditional machine learning classifiers.

We employed 3858 Android applications (1929 botnet and

1929 clean) to evaluate the proposed system. The botnet sam-

ples are from the well-known ISCX botnet dataset available

from the Canadian Institute for Cybersecurity (CIC). The re-

sults of our experiments showed that the best singular classifier

from the popular ML algorithms that were tested reached up to

95.5% accuracy, demonstrating the effectiveness of the NLP

based approach. The hybrid models improved the performance

of the botnet detection system; several hybrid model configura-

tions (where CNN was used with a different ML classifier) ob-

tained greater than 95% overall accuracy.

The rest of the paper is organized as follows: In section II, we

review related work. In section III we discuss our methodology,

while in section IV experimental results are presented and eval-

uated. Finally, we conclude the paper and outline future work

in section V.

The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)

II. RELATED WORK

In this section, we briefly review the existing related Android

botnet detection works. The experiments presented later in this

paper were based on the ISCX botnet samples which were first

made available by the study conducted by Kadir et al. in [6]. In

their paper, an in-depth analysis of the C&C and built-in URLs

of Android botnets was conducted. They shared their 1929 bot-

net samples categorized into 14 families which can be obtained

from [7]. In [8], Karim et al presented a static analysis system

called DeDroid for mobile botnet detection and evaluated the

system using the Drebin malware dataset. This study led to a

conclusion that 90% of the malware samples in the dataset were

botnets. In [9], network features were used with machine learn-

ing to enable detection of mobile botnets. The features used in-

clude TCP/UDP packet size, frame duration, and source/desti-

nation IP address. Five ML classifiers were employed i.e. KNN,

Decision Tree, Naïve Bayes, SVM and Neural Network.

Hijawi et al. [10] presented an ML-based detection framework

for Android botnets using permissions and their protection lev-

els. They performed experiments using 1635 ISCX botnet apps

and 1635 benign apps and evaluated Naïve Bayes, Decision

Trees, MLP and Random Forest classifiers. Random forest had

the highest overall accuracy of 97.3%. In [11], an ML-based

botnet classification system called ABC (Android botnet clas-

sification) which uses requested permissions as features with

Information Gain feature selection was proposed. Naïve Bayes,

Random Forest, and J48 classifiers were used and Random for-

est had the highest botnet TPR of 94.6% and the lowest FPR of

9.9%. The experiments were performed using 1505 ISCX bot-

net samples and 850 benign applications.

Yusof et al. proposed a botnet classification system based on

API calls and permissions and used feature selection to select

16 permissions and 31 API calls to train ML classifiers using

WEKA [22]. In their paper [12], they presented experiments

with SVM, KNN, Naïve Bayes, J48 and Random Forest using

6282 benign and malicious samples. Random Forest obtained

the best results. The work was extended in [13] by including

system calls in the feature set. The best results obtained were

99.4% TPR, 12.5% FPR, and 97.9% overall accuracy.

In [14], a framework for image-based detection of Android bot-

nets called Bot-IMG was proposed. Histogram of Oriented Gra-

dients was used to train ML classifiers to distinguish between

botnets and benign applications. Experiments were performed

on 1929 ISCX botnet apps and 2500 benign apps. The system

achieved 93.1% accuracy using 10-fold cross validation and

95.3% accuracy with 80:20 training-testing split. In [15], per-

missions were used to generate images based on co-occurrence

matrix and the images were used to train a CNN model to clas-

sify apps into benign or botnet. The system achieved 97.2% ac-

curacy on 1800 ISCX botnet applications and 3650 benign ap-

plication. Moodi et al [16], utilized traffic features for the de-

tection of Android botnets based on SVM. They presented an

approach called smart adaptive particle swarm optimization

support vector machine (SAPSO-SVM) based on the top 20

traffic features from the 28-SABD Android botnet dataset. In

[17], 342 static features consisting of permissions, intents, extra

executable files, API calls, and commands were used in con-

junction with deep learning classifiers such as CNN-LSTM,

LSTM, GRU and DNN. This study was also based on the 1929

ISCX botnet samples with additional 4873 clean samples.

Anwar et al. [18], used MD5 hashes, broadcast receivers, per-

missions and background services as features for detecting mo-

bile botnet attacks. Their experiments were conducted on 1400

ISCX botnet samples together with 1400 benign apps, yielding

95.1% classification accuracy. In [19], an image-based ap-

proach to detecting Android botnets was investigated. A mix of

features including Histogram of Oriented Gradients, and Byte

Histograms were extracted from the images and combined with

permissions and these were used to train machine learning clas-

sifiers. The best results observed were 96% accuracy from 10-

fold cross validation, and 97.5% accuracy using 80:20 training-

testing split. A review of the previous works has revealed that

NLP or text mining-based techniques are relatively less popular

in Android botnet or malware detection. A few of the works that

exist in this realm include [20] and [21]. Furthermore, text min-

ing/NLP techniques have been applied to malicious JavaScript

code detection [22], [23], intrusion detection via proxy logs

[24], as well as PE malware detection [27], [28].

In [20] a source code mining approach was proposed for An-

droid botnet detection. Dex2jar was used to decompile the app

executable to Java source code and NLP techniques were ap-

plied to the code. The authors utilized TextToWordVector with

TF-IDF and StringToWordVector with TF-IDF using WEKA

[25]. They evaluated several classifiers including Naive Bayes,

KNN, J48, SVM and Random Forest. However, their work was

based on only 21 Android app samples. In [21], document em-

bedding using Paragraph Vector – Distributed Bag of Words

(PV-DBoW) was proposed as features to train ML classifiers to

detect Android malware. The embeddings are created from text

obtained from manifest and dex files within the apps. CNN,

SVM, and LR were trained with the embedding vectors and

evaluated on 2234 Android apps. However, the paper only

evaluated few ML classifiers and the individual performances

of benign and malware classes were not presented.

Unlike the system described in [10], which showed good per-

formance on the ISCX botnet dataset, the text mining-based ap-

proach proposed in this paper does not depend on domain

knowledge (i.e. understanding permission protection levels).

Moreover, our proposed system requires less feature extraction

effort compared to the more manual systems presented in [12],

[13], [17], [18], [19] and [26] for example. In section III, we

describe the proposed system in greater detail.

III. METHODOLOGY

A. Dataset

The Android botnet dataset known as the ISCX botnet dataset

is used in this paper. The dataset consists of 1929 botnet sam-

ples from 14 families. The dataset is originally from [6] and has

The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)

been utilized in several previous works such as [10], [11], [14],

[15], [17], [18] and [19]. In order to build and evaluate the pro-

posed models via supervised learning, we obtained an addi-

tional 1929 benign samples to give a balanced dataset consist-

ing of a total of 3858 Android app samples.

B. Android application manifest text mining

The manifest file in an Android application is a required XML

file containing important metadata about the application. These

includes package name, the names of activities, services, broad-

cast receivers and content providers. Other information in-

cludes Android version support, hardware features support, per-

mission, intents, etc. To enable the text mining of the manifest

file, the Android Asset Packaging Tool (AAPT) was used to

extract the contents of the AndroidManifest.xml file into a .txt

file as illustrated in Figure 1. The command ‘aapt dump xml-

tree’ was used to extract the contents of the manifest file to the

.txt file. A pre-processing script is then run on the .txt file to

remove redundant and unwanted characters in preparation for

further text mining operations.

First, the script filters out all the unnecessary characters and

numbers from the XML tree dump. It removes the redundant

words such as descriptors like ‘uses-permission’, ‘an-

droid:name’ etc. Also, lengthy expressions such as intents and

permissions were truncated by removing the common preced-

ing characters and leaving only the parts that uniquely identify

the intent or permission. For example, android.permis-

sion.RESTART_PACKAGES is shortened to RESTART_

PACKAGES and android.intent.action.SMS_RECEIVED is

reduced to SMS_RECEIVED. After reducing the text in the

manifest file to keep only the potentially informative items, the

remaining text was tokenized i.e. split into individual words and

each token or word is later converted into an element of a vector

that is used to characterize the manifest file.

The tokenized words contained in the files from the training set

are also used to create the ‘vocabulary’ or ‘bag of words’ from

which the training vectors are created. The vocabulary words

were stored in a separate text file. Each application was repre-

sented by a single text document that contained the reduced and

tokenized text, and an array was created from these documents

to hold the contents of each document in a single line (as one

element of the array). Thus, the number of items in the array

was equal to the number of processed documents.

The texts_to_matrix function from the tokenizer class of Keras

text pre-processing Python library was used to process all the

lines in the array to produce three different types of features

from the bag of words. These features were binary, frequency

and tf-idf. The dimensions of the feature vectors were equal to

the length of the vocabulary created from the training set. The

top 25 words in the vocabulary produced from a training set is

shown in Table 1. While creating the vocabulary, only the

words that had more than 5 occurrences in all of the documents

from the training set were kept. Those with less than 5 were

eliminated. This resulted in a total of 958 unique words in the

vocabulary created from this training set.

Fig. 1: Manifest file pre-processing using AAPT to extract con-

tents and scripts to remove redundant and unnecessary text.

Table I: Top 25 words from a training set vocabulary (bag of

words) extracted from the manifests of botnet and benign apps.

Top vocabulary words in the training set

Occurrences

MAIN

3506

LAUNCHER

3295

INTERNET

3293

ACCESS_NETWORK_STATE

2759

READ_PHONE_STATE

2565

WRITE_EXTERNAL_STORAGE

2210

BOOT_COMPLETED

1847

RECEIVE_BOOTCOMPLETED

1751

DEFAULT

1479

WAKELOCK

1401

ACCESS_WIFI_STATE

1322

SEND_SMS

1308

READ_CONTACTS

1282

VIBRATE

1248

READ_SMS

1168

RECEIVE_SMS

1077

CALL_PHONE

886

ACCESS_FINE_LOCATION

874

WRITE_SMS

847

ACCESS_COARSE_LOCATION

823

WRITE_CONTACTS

788

WRITE_SETTINGS

636

PHONE_STATE

560

INSTALL_SHORTCUT

553

USER_PRESENT

541

The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)

Figure 2: Word cloud for benign subset.

Fig. 3: Word cloud for botnet subset.

From Table 1, we can see that the most prevalent word in the

vocabulary set was ‘MAIN’ which was seen 3506 times in all

documents within the training set. Other popular words include:

LAUNCHER (3295), INTERNET (3293),

ACCESS_NETWORK_STATE (2759), READ_PHONE_

STATE (2565), WRITE_EXTERNAL_STORAGE (2210),

SEND_SMS (1308). Notice that most of these words corre-

spond to app permissions. In the vocabulary set, words that rep-

resent intents and other components were also present.

We generated a word cloud from training documents for each

class, and Figures 2 and 3 illustrate the most prevalent words in

the respective classes. The word prevalence is denoted by the

sizes of the words, i.e. the larger the size the more times the

word appears in the (processed) documents. Figure 3 is the

word cloud generated from the documents in the botnet cate-

gory, while Figure 2 was generated from documents in the be-

nign category. We can see that the word ‘SEND’ is prevalent in

the botnet class but it is not prominent in the benign class, as

seen from Figure 2. The word ACCESS_DOWNLOAD_

MANAGER is also prominent within the botnet category (Fig-

ure 3) but not in benign category (Figure 2).

C. Machine learning models implementation

Our proposed approach is based on hybrid learning with deep

and traditional shallow learning algorithms to produce the bot-

net detection model. In this method, the bag of words (BoW)

vectors are not used to directly train the traditional machine

learning classifiers. Instead, we use the BoW vectors to train a

Convolutional Neural Network (CNN), consisting of a number

of layers and then extract a representative vector from the CNN

layers. This process is shown in Figure 4. CNN is used to enable

the recognition of any hidden patterns and relationships that

might exist between the words found in the manifest file which

may not be directly captured by the BoW vectors.

Fig. 4: Proposed hybrid learning model for botnet detection us-

ing BoW vectors extracted from the manifest text documents.

The CNN layers consisted of four ReLU activated convolu-

tional layers with each one followed by a MaxPooling layer.

The input dimension to the first convolutional layer is equal to

the size of the BoW vector as derived from the training set. The

number of filters used in the convolutional layers was 32, while

the size of filters was 3. For the learning, we use ‘Adam’ opti-

mizer and the CNN model is run for 150 training epochs but

with an early stopping monitor call back which terminated the

training after 20 epochs of non-improvement in the validation

loss. A training-validation split of 90:10 was utilized. After the

training, a flat vector is derived from the output of the final

MaxPooling layer. In a typical CNN model, the final classifica-

tion layer will be a sigmoid activated layer with a single neuron

to enable binary classification. Instead of deriving our final

classification from a sigmoid activated layer, we extract the flat

vectors (of reduced dimensionality) obtained the final Max-

Pooling layer.

As shown in Figure 4, the vector derived from the CNN layers

is concatenated with the original BoW vector and after normal-

ization, feature reduction is applied through the use of a feature

selection algorithm such as Chi Square or Information Gain.

This further reduces the dimensionality while selecting the best

features for classifier training. The final selected features are

used to train a high-performing traditional machine learning

classifier such as SVM, RF, DT or KNN. In our experiments

we used Chi Square to select the 200 top features before apply-

ing them to the final ML classifier.

The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)

IV. EXPERIMENTS AND RESULTS

In this section we present the results of the study undertaken to

evaluate our proposed system. The first set of experiments were

with the BoW vectors and traditional machine learning classifi-

ers. Further experiments were performed with several configu-

rations of the proposed hybrid method and the results are com-

pared with the baseline BoW vectors approach. Note that all of

the results presented were from using 90% of the dataset for

training and 10% for testing. This was done 10 times using 10

different equal sized segments of the dataset for testing and the

average was taken to obtain the results presented in the tables.

A. Results of the ML models with BoW binary vectors

Table II shows results from models trained with binary features.

These are categorical features denoted by a ‘0’ or ‘1’ that cap-

ture the presence or absence of the tokenized words in the vo-

cabulary. The length of the feature vector was equal to the num-

ber of tokens (words) in the bag of words/vocabulary. In the

table, we present the results of Gaussian Naïve Bayes (GNB),

KNN, Decision Tree (DT), SVM and Random Forest (RF).

GNB performed poorly on the binary BoW vectors yielding

only 55.7% accuracy and showing huge disparity in the True

Positive Rates (TPR) with 98.5% for benign and only 12.4% for

botnet. In terms of overall accuracy, KNN (with 94.5%) per-

formed better than SVM (92.7%) and DT (93.2%). However,

the best performance was obtained with RF i.e. 97.3% TPR for

botnet, 93.7% TPR for benign, and overall accuracy of 95.5%.

Table II: Results of singular ML models with binary features.

TPR (B)

TPR(C)

Accuracy

GNB

0.124

0.985

0.557

0.456

KNN

0.970

0.922

0.945

0.922

0.941

0.932

SVM

0.937

0.918

0.927

0.973

0.937

0.955

B. Results of the ML models with BoW frequency vectors

Table III shows results obtained from models trained with BoW

frequency features. These features capture how often tokenized

words in the vocabulary appear in a message. It can be seen that

for KNN, DT, RF and SVM classifiers, the results were below

those of the BoW binary features shown in Table II. For the RF

classifier, the results are quite close. Again, the RF classifier

gave the best overall accuracy result of 95.4% compared to DT

which was the next best classifier with 92.6% overall accuracy.

Table III: Results of singular models with frequency features.

TPR (B)

TPR(C)

Accuracy

GNB

0.182

0.984

0.586

0.505

KNN

0.921

0.931

0.924

0.925

0.916

0.937

0.926

SVM

0.931

0.913

0.922

0.921

0.971

0.936

0.954

C. Results of the ML models with tf-idf BoW vectors

Table IV shows results from models trained with TF-IDF fea-

tures. The TF-IDF features provide a weighting that determines

the importance of a word or token based on a metric derived

from its frequency in a document (representing one application

manifest) and the number of documents (i.e. applications)

where the word or token appears. The TF-IDF weight of a word

w in a document j is given by: Nj(w) * log (DT /D(w)) where

Nj(w) is the number of times the word w appears in document j,

while D(w) is the number of documents containing the word w.

From Table IV we can see that TF-IDF did not improve the

overall accuracies over the binary features for any of the classi-

fiers. The best classifier was RF with 95.5% which is the same

as obtained with the binary features.

Table IV: Results of singular ML models with TF-IDF features.

TPR (B)

TPR(C)

Accuracy

GNB

0.124

0.985

0.557

0.456

KNN

0.960

0.926

0.943

0.922

0.940

0.931

SVM

0.937

0.918

0.927

0.973

0.935

0.955

D. Enhanced performance using the proposed deep learning

based scheme

In Table V, we present the results of the proposed hybrid

scheme described in section III. Because the BoW binary vec-

tors showed better results compared to frequency and TF-IDF,

we used it as the input to the hybrid model as described in Fig-

ure 4. From the results it is clear that the CNN based feature

extraction scheme improved the results for all the classifiers.

Except for CNN-GNB, the results from all the other four hybrid

classifiers obtained greater than 95% overall accuracy. The best

results were from the CNN-SVM model which had 97.6% TPR

for botnet and 95.9% TPR for benign. The overall accuracies of

CNN-DT (95.9%), CNN-RF (96.8%) and CNN-SVM (96.9%)

were better than the best result from the singular RF classifier

with binary BoW vectors which yielded 95.5% accuracy.

Table V: CNN-ML manifest text mining results.

TPR (B)

TPR(C)

Accuracy

CNN-GNB

0.246

0.993

0.622

0.560

CNN-KNN

0.967

0.937

0.952

CNN-DT

0.961

0.956

0.959

CNN-SVM

0.976

0.959

0.969

CNN-RF

0.976

0.959

0.968

E. Pre-processing and model training overhead

The pre-processing steps were performed on an Ubuntu Linux

Machine with Intel core-i7 3.3GHz CPU and 8GB RAM. The

The 10th International Symposium on Digital Forensics and Security (ISDFS 2022), Istanbul, Turkey, June 6-7, 2022 (Accepted version)

time taken by our scripts to pre-process the manifest file, extract

tokens into .txt documents and to derive the BoW vectors

amounted to about 4 minutes 37 seconds for 3858 applications;

which is an average of 0.072 seconds per application. This in-

dicates that the approach incurs quite a low pre-processing

overhead and is thus feasible to implement in practice. The av-

erage training time for the hybrid models using 3472 instances

(90%) was about 95 seconds (mainly dominated by the CNN

vector extraction part), while the testing time was much less,

and varied depending on the ML classifier. Testing time for

CNN-DT was the lowest with an average of 0.0026 millisec-

onds per instance, while that of CNN-RF was the highest, aver-

aging around 0.0192 milliseconds per instance.

V. CONCLUSIONS AND FUTURE WORK

In this paper we proposed an investigated an Android botnet

detection method based on app manifest text mining and a hy-

brid learning classification model. The classification model uti-

lizes deep learning i.e. CNN to extract additional features that

are subsequently concatenated to the original BoW binary fea-

tures to improve the efficiency of the model. The advantage of

employing text mining is the ability for more effective feature

extraction compared to the manual methods which are more

prevalent in the current literature. The results of our experi-

ments performed on 3858 applications showed that the pro-

posed hybrid model enhances the accuracy obtained with the

best singular classifiers. The CNN-SVM hybrid model yielded

96.9% accuracy with 97.6% botnet detection rate which are

comparable to the state-of-the-art results. Future work will ex-

plore avenues to enhance performance including text mining of

additional components of the Android applications.

REFERENCES

[1] McAfee. McAfee Labs Threat Report 06.21. [online]:

https://www.mcafee.com/enterprise/en-us/assets/reports/rp-threats-jun-

2021.pdf [accessed 10 April 2022].

[2] Chris Brook “Google Eliminates Android Adfraud Botnet Chamois”

Threat Post. March 2017. [online]: https://threatpost.com/google-

eliminates-android-adfraud-botnet-chamois/124311/ [accessed 10 April

2022].

[3] Fahmida, Y. Rashid “Chamois: The Big Botnet You Didn’t Hear about”

April 2019 Decipher, by Duo Security. [Online]:

https://duo.com/decipher/chamois-the-big-botnet-you-didnt-hear-about

[accessed 10 April 2022].

[4] S. Jadhav, S. Dutia, K. Calangutkar, T. Oh, Y. H. Kim, and J. N. Kim,

“Cloud-based android botnet malware detection system,” in 2015 17th

International Conference on Advanced Communication Technology

(ICACT), 2015, pp. 347–352.

[5] C. Tansettanakorn, S. Thongprasit, S. Thamkongka, and V. Visoot-

tiviseth, “ABIS: A prototype of Android Botnet Identification System,”

in 2016 Fifth ICT International Student Project Conference (ICT-ISPC),

(ICT-ISPC), Nakhonpathom, Thailand, 27–28 May 2016; pp. 1–5.

[6] Kadir, A.F.A., Stakhanova, N., Ghorbani, A.A., 2015. “Android botnets:

What urls are telling us,” in: International Conference on Network and

System Security, Springer. pp. 78–91.

[7] ISCX Android botnet dataset. [online]: https://www.unb.ca/cic/datasets

/android-botnet.html. [Accessed 03 March 2022]

[8] A. Karim, R. Salleh, and S. A. A. Shah, “Dedroid: A mobile botnet

detection approach based on static analysis,” in 2015 IEEE 15th Intl Conf

on Scalable Computing and Communications and Its Associated

Workshops (UIC-ATC-ScalCom), 2015, pp. 1327–1332.

[9] X. Meng, and G. Spanoudakis, "MBotCS: A mobile botnet detection

system based on machine learning. Lecture Notes in Computer Science,

9572, 2016, pp. 274-291. doi: 10.1007/978-3-319-31811-0_17

[10] W. Hijawi, J. Alqatawna, and H. Faris, “Toward a detection framework

for android botnet,” in 2017 International Conference on New Trends in

Computing Sciences (ICTCS), 2017, pp. 197–202.

[11] Z. Abdullah, M. M. Saudi and N. B. Anuar,”ABC: Android Botnet

Classification Using Feature Selection and Classification Algorithms”.

Adv. Sci. Lett. 2017, 23, 4717–4720.

[12] M. Yusof, M. M. Saudi, and F. Ridzuan, “A new mobile botnet

classification based on permission and API calls,” In Proceedings of the

Seventh International Conference on Emerging Security Technologies

(EST), Canterbury, UK, 6–8 September 2017; pp. 122–127.

[13] M. Yusof, M. M. Saudi, and F. Ridzuan, “Mobile Botnet Classification

by using Hybrid Analysis,” Int. J. Eng. Technol. 2018, 7, 103–108.

[14] S. Y. Yerima, and A. Bashar, “Bot-IMG: “A framework for image-based

detection of Android botnets using machine learning”. In Proc. of the 18th

ACS/IEEE International Conf. on Computer Systems and Applications

(AICCSA 2021), Tangier, Morocco, 30 Nov. to 3 Dec., 2021; pp. 1–7.

[15] S. Hojjatinia, S. Hamzenejadi, and H. Mohseni, “Android botnet de-

tection using convolutional neural networks,” in 2020 28th Iranian

Conference on Electrical Engineering (ICEE), 2020, pp. 1–6.

[16] M. Moodi, M. Ghazvini, H. Moodi, and B. Ghawami, “A smart adaptive

particle swarm optimization–support vector machine: Android botnet

detection application.” J. Supercomput. 2020, 76, 9854–9881.

[17] S. Y. Yerima, M. Alzaylaee, A. Shajan, and P. Vinod, “Deep learning

techniques for Android botnet detection,” Electronics, vol. 10, no. 519,

2021. https://doi.org/10.3390/electronics10040519

[18] S. Anwar, J. M. Zain, Z. Inayat, R. U. Haq, A. Karim, and A. N. Jabir, "A

static approach towards mobile botnet detection," In Proc. 3rd Int.

Conference on Electronic Design (ICED), 2016: IEEE, pp. 563-567.

[19] S. Y. Yerima, and A. Bashar, “A novel Android botnet detection system

using image-based and manifest file features,” Electronics, vol. 11, no.

486, 2022. https://doi.org/10.3390/electronics11030486

[20] B. Alothman and P. Rattadilok ‘Android botnet detection: An integrated

source code mining aproach’ 12th International Conference for Internet

Technology and Secured Transactions (ICITST),11-14 Dec.,Cambridge,

UK, 2017, IEEE, pp 111-115

[21] U. Raghav, E. Martinez-Marroquin and W. Ma, "Static analysis for

Android Malware detection with document vectors," 2021 International

Conference on Data Mining Workshops (ICDMW), 2021, pp. 805-812,

doi: 10.1109/ICDMW53433.2021.00104.

[22] M. Mimura, and Y. Suga, “Filtering malicious javascript code with

doc2vec on an imbalanced dataset,” In Proc. 2019 14th Asia Joint Confer-

ence on Information Security (AsiaJCIS), pp.24–31, 2019.

https://doi.org/10.1109/AsiaJCIS.2019.000-9

[23] S. Ndichu, S. Kim, S. Ozawa, T. Misu, and K. Makishima, “A machine

learning approach to detection of javascript-based attacks using ast

features and paragraph vectors,” Appl. Soft Comput. 84, 105721 (2019)

[24] M. Mimura, and H. Tanaka “A linguistic approach towards intrusion

detection in actual proxy logs,” In Proc. 20th international conference on

information and communications security, ICICS 2018, Lille, France,

Oct. 29-31, 2018, pp. 708–718.

[25] M. Hall et al., “The WEKA data mining software: An update,” ACM

SIGKDD Explor. Newslett., vol. 11, no. 1, pp. 10–18, Jun. 2009

[26] S. Y. Yerima and S. Khan, “Longitudinal performance analysis of

machine learning based Android malware detectors”. In Proceedings of

the 2019 International Conference on Cyber Security and Protection of

Digital Services (Cyber Security), Oxford, UK, 3–4 June 2019.

[27] Y. Nagano, and R. Uda, “Static analysis with paragraph vector for mal-

ware detection,” In Proc. 11th International Conf. on Ubiquitous

Information Management and Communication, IMCOM 2017, Article

no. 80, pp. 1-7. https://doi.org/10.1145/3022227.3022306

[28] E. Raff, J. Sylvester, and C. Nicholas, “Learning the PE header, malware

detection with minimal domain knowledge”. In Proc. of the 10th ACM

Workshop on Artificial Intelligence and Security, AISec@CCS 2017,

Dallas, TX, USA, November 3, 2017, pp. 121–132.

https://doi.org/10.1145/3128572.3140442

Enhancements in the world of digital forensics

Article

Full-text available

Mar 2024

p>Currently, the rapid advancement of computer systems and mobile phones has resulted in their utilization in unlawful acts. Ensuring adequate and effective security measures poses a difficult task due to the intricate nature of these devices, thereby exacerbating the challenges associated with investigating crimes involving them. Digital forensics, which involves investigating cyber crimes, plays a crucial role in this realm. Extensive research has been conducted in this field to aid forensic investigations in addressing contemporary obstacles. This paper aims to explore the progress made in the applications of digital forensics and security, encompassing various aspects, and provide insights into the evolution of digital forensics over the past five years.</p

BotDroid: Permission-Based Android Botnet Detection Using Neural Networks

Chapter

Jun 2023

Android devices can now offer a wide range of services. They support a variety of applications, including those for banking, business, health, and entertainment. The popularity and functionality of Android devices, along with the open-source nature of the Android operating system, have made them a prime target for attackers. One of the most dangerous malwares is an Android botnet, which an attacker known as a botmaster can remotely control to launch destructive attacks. This paper investigates Android botnets by using static analysis to extract features from reverse-engineered applications. Furthermore, this article delivers a new dataset of Android apps, including botnet or benign, and an optimized multilayer perceptron neural network (MLP) for detecting botnets infected by malware based on the permissions of the apps. Experimental results show that the proposed methodology is both practical and effective while outperforming other standard classifiers in various evaluation metrics.KeywordsAndroid Malware detectionBotnetsNeural NetworksNew dataset

Verification of Neural Networks Meets PLC Code: An LHC Cooling Tower Control System at CERN

Conference Paper

Full-text available

Jun 2023

In the last few years, control engineers have started to use artificial neural networks (NNs) embedded in advanced feedback control algorithms. Its natural integration into existing controllers, such as programmable logic controllers (PLCs) or close to them, represents a challenge. Besides, the application of these algorithms in critical applications still raises concerns among control engineers due to the lack of safety guarantees. Building trustworthy NNs is still a challenge and their verification is attracting more attention nowadays. This paper discusses the peculiarities of formal verification of NNs controllers running on PLCs. It outlines a set of properties that should be satisfied by a NN that is intended to be deployed in a critical high-availability installation at CERN. It compares different methods to verify this NN and sketches our future research directions to find a safe NN.KeywordsVerification of neural networksPLCsControl system

A Deep Learning based Approach to Android Botnet Detection using Transfer Learning

Conference Paper

Full-text available

Dec 2022

The ever-increasing use of mobile phones running the Android OS has created security threats of data breach and botnet-based remote control. To address these challenges, numerous countermeasures have been proposed in the domain of image-based Android Malware Detection (AMD) applying Deep Learning (DL) approaches. This paper proposes, implements and evaluates a solution based on pre-trained CNN models using Transfer Learning feature to identify botnets from the ISCX Android Botnet 2015 dataset. More specifically, we study the performance of 6 prominent pre-trained CNN models namely, MobileNetV2, RestNet101, VGG16, VGG19, InceptionRestNetV2 and DenseNet121, in terms of training accuracies, computation time complexity and testing accuracies. The maximum classification accuracy obtained was 91% for Manifest dataset using the MobileNetV2 model. Also, in terms of computational complexity the MobileNetV2 yielded the lowest training time of 16 ms per sample and testing time of 0.9 ms per sample. In order to improve the testing accuracies we plan to further augment these pre-trained models with larger datasets or fine-tune the model parameters for enhanced performance.

A Novel Android Botnet Detection System Using Image-Based and Manifest File Features

Article

Full-text available

Feb 2022

Malicious botnet applications have become a serious threat and are increasingly incorporating sophisticated detection avoidance techniques. Hence, there is a need for more effective mitigation approaches to combat the rise of Android botnets. Although the use of Machine Learning to detect botnets has been a focus of recent research efforts, several challenges remain. To overcome the limitations of using hand-crafted features for Machine-Learning-based detection, in this paper, we propose a novel mobile botnet detection system based on features extracted from images and a manifest file. The scheme employs a Histogram of Oriented Gradients and byte histograms obtained from images representing the app executable and combines these with features derived from the manifest files. Feature selection is then applied to utilize the best features for classification with Machine-Learning algorithms. The proposed system was evaluated using the ISCX botnet dataset, and the experimental results demonstrate its effectiveness with F1 scores ranging from 0.923 to 0.96 using popular Machine-Learning algorithms. Furthermore, with the Extra Trees model, up to 97.5% overall accuracy was obtained using an 80:20 train–test split, and 96% overall accuracy was obtained using 10-fold cross validation.

Bot-IMG: A framework for image-based detection of Android botnets using machine learning

Conference Paper

Full-text available

Oct 2021

To enable more effective mitigation of Android botnets, image-based detection approaches offer great promise. Such image-based or visualization methods provide detection solutions that are less reliant on hand-engineered features which require domain knowledge. In this paper we propose Bot-IMG, a framework for visualization and image-based detection of Android botnets using machine learning. Furthermore, we evaluated the efficacy of Bot-IMG framework using the ISCX botnet dataset. In particular, we implement an image-based detection method using Histogram of Oriented Gradients (HOG) as feature descriptors within the framework, and utilized Autoencoders in conjunction with traditional machine learning classifiers. From the experiments performed, we obtained up to 95.3% classification accuracy using train-test split of 80:20 and 93.1% classification accuracy with 10-fold cross validation.

Deep Learning Techniques for Android Botnet Detection

Article

Full-text available

Feb 2021

Android is increasingly being targeted by malware since it has become the most popular mobile operating system worldwide. Evasive malware families, such as Chamois, designed to turn Android devices into bots that form part of a larger botnet are becoming prevalent. This calls for more effective methods for detection of Android botnets. Recently, deep learning has gained attention as a machine learning based approach to enhance Android botnet detection. However, studies that extensively investigate the efficacy of various deep learning models for Android botnet detection are currently lacking. Hence, in this paper we present a comparative study of deep learning techniques for Android botnet detection using 6802 Android applications consisting of 1929 botnet applications from the ISCX botnet dataset. We evaluate the performance of several deep learning techniques including: CNN, DNN, LSTM, GRU, CNN-LSTM, and CNN-GRU models using 342 static features derived from the applications. In our experiments, the deep learning models achieved state-of-the-art results based on the ISCX botnet dataset and also outperformed the classical machine learning classifiers.

A smart adaptive particle swarm optimization–support vector machine: android botnet detection application

Article

Full-text available

Dec 2020
J SUPERCOMPUT

Support vector machine (SVM) is a renowned machine learning technique, which has been successfully applied to solve many practical pattern classification problems. One of the difficulties in successful implementation of SVM is its different parameters (i.e., kernel parameter(s), penalty parameter (C) and the features available in the dataset), which should be well adjusted during the training process. In this paper, a new approach called smart adaptive particle swarm optimization–support vector machine (SAPSO–SVM) is developed to adapt the parameters of optimization algorithm (i.e., inertia weight and acceleration coefficients) to the latest changes in the search space, so that each particle explicitly explores the search space based on the latest changes made to Personal best, Global best and other particle locations. In this algorithm, using the changes in Personal best and Global best at each stage of execution, the new evolution factor values are designated and the interference of the intervals of inertia weight is eradicated. Then, the states of each particle (i.e., convergence, exploitation, exploration, jumping-out) at each stage of administration, based on the interval weights, are specified accurately. By fine tuning the parameters of SAPSO, this algorithm can acquire the best optimal responses for SVM parameters. The results obtained from the SAPSO–SVM method demonstrate the superiority of this method in four different measures (i.e., sensitivity, specificity, precision, accuracy) in comparison with the other three similar ones. Finally, the top 20 features of Android botnets are somehow introduced by the proposed approach and three other approaches; firstly, these features are not encrypted by Android botnets, and secondly, are selected based on the best results.

A machine learning approach to detection of JavaScript-based attacks using AST features and paragraph vectors

Article

Full-text available

Aug 2019
APPL SOFT COMPUT

Websites attract millions of visitors due to the convenience of services they offer, which provide for interesting targets for cyber attackers. Most of these websites use JavaScript (JS) to create dynamic content. The exploitation of vulnerabilities in servers, plugins, and other third-party systems enables the insertion of malicious codes into websites. These exploits use methods such as drive-by-downloads, pop up ads, and phishing attacks on news, porn, piracy, torrent or free software websites, among others. Many of the recent cyber-attacks exploit JS vulnerabilities, in some cases employing obfuscation to hide their maliciousness and evade detection. It is, therefore, primal to develop an accurate detection system for malicious JS to protect users from such attacks. This study adopts Abstract Syntax Tree (AST) for code structure representation and a machine learning approach to conduct feature learning called Doc2vec to address this issue. Doc2vec is a neural network model that can learn context information of texts with variable length. This model is a well-suited feature learning method for JS codes, which consist of text content ranging among single line sentences, paragraphs, and full-length documents. Besides, features learned with Doc2Vec are of low dimensions which ensure faster detections. A classifier model judges the maliciousness of a JS code using the learned features. The performance of this approach is evaluated using the D3M dataset (Drive-by-Download Data by Marionette) for malicious JS codes and the JSUNPACK plus Alexa top 100 websites datasets for benign JS codes. We then compare the performance of Doc2Vec on plain JS codes (Plain-JS) and AST form of JS codes (AST-JS) to other feature learning methods. Our experimental results show that the proposed AST features and Doc2Vec for feature learning provide better accuracy and fast classification in malicious JS codes detection compared to conventional approaches and can flag malicious JS codes previously identified as hard-to-detect.

Longitudinal performance analysis of machine learning based Android malware detectors

Conference Paper

Full-text available

May 2019

This paper presents a longitudinal study of the performance of machine learning classifiers for Android malware detection. The study is undertaken using features extracted from Android applications first seen between 2012 and 2016. The aim is to investigate the extent of performance decay over time for various machine learning classifiers trained with static features extracted from date-labelled benign and malware application sets. Using date-labelled apps allows for true mimicking of zero-day testing, thus providing a more realistic view of performance than the conventional methods of evaluation that do not take date of appearance into account. In this study, all the investigated machine learning classifiers showed progressive diminishing performance when tested on sets of samples from a later time period. Overall, it was found that false positive rate (misclassifying benign samples as malicious) increased more substantially compared to the fall in True Positive rate (correct classification of malicious apps) when older models were tested on newer app samples.

Static Analysis for Android Malware detection with Document Vectors

Conference Paper

Dec 2021

Android Botnet Detection using Convolutional Neural Networks

Conference Paper

Aug 2020

Filtering Malicious JavaScript Code with Doc2Vec on an Imbalanced Dataset

Conference Paper

Aug 2019

A Linguistic Approach Towards Intrusion Detection in Actual Proxy Logs: 20th International Conference, ICICS 2018, Lille, France, October 29-31, 2018, Proceedings

Chapter

Oct 2018

Modern malware imitates benign http traffic to evade detection. To detect unseen malicious traffic, a linguistic-based detection method for proxy logs has been proposed. This method uses Paragraph Vector to extract features automatically. To generate discriminative feature representation, a balanced corpus is required. In actual proxy logs, benign traffic is dominant, and occupies malicious feature representation. Therefore, the previous method does not perform accuracy in practical environment.

A deep learning-enhanced botnet detection system based on Android manifest text mining

Abstract and Figures

Recommended publications

Mobile Botnet Detection: A Deep Learning Approach Using Convolutional Neural Networks

Deep Learning Techniques for Android Botnet Detection

Mobile Botnet Detection: A Deep Learning Approach Using Convolutional Neural Networks

BotDroid: Permission-Based Android Botnet Detection Using Neural Networks