FIGURE 3 - uploaded by Domhnall Carlin
Content may be subject to copyright.
Automated data collection system model.

Automated data collection system model.

Source publication
Article
Full-text available
The arms race between distributors of malware and those seeking to provide defenses has so far favored the former. Signature detection methods have been unable to cope with the onslaught of new binaries aided by rapidly developing obfuscation techniques. Recent research has focused on the analysis of low-level opcodes, both static and dynamic, as a...

Context in source publication

Context 1
... examining executables at the opcode level, these can be observed. As shown in fig.3, the system automates processing of the executables and extraction of runtraces. ...

Similar publications

Article
Full-text available
Malware and its variants continue to pose a threat to network security. Machine learning has been widely used in the field of malware classification, but some emerging studies, such as attention mechanisms, are rarely applied in this field. In this paper, we analyze the correspondence between bytecode and disassembly of malware, and propose a new f...
Preprint
Full-text available
With the increase of IoT devices and technologies coming into service, Malware has risen as a challenging threat with increased infection rates and levels of sophistication. Without strong security mechanisms, a huge amount of sensitive data is exposed to vulnerabilities, and therefore, easily abused by cybercriminals to perform several illegal act...
Article
Full-text available
Machine learning has already become one of the most widely used techniques in the field of computer science, and it has been widely applied in image processing, natural language processing, network security and other fields. However, there has been many security threats that need to be overcome on current machine learning algorithms and training da...
Article
Full-text available
In computer security, botnets still represent a significant cyber threat. Concealing techniques such as the dynamic addressing and the Domain Generation Algorithms (DGAs) require an improved and more effective detection process. To this extent, this data descriptor presents a collection of over 30 million manually-labeled algorithmically generated...
Article
Full-text available
As the bring your own device (BYOD) to work trend grows, so do the network security risks. This fast-growing trend has huge benefits for both employees and employers. With malware, spyware and other malicious downloads, tricking their way onto personal devices, organizations need to consider their information security policies. Malicious programs c...

Citations

... The classifiers exhibited superior accuracy compared to the scanner, doubling its performance. In another study by Kolter and Maloof, Boolean n-gram analysis was applied to the hexadecimal representation of malware executables [24], yielding excellent accuracy using a boosted decision tree classifier with an area under the receiver operating characteristic (ROC) curve of 0.996. Moskovitch et al. [25] collected data on 323 features per second across five unseen worm variants and achieved an average detection rate of over 90% accuracy using only 20 features, with certain worm variants surpassing 99% accuracy. ...
Article
Full-text available
In this study, the methodology of cyber-resilience in small and medium-sized organizations (SMEs) is investigated, and a comprehensive solution utilizing prescriptive malware analysis, detection and response using open-source solutions is proposed for detecting new emerging threats. By leveraging open-source solutions and software, a system specifically designed for SMEs with up to 250 employees is developed, focusing on the detection of new threats. Through extensive testing and validation, as well as efficient algorithms and techniques for anomaly detection, safety, and security, the effectiveness of the approach in enhancing SMEs’ cyber-defense capabilities and bolstering their overall cyber-resilience is demonstrated. The findings highlight the practicality and scalability of utilizing open-source resources to address the unique cybersecurity challenges faced by SMEs. The proposed system combines advanced malware analysis techniques with real-time threat intelligence feeds to identify and analyze malicious activities within SME networks. By employing machine-learning algorithms and behavior-based analysis, the system can effectively detect and classify sophisticated malware strains, including those previously unseen. To evaluate the system’s effectiveness, extensive testing and validation were conducted using real-world datasets and scenarios. The results demonstrate significant improvements in malware detection rates, with the system successfully identifying emerging threats that traditional security measures often miss. The proposed system represents a practical and scalable solution using containerized applications that can be readily deployed by SMEs seeking to enhance their cyber-defense capabilities.
... Carlin et al. [33] proposed a new analyzed run-tracking dataset of more than 100,000 labeled samples that would address these shortcomings, and we are making the dataset itself available to the research community to use. Xu et al. [34] provide DroidEvolver, an Android malware detection system that can automatically and continuously update itself when malware is detected without any human intervention. ...
... Previous work in this direction [33] has shown that static analysis cannot unravel the obfuscated code. e previous studies mentioned above have the following limitations: higher memory consumption, limited dataset, higher detection rate on limited dataset, high computational complexity, implementation of a selection approach, with limited features, limited implementation of 100% classification algorithms on datasets, and no advanced malware detection capabilities. ...
Article
Full-text available
Malware detection refers to the process of detecting the presence of malware on a host system, or that of determining whether a specific program is malicious or benign. Machine learning-based solutions first gather information from applications and then use machine learning algorithms to develop a classifier that can distinguish between malicious and benign applications. Researchers and practitioners have long paid close attention to the issue. Most previous work has addressed the differences in feature importance or the computation of feature weights, which is unrelated to the classification model used, and therefore, the implementation of a selection approach with limited feature hiccups, and increases the execution time and memory usage. BFEDroid is a machine learning detection strategy that combines backward, forward, and exhaustive subset selection. This proposed malware detection technique can be updated by retraining new applications with true labels. It has higher accuracy (99%), lower memory consumption (1680), and a shorter execution time (1.264SI) than current malware detection methods that use feature selection.
... Dynamic sequences of opcodes extracted at runtime were used to train and test different malware detection models (LSTM-RNN, Bi-LSTM, and CNN). The dataset presented in [222] was used for analysis while the TensorFlow framework was used in the implementation. A CNN-based technique that detects malware attacks in a metamorphic environment was implemented using TensorFlow and Keras [202]. ...
Preprint
Full-text available
Malware is one of the most common and severe cyber-attack today. Malware infects millions of devices and can perform several malicious activities including mining sensitive data, encrypting data, crippling system performance, and many more. Hence, malware detection is crucial to protect our computers and mobile devices from malware attacks. Deep learning (DL) is one of the emerging and promising technologies for detecting malware. The recent high production of malware variants against desktop and mobile platforms makes DL algorithms powerful approaches for building scalable and advanced malware detection models as they can handle big datasets. This work explores current deep learning technologies for detecting malware attacks on the Windows, Linux, and Android platforms. Specifically, we present different categories of DL algorithms, network optimizers, and regularization methods. Different loss functions, activation functions, and frameworks for implementing DL models are presented. We also present feature extraction approaches and a review of recent DL-based models for detecting malware attacks on the above platforms. Furthermore, this work presents major research issues on malware detection including future directions to further advance knowledge and research in this field.
... In this direction, researchers had many promising results on malware detection based on machine learning using static or dynamic features. Dynamic features include memory usage [30], instruction traces [9], network traffic [11], API call trace [10], [12]. The effectiveness of dynamic analysis is highly dependent on the malware execution environment. ...
Article
Abstract— Our world has recently witnessed the explosive growth of IoT networks as one of the pillars of the 4th industrial revolution. Malware on IoT devices also grows accordingly in number and sophisticated techniques. Therefore, it is necessary to come up with more efficient approaches to IoT malware detection with machine learning models that can be used in solutions using limited resources. In this paper, we study and evaluate the efficiency of using a weight of term frequency– inverse document frequency model in feature selection method combined with an effective machine learning model in IoT malware detection based on opcode sequence features. We performed experiments on a MIPS ELF dataset that included 4,511 malicious samples with main four classes and 4,393 benign programs. Experiment results show that our proposed method has very good performance on the above dataset with detection and classification accuracy which are 99.8% and 95.8% respectively while the models only use 20 opcodes that have the highest weight values.
... Experimental results reveal that dynamic analysis achieved significant results in the number of cases. Carlin et al. [11] creates dynamic analysis based data set of 1,00,000 with labeled samples. They developed the malware model, which is based on sequence-based and count-based data. ...
... Earlier researches [11,68], done in this direction have shown that static analysis fails to unravel obfuscated code. Previous research works mentioned above has the following limitations: limited data set, higher detection rate with limited data set, high computation burden, implement limited feature selection approaches, implementation of limited classification algorithms using 100% labelled data set and unable to detect sophisticated malware. ...
... Figure 3 demonstrates the representation of U-matrix. For U-matrix, we employ a heat color map 11 to represent the distance between the weight vectors Table 9 Feature selection approaches implemented in this study Additional information related to Gain-ratio feature selection approach, Principal Component Analysis (PCA) and Logistic regression mentioned in Appendix B ...
Article
Full-text available
Android has gained its popularity due to its open-source and number of freely available apps in its official play store. Appropriate functioning of Android apps depends upon the permission or set of permissions which an app demands at the time of installation and run-time. By taking the advantage of these permissions or set of permissions, cybercriminals are developing malware-infected apps daily. In this study, we proposed a framework named as “SOMDROID”, that work on the principle of unsupervised machine learning algorithm. To develop an effective and efficient Android malware detection model, we collect 5,00,000 distinct Android apps from promised repositories and extract 1844 unique features. Further, to select significant features or feature sets, we applied six different feature ranking approaches in this study. With the selected feature or feature sets, we implement the Self-Organizing Map (SOM) algorithm of Kohonen and measure four distinct performance parameters, i.e., Intra-cluster distance, Inter-cluster distance, Accuracy and F-measure. Empirical result reveals that our proposed framework is able to detect 98.7% malware that belongs to unknown families and in addition to that the detection rate is higher by 2% when compared to commercial anti-virus scanners and frameworks proposed in the literature.
... The monitoring process reveals process creation, file, and registry manipulation, and modifications of memory values, registers, and variables. 22 For instance, researchers 24 proposed a method to distinguish benign programs from malicious programs using the features of memory and registers usage, while another previous study 25 developed a method that performed dynamic analysis on virtual machines (VMs) to extract program run-time traces from both benign and malicious executables. Besides, in previous studies, [26][27][28] network traffic features such as Hyper Text Transfer Protocol (HTTP) and Domain Name System (DNS) requests, host-based events, and metadata such as IP addresses, ports, and packet counts have been actively utilized to detect and classify packets as normal or threats, while other studies [29][30][31][32][33] have presented a dynamic approach to detect malicious programs using Application Programming Interface (API) call traces. ...
Article
Full-text available
Intrusion detection systems (IDS) play a vital role in traffic flow monitoring on Internet of Things networks by providing a secure network traffic environment and blocking unwanted traffic packets. Various IDS approaches have been proposed previously based on data mining, fuzzy techniques, genetic, neuro-genetic, particle swarm intelligence, rough sets, and conventional machine learning (ML). However, these methods are not energy efficient and do not perform accurately due to the inappropriate feature selection or the use of full features of datasets. In general, datasets contain more than 10 features. Any ML-based lightweight IDS trained with full features turns into an inefficient and heavyweight IDS. This case challenges IoT networks that suffer from power efficiency problems. Therefore, lightweight (energy-efficient), accurate, and high-performance IDS are paramount instead of inefficient and heavyweight IDS. To address these challenges, a new approach that can help to determine the most effective and optimal feature pairs of datasets which enable the development of lightweight IDS was proposed. For this purpose, 10 ML algorithms and the recent BoT-IoT(2018) dataset were selected. 12-best-features recommended by the developers of this dataset were used in this study. 66 unique feature pairs were generated from the 12-best-features. Next, 10 full feature-based IDS were developed by training the 10 ML algorithms with the 12-full-features. Similarly, 660 feature pair-based lightweight IDS were developed by training the 10 ML algorithms via each feature pair out of the 66 feature pairs. Moreover, the 10 IDS trained with 12-best-features and the 660 IDS trained via 66 feature pairs were compared to each other based on the ML algorithmic groups. Then, the feature pair-based lightweight IDS that achieved the accuracy level of the ten full-feature-based IDS were selected. This way, the optimal and efficient feature pairs, and the lightweight IDS were determined. The most lightweight IDS achieved more than 90% detection accuracy.
... In an effort to assist the development of much more effective heuristic-based malware detection methods, [6] introduces a dataset that consists of dynamically yielded opcodes belonging to various malware and benign executables. These opcodes were obtained from runtime trace using virtual machine-aided dynamic analysis. ...
... Consequently, this runtime opcode dataset is valuable to develop an advanced malware detection system, provided that the dataset is combined with effective predictive modeling techniques. This system has the potential to be effective against novel malware as well as various concealment and code obfuscation strategies employed by the malware developers according to [6]. The skeleton of our work originates from this dataset. ...
... The comparison could be important since different deep learning algorithms may have distinct inductive biases. Secondly, it is crucial to state that the original dataset from [6] has disproportionately more malware than benign instances. This inadequacy can be critical because class imbalance may severely disturb classification performance by exacerbating false positive error rate. ...
Article
Full-text available
Thousands of new malware codes are developed every day. Signature-based methods, which are employed by common malware detectors, are susceptible to code obfuscation and novel malware. In this paper, we present an alternative method for malware detection, which makes use of assembly opcode sequences obtained during runtime. First, for sequential opcode data, we utilize natural language processing and deep learning techniques to facilitate the extraction of deeper behavioral features. Due to these features, this method can be impervious to code obfuscation and effective against novel malware. Finally, these features are fed to various machine learning algorithms for classification. The experiments on a more class balanced dataset of 26869 samples demonstrated that MCC (Matthew’s correlation coefficient) score as high as 0.95 is achievable with this approach. The MCC score results for the experiments conducted on imbalanced and artificially balanced datasets are 0.81 and 0.83, respectively.
... The MHMM technique is utilized by the Gaussian mixture model (GMM), whenever the quantity of mixture components is known, and borders of perceived data are unlimited (i.e., (−∞, +∞), and a disadvantage of the proposed system requires a large number of normal and attack instances to accurately estimate the BMM and HMM parameters, moreover, the system also needs a new functions that enables running the algorithm for adjusting the sliding window to be implemented [21]. Some other similar works for malware detection using HMMs have been conducted in [76,77]. ...
Article
Full-text available
Network anomaly detection systems (NADSs) play a significant role in every network defense system as they detect and prevent malicious activities. Therefore, this paper offers an exhaustive overview of different aspects of anomaly-based network intrusion detection systems (NIDSs). Additionally, contemporary malicious activities in network systems and the important properties of intrusion detection systems are discussed as well. The present survey explains important phases of NADSs, such as pre-processing, feature extraction and malicious behavior detection and recognition. In addition, with regard to the detection and recognition phase, recent machine learning approaches including supervised, unsupervised, new deep and ensemble learning techniques have been comprehensively discussed; moreover, some details about currently available benchmark datasets for training and evaluating machine learning techniques are provided by the researchers. In the end, potential challenges together with some future directions for machine learning-based NADSs are specified.
... This allows one to identify when and how often distinct AVs do not agree on naming strains. This evaluation is important because the use of inconsistent AV labels may even decrease AV classification accuracy [9]. Whereas theoretically AV labels should be standardized by CARO, in practice, non-standard extensions are often implemented by vendors. ...
Article
Full-text available
Security evaluation is an essential task to identify the level of protection accomplished in running systems or to aid in choosing better solutions for each specific scenario. Although antiviruses (AVs) are one of the main defensive solutions for most end-users and corporations, AV’s evaluations are conducted by few organizations and often limited to compare detection rates. Moreover, other important factors of AVs’ operating mode (e.g., response time and detection regression) are usually underestimated. Ignoring such factors create an “understanding gap” on the effectiveness of AVs in actual scenarios, which we aim to bridge by presenting a broader characterization of current AVs’ modes of operation. In our characterization, we consider distinct file types, operating systems, datasets, and time frames. To do so, we daily collected samples from two distinct, representative malware sources and submitted them to the VirusTotal (VT) service for 30 consecutive days. In total, we considered 28,875 unique malware samples. For each day, we retrieved the submitted samples’ detection rates and assigned labels, resulting in more than 1M distinct VT submissions overall. Our experimental results show that: (i) phishing contexts are a challenge for all AVs, turning malicious Web pages detectors less effective than malicious files detectors; (ii) generic procedures are insufficient to ensure broad detection coverage, incurring in lower detection rates for particular datasets (e.g., country-specific) than for those with world-wide collected samples; (iii) detection rates are unstable since all AVs presented detection regression effects after scans in different time frames using the same dataset and (iv) AVs’ long response times in delivering new signatures/heuristics create a significant attack opportunity window within the first 30 days after we first identified a malicious binary. To address the effects of our findings, we propose six new metrics to evaluate the multiple aspects that impact the effectiveness of AVs. With them, we hope to assess corporate (and domestic) users to better evaluate the solutions that fit their needs more adequately.
... Dynamic traces are a more robust measure of the program's behavior, since code packers and encrypters can obfuscate and hinder the code instructions from static analysis. Carlin et al. (2017a) presented an approach that performs dynamic analysis on virtual machines to extract program runtime traces from both benign and malicious executables. They analyzed the sequence of opcodes executed to detect malware by testing two algorithms: (1) a Random Forest classifier to classify all count-based data and (2) a Hidden Markov model to classify data based on temporal relations in the opcode sequences. ...
Article
Full-text available
The struggle between security analysts and malware developers is a never-ending battle with the complexity of malware changing as quickly as innovation grows. Current state-of-the-art research focus on the development and application of machine learning techniques for malware detection due to its ability to keep pace with malware evolution. This survey aims at providing a systematic and detailed overview of machine learning techniques for malware detection and in particular, deep learning techniques. The main contributions of the paper are: (1) it provides a complete description of the methods and features in a traditional machine learning workflow for malware detection and classification, (2) it explores the challenges and limitations of traditional machine learning and (3) it analyzes recent trends and developments in the field with special emphasis on deep learning approaches. Furthermore, (4) it presents the research issues and unsolved challenges of the state-of-the-art techniques and (5) it discusses the new directions of research. The survey helps researchers to have an understanding of the malware detection field and of the new developments and directions of research explored by the scientific community to tackle the problem.