Table 8 - uploaded by Mehedy Masud
Content may be subject to copyright.
Assembly code sequence for binary 4-g "00005068"

Assembly code sequence for binary 4-g "00005068"

Source publication
Article
Full-text available
We present a scalable and multi-level feature extraction technique to detect malicious executables. We propose a novel combination of three different kinds of features at different levels of abstraction. These are binary n-grams, assembly instruction sequences, and Dynamic Link Library (DLL) function calls; extracted from binary executables, disass...

Similar publications

Article
Full-text available
Ausgewählte Visualisierungsanwendungen der jüngeren Vergangenheit, die den Retrievalprozess betreffen, werden vorgestellt. Die Einsatzszenarien reichen von mobilen kleinformatigen Anwendungen bis zu großformatigen Darstellungen auf hochauflösenden Bildschirmen, von integrativen Arbeitsplätzen für den einzelnen Nutzer bis zur Nutzung interaktiver Ob...

Citations

... Finally, a rule-based classification algorithm was used for malware detection. Assembly-based instructions were used by Masud et al. [4], which were then transformed to bytecodes of 4-gram. Furthermore, five unique static features were generated using feature engineering techniques. ...
Preprint
Full-text available
Deep neural networks can be trained in various domains using increasingly large batch sizes without compromising the efficiency. However, this massive data parallelism may differ from domain to domain. It is computationally challenging to train large deep neural networks on large datasets. In response, there has been a surge of interest in utilizing large batch size values during the optimization process, as it enables faster training of these networks, thereby facilitating distributed processing. However, this approach also presents a well-known problem called the "generalization gap," which can result in a degraded performance across multiple datasets. Currently, there is limited understanding of how to determine the optimal batch size. To address this issue, we propose an adaptive tuning algorithm that dynamically adjusts the batch size. Our algorithm consists of four stages: gradient warm-up, loss derivation, calculation of weighted loss using historical batch size data, and batch size updating. We demonstrate the superior performance of our algorithm compared with the traditional constant-batch size approach by comparing it with multiple system-call datasets of varying sizes.
... The method showed a small improvement in classification accuracy over existing methods. Mohammad et al. [2] developed a multi-level method that uses binary n-gram and statistical methods to extract important features. This method addresses the limitations of existing methods by extracting more features and by being more efficient. ...
Conference Paper
Feature extraction is the process of transforming raw data into features that are more relevant for machine learning algorithms. The goal of feature extraction is to find a set of features that can be used to accurately predict the target variable. The specific features that are extracted will depend on the specific application. For example, features that are extracted for the purpose of diagnosing arrhythmias will be different from the features that are extracted for the purpose of assessing myocardial infarction. A generalized new algorithm for feature extraction could be helpful for all complex feature extraction data sets. In this paper, we propose a random selection process to generate the required number of new features with the help of existing specific features of the electrocardiogram (ECG) signal. We have named this novel feature extraction method the Random Feature Explorer (RFE). The proposed method was tested and evaluated using Physio Net's MIT-BIH datasets. The results indicate that the suggested method achieved an accuracy of 99.79% in arrhythmia classification. We have made the source code for our proposed method available on GitHub for open access and reproducibility. The code can be accessed at https://bit.ly/3NnrH4A
... Static analysis is widely used to detect malicious software [36], [52]. Additionally, using off-the-shelf machine learning techniques led to exceptional success for academic and industry-based security researchers [5], [6], [28]- [30], [33], [35], [38], [41]. Investigation of static features to detect ransomware has been well studied for both mobile devices [4], [12], [14], [18], [31], [34], [56] (especially for Android OS) and Windows-based platforms [43], [51], [63], [64]. ...
Conference Paper
Full-text available
The everlasting fight between security researchers and ransomware authors, including cyber criminals who leverage ransomware to cripple organizations worldwide, has continued to evolve as novel techniques are used to evade ransomware detection. The victim not only endures paramount financial loss from business downtime for several days and/or paying ransom to regain control of their environment but also becomes at risk of being exposed to the stolen digital assets out on the Internet. To tackle these threats against ransomware, our research project aims to identify (1) structural similarities among 2,436 cryptographic Windows ransomware samples per calendar year between 2017 and 2021 and (2) structural dissimilarities against 3,014 benign applications using machine learning classifiers. We base our analysis on PE metadata for similarity analysis and binary classification tasks. With the Cosine Index, we capture 71% – 87.80% and 66% – 82.30% of similarities based on imports and function names feature spaces, respectively. On the other hand, after designing four experimental settings, Random Forest outperforms other applied classifiers by achieving 91.75%, 91.99%, 90.47%, and 91.05% at best for accuracy, precision, recall, and F1 scores, respectively, for ransomware detection.
... Then, the occurrences of N-grams are counted to convert the text into the occurrence number (Masud et al., 2008). The method based on character-level N-grams is proposed since the sequence of characters of a corpus in the artificial language is more informative than the sequence of words, as shown in Appendix B. This is because the textual data representing the mixture proportions, particle size distribution, and chemical composition have both digits and letters, as indicated in Section 2.3. ...
Article
Existing machine learning-based approaches to investigate and design concrete mainly use the mixture design variables to predict concrete properties and do not consider the physicochemical properties of ingredients such as the particle size distribution and chemical composition of various binders and aggregates. This paper presents an approach to discover the intrinsic relationships between the physicochemical properties of the ingredients and mechanical properties of concrete. Specifically, this research creates an artificial language to represent concrete mixtures and the physicochemical information of their ingredients, develops a feature extraction method based on character-level N-grams, and proposes a method to configure deep learning models automatically. The proposed approach has been implemented to predict the compressive strength of complex concrete mixtures, assess the importance of variables, and discover chemical reactions, showing high accuracy and high generalizability. This research advances the capabilities of understanding the underlying reactions for complex concrete mixtures and designing low-carbon cost-effective concrete.
... world applications, the most popular approach to obtain multi-views unsupervised learning is through centralized algorithms. This includes approaches like feature concatenation [65,95,131,179,180,194,284], feature hashing [22,125], Multiple Kernel Learning (MKL) [6], feature composition [84] and metric composition [86,137,138,206]. ...
Thesis
Historically, malware (MW) analysis has heavily resorted to human savvy for manual signature creation to detect and classify MW.This procedure is very costly and time consuming, thus unable to cope with modern cyber threat scenario.The solution is to widely automate MW analysis.Toward this goal, MW classification allows optimizing the handling of large MW corpora by identifying resemblances across similar instances.Consequently, MW classification figures as a key activity related to MW analysis, which is paramount in the operation of computer security as a whole.This thesis addresses the problem of MW classification taking an approach in which human intervention is spared as much as possible.Furthermore, we steer clear of subjectivity inherent to human analysis by designing MW classification solely on data directly extracted from MW analysis, thus taking a data-driven approach.Our objective is to improve the automation of malware analysis and to combine it with machine learning methods that are able to autonomously spot and reveal unwitting commonalities within data.We phased our work in three stages.Initially we focused on improving MW analysis and its automation, studying new ways of leveraging symbolic execution in MW analysis and developing a distributed framework to scale up our computational power.Then we concentrated on the representation of MW behavior, with painstaking attention to its accuracy and robustness.Finally, we fixed attention on MW clustering, devising a methodology that has no restriction in the combination of syntactical and behavioral features and remains scalable in practice.As for our main contributions, we revamp the use of symbolic execution for MW analysis with special attention to the optimal use of SMT solver tactics and hyperparameter settings;we conceive a new evaluation paradigm for MW analysis systems;we formulate a compact graph representation of behavior, along with a corresponding function for pairwise similarity computation, which is accurate and robust;and we elaborate a new MW clustering strategy based on ensemble clustering that is flexible with respect to the combination of syntactical and behavioral features.
... Schultz et al. used n-grams, printable strings, and DLL imports with machine learning techniques for malware detection [46]. Masud et al. used byte n-grams, assembly instructions, and DLL function calls [20]. Ye et al. used interpretable strings such as API execution calls and important semantic strings [54]. ...
Article
Full-text available
Executable files still remain popular to compromise the endpoint computers. These executable files are often obfuscated to avoid anti-virus programs. To examine all suspicious files from the Internet, dynamic analysis requires too much time. Therefore, a fast filtering method is required. With the recent development of natural language processing (NLP) techniques, printable strings became more effective to detect malware. The combination of the printable strings and NLP techniques can be used as a filtering method. In this paper, we apply NLP techniques to malware detection. This paper reveals that printable strings with NLP techniques are effective for detecting malware in a practical environment. Our dataset consists of more than 500,000 samples obtained from multiple sources. Our experimental results demonstrate that our method is effective to not only subspecies of the existing malware, but also new malware. Our method is effective against packed malware and anti-debugging techniques.
... In the context of the proposed approach, the n-gram technique involves the extraction of contiguous sequences of n items from a given sequence of machine instruction. In [26], four-gram was used to combine sequences of six initial bytes into a sequence that could store a very large amount of data. The proposed method also extracts very large sets of unique features from each dataset; however, since the number of datasets is large, only two-gram is considered. ...
... As shown in Fig. 1, the disassembler disassembles the converted AOT file into the readable text files segment, address, raw bytes, and instruction [24,26,27]. Fig. 2 shows each part of a disassembled output file. ...
... All disassembly information can be used as features for malware detection [24]. Although most systems [24,26,27] use hexadecimal, raw bytes, or the entirety of the disassembled data, we choose only instructions and segments as the feature sets for extraction, as others were considerably more expensive in term of the file size and optimization overhead as shown in Fig. 3. ...
Article
Full-text available
The Android operating system has become a leading smartphone platform for mobile and other smart devices, which in turn has led to a diversity of malware applications. The amount of research on Android malware detection has increased significantly in recent years and many detection systems have been proposed. Despite these efforts, however, most systems can be thwarted by sophisticated Android malware adopting obfuscation or native code to avoid discovery by anti-virus tools. In this paper, we propose a new static analysis technique to address the problems of obfuscating and native malware applications. The proposed system provides a unified technique for extracting features from applications and native libraries using a selection algorithm that can extract a small set of unique and effective features for detecting malware applications rapidly and with a high detection rate. Evaluation using large Android malware detection datasets obtained from various sources confirmed that the proposed approach achieves very promising results in terms of improved accuracy, low false positive rate, and high detection rate.
... Most researchers extract asm features from disassembled malware for analysis, including opcode count [28,43,47], string n-grams [30], assembly instruction sequences, metadata of PE files, imported Dlls and function call graph [21,26]. David et al. [14]. ...
Article
Full-text available
Malware threats and privacy protection are two of the biggest challenges in the cloud computing environment. Many studies have focused on the accuracy of malware detection, but they did not sufficiently take into account the privacy protection of cloud tenants. This paper proposes a novel malware detection model, based on semi-supervised transfer learning (SSTL) for the cloud, that consists of detection, prediction, and transfer components. To protect the privacy of tenants in the public cloud, a byte classifier based on a recurrent neural network (RNN) for its detection component is designed to detect malware. However, because it is limited by the scarcity of training samples, the accuracy of the byte classifier is only 94.72% after supervised learning. An asm classifier is proposed for the prediction component, and it achieves 99.69% accuracy. The transfer component invokes the prediction component to classify an unlabeled dataset, and it combines the predicted labels and byte features of the unlabeled dataset into a new training dataset. Through the advantages of semi-supervised learning, the new dataset is transferred to the byte classifier for training again. The test results on the Kaggle malware datasets show that semi-supervised transfer learning improved the accuracy of the detection component from 94.72% to 96.9%. The improved malware detection method can not only do a better job of resolving the privacy concerns of tenants in the public cloud than other similar methods, but it can also detect malware more accurately.
... In the security field of malware protection, many computer scientists have found opportunities to combine these cutting-edge technologies. From machine learning applications applied on malware detection [1,2], to deep learning applications used to implement anomaly detection [6,7], to the recent adversarial machine learning [8,9,10], malware detection has found the efficient access point with these automatic learning technologies. But there are still very urgent questions in the analysis process. ...
... Masud [6] created a mixed feature set with assembly instruction sequences, binary n-grams, and DLL call information. For the extracted feature set, they used the information gain method to implement feature selection first, and then use the ensemble SVM and enhanced DT to classify. ...
... In Goodfellow [6], Goodfellow mentioned that the prospect of deep learning is to find rich hierarchical models of representative probabilities. The most compelling success today involves discriminant models, usually those that map high-dimensional, rich sensory inputs to category labels. ...
Preprint
Full-text available
The poisoning attack as one of the adversarial machine learning attack has become severe threat to many artificial intelligent systems which use machine learning (ML) and deep learning (DL). Those systems can be re-trained us- ing data collected during operations. With such an poisoning attack, the attackers can possibly evade those systems by disrupting the retraining. Fur- thermore, it has influenced the domain of cyber security [1, 2], which threats many next-generation intrusion and anomaly detection systems. Especially, the adversarial machine learning combined with malware has been developed [3], which increases the possibility for malware to evade detection systems. Such attacks targeting DL models are difficult to defend. Even though there are some academic outcomes for that, [3, 1, 2]. there are no enough strategic tools in industry yet because of wide attack surfaces [4]. Moreover, there are no enough research about the malware detection with poisoning attack. We took the malware as our experiment object to learn the influence from poisoning attack and also explore the strategic defence technology under 2D data. In order to prevent such an attack efficiently, we explored the com- bined approach with generative neural network (GAN) and ensemble train- ing. Moreover, we simulated the whole-stage process from data collection, feature extraction and detection to final defence. During our experiments, we found some components, including combination of merged classifiers, dif- ferent hidden layers, number of units and test size all have influences to the prevention performance for poisoning attack. With some specific configura- tions, including automatic ratios for merged classifiers, optimized numbers for hidden layers, number of units and test size, the detection model for our generated ransomware dataset can achieve steady accuracy above 95% under adversarial one dimensional (2D) input. In order to verify the robustness, we tested our model with another open source malware dataset [5]. The detection accuracy on malware dataset is near 100%, which is even better than ransomware one. The tests we conducted here are all grey-box attacks by assuming some pre-known knowledge for adversaries excluding some core configuration and parameters choices
... The tool used some program property features to determine if the program is malicious or not. Masud et al. [24] combined three types of features, binary n-grams, assembly instruction sequences, and dynamic link libraries (DLLs) to detect malicious executables. Then, a classifier for malware detection was builed based on SVM and Boosted J48. ...
Preprint
Full-text available
Malware threats are a serious problem for computer security, and the ability to detect and classify malware is critical for maintaining the security level of a computer. Recently, a number of researchers are investigating techniques for classifying malware families using malware visualization, which convert the binary structure of malware into grayscale images. Although there have been many reports that applied CNN to malware visualization image classification, it has not been revealed how to pick out a model that fits a given malware dataset and achieves higher classification accuracy. We propose a strategy to select a Deep learning model that fits the malware visualization images. Our strategy uses the fine-tuning method for the pre-trained CNN model and a dataset that solves the imbalance problem. We chose the VGG19 model based on the proposed strategy to classify the Malimg dataset. Experimental results show that the classification accuracy is 99.72 %, which is higher than other previously proposed malware classification methods.