The classified X86 instructions

The classified X86 instructions

Source publication
Article
Full-text available
Code obfuscating technique plays a significant role to produce new obfuscated malicious programs, generally called malware variants, from previously encountered malwares. However, the traditional signature-based malware detecting method is hard to recognize the up-to-the-minute obfuscated malwares. This paper proposes a method to identify the malwa...

Context in source publication

Context 1
... vertex in the function-call graph will be colored in the light of the instructions used in this function. X86 Instruc- tions will be classified as 15 classes according to their func- tions as shown in Table 1. A 15-bit color variable is defined to describe a color for each vertex and the initial value is 0. Each bit corresponds to a certain class of instructions. ...

Citations

... It is often used in data matching and search applications [37], but we apply it here to measure code similarity. -Function Calls Similarity: This family measures the similarity between different code fragments based on the functions and procedures in the code fragments [39]. -Graph-based Similarity: It calculates similarity based on a graph's relations, which could represent various data structures and dependencies [40]. ...
Chapter
Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/ codesim
... It is often used in data matching and search applications [37], but we apply it here to measure code similarity. -Function Calls Similarity: This family measures the similarity between different code fragments based on the functions and procedures in the code fragments [39]. -Graph-based Similarity: It calculates similarity based on a graph's relations, which could represent various data structures and dependencies [40]. ...
Preprint
Full-text available
Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/codesim
... Searching program clones between x86 or ARM binaries over a large program repository is necessary when the original program written in source code is unavailable, which happens with commercial o -the-shelf (COTS), legacy programs, rmware or malware. For example, detecting malware clones is a major issue [4,18,57,73], as most malware are actually variants of a few major families active for more than ve years 1 . Another application is the identi cation of libraries [3,20,32,36,69,70], which is both a software engineering issue and a cybersecurity issue due to vulnerabilities inside dynamically linked libraries. ...
... Given its potential applications and challenges, the eld of similarity detection has been extremely active over the last two decades, starting from the pioneering work of Dullien in 2004 [22,23] on call-graph isomorphisms and the popular Bin-Di tool for recognizing similar binary functions among two related executables. Other approaches include for example symbolic methods [28], graph edit distances [34,44] and matching techniques [4,73]. Interestingly, the last ve years have seen a strong trend toward machine learning based approaches to binary function similarity [19,52,55,74,77]. ...
... We set up a strong comprehensive evaluation framework (14 competitors and 3 baselines) to systematically compare PSS with state-ofthe-art methods, covering string based methods [69,70], graph edit distance [27,34], N-grams [33], vector embedding [19,52,55,74], standard spectral methods [27] and matching algorithms [4,73]. Our experiments cover our own dataset of diverse open-source 2 According to Haq and Caballero [31], since 2014, among 40 binary code similarity approaches, only 7 approaches have taken programs as input. ...
Conference Paper
Full-text available
We consider the problem of program clone search, i.e. given a target program and a repository of known programs (all in executable format), the goal is to find the program in the repository most similar to the target program -- with potential applications in terms of reverse engineering, program clustering, malware lineage and software theft detection. Recent years have witnessed a blooming in code similarity techniques, yet most of them focus on function-level similarity and function clone search, while we are interested in program-level similarity and program clone search. Actually, our study shows that prior similarity approaches are either too slow to handle large program repositories, or not precise enough, or yet not robust against slight variations introduced by compilers, source code versions or light obfuscations. We propose a novel spectral analysis method for program-level similarity and program clone search called Programs Spectral Similarity (PSS). In a nutshell, PSS one-time spectral feature extraction is tailored for large repositories, making it a perfect fit for program clone search. We have compared the different approaches with extensive benchmarks, showing that PSS reaches a sweet spot in terms of precision, speed and robustness.
... Searching program clones between x86 or ARM binaries over a large program repository is necessary when the original program written in source code is unavailable, which happens with commercial off-the-shelf (COTS), legacy programs, firmware or malware. For example, detecting malware clones is a major issue [4,18,57,73], as most malware are actually variants of a few major families active for more than five years 1 . Another application is the identification of libraries [3,20,32,36,69,70], which is both a software engineering issue and a cybersecurity issue due to vulnerabilities inside dynamically linked libraries. ...
... Given its potential applications and challenges, the field of similarity detection has been extremely active over the last two decades, starting from the pioneering work of Dullien in 2004 [22,23] on call-graph isomorphisms and the popular Bin-Diff tool for recognizing similar binary functions among two related executables. Other approaches include for example symbolic methods [28], graph edit distances [34,44] and matching techniques [4,73]. Interestingly, the last five years have seen a strong trend toward machine learning based approaches to binary function similarity [19,52,55,74,77]. ...
... We set up a strong comprehensive evaluation framework (14 competitors and 3 baselines) to systematically compare PSS with state-ofthe-art methods, covering string based methods [69,70], graph edit distance [27,34], N-grams [33], vector embedding [19,52,55,74], standard spectral methods [27] and matching algorithms [4,73]. Our experiments cover our own dataset of diverse open-source projects along with classical Coreutils, Diffutils, Findutils, and Binutils packages along two dimensions (optimization levels and code versions) for a total of 950 programs. ...
Preprint
Full-text available
We consider the problem of program clone search, i.e. given a target program and a repository of known programs (all in executable format), the goal is to find the program in the repository most similar to our target program - with potential applications in terms of reverse engineering, program clustering, malware lineage and software theft detection. Recent years have witnessed a blooming in code similarity techniques, yet most of them focus on function-level similarity while we are interested in program-level similarity. Consequently, these recent approaches are not directly suited to program clone search, being either too slow to handle large code bases, not precise enough, or not robust against slight variations introduced by compilation or source code versions. We introduce Programs Spectral Similarity (PSS), the first spectral analysis dedicated to program-level similarity. PSS reaches a sweet spot in terms of precision, speed and robustness. Especially, its one-time spectral feature extraction is tailored for large repositories of programs, making it a perfect fit for program clone search.
... Malware authors have incorporated a series of obfuscation techniques such as encryption, polymorphism, oligomorphism, metamorphism and several other mutation methods [4,17,20,32,34,39,40] in order to mutate their product aiming to avoid the traditional bytelevel signature-based detection techniques. Hence, a deeper theoretical background should be constructed over a wider investigation over the insights of such mutation techniques in order to propose more elaborated countermeasures against malicious software. ...
Article
Full-text available
In this work we present a graph-based approach for behavior-based malware detection and classification utilizing the Group Relation Graphs (GrG), resulting after the grouping of disjoint vertices of System-call Dependency Graphs obtained through the dynamic taint analysis over after the execution of a program. Throughout this approach we utilize the sequence on the appearance of each edge in the GrG graph in order to depict the information regarding the sequential dependencies between the System-calls groups invoked during the execution of a program, proposing the so-called Group Sequence Graphs (GsG). Utilizing the proposed approach, we investigate further valuable structural characteristics of the graphs augmenting the GrG with further information that increase their potentials against the representation of mutated malware samples. We develop an integrated behavior-based malware detection and classification system that incorporates the proposed approach, utilizing different types of structural characteristics of GsG graphs, namely, the Relational, the Quantitative and the Qualitative characteristics, evaluating its potentials on distinguishing malicious from benign samples and indexing the malicious ones into known malware families, proving it potentials against a set of malicious samples from a wide variety of known malware families.
... However, from the set of newly discovered malware samples only a few are brand-new, while the most of the specimen are variants (mutations) of already existing malware samples. A series of obfuscation techniques [5,24,27,38,43,45,49,50] have been developed from malware authors in order to avoid the traditional signature-based detection techniques. Among the most frequently utilized by malware authors techniques are the Polymorphism [43], where an endless number of new decryptors is produced using different encryption methods to encrypt the body of the malware, where code obfuscation techniques [45,50] are deployed to mutate the decryptor, Metamorphism [50] where an additional mutation module called metamorphic engine, that is responsible for malware's mutation by modifying its structure and retaining its functionality each time it is replicated. ...
Article
Full-text available
In this work, we present a graph-based method for the detection and classification of malicious software samples utilizing the Max-Flows exhibited through their corresponding behavioral graphs. In the proposed approach, we utilize the Max-Flows exhibited in the behavioral graphs that represent the interaction of software samples with their host environment, in order to depict the flow of information between System-call Groups. Obtaining the System-call Dependency Graphs of the samples under consideration, we construct the corresponding Group Relation Graphs, and proceed with the construction of the so-called, Flow Maps, another representation of Group Relation Graphs, that depict the Max-Flows among its vertices. Additionally, we provide a detailed representation over the architecture and the core components of our proposed approach for malware detection and classification discussing also several technical aspects regarding its implementation and deployment. Finally, we conduct a series of five-fold cross validation experiments in order to evaluate the potentials of our proposed approach in detecting and classifying malicious samples discussing also the exhibited experimental results.
... Software obfuscation is a technique that obscures the structure and/or behavior of software code without impacting its expected functionality such that the code is rendered hard to understand, analyze, or reverse engineer [1] [2]. Software obfuscation techniques are developed for both malicious (e.g., evading automatic static code inspection) and benign (e.g., protecting code privacy or intellectual property) purposes [3]. For example, Malware authors use software obfuscation to evade detection and thwart inspection and removal. ...
Article
Full-text available
Software obfuscation techniques are increasingly being used to prevent attackers from exploiting security flaws and launching successful attacks. With research on software obfuscation techniques rapidly growing, many software obfuscation techniques with varying quality and strength have been proposed in the literature. However, the literature on obfuscation techniques has not yet been coherently collated and reviewed. This research paper aims to present an overview of state-of-the-art software obfuscation techniques, focusing on quality and strength. A systematic analysis and synthesis of literature published between 2010 and April 2021 has been performed to identify the common measures to quantify obfuscation and their measures, the publication venue, and the home country of the researchers. We have identified the obfuscation quality attributes, such as potency, resilience, cost, stealth, and similarity, that are the most widely used metrics to evaluate the quality of obfuscation techniques. In addition, different measures have been used to quantify these qualities, such as complexity (to measure potency), human effort (to measure resilience), efficiency (to estimate cost), and multiclass performance metrics, distance measures, and matching method (to quantify similarity). These measures were then categorized into sub-measures. The literature lacks research in the following two areas: empirical research using a case study strategy, i.e., real-world datasets, and measurements of obfuscation stealth. Researchers did not address stealth as clearly as they addressed potency, cost, and similarity.
... Xu et al. [16] proposed a malicious code detection method based on function call graphs. is method firstly extracted function call graphs from mobile applications. ...
Article
Full-text available
As mobile phone is widely used in social network communication, it attracts numerous malicious attacks, which seriously threaten users’ personal privacy and data security. To improve the resilience to attack technologies, structural information analysis has been widely applied in mobile malware detection. However, the rapid improvement of mobile applications has brought an impressive growth of their internal structure in scale and attack technologies. It makes the timely analysis of structural information and malicious feature generation a heavy burden. In this paper, we propose a new Android malware identification approach based on malicious subgraph mining to improve the detection performance of large-scale graph structure analysis. Firstly, function call graphs (FCGs), sensitive permissions, and application programming interfaces (APIs) are generated from the decompiled files of malware. Secondly, two kinds of malicious subgraphs are generated from malware’s decompiled files and put into the feature set. At last, test applications’ safety can be automatically identified and classified into malware families by matching their FCGs with malicious structural features. To evaluate our approach, a dataset of 11,520 malware and benign applications is established. Experimental results indicate that our approach has better performance than three previous works and Androguard.
... However, these strategies have some potential disadvantages. Obfuscation can be used to evade signature-based detection [26], while anomaly-based detection is costly and often yields an unacceptably high false positive rate [15]. Malware detection based on machine learning models may overcome these weaknesses. ...
Preprint
Full-text available
Research in the field of malware classification often relies on machine learning models that are trained on high-level features, such as opcodes, function calls, and control flow graphs. Extracting such features is costly, since disassembly or code execution is generally required. In this paper, we conduct experiments to train and evaluate machine learning models for malware classification, based on features that can be obtained without disassembly or execution of code. Specifically, we visualize malware samples as images and employ image analysis techniques. In this context, we focus on two machine learning models, namely, Convolutional Neural Networks (CNN) and Extreme Learning Machines (ELM). Surprisingly, we find that ELMs can achieve accuracies on par with CNNs, yet ELM training requires less than~2\%\ of the time needed to train a comparable CNN.
... Machine learning techniques not only detect known malwares but also act as knowledge base for the detection of new and variants of malware. The different machine learning techniques for the detection of malwares are Association Rule [23], Naive Bayes [26], Decision Tree [28], Data Mining [24], Neural Networks [26] and Hidden Markov Modes [12]. ...