The classified X86 instructions

Source Code Clone Detection Using Unsupervised Similarity Measures

Chapter

Apr 2024

Jorge Martinez-Gil

Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/ codesim

Source Code Clone Detection Using Unsupervised Similarity Measures

Preprint

Full-text available

Jan 2024

Jorge Martinez-Gil

Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis of unsupervised similarity measures for identifying source code clone detection. The goal is to overview the current state-of-the-art techniques, their strengths, and weaknesses. To do that, we compile the existing unsupervised strategies and evaluate their performance on a benchmark dataset to guide software engineers in selecting appropriate methods for their specific use cases. The source code of this study is available at https://github.com/jorge-martinez-gil/codesim

Scalable Program Clone Search through Spectral Analysis

Conference Paper

Full-text available

Nov 2023

We consider the problem of program clone search, i.e. given a target program and a repository of known programs (all in executable format), the goal is to find the program in the repository most similar to the target program -- with potential applications in terms of reverse engineering, program clustering, malware lineage and software theft detection. Recent years have witnessed a blooming in code similarity techniques, yet most of them focus on function-level similarity and function clone search, while we are interested in program-level similarity and program clone search. Actually, our study shows that prior similarity approaches are either too slow to handle large program repositories, or not precise enough, or yet not robust against slight variations introduced by compilers, source code versions or light obfuscations. We propose a novel spectral analysis method for program-level similarity and program clone search called Programs Spectral Similarity (PSS). In a nutshell, PSS one-time spectral feature extraction is tailored for large repositories, making it a perfect fit for program clone search. We have compared the different approaches with extensive benchmarks, showing that PSS reaches a sweet spot in terms of precision, speed and robustness.

Scalable Program Clone Search Through Spectral Analysis

Preprint

Full-text available

Oct 2022

We consider the problem of program clone search, i.e. given a target program and a repository of known programs (all in executable format), the goal is to find the program in the repository most similar to our target program - with potential applications in terms of reverse engineering, program clustering, malware lineage and software theft detection. Recent years have witnessed a blooming in code similarity techniques, yet most of them focus on function-level similarity while we are interested in program-level similarity. Consequently, these recent approaches are not directly suited to program clone search, being either too slow to handle large code bases, not precise enough, or not robust against slight variations introduced by compilation or source code versions. We introduce Programs Spectral Similarity (PSS), the first spectral analysis dedicated to program-level similarity. PSS reaches a sweet spot in terms of precision, speed and robustness. Especially, its one-time spectral feature extraction is tailored for large repositories of programs, making it a perfect fit for program clone search.

Behavior-based detection and classification of malicious software utilizing structural characteristics of group sequence graphs

Article

Full-text available

Jun 2022

In this work we present a graph-based approach for behavior-based malware detection and classification utilizing the Group Relation Graphs (GrG), resulting after the grouping of disjoint vertices of System-call Dependency Graphs obtained through the dynamic taint analysis over after the execution of a program. Throughout this approach we utilize the sequence on the appearance of each edge in the GrG graph in order to depict the information regarding the sequential dependencies between the System-calls groups invoked during the execution of a program, proposing the so-called Group Sequence Graphs (GsG). Utilizing the proposed approach, we investigate further valuable structural characteristics of the graphs augmenting the GrG with further information that increase their potentials against the representation of mutated malware samples. We develop an integrated behavior-based malware detection and classification system that incorporates the proposed approach, utilizing different types of structural characteristics of GsG graphs, namely, the Relational, the Quantitative and the Qualitative characteristics, evaluating its potentials on distinguishing malicious from benign samples and indexing the malicious ones into known malware families, proving it potentials against a set of malicious samples from a wide variety of known malware families.

Detection and classification of malicious software utilizing Max-Flows between system-call groups

Article

Full-text available

Jun 2022

In this work, we present a graph-based method for the detection and classification of malicious software samples utilizing the Max-Flows exhibited through their corresponding behavioral graphs. In the proposed approach, we utilize the Max-Flows exhibited in the behavioral graphs that represent the interaction of software samples with their host environment, in order to depict the flow of information between System-call Groups. Obtaining the System-call Dependency Graphs of the samples under consideration, we construct the corresponding Group Relation Graphs, and proceed with the construction of the so-called, Flow Maps, another representation of Group Relation Graphs, that depict the Max-Flows among its vertices. Additionally, we provide a detailed representation over the architecture and the core components of our proposed approach for malware detection and classification discussing also several technical aspects regarding its implementation and deployment. Finally, we conduct a series of five-fold cross validation experiments in order to evaluate the potentials of our proposed approach in detecting and classifying malicious samples discussing also the exhibited experimental results.

Measuring Software Obfuscation Quality—A Systematic Literature Review

Article

Full-text available

Jul 2021

Software obfuscation techniques are increasingly being used to prevent attackers from exploiting security flaws and launching successful attacks. With research on software obfuscation techniques rapidly growing, many software obfuscation techniques with varying quality and strength have been proposed in the literature. However, the literature on obfuscation techniques has not yet been coherently collated and reviewed. This research paper aims to present an overview of state-of-the-art software obfuscation techniques, focusing on quality and strength. A systematic analysis and synthesis of literature published between 2010 and April 2021 has been performed to identify the common measures to quantify obfuscation and their measures, the publication venue, and the home country of the researchers. We have identified the obfuscation quality attributes, such as potency, resilience, cost, stealth, and similarity, that are the most widely used metrics to evaluate the quality of obfuscation techniques. In addition, different measures have been used to quantify these qualities, such as complexity (to measure potency), human effort (to measure resilience), efficiency (to estimate cost), and multiclass performance metrics, distance measures, and matching method (to quantify similarity). These measures were then categorized into sub-measures. The literature lacks research in the following two areas: empirical research using a case study strategy, i.e., real-world datasets, and measurements of obfuscation stealth. Researchers did not address stealth as clearly as they addressed potency, cost, and similarity.

A Mobile Malware Detection Method Based on Malicious Subgraphs Mining

Article

Full-text available

Apr 2021

As mobile phone is widely used in social network communication, it attracts numerous malicious attacks, which seriously threaten users’ personal privacy and data security. To improve the resilience to attack technologies, structural information analysis has been widely applied in mobile malware detection. However, the rapid improvement of mobile applications has brought an impressive growth of their internal structure in scale and attack technologies. It makes the timely analysis of structural information and malicious feature generation a heavy burden. In this paper, we propose a new Android malware identification approach based on malicious subgraph mining to improve the detection performance of large-scale graph structure analysis. Firstly, function call graphs (FCGs), sensitive permissions, and application programming interfaces (APIs) are generated from the decompiled files of malware. Secondly, two kinds of malicious subgraphs are generated from malware’s decompiled files and put into the feature set. At last, test applications’ safety can be automatically identified and classified into malware families by matching their FCGs with malicious structural features. To evaluate our approach, a dataset of 11,520 malware and benign applications is established. Experimental results indicate that our approach has better performance than three previous works and Androguard.

CNN vs ELM for Image-Based Malware Classification

Preprint

Full-text available

Mar 2021

Research in the field of malware classification often relies on machine learning models that are trained on high-level features, such as opcodes, function calls, and control flow graphs. Extracting such features is costly, since disassembly or code execution is generally required. In this paper, we conduct experiments to train and evaluate machine learning models for malware classification, based on features that can be obtained without disassembly or execution of code. Specifically, we visualize malware samples as images and employ image analysis techniques. In this context, we focus on two machine learning models, namely, Convolutional Neural Networks (CNN) and Extreme Learning Machines (ELM). Surprisingly, we find that ELMs can achieve accuracies on par with CNNs, yet ELM training requires less than~2\%\ of the time needed to train a comparable CNN.

Analysis and Detection of Evolutionary Malware: A Review

Article

Full-text available

Feb 2021

The classified X86 instructions

Context in source publication

Citations