C++ code and its corresponding LLVM IR.

Source publication

Value-Flow-Based Demand-Driven Pointer Analysis for C and C++

Article

Full-text available

Sep 2018

We present Supa , a value-flow-based demand-driven flow- and context-sensitive pointer analysis with strong updates for C and C++ programs. Supa enables computing points-to information via value-flow refinement, in environments with small time and memory budgets. We formulate Supa by solving a graph-reachability problem on an inter-procedural...

Figure 4: Comparison between the samples fed to classification or...

Figure 5: Construction of neighboring graph w.r.t. the pair s, t. (I)...

Figure 9: Classification accuracy of compared methods versus privacy...

Figure 10: Different -DP mechanisms comparison in implementing DPP...

Secure Metric Learning via Differential Pairwise Privacy

Preprint

Full-text available

Mar 2020

Distance Metric Learning (DML) has drawn much attention over the last two decades. A number of previous works have shown that it performs well in measuring the similarities of individuals given a set of correctly labeled pairwise data by domain experts. These important and precisely-labeled pairwise data are often highly sensitive in real world (e....

Exploratory Adversarial Attacks on Graph Neural Networks

Conference Paper

Full-text available

Nov 2020

Graph neural networks (GNNs) have been successfully used to analyze non-Euclidean network data. Recently, there emerge a number of works to investigate the robustness of GNNs by adding adversarial noises into the graph topology, where gradient-based attacks are widely studied due to their inherent efficiency and high ffectiveness. However, the grad...

Subgoal Search For Complex Reasoning Tasks

Preprint

Full-text available

Aug 2021

Humans excel in solving complex reasoning tasks through a mental process of moving from one idea to a related one. Inspired by this, we propose Subgoal Search (kSubS) method. Its key component is a learned subgoal generator that produces a diversity of subgoals that are both achievable and closer to the solution. Using subgoals reduces the search s...

Data-Dependent Differentially Private Parameter Learning for Directed Graphical Models

Preprint

Full-text available

May 2019

Directed graphical models (DGMs) are a class of probabilistic models that are widely used for predictive analysis in sensitive domains, such as medical diagnostics. In this paper we present an algorithm for differentially private learning of the parameters of a DGM with a publicly known graph structure over fully observed data. Our solution optimiz...

Deep geometric knowledge distillation with graphs

Preprint

Full-text available

Nov 2019

In most cases deep learning architectures are trained disregarding the amount of operations and energy consumption. However, some applications, like embedded systems, can be resource-constrained during inference. A popular approach to reduce the size of a deep learning architecture consists in distilling knowledge from a bigger network (teacher) to...

Why Do Deep Learning Projects Differ in Compatible Framework Versions? An Exploratory Study

Conference Paper

Full-text available

Oct 2023

Deep learning (DL) is becoming increasingly important and widely used in our society. DL projects are mainly built upon DL frameworks, which frequently evolve due to the introduction of new features or bug fixing. Consequently, compatibility issues are commonly seen in DL projects. The compatible framework versions may differ across DL projects, i.e., for a specific framework version, one project runs normally while the other crashes, even if the client code uses the same framework API. Existing studies mainly focus on analyzing the API evolution of Python libraries and the related compatibility issues. However, the difference in framework version compatibility (DFVC) among DL projects has rarely been systematically studied. In this paper, we conduct an empirical study on 90 PyTorch and 50 TensorFlow projects collected from GitHub. By upgrading and downgrading the framework versions, we obtain compatible versions for each project and further investigate the root causes of the different compatible framework versions across projects. We summarize seven root causes: Python version, absence of using the same breaking API, import path, parameter, third-party library, resource, and API usage constraint. We further present six implications based on our empirical findings. Our study can facilitate DL practitioners to gain a better understanding of the DFVC among DL projects.

Efficient Path-Sensitive Data-Dependence Analysis

Preprint

Full-text available

Sep 2021

This paper presents a scalable path- and context-sensitive data-dependence analysis. The key is to address the aliasing-path-explosion problem via a sparse, demand-driven, and fused approach that piggybacks the computation of pointer information with the resolution of data dependence. Specifically, our approach decomposes the computational efforts of disjunctive reasoning into 1) a context- and semi-path-sensitive analysis that concisely summarizes data dependence as the symbolic and storeless value-flow graphs, and 2) a demand-driven phase that resolves transitive data dependence over the graphs. We have applied the approach to two clients, namely thin slicing and value flow analysis. Using a suite of 16 programs ranging from 13 KLoC to 8 MLoC, we compare our techniques against a diverse group of state-of-the-art analyses, illustrating significant precision and scalability advantages of our approach.

Cross-project bug type prediction based on transfer learning

Article

Full-text available

Mar 2020
SOFTWARE QUAL J

The prediction of bug types provides useful insights into the software maintenance process. It can improve the efficiency of software testing and help developers adopt corresponding strategies to fix bugs before releasing software projects. Typically, the prediction tasks are performed through machine learning classifiers, which rely heavily on labeled data. However, for a software project that has insufficient labeled data, it is difficult to train the classification model for predicting bug types. Although labeled data of other projects can be used as training data, the results of the cross-project prediction are often poor. To solve this problem, this paper proposes a cross-project bug type prediction framework based on transfer learning. Transfer learning breaks the assumption of traditional machine learning methods that the training set and the test set should follow the same distribution. Our experiments show that the results of cross-project bug type prediction have significant improvement by adopting transfer learning. In addition, we have studied the factors that influence the prediction results, including different pairs of source and target projects, and the number of bug reports in the source project.

Memory Leak Detection in IoT Program Based on an Abstract Memory Model SeqMM

Article

Full-text available

Nov 2019

With the rapid growth of the Internet-of-Things (IoT), security issues for the IoT are becoming increasingly serious. Memory leaks are a common and harmful software defect for IoT programs running on resource-limited devices. Static analysis is an effective method for memory leak detection, however, because the existing methods cannot fully describe the memory state of IoT programs at run time, false positives and false negatives frequently occur. To improve the precision of memory leak detection, we propose an abstract memory model SeqMM to describe sequential storage structures. SeqMM differs from other abstract memory models in its ability to handle both points-to analysis and numerical analysis of pointers, which contributes to eliminating false positives in defect detection. In addition, based on the analysis of the sequential storage structure, we introduce the analysis of its operations in C programs, including transfer operations and predicate operations. Moreover, we present a memory leak detection algorithm by determining the state of the program points related to allocated memory blocks. The experimental results of five real projects indicate that the false positive rates of DTSC_SeqMM, Klocwork12 and DTSC_RSTVL are 29.0%, 15.0% and 40.6% respectively, and the corresponding false negative rates are 0%, 22.7% and 13.6%.

Per-Dereference Verification of Temporal Heap Safety via Adaptive Context-Sensitive Analysis

Chapter

Full-text available

Oct 2019

We address the problem of verifying the temporal safety of heap memory at each pointer dereference. Our whole-program analysis approach is undertaken from the perspective of pointer analysis, allowing us to leverage the advantages of and advances in pointer analysis to improve precision and scalability. A dereference $\omega $, say, via pointer q is unsafe iff there exists a deallocation $\psi $, say, via pointer p such that on a control-flow path $\rho $,p aliases with q (with both pointing to an object o representing an allocation), denoted Open image in new window , and $\psi $ reaches $\omega $ on $\rho $ via control flow, denoted Open image in new window . Applying directly any existing pointer analysis, which is typically solved separately with an associated control-flow reachability analysis, will render such verification highly imprecise, since Open image in new window (i.e., $\exists $ does not distribute over $\wedge $). For precision, we solve Open image in new window , with a control-flow path $\rho $ containing an allocation o, a deallocation $\psi $ and a dereference $\omega $ abstracted by a tuple of three contexts Open image in new window . For scalability, a demand-driven full context-sensitive (modulo recursion) pointer analysis, which operates on pre-computed def-use chains with adaptive context-sensitivity, is used to infer Open image in new window , without losing soundness or precision. Our evaluation shows that our approach can successfully verify the safety of 81.3% (or $\frac{93,141}{114,508}$) of all the dereferences in a set of ten C programs totalling 1,166 KLOC.

An Empirical Study of Regression Bug Chains in Linux

Article

Mar 2019
IEEE T RELIAB

Regression bugs are a type of bugs that cause a feature of software that worked correctly but stop working after a certain software commit. This paper presents a systematic study of regression bug chains, an important but unexplored phenomenon of regression bugs. Our paper is based on the observation that a commit $c1$, which fixes a regression bug $b1$, may accidentally introduce another regression bug $b2$. Likewise, commit $c2$ repairing $b2$ may cause another regression bug $b3$ , resulting in a bug chain, i.e., $b1\rightarrow c1\rightarrow b2\rightarrow c2\rightarrow b3$. We have conducted a large-scale study by collecting 1579 regression bugs and 2630 commits from 57 Linux versions (from 2.6.12 to 4.9). The relationships between regression bugs and commits are modeled as a directed bipartite network. Our major contributions and findings are fourfold: 1) a novel concept of regression bug chains and their formulation; 2) compared to an isolated regression bug, a bug on a regression bug chain is much more difficult to repair, costing 2.4× more fixing time, involving 1.3× more developers and 2.8× more comments; 3) 85.8% of bugs on the chains in Linux reside in Drivers, ACPI, Platform Specific/Hardware, and Power Management; and 4) 83% of the chains affect only a single Linux subsystem, while 68% of the chains propagate across Linux versions.

Automatic Hierarchical Clustering of Static Call Graphs for Program Comprehension

Conference Paper

Full-text available

Dec 2018

Program comprehension is an imperative and indispensable prerequisite for several software tasks, including testing, maintenance, and evolution. In practice, understanding the software system requires investigating the high-level system functionality and mapping it to its low-level implementation, i. e. source code. The implementation of a software system can be captured using a call graph. A call graph represents the system’s functions and their interactions at a single level of granularity. While call graphs can facilitate understanding the inner system functionality, developers are still required to manually map the high-level system functionality to its call graph. This manual mapping process is expensive, time-consuming and creates a cognitive gap between the system’s highly-level functionality and its implementation. In this paper, we present an innovative approach that can automatically (1) construct and visualize the static call graph for a system written in Python, (2) cluster the execution paths of the call graph into hierarchal abstractions, and (3) label the clusters according to their major functional behaviors. The goal is to bridge the cognitive gap between the high-level system functionality and its call graph, which can further facilitate system comprehension. To validate our approach, we conducted four case studies including code2graph, Detectron, Flask, and Keras. The results demonstrated that our approach is feasible to construct call graphs and hierarchically cluster them into abstraction levels with proper labels.

Falcon: A Fused Approach to Path-Sensitive Sparse Data Dependence Analysis

Article

Jun 2024

This paper presents a scalable path- and context-sensitive data dependence analysis. The key is to address the aliasing-path-explosion problem when enforcing a path-sensitive memory model. Specifically, our approach decomposes the computational efforts of disjunctive reasoning into 1) a context- and semi-path-sensitive analysis that concisely summarizes data dependence as the symbolic and storeless value-flow graphs, and 2) a demand-driven phase that resolves transitive data dependence over the graphs, piggybacking the computation of fully path-sensitive pointer information with the resolution of data dependence of interest. We have applied the approach to two clients, namely thin slicing and value-flow bug finding. Using a suite of 16 C/C++ programs ranging from 13 KLoC to 8 MLoC, we compare our techniques against a diverse group of state-of-the-art analyses, illustrating the significant precision and scalability advantages of our approach.

Octopus : Scaling Value-Flow Analysis via Parallel Collection of Realizable Path Conditions

Article

Jan 2024
ACM T SOFTW ENG METH

Value-flow analysis is a fundamental technique in program analysis, benefiting various clients, such as memory corruption detection and taint analysis. However, existing efforts suffer from the low potential speedup that leads to a deficiency in scalability. In this work, we present a parallel algorithm Octopus to collect path conditions for realizable paths efficiently. Octopus builds on the realizability decomposition to collect the intraprocedural path conditions of different functions simultaneously on-demand and obtain realizable path conditions by concatenation, which achieves a high potential speedup in parallelization. We implement Octopus as a tool and evaluate it over 15 real-world programs. The experiment shows that Octopus significantly outperforms the state-of-the-art algorithms. Particularly, it detects NPD bugs for the project llvm with 6.3 MLoC within 6.9 minutes under the 40-thread setting. We also state and prove several theorems to demonstrate the soundness, completeness, and high potential speedup of Octopus . Our empirical and theoretical results demonstrate the great potential of Octopus in supporting various program analysis clients. The implementation has officially deployed at Ant Group, scaling the nightly code scan for massive FinTech applications.

A Smart Status Based Monitoring Algorithm for the Dynamic Analysis of Memory Safety

Article

Dec 2023
ACM T SOFTW ENG METH

C is a dominant programming language for implementing system and low-level embedded software. Unfortunately, the unsafe nature of its low-level control of memory often leads to memory errors. Dynamic analysis has been widely used to detect memory errors at runtime. However, existing monitoring algorithms for dynamic analysis are not yet satisfactory as they cannot deterministically and completely detect some types of errors, e.g., segment confusion errors, sub-object overflows, use-after-frees and memory leaks. We propose a new monitoring algorithm, namely Smatus , short for smart status , that improves memory safety by performing comprehensive dynamic analysis. The key innovation is to maintain at runtime a small status node for each memory object. A status node records the status value and reference count of an object, where the status value denotes the liveness and segment type of this object, and the reference count tracks the number of pointer variables pointing to this object. Smatus maintains at runtime a pointer metadata for each pointer variable, to record not only the base and bound of a pointer’s referent but also the address of the referent’s status node. All the pointers pointing to the same referent share the same status node in their pointer metadata. A status node is smart in the sense that it is automatically deleted when it becomes useless (indicated by its reference count reaching zero). To the best of our knowledge, Smatus represents the most comprehensive approach of its kind. We have evaluated Smatus by using a large set of programs including the NIST Software Assurance Reference Dataset, MSBench, MiBench, SPEC and stress testing benchmarks. In terms of effectiveness (detecting different types of memory errors), Smatus outperforms state-of-the-art tools, Google’s AddressSanitizer, SoftBoundCETS and Valgrind, as it is capable of detecting more errors. In terms of performance (the time and memory overheads), Smatus outperforms SoftBoundCETS and Valgrind in terms of both lower time and memory overheads incurred, and is on par with AddressSanitizer in terms of the time and memory overhead tradeoff made (with much lower memory overheads incurred).

C++ code and its corresponding LLVM IR.

Similar publications

Citations