Fig 13 - uploaded by Yulei Sui
Content may be subject to copyright.
C++ code and its corresponding LLVM IR.

C++ code and its corresponding LLVM IR.

Source publication
Article
Full-text available
We present Supa , a value-flow-based demand-driven flow- and context-sensitive pointer analysis with strong updates for C and C++ programs. Supa enables computing points-to information via value-flow refinement, in environments with small time and memory budgets. We formulate Supa by solving a graph-reachability problem on an inter-procedural...

Similar publications

Preprint
Full-text available
Distance Metric Learning (DML) has drawn much attention over the last two decades. A number of previous works have shown that it performs well in measuring the similarities of individuals given a set of correctly labeled pairwise data by domain experts. These important and precisely-labeled pairwise data are often highly sensitive in real world (e....
Conference Paper
Full-text available
Graph neural networks (GNNs) have been successfully used to analyze non-Euclidean network data. Recently, there emerge a number of works to investigate the robustness of GNNs by adding adversarial noises into the graph topology, where gradient-based attacks are widely studied due to their inherent efficiency and high ffectiveness. However, the grad...
Preprint
Full-text available
Humans excel in solving complex reasoning tasks through a mental process of moving from one idea to a related one. Inspired by this, we propose Subgoal Search (kSubS) method. Its key component is a learned subgoal generator that produces a diversity of subgoals that are both achievable and closer to the solution. Using subgoals reduces the search s...
Preprint
Full-text available
Directed graphical models (DGMs) are a class of probabilistic models that are widely used for predictive analysis in sensitive domains, such as medical diagnostics. In this paper we present an algorithm for differentially private learning of the parameters of a DGM with a publicly known graph structure over fully observed data. Our solution optimiz...
Preprint
Full-text available
In most cases deep learning architectures are trained disregarding the amount of operations and energy consumption. However, some applications, like embedded systems, can be resource-constrained during inference. A popular approach to reduce the size of a deep learning architecture consists in distilling knowledge from a bigger network (teacher) to...

Citations

... To avoid dependency conflicts between third-party libraries, developers can leverage various tools, such as PyEGo [64], smartPip [65], and Watchman [66], which can assist in automatically managing and resolving version conflicts between different third-party libraries. • Implication #6: To improve the framework version compatibility of DL projects, developers should be mindful of API usage constraints or use static valueflow analysis [67] to reason about the constraints. For example, when using the view() operation to manipulate a tensor in PyTorch projects, it is suggested to invoke contiguous() function beforehand to ensure that the data is stored contiguously in memory. ...
Conference Paper
Full-text available
Deep learning (DL) is becoming increasingly important and widely used in our society. DL projects are mainly built upon DL frameworks, which frequently evolve due to the introduction of new features or bug fixing. Consequently, compatibility issues are commonly seen in DL projects. The compatible framework versions may differ across DL projects, i.e., for a specific framework version, one project runs normally while the other crashes, even if the client code uses the same framework API. Existing studies mainly focus on analyzing the API evolution of Python libraries and the related compatibility issues. However, the difference in framework version compatibility (DFVC) among DL projects has rarely been systematically studied. In this paper, we conduct an empirical study on 90 PyTorch and 50 TensorFlow projects collected from GitHub. By upgrading and downgrading the framework versions, we obtain compatible versions for each project and further investigate the root causes of the different compatible framework versions across projects. We summarize seven root causes: Python version, absence of using the same breaking API, import path, parameter, third-party library, resource, and API usage constraint. We further present six implications based on our empirical findings. Our study can facilitate DL practitioners to gain a better understanding of the DFVC among DL projects.
... -In building value-flow graphs, Falcon outperforms Svf [85], Sfs [36], and Dsa [47], achieving on average 17×, 25×, and 4.4× speedups, respectively. -Compared with Supa [84,86], the state-of-the-art demand-driven flow-and context-sensitive pointer analysis for C/C++, Falcon is 54× in answering thin slicing queries, and it improves the precision by 1.6×. -In comparison with Cred [96], a state-of-the-art path-sensitive value flow analysis for bug hunting, Falcon is on average 6× faster, and finds more real bugs (21 vs. 12) with a lower false-positive rate (25% vs. 47.8%). ...
... A recent analysis [96] has leveraged the idea of sparsity to refine the flow-insensitive results into a path-sensitive one on demand. It first constructs the flow-insensitive def-use chains with a pre-analysis, which then enable the primary path-sensitive analysis to be performed sparsely [84,86,96]. For instance, as shown in Fig. 1(b), the two edges between * = and = * state that the value of pointer can flow to the pointer via the memory objects 1 or 2 , implying that may be data-dependent on . ...
... For instance, as shown in Fig. 1(b), the two edges between * = and = * state that the value of pointer can flow to the pointer via the memory objects 1 or 2 , implying that may be data-dependent on . The pre-computed def-use chains enable the primary pathsensitive analysis to be performed sparsely [84,86,96]. ...
Preprint
Full-text available
This paper presents a scalable path- and context-sensitive data-dependence analysis. The key is to address the aliasing-path-explosion problem via a sparse, demand-driven, and fused approach that piggybacks the computation of pointer information with the resolution of data dependence. Specifically, our approach decomposes the computational efforts of disjunctive reasoning into 1) a context- and semi-path-sensitive analysis that concisely summarizes data dependence as the symbolic and storeless value-flow graphs, and 2) a demand-driven phase that resolves transitive data dependence over the graphs. We have applied the approach to two clients, namely thin slicing and value flow analysis. Using a suite of 16 programs ranging from 13 KLoC to 8 MLoC, we compare our techniques against a diverse group of state-of-the-art analyses, illustrating significant precision and scalability advantages of our approach.
... For example, SVF was developed to do scalable and precise intraprocedural static value-flow analysis for C programs (Sui and Xue 2016). SUPA, proposed by Sui and Xue (2018), focuses on performing strong updates on-demand flow and context-sensitively for analyzing C and C++ programs. There are also tools for analyzing Java and python programs (Feng et al. 2018;Gharibi et al. 2018). ...
Article
Full-text available
The prediction of bug types provides useful insights into the software maintenance process. It can improve the efficiency of software testing and help developers adopt corresponding strategies to fix bugs before releasing software projects. Typically, the prediction tasks are performed through machine learning classifiers, which rely heavily on labeled data. However, for a software project that has insufficient labeled data, it is difficult to train the classification model for predicting bug types. Although labeled data of other projects can be used as training data, the results of the cross-project prediction are often poor. To solve this problem, this paper proposes a cross-project bug type prediction framework based on transfer learning. Transfer learning breaks the assumption of traditional machine learning methods that the training set and the test set should follow the same distribution. Our experiments show that the results of cross-project bug type prediction have significant improvement by adopting transfer learning. In addition, we have studied the factors that influence the prediction results, including different pairs of source and target projects, and the number of bug reports in the source project.
... Flow-sensitive analysis [7]- [10] considers the order in which statements are executed. Traditional data flow analysis is flow-sensitive, whereas flow-insensitive analysis appears mainly in points-to analysis [11], [12]. ...
Article
Full-text available
With the rapid growth of the Internet-of-Things (IoT), security issues for the IoT are becoming increasingly serious. Memory leaks are a common and harmful software defect for IoT programs running on resource-limited devices. Static analysis is an effective method for memory leak detection, however, because the existing methods cannot fully describe the memory state of IoT programs at run time, false positives and false negatives frequently occur. To improve the precision of memory leak detection, we propose an abstract memory model SeqMM to describe sequential storage structures. SeqMM differs from other abstract memory models in its ability to handle both points-to analysis and numerical analysis of pointers, which contributes to eliminating false positives in defect detection. In addition, based on the analysis of the sequential storage structure, we introduce the analysis of its operations in C programs, including transfer operations and predicate operations. Moreover, we present a memory leak detection algorithm by determining the state of the program points related to allocated memory blocks. The experimental results of five real projects indicate that the false positive rates of DTSC_SeqMM, Klocwork12 and DTSC_RSTVL are 29.0%, 15.0% and 40.6% respectively, and the corresponding false negative rates are 0%, 22.7% and 13.6%.
... Pointer Analysis. Substantial progress has been made for whole-program [23,33,48] and demand-driven [20,47,51] pointer analyses, with flowsensitivity [19,31], call-site-sensitivity [40,61], object-sensitivity [37,55] and type-sensitivity [25,45]). These recent advances in both precision and scalability have resulted in their widespread adoption in detecting memory bugs [2,17], such as memory leaks [9,52], null dereferences [34,36], uninitialized variables [35,60], buffer overflows [10,30], and typestate verification [12,16]. ...
Chapter
Full-text available
We address the problem of verifying the temporal safety of heap memory at each pointer dereference. Our whole-program analysis approach is undertaken from the perspective of pointer analysis, allowing us to leverage the advantages of and advances in pointer analysis to improve precision and scalability. A dereference \(\omega \), say, via pointer q is unsafe iff there exists a deallocation \(\psi \), say, via pointer p such that on a control-flow path \(\rho \),p aliases with q (with both pointing to an object o representing an allocation), denoted Open image in new window , and \(\psi \) reaches \(\omega \) on \(\rho \) via control flow, denoted Open image in new window . Applying directly any existing pointer analysis, which is typically solved separately with an associated control-flow reachability analysis, will render such verification highly imprecise, since Open image in new window (i.e., \(\exists \) does not distribute over \(\wedge \)). For precision, we solve Open image in new window , with a control-flow path \(\rho \) containing an allocation o, a deallocation \(\psi \) and a dereference \(\omega \) abstracted by a tuple of three contexts Open image in new window . For scalability, a demand-driven full context-sensitive (modulo recursion) pointer analysis, which operates on pre-computed def-use chains with adaptive context-sensitivity, is used to infer Open image in new window , without losing soundness or precision. Our evaluation shows that our approach can successfully verify the safety of 81.3% (or \(\frac{93,141}{114,508}\)) of all the dereferences in a set of ten C programs totalling 1,166 KLOC.
... Available: https://networkx.github.io/ and memory-related bugs (e.g., null pointer dereference: bug ID-14030, memory leak: bug ID-13518), it is useful to apply some static code analysis tools for detecting these types of bugs [27]- [29]. ...
Article
Regression bugs are a type of bugs that cause a feature of software that worked correctly but stop working after a certain software commit. This paper presents a systematic study of regression bug chains, an important but unexplored phenomenon of regression bugs. Our paper is based on the observation that a commit $c1$, which fixes a regression bug $b1$, may accidentally introduce another regression bug $b2$. Likewise, commit $c2$ repairing $b2$ may cause another regression bug $b3$ , resulting in a bug chain, i.e., $b1\rightarrow c1\rightarrow b2\rightarrow c2\rightarrow b3$. We have conducted a large-scale study by collecting 1579 regression bugs and 2630 commits from 57 Linux versions (from 2.6.12 to 4.9). The relationships between regression bugs and commits are modeled as a directed bipartite network. Our major contributions and findings are fourfold: 1) a novel concept of regression bug chains and their formulation; 2) compared to an isolated regression bug, a bug on a regression bug chain is much more difficult to repair, costing 2.4× more fixing time, involving 1.3× more developers and 2.8× more comments; 3) 85.8% of bugs on the chains in Linux reside in Drivers, ACPI, Platform Specific/Hardware, and Power Management; and 4) 83% of the chains affect only a single Linux subsystem, while 68% of the chains propagate across Linux versions.
... While there exists many tools to construct call graphs for several programming languages, such as C, C++, and Java [2,17,40], the methods and tools that address call graph construction for Python are scarce and limited in functionality, due to the dynamic nature of Python. The call graph construction can either be static, done at compile time and can approximate all system functionalities, or dynamic, done at runtime and can represent a single run of the system based on the input. ...
Conference Paper
Full-text available
Program comprehension is an imperative and indispensable prerequisite for several software tasks, including testing, maintenance, and evolution. In practice, understanding the software system requires investigating the high-level system functionality and mapping it to its low-level implementation, i. e. source code. The implementation of a software system can be captured using a call graph. A call graph represents the system’s functions and their interactions at a single level of granularity. While call graphs can facilitate understanding the inner system functionality, developers are still required to manually map the high-level system functionality to its call graph. This manual mapping process is expensive, time-consuming and creates a cognitive gap between the system’s highly-level functionality and its implementation. In this paper, we present an innovative approach that can automatically (1) construct and visualize the static call graph for a system written in Python, (2) cluster the execution paths of the call graph into hierarchal abstractions, and (3) label the clusters according to their major functional behaviors. The goal is to bridge the cognitive gap between the high-level system functionality and its call graph, which can further facilitate system comprehension. To validate our approach, we conducted four case studies including code2graph, Detectron, Flask, and Keras. The results demonstrated that our approach is feasible to construct call graphs and hierarchically cluster them into abstraction levels with proper labels.
Article
This paper presents a scalable path- and context-sensitive data dependence analysis. The key is to address the aliasing-path-explosion problem when enforcing a path-sensitive memory model. Specifically, our approach decomposes the computational efforts of disjunctive reasoning into 1) a context- and semi-path-sensitive analysis that concisely summarizes data dependence as the symbolic and storeless value-flow graphs, and 2) a demand-driven phase that resolves transitive data dependence over the graphs, piggybacking the computation of fully path-sensitive pointer information with the resolution of data dependence of interest. We have applied the approach to two clients, namely thin slicing and value-flow bug finding. Using a suite of 16 C/C++ programs ranging from 13 KLoC to 8 MLoC, we compare our techniques against a diverse group of state-of-the-art analyses, illustrating the significant precision and scalability advantages of our approach.
Article
Value-flow analysis is a fundamental technique in program analysis, benefiting various clients, such as memory corruption detection and taint analysis. However, existing efforts suffer from the low potential speedup that leads to a deficiency in scalability. In this work, we present a parallel algorithm Octopus to collect path conditions for realizable paths efficiently. Octopus builds on the realizability decomposition to collect the intraprocedural path conditions of different functions simultaneously on-demand and obtain realizable path conditions by concatenation, which achieves a high potential speedup in parallelization. We implement Octopus as a tool and evaluate it over 15 real-world programs. The experiment shows that Octopus significantly outperforms the state-of-the-art algorithms. Particularly, it detects NPD bugs for the project llvm with 6.3 MLoC within 6.9 minutes under the 40-thread setting. We also state and prove several theorems to demonstrate the soundness, completeness, and high potential speedup of Octopus . Our empirical and theoretical results demonstrate the great potential of Octopus in supporting various program analysis clients. The implementation has officially deployed at Ant Group, scaling the nightly code scan for massive FinTech applications.
Article
C is a dominant programming language for implementing system and low-level embedded software. Unfortunately, the unsafe nature of its low-level control of memory often leads to memory errors. Dynamic analysis has been widely used to detect memory errors at runtime. However, existing monitoring algorithms for dynamic analysis are not yet satisfactory as they cannot deterministically and completely detect some types of errors, e.g., segment confusion errors, sub-object overflows, use-after-frees and memory leaks. We propose a new monitoring algorithm, namely Smatus , short for smart status , that improves memory safety by performing comprehensive dynamic analysis. The key innovation is to maintain at runtime a small status node for each memory object. A status node records the status value and reference count of an object, where the status value denotes the liveness and segment type of this object, and the reference count tracks the number of pointer variables pointing to this object. Smatus maintains at runtime a pointer metadata for each pointer variable, to record not only the base and bound of a pointer’s referent but also the address of the referent’s status node. All the pointers pointing to the same referent share the same status node in their pointer metadata. A status node is smart in the sense that it is automatically deleted when it becomes useless (indicated by its reference count reaching zero). To the best of our knowledge, Smatus represents the most comprehensive approach of its kind. We have evaluated Smatus by using a large set of programs including the NIST Software Assurance Reference Dataset, MSBench, MiBench, SPEC and stress testing benchmarks. In terms of effectiveness (detecting different types of memory errors), Smatus outperforms state-of-the-art tools, Google’s AddressSanitizer, SoftBoundCETS and Valgrind, as it is capable of detecting more errors. In terms of performance (the time and memory overheads), Smatus outperforms SoftBoundCETS and Valgrind in terms of both lower time and memory overheads incurred, and is on par with AddressSanitizer in terms of the time and memory overhead tradeoff made (with much lower memory overheads incurred).