Jidong Zhai's research works | Tsinghua University and other places

This page lists the scientific contributions of an author, who either does not have a ResearchGate profile, or has not yet added these contributions to their profile.

It was automatically created by ResearchGate to create a record of this author's body of work. We create such pages to advance our goal of creating and maintaining the most comprehensive scientific repository possible. In doing so, we process publicly available (personal) data relating to the author as a member of the scientific community.

If you're a ResearchGate member, you can follow this page to keep up with this author's work.

If you are this author, and you don't want us to display this page anymore, please let us know.

Graph-Centric Performance Analysis for Large-Scale Parallel Applications

Article

July 2024

4 Reads

IEEE Transactions on Parallel and Distributed Systems

[...]

Performance analysis is essential for understanding the performance behaviors of parallel programs and detecting performance bottlenecks. Whereas, complex interconnections across several types of performance bugs, as well as inter-process communications and data dependence, make efficient performance analysis even more difficult. Despite the fact that many performance tools have been developed, accurately identifying underlying performance bottlenecks for such complex scenarios requires specific in-depth analysis. Significant human efforts and analysis knowledge are often required to implement each specific analytic task. To alleviate the complexity of developing specific performance analytic tasks, we present a programmable performance analysis tool, called PerFlow . In PerFlow , a step-by-step performance analysis process is represented as an Analysis Flow Diagram, which is constructed with several performance analysis sub-tasks, namely passes, that can be defined by developers or provided by PerFlow 's built-in analysis pass library. Furthermore, we define a Performance Abstraction Graph to describe the performance behavior of a parallel program, where the edges indicate the interactions between parallel units, therefore the analytic sub-tasks are converted to graph analysis tasks. PerFlow provides plentiful Python APIs for developing analytic tasks. Several case studies of real-world applications with up to 700 K lines of code are used to demonstrate the effectiveness of PerFlow . The results indicate that PerFlow makes it much easier to implement specific performance analytic tasks, and these tasks are performed automatically and efficiently to detect underlying performance bottlenecks.

Examples of some frequently used primitves.

Optimal Kernel Orchestration for Tensor Programs with Korch

June 2024

[...]

Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.

Download

G-Learned Index: Enabling Efficient Learned Index on GPU

Article

June 2024

39 Reads

IEEE Transactions on Parallel and Distributed Systems

[...]

AI and GPU technologies have been widely applied to solve big data problems. The total data volume worldwide reaches 200 zettabytes in 2022. How to efficiently index the required content among massive data becomes serious. Recently, a promising learned index has been proposed to address this challenge: It has extremely high efficiency while retaining marginal space overhead. However, we notice that previous learned indexes have mainly focused on CPU architecture, while ignoring the advantages of GPU. Because traditional indexes like B-Tree, LSM, and bitmap have greatly benefited from GPU acceleration, a combination of a learned index and GPU has great potentials to reach tremendous speedups. In this paper, we propose a GPU-based learned index, called G-Learned Index, to significantly improve the performance of learned index structures. The primary challenges in developing G-Learned Index lie in the use of thousands of GPU cores including minimization of synchronization and branch divergence, data structure design for parallel operations, and usage of memory bandwidth including limited memory transactions and multi-memory hierarchy. To overcome these challenges, a series of novel technologies are developed, including efficient thread organization, succinct data structures, and heterogeneous memory hierarchy utilization. Compared to the state-of-the-art learned index, the proposed G-Learned Index achieves an average of 174× speedup (and 107× of its parallel version). Meanwhile, we attain 2× less query time over the state-of-the-art GPU B-Tree. Our further exploration of range queries shows that G-Learned Index is 17× faster than CPU multi-dimensional learned index. We have made G-Learned Index available at https://anonymous.4open.science/r/G-Learned-Index-8D89 .

FreshGNN: Reducing Memory Access via Stable Historical Embeddings for Graph Neural Network Training

Article

May 2024

7 Reads

2 Citations

Proceedings of the VLDB Endowment

[...]

A key performance bottleneck when training graph neural network (GNN) models on large, real-world graphs is loading node features onto a GPU. Due to limited GPU memory, expensive data movement is necessary to facilitate the storage of these features on alternative devices with slower access (e.g. CPU memory). Moreover, the irregularity of graph structures contributes to poor data locality which further exacerbates the problem. Consequently, existing frameworks capable of efficiently training large GNN models usually incur a significant accuracy degradation because of the currently-available shortcuts involved. To address these limitations, we instead propose FreshGNN, a general-purpose GNN mini-batch training framework that leverages a historical cache for storing and reusing GNN node embeddings instead of re-computing them through fetching raw features at every iteration. Critical to its success, the corresponding cache policy is designed, using a combination of gradient-based and staleness criteria, to selectively screen those embeddings which are relatively stable and can be cached, from those that need to be re-computed to reduce estimation errors and subsequent downstream accuracy loss. When paired with complementary system enhancements to support this selective historical cache, FreshGNN is able to accelerate the training speed on large graph datasets such as ogbn-papers100M and MAG240M by 3.4× up to 20.5× and reduce the memory access by 59%, with less than 1% influence on test accuracy.

Optimal Kernel Orchestration for Tensor Programs with Korch

Conference Paper

April 2024

7 Reads

[...]

WiseGraph: Optimizing GNN with Joint Workload Partition of Graph and Operations

Conference Paper

April 2024

9 Reads

[...]

POSTER: Pattern-Aware Sparse Communication for Scalable Recommendation Model Training

Conference Paper

February 2024

1 Read

Jiaao He

Shengqi Chen

Jidong Zhai

Optimizing DNNs With Partially Equivalent Transformations and Automated Corrections

Article

December 2023

11 Reads

7 Citations

IEEE Transactions on Computers

[...]

Deep neural network (DNN) applications are typically represented by tensor programs. To boost the performance of DNN computations, existing works adopt fully equivalent transformations for tensor program optimization by guaranteeing the equivalence on each element of tensors. However, as there are thousands of elements in a tensor, such optimization misses the opportunities that allow the in-equivalence of minority elements. In this work, we propose PET, the first work that introduces partially equivalent transformations to optimize tensor programs. To maintain the functional equivalence of tensor programs, PET automatically finds and corrects the in-equivalent positions by leveraging the multi-linearity of DNN computations. PET further uses a mutation manager to improve search efficiency. Evaluation results show that PET can achieve up to 1.98× and 2.20× speedups on NVIDIA Tesla A100 and V100 respectively compared with existing DNN frameworks by introducing new optimization opportunities of partially equivalent transformations.</p

BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach

Article

November 2023

72 Reads

2 Citations

Proceedings of the ACM on Management of Data

[...]

Compiler optimization plays an increasingly important role to boost the performance of machine learning models for data processing and management. With increasingly complex data, the dynamic tensor shape phenomenon emerges for ML models. However, existing ML compilers either can only handle static shape models or expose a series of performance problems for both operator fusion optimization and code generation in dynamic shape scenes. This paper tackles the main challenges of dynamic shape optimization: the fusion optimization without shape value, and code generation supporting arbitrary shapes. To tackle the fundamental challenge of the absence of shape values, it systematically abstracts and excavates the shape information and designs a cross-level symbolic shape representation. With the insight that what fusion optimization relies upon is tensor shape relationships between adjacent operators rather than exact shape values, it proposes the dynamic shape fusion approach based on shape information propagation. To generate code that adapts to arbitrary shapes efficiently, it proposes a compile-time and runtime combined code generation approach. Finally, it presents a complete optimization pipeline for dynamic shape models and implements an industrial-grade ML compiler, named BladeDISC. The extensive evaluation demonstrates that BladeDISC outperforms PyTorch, TorchScript, TVM, ONNX Runtime, XLA, Torch Inductor (dynamic shape), and TensorRT by up to 6.95×, 6.25×, 4.08×, 2.04×, 2.06×, 7.92×, and 4.16× (3.54×, 3.12×, 1.95×, 1.47×, 1.24×, 2.93×, and 1.46× on average) in terms of end-to-end inference speedup on the A10 and T4 GPU, respectively. BladeDISC's source code is publicly available at https://github.com/alibaba/BladeDISC.

GraphSet: High Performance Graph Mining through Equivalent Set Transformations

Conference Paper

November 2023

14 Reads

1 Citation

[...]

... To prevent the omission of potential butterflies, they enumerate both out-wedges and inout-wedges, forming all possible acyclic orientations of a butterfly by two kinds of wedges. Recently, GraphSet [68] further optimizes graph mining through equivalent set transformation, which aims to eliminate most control flow and reduce computation overhead. As triangle and butterfly are the two most widelystudied motifs, we will next focus on reviewing the works that were explicitly developed for counting/listing the two motifs in massive graphs. ...
Reference:
Parallelization of butterfly counting on hierarchical memory

GraphSet: High Performance Graph Mining through Equivalent Set Transformations

Citing Conference Paper
November 2023

[...]

... The attention of researchers towards the use of IRT as an auxiliary tool in the analysis of the human body is increasing, particularly in the application of the musculoskeletal system [7,8] within different age ranges [9] or daily activities [10]. Moreover, the application has also extended to other fields, such as sport [11] and posture [12], but also in pathological subjects with metabolic alterations [13], breast cancer [14], rheumatic diseases, and osteoarthritis [15,16], and even in the field of musculoskeletal disorders [17]. IRT has been applied to various fields of musculoskeletal disorders, such as scoliosis [18], arthritis [19], and low back pain [20]. ...
Reference:
Active Breaks Reduce Back Overload during Prolonged Sitting: Ergonomic Analysis with Infrared Thermography

Breast cancer pre-clinical screening using infrared thermography and artificial intelligence: a prospective, multicentre, diagnostic accuracy cohort study

Citing Article
Full-text available
September 2023

International Journal of Surgery

[...]

... In the field of NLP, a common way to reduce the difficulty of reasoning is to generate answers with an intermediate meaning representation (IMR) (Gan et al. 2021;Nie et al. 2022;Paul et al. 2023). Such methods first generate IMRs of questions and then use external tools (e.g., algorithms, interpreters) to generate the answer results based on IMRs. ...
Reference:
Exploring Equation as a Better Intermediate Meaning Representation for Numerical Reasoning of Large Language Models

GraphQ IR: Unifying the Semantic Parsing of Graph Query Languages with One Intermediate Representation

Citing Conference Paper
January 2022

[...]

... • CompressGraph finds a 2× speedup over Ligra+ [26] • Teseo finds frequent speedups of at least 1.5× over other graph containers [33] However, it is likely that much of the improvements seen in these works are from factors other than the graph container itself. ...
Reference:
BYO: A Unified Framework for Benchmarking Large-Scale Graph Containers

CompressGraph: Efficient Parallel Graph Analytics with Rule-Based Compression

Citing Article
May 2023

Proceedings of the ACM on Management of Data

[...]

... Although it is impossible to improve the compute density of a single gate operation (however, it is possible to increase the overall compute density by fusing several gate operations together [47]), one could still greatly increase the computational efficiency by vectorizing the inner matrix-vector multiplication and making better use of the cache. To achieve this, we aggregate 8 vectors of size 4 for the inner operation of Algorithm. ...
Reference:
JuliVQC: an Efficient Variational Quantum Circuit Simulator for Near-Term Quantum Algorithms

UniQ: A Unified Programming Model for Efficient Quantum Circuit Simulation

Citing Conference Paper
November 2022

[...]

... Moreover, Tensorflow-XLA [35] is a domain-specific compiler for linear algebra. FreeTensor [40] is a domain-specific language that supports irregular tensor programs. Besides optimizing each single operator for DNN inference, Rammer [30] and IOS [16] propose to parallelize independent operators in a network, and TASO [23] applies auto-generated rewriting rules to optimize DNN in graph level. ...
Reference:
Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs

FreeTensor: a free-form DSL with holistic optimizations for irregular tensor programs

Citing Conference Paper
June 2022

[...]

... In order to further reduce write amplification, they propose to incorporate delta encoding within a background re-compression process. Zhang et al. develop a storage engine called CompressDB [45]. Since it is integrated directly into file systems, it is able to support various database systems, one of which is LevelDB. ...
Reference:
Requirements and Trade-Offs of Compression Techniques in Key–Value Stores: A Survey

CompressDB: Enabling Efficient Compressed Data Direct Processing for Various Databases

Citing Conference Paper
June 2022

[...]

... Moreover, [45] presents federated learning as a way to train models across distributed clients in edge environments, thus enhancing collaborative learning whilst preserving data privacy, whereas [46] describes an efficient query processing engine for edge devices called FineQuery, which improves both latency and bandwidth values. In this paper, a toroidal topology based on a dynamic set of k-ary grids is presented, that being based on the one exposed in [47]. ...
Reference:
Applying Toroidal k-ary Grids for Optimizing Edge Data CentersUç Veri Merkezlerini Optimize Etmek İçin Toroidal K-ary Izgaralarını Uygulama

Exploring Query Processing on CPU-GPU Integrated Edge Device

Citing Article
December 2022

IEEE Transactions on Parallel and Distributed Systems

[...]

... While various robotics frameworks exist, few prioritize user experience and accessibility. Relevant export-oriented robotics frameworks are TalKRoBots [6], ROS [7], ROS 2 [8], E2M [9], Zoro [10], PDRA [11], YARP [12], and the RT-Middleware [13]. In contrast, recent developments in IoT have introduced user-friendly tools like Node-RED [14], designed to simplify IoT application integration and democratize technology for a broader audience. ...
Reference:
NEP+: A Human-Centered Framework for Inclusive Human-Machine Interaction Development

Zoro: A robotic middleware combining high performance and high reliability

Citing Article
August 2022

Journal of Parallel and Distributed Computing

[...]

... Recently, the new generation Sunway Supercomputer that consists of numerous SW26010pro processors has shown great potential in supporting AI-based workloads [10][11][12][13]. However, it is non-trivial to apply the FlashAttention algorithm on the new generation Sunway Supercomputer. ...
Reference:
SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer

BaGuaLu: targeting brain scale pretrained models with over 37 million cores

Citing Conference Paper
April 2022

[...]

Jidong Zhai's research while affiliated with Tsinghua University and other places

What is this page?

Publications (137)

Citations (47)