VCUDA: GPU accelerated high performance computing in virtual machines

GPGPU Task Scheduling Technique for Reducing the Performance Deviation of Multiple GPGPU Tasks in RPC-Based GPU Virtualization Environments

Article

Full-text available

Mar 2021

In remote procedure call (RPC)-based graphic processing unit (GPU) virtualization environments, GPU tasks requested by multiple-user virtual machines (VMs) are delivered to the VM owning the GPU and are processed in a multi-process form. However, because the thread executing the computing on general GPUs cannot arbitrarily stop the task or trigger context switching, GPU monopoly may be prolonged owing to a long-running general-purpose computing on graphics processing unit (GPGPU) task. Furthermore, when scheduling tasks on the GPU, the time for which each user VM uses the GPU is not considered. Thus, in cloud environments that must provide fair use of computing resources, equal use of GPUs between each user VM cannot be guaranteed. We propose a GPGPU task scheduling scheme based on thread division processing that supports GPU use evenly by multiple VMs that process GPGPU tasks in an RPC-based GPU virtualization environment. Our method divides the threads of the GPGPU task into several groups and controls the execution time of each thread group to prevent a specific GPGPU task from a long time monopolizing the GPU. The efficiency of the proposed technique is verified through an experiment in an environment where multiple VMs simultaneously perform GPGPU tasks.

DEVELOPING PARALLEL COMPUTING ALGORITHMS USING GPU’S TO DETERMINE OIL AND GAS RESERVES PRESENTED IN THE UPSTREAM (EXPLORATION) SECTOR

Conference Paper

Jan 2020

Oil and Gas companies keep exploring every new possible method to increase the likelihood of finding a commercial hydrocarbon bearing prospect. Well logging generates gigabytes of data from various probes and sensors. After processing, a prospective reservoir will indicate areas of oil, gas, water and reservoir rock. Incorporating High Performance Computing (HPC) methodologies will allow for thousands of potential wells to be indicative of its hydrocarbon bearing potential. This study will present the use of the Graphics Processing Unit (GPU) computing as another method of analyzing probable reserves. Raw well log data from the Kansas Geological Society (1999-2018) forms the basis of the data analysis. Parallel algorithms are developed and make use of Nvidia’s Compute Unified Device Architecture (CUDA). The results gathered highlight a 5 times speedup using a Nvidia GeForce GT 330M GPU as compared to an Intel Core i7 740QM Central Processing Unit (CPU). The processed results display depth wise areas of shale and rock formations as well as water, oil and/or gas reserves.

SNCL: a supernode OpenCL implementation for hybrid computing arrays

Article

Full-text available

Dec 2023
J SUPERCOMPUT

Heterogeneous computing has been developing continuously in the field of high-performance computing because of its high performance and energy efficiency. More and more accelerators have emerged, such as GPU, FPGA, DSP, AI accelerator, and so on. Usually, the accelerator is connected to the host CPU as a peripheral device to form a tightly coupled heterogeneous computing node, and then, a parallel system is constructed by multiple nodes. This organization is computationally efficient, but not flexible. When new accelerators appear, it is difficult to join the system that has been built. At the hardware level, we create an array of accelerators and connect them to the existing system through a high-speed network. At the software level, we dynamically organize computing resources from various arrays to build a virtual heterogeneous computing node. This approach also includes a standard programming environment. Therefore, it is a more flexible, elastic, and scalable heterogeneous computing organization. In this paper, a supernode OpenCL implementation is proposed for hybrid parallel computing systems, in which virtual supernodes can be dynamically constructed between different computing arrays, and a standard OpenCL environment is implemented based on RDMA communication of high-speed interconnection, which can be combined with the system-level MPI programming environment, thereby realizing the large-scale parallel computing of the hybrid array. SNCL is compatible with existing MPI/OpenCL programs without the need for additional modifications. Experiments show that the runtime overhead of the supernode OpenCL environment is very low, and it is suitable for deploying applications with high computing density and large data scale between different arrays to utilize their computing power without affecting scalability.

Enabling the CUDA Unified Memory model in Edge, Cloud and HPC offloaded GPU kernels

Conference Paper

May 2022

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

Preprint

Full-text available

May 2022

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-SystemGroup/Awesome-DL-Scheduling-Papers

FPGA Accelerator for Machine Learning Interatomic Potential-Based Molecular Dynamics of Gold Nanoparticles

Article

Full-text available

Jan 2022

Molecular dynamics (MD) simulations involve computations of forces between atoms and the total energy of the chemical systems. The scientific community is dependent on high-end servers for such computations that are generally sequential and highly power hungry, thereby restricting these computations in reaching experimentally relevant large systems. This work explores the concept of parallelization of the code and accelerating them by exploring the usage of high level synthesis (HLS) based Field Programmable Gate Array (FPGA). This work proposes a hardware and software based interface to implement parallel algorithms in an FPGA framework and communication between the software and hardware interface is implemented. The forces of Au 147 obtained through the ANN based interatomic potentials in the proposed model shows an acceleration of 1.5 times compared with an expensive server with several nodes. Taking this work forward can result in a lab-on-a-chip application and this would potentially be applied onto several large experimentally relevant chemical systems.

Effective GPU Sharing Under Compiler Guidance

Preprint

Jul 2021

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive resource, and boosting utilization of GPUs without causing performance degradation of individual workloads is an important and challenging problem. Although services like MPS support simultaneous execution of multiple co-operative kernels on a single device, they do not solve the above problem for uncooperative kernels, MPS being oblivious to the resource needs of each kernel. We propose a fully automated compiler-assisted scheduling framework. The compiler constructs GPU tasks by identifying kernel launches and their related GPU operations (e.g. memory allocations). For each GPU task, a probe is instrumented in the host-side code right before its launch point. At runtime, the probe conveys the information about the task's resource requirements (e.g. memory and compute cores) to a scheduler, such that the scheduler can place the task on an appropriate device based on the task's resource requirements and devices' load in a memory-safe, resource-aware manner. To demonstrate its advantages, we prototyped a throughput-oriented scheduler based on the framework, and evaluated it with the Rodinia benchmark suite and the Darknet neural network framework on NVIDIA GPUs. The results show that the proposed solution outperforms existing state-of-the-art solutions by leveraging its knowledge about applications' multiple resource requirements, which include memory as well as SMs. It improves throughput by up to 2.5x for Rodinia benchmarks, and up to 2.7x for Darknet neural networks. In addition, it improves job turnaround time by up to 4.9x, and limits individual kernel performance degradation to at most 2.5%.

A Survey on Serverless Computing

Preprint

Full-text available

Jun 2021

The Internet is responsible for accelerating growth in several fields such as digital media, healthcare, the military. Furthermore, the Internet was founded on the principle of allowing clients to communicating with servers. However, serverless computing is one such field that tries to break free from this paradigm. Event-driven compute services allow users to build more agile applications using capacity provisioning and a pay-for-value billing model. This paper provides a formal account of the research contributions in the field of Serverless computing.

Cricket: A virtualization layer for distributed execution of CUDA applications with checkpoint/restart support

Article

Full-text available

Jul 2021

In high‐performance computing and cloud computing the introduction of heterogeneous computing resources, such as GPU accelerator have led to a dramatic increase in performance and efficiency. While the benefits of virtualization features in these environments are well researched, GPUs do not offer virtualization support that enables fine‐grained control, increased flexibility, and fault tolerance. In this article, we present Cricket: A transparent and low‐overhead solution to GPU virtualization that enables future research into other virtualization techniques, due to its open‐source nature. Cricket supports remote execution and checkpoint/restart of CUDA applications. Both features enable the distribution of GPU tasks dynamically and flexibly across computing nodes and the multitenant usage of GPU resources, thereby improving flexibility and utilization for high‐performance and cloud computing.

Efficient parallel A* search on multi-GPU system

Article

Oct 2021
FUTURE GENER COMP SY

A* search is a best-first search algorithm that is widely used in pathfinding and graph traversal. To meet the ever-increasing demand of performance, various high-performance architectures (e.g., multi-core CPU and GPU) have been explored to accelerate the A* search. However, the current GPU based A* search approaches are merely designed based on single-GPU architecture. Nowadays, the amount of data grows at an exponential rate, making it inefficient or even infeasible for the current A* to process the data sets entirely on a single GPU. In this paper, we propose DA*, a parallel A* search algorithm based on the multi-GPU architecture. DA* enables the efficient acceleration of the A* algorithm using multiple GPUs with effective graph partitioning and data communication strategies. To make the most of the parallelism of multi-GPU architecture, in the state extension phase, we adopt the method of multiple priority queues for the open list, which allows multiple states being calculated in parallel. In addition, we use the parallel hashing of replacement and frontier search mechanism to address node duplication detection and memory bottlenecks respectively. The evaluation shows that DA* is effective and efficient in accelerating A* based computational tasks on the multi-GPU system. Compared to the state-of-the-art A* search algorithm based on a single GPU, our algorithm can achieve up to 3× performance speedup with four GPUs.

AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries

Preprint

Full-text available

Mar 2021

Edge computing offers the distinct advantage of harnessing compute capabilities on resources located at the edge of the network to run workloads of relatively weak user devices. This is achieved by offloading computationally intensive workloads, such as deep learning from user devices to the edge. Using the edge reduces the overall communication latency of applications as workloads can be processed closer to where data is generated on user devices rather than sending them to geographically distant clouds. Specialised hardware accelerators, such as Graphics Processing Units (GPUs) available in the cloud-edge network can enhance the performance of computationally intensive workloads that are offloaded from devices on to the edge. The underlying approach required to facilitate this is virtualization of GPUs. This paper therefore sets out to investigate the potential of GPU accelerator virtualization to improve the performance of deep learning workloads in a cloud-edge environment. The AVEC accelerator virtualization framework is proposed that incurs minimum overheads and requires no source-code modification of the workload. AVEC intercepts local calls to a GPU on a device and forwards them to an edge resource seamlessly. The feasibility of AVEC is demonstrated on a real-world application, namely OpenPose using the Caffe deep learning library. It is observed that on a lab-based experimental test-bed AVEC delivers up to 7.48x speedup despite communication overheads incurred due to data transfers.

Efficient Buffer Overflow Detection on GPU

Article

Full-text available

Dec 2020

Rich thread-level parallelism of GPU has motivated co-running GPU kernels on a single GPU. However, when GPU kernels co-run, it is possible that one kernel can leverage buffer overflow to attack another kernel running on the same GPU. Existing work has either large performance overhead or limited capability in detecting buffer overflow. In this paper, we introduce GMODx, a runtime software system that can detect GPU buffer overflow. GMODx performs always-on monitoring on allocated memory based on canary-based design. First, for fine-grained memory management, GMODx introduces a set of byte arrays to store buffer information for overflow detection. Techniques, including lock-free accesses to the byte arrays, delayed memory free, efficient memory reallocation, and garbage collection for the byte arrays, are proposed to achieve high performance. Second, for coarse-grained memory management, GMODx utilizes unified memory to delegate the always-on monitoring to the CPU. To reduce performance overhead, we propose several techniques, including customized list data structure and specific optimizations against the unified memory. Our experiments show that GMODx is capable of detecting buffer overflow for the fine-grained memory management without performance loss, and that it incurs small runtime overhead (4.2% on average and up to 9.7%) for the coarse-grained memory management.

Characterization and Prediction of Performance Interference on Mediated Passthrough GPUs for Interference-aware Scheduler

Preprint

Full-text available

Jul 2019

Sharing GPUs in the cloud is cost effective and can facilitate the adoption of hardware accelerator enabled cloud. But sharing causes interference between co-located VMs and leads to performance degradation. In this paper, we proposed an interference-aware VM scheduler at the cluster level with the goal of minimizing interference. NVIDIA vGPU provides sharing capability and high performance, but it has unique performance characteristics, which have not been studied thoroughly before. Our study reveals several key observations. We leverage our observations to construct models based on machine learning techniques to predict interference between co-located VMs on the same GPU. We proposed a system architecture leveraging our models to schedule VMs to minimize the interference. The experiments show that our observations improves the model accuracy (by 15%˜40%) and the scheduler reduces application run-time overhead by 24.2% in simulated scenarios.

Hypervisor injection attack using X-cross API calls (HI-API attack)

Article

Full-text available

May 2021

E. Arul

Progressive cyber-attacks emphasize secrecy and industriousness the more they are able to move alongside, exfiltration of information and cause harm. The more they stay under radar. The abusers swing progressively to cross-process infusion to preserve a strategic distance from identification. Cross-process infusion helps attackers to execute malicious codes that take on truly project appearance. Code infusion does not require aggressors to use specific procedures that can be quickly differentiated. Alternatively, they incorporate malignancy code into the normal procedure, allowing their operations a wider range of secrecy and naivety (e.g. explorer.exe, regsvr32.exe, svchost.exe…). For the purpose of detect malware injection The hypervisor injection attack proposed in this paper by using a method of X-cross application programming interface calls (API-HI-attack) raises awareness that malware is injecting into the simulation tool with X-cross-language API calls. The experimental results of the proposed work shows antimalware protector need to take more attention on API call hooking at process level injection by X-cross languages. The results proposed method with high true positive (92.96%) and less false positive (0.07%) over the existing methods.

DCUDA: Dynamic GPU Scheduling with Live Migration Support

Conference Paper

Nov 2019

In clouds and data centers, GPU servers which consist of multiple GPUs are widely deployed. Current state-of-the-art GPU scheduling algorithm are "static" in assigning applications to different GPUs. These algorithms usually ignore the dynamics of the GPU utilization and are often inaccurate in estimating resource demand before assigning/running applications, so there is a large opportunity to further load balance and to improve GPU utilization. Based on CUDA (Compute Unified Device Architecture), we develop a runtime system called DCUDA which supports "dynamic" scheduling of running applications between multiple GPUs. In particular, DCUDA provides a realtime and lightweight method to accurately monitor the resource demand of applications and GPU utilization. Furthermore, it provides a universal migration facility to migrate "running applications" between GPUs with negligible overhead. More importantly, DCUDA transparently supports all CUDA applications without changing their source codes. Experiments with our prototype system show that DCUDA can reduce 78.3% of overloaded time of GPUs on average. As a result, for different workloads consisting of a wide range applications we studied, DCUDA can reduce the average execution time of applications by up to 42.1%. Furthermore, DCUDA also reduces 13.3% energy in the light load scenario.

Design of an adaptive GPU sharing and scheduling scheme in container-based cluster

Article

Full-text available

Sep 2020
CLUSTER COMPUT

Container based virtualization is an innovative technology that accelerates software development by providing portability and maintainability of applications. Recently, a growing number of workloads such as high performance computing (HPC) and Deep Learning(DL) are deployed in the container based environment. However, GPU resource management issues especially the GPU memory over subscription issue in container-based clusters, which brings substantial performance loss, is still challenging. This paper proposes an adaptive fair-share method to share effectively in container-based virtualization environment as well as an execution rescheduling method to manage the execution order of each container for acquiring maximum performance gain. We also proposed a checkpoint based mechanism especially for DL workload running with TensorFlow, which can efficiently solve the GPU memory over subscription problem. We demonstrate that our approach contributes to overall performance improvement as well as higher resource utilization compared to default and static fair-share methods with homogeneous and heterogeneous workloads. Compared to two other conditions, their results show that the proposed method reduces by 16.37%, 15.61% in average execution time and boosts approximately by 52.46%, 10.3% in average GPU memory utilization, respectively. We also evaluated our checkpoint based mechanism by running multiple CNN workloads with TensorFlow at the same time and the result shows our proposed mechanism can ensure each workload executing safely without out of memory (OOM) error occurs.

Survey of Finite Element Method-Based Real-Time Simulations

Article

Full-text available

Jul 2019

The finite element method (FEM) has deservedly gained the reputation of the most powerful, highly efficient, and versatile numerical method in the field of structural analysis. Though typical application of FE programs implies the so-called “off-line” computations, the rapid pace of hardware development over the past couple of decades was the major impetus for numerous researchers to consider the possibility of real-time simulation based on FE models. Limitations of available hardware components in various phases of developments demanded remarkable innovativeness in the quest for suitable solutions to the challenge. Different approaches have been proposed depending on the demands of the specific field of application. Though it is still a relatively young field of work in global terms, an immense amount of work has already been done calling for a representative survey. This paper aims to provide such a survey, which of course cannot be exhaustive.

Hardware-Accelerated FaaS for the Edge-Cloud Continuum

Conference Paper

Oct 2023

Technology for embedded GPU virtualization in the edge computing environment

Conference Paper

Dec 2022

Towards Reproducible Execution of Closed-Source Applications from Internet Archives

Conference Paper

Jun 2023

An Economy-Oriented GPU Virtualization With Dynamic and Adaptive Oversubscription

Article

Jan 2022

GPU is becoming attractive around multiple academic and industrial area because of its massively parallel computing ability. However, there are still some obstacles which the GPU virtualization technologies should overcome to reach their maturity. These obstacles mainly include the problem of resource allocation strategy to guarantee possible higher yield. This shortage has already become an obvious barrier to the practical GPU usage in the cloud for satisfying business and academical requirements. There are many mature pieces of research in the area of oversubscribed cloud computing to enhance economic efficiency. However, the study on GPU oversubscription is almost blank for the just started use of GPU in cloud computing. This paper introduces gOver, an economy-oriented GPU resource oversubscription system based on the GPU virtualization platform. gOver is able to share and modulate GPU resource among workloads in an adaptive and dynamic manner, guaranteeing the QoS level at the same time. We evaluate the proposed gOver strategy with designed experiments with specific workload characteristics. The experimental results show that our dynamic GPU oversubscription solution improves the economic efficiency by 20% over traditional GPU sharing strategy, and outperforms the static oversubscription method by much better stability in QoS control.

PILOT: a Runtime System to Manage Multi-tenant GPU Unified Memory Footprint

Conference Paper

Dec 2021

Analysis of virtio GPU in a containerized environment

Conference Paper

Nov 2021

Vapor: A GPU Sharing Scheduler with Communication and Computation Pipeline for Distributed Deep Learning

Conference Paper

Sep 2021

Evaluation of GPU Virtualisation Approaches for Machine Learning Enhanced Debugging of Cloud Orchestration

Conference Paper

May 2021

Transparent I/O-Aware GPU Virtualization for Efficient Resource Consolidation

Conference Paper

May 2021

AVEC: Accelerator Virtualization in Cloud-Edge Computing for Deep Learning Libraries

Conference Paper

May 2021

SEGIVE: A Practical Framework of Secure GPU Execution in Virtualization Environment

Conference Paper

Nov 2020

Efficient Architecture Paradigm for Deep Learning Inference as a Service

Conference Paper

Nov 2020

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

Conference Paper

Nov 2020

Performance Analysis of NVIDIA GPU Virtualization in NARI Desktop Cloud

Conference Paper

Oct 2019

Potential Bottleneck and Measuring Performance of Serverless Computing: A Literature Study

Conference Paper

Jun 2020

Trillium: The code is the IR

Conference Paper

Jul 2019

A Heterogeneous Dynamic Scheduling Minimize Energy—HDSME

Chapter

Jan 2021

FlexGPU: A Flexible and Efficient Scheduler for GPU Sharing Systems

Conference Paper

May 2020

Zero-Copy Data Transfer for an OpenCL API Remoting System

Conference Paper

Apr 2020

Adaptive and transparent task scheduling of GPU ‐powered clusters

Article

Apr 2020

GPGPU‐powered supercomputers are vital for various science and engineering applications. On each cluster node, the GPU works as a coprocessor of the CPU, and the computing task runs alternatively on CPU and GPU. Due to this characteristic, traditional task scheduling strategy tends to result in significant workload imbalance and underutilization of GPUs. We design an adaptive scheduling strategy to alleviate such imbalance and underutilization. Our strategy proposes to logically treats all GPUs in the cluster as a whole. Every cluster node maintains a local information table of all GPUs. Once a GPU call request is received, a node selects a GPU to run the task in an adaptive manner based on this table. In addition, our strategy does not rely on a global queue, and thus avoids excessive internode communication overhead. Moreover, we encapsulate our strategy into an intermedia module between the cluster and users. Consequently, underlying details of task scheduling is transparent to users, which enhances usability. We validate our strategy through experiments.

Empirical Analysis of Hardware-Assisted GPU Virtualization

Conference Paper

Dec 2019

Partial migration technique for GPGPU tasks to Prevent GPU Memory Starvation in RPC‐based GPU Virtualization

Article

Feb 2020

Graphics processing unit (GPU) virtualization technology enables a single GPU to be shared among multiple virtual machines (VMs), thereby allowing multiple VMs to perform GPU operations simultaneously with a single GPU. Because GPUs exhibit lower resource scalability than central processing units (CPUs), memory, and storage, many VMs encounter resource shortages while running GPU operations concurrently, implying that the VM performing the GPU operation must wait to use the GPU. In this paper, we propose a partial migration technique for general‐purpose graphics processing unit (GPGPU) tasks to prevent the GPU resource shortage in a remote procedure call‐based GPU virtualization environment. The proposed method allows a GPGPU task to be migrated to another physical server's GPU based on the available resources of the target's GPU device, thereby reducing the wait time of the VM to use the GPU. With this approach, we prevent resource shortages and minimize performance degradation for GPGPU operations running on multiple VMs. Our proposed method can prevent GPU memory shortage, improve GPGPU task performance by up to 14%, and improve GPU computational performance by up to 82%. In addition, experiments show that the migration of GPGPU tasks minimizes the impact on other VMs.

Performance Optimization for InfiniBand Virtualization on QEMU/KVM

Conference Paper

Dec 2019

qCUDA: GPGPU Virtualization for High Bandwidth Efficiency

Conference Paper

Dec 2019

Improving the performance of physics applications in atom-based clusters with rCUDA

Article

Nov 2019

Traditionally, High-Performance Computing (HPC) has been associated with large power requirements. The reason was that chip makers of the processors typically employed in HPC deployments have always focused on getting the highest performance from their designs, regardless of the energy their processors may consume. Actually, for many years only heat dissipation was the real barrier for achieving higher performance, at the cost of higher energy consumption. However, a new trend has recently appeared consisting on the use of low-power processors for HPC purposes. The MontBlanc and Isambard projects are good examples of this trend. These proposals, however, do not consider the use of GPUs. In this paper we propose to use GPUs in this kind of low-power processor based HPC deployments by making use of the remote GPU virtualization mechanism. To that end, we leverage the rCUDA middleware in a hybrid cluster composed of low-power Atom-based nodes and regular Xeon-based nodes equipped with GPUs. Our experiments show that, by making use of rCUDA, the execution time of applications belonging to the physics domain is noticeably reduced, achieving a speed up of up to 140x with just one remote NVIDIA V100 GPU with respect to the execution of the same applications using 8 Atom-based nodes. Additionally, a rough energy consumption estimation reports improvements in energy demands of up to 37x.

gQoS: A QoS-oriented GPU Virtualization with Adaptive Capacity Sharing

Article

Oct 2019

Currently, the virtualization technologies for cloud computing infrastructures supporting extra devices such as GPU still requires development and refinement. This requirement is evident especially in the direction of resource sharing and allocation under some performance constraints such as the quality of service (QoS) guarantee, considering the closed GPU platform. This deficiency significantly limits the applicability range of the cloud platform which aims at supporting the efficient and fluent execution of business and academic workloads. This paper introduces gQoS, an adaptive virtualized GPU resource capacity sharing system under the QoS target, which is able to share and allocate the virtualized GPU resource among workloads adaptively, guaranteeing the QoS level with stability and accuracy. We evaluate the workloads and make the comparison between our gQoS strategy and other allocation strategies. The experiments show that our strategy guarantees much better accuracy and stability in QoS control, and the total GPU resource utilization under gQoS can be rewarded with at most 25.85% reduction compared with other strategies.

gMig: Efficient vGPU Live Migration with Overlapped Software-Based Dirty Page Verification

Article

Oct 2019

This paper introduces gMig, an open-source and practical vGPU live migration solution for full virtualization. Taking the advantage of the dirty pattern of GPU workloads, gMig presents the One-Shot Pre-Copy mechanism combined with the hashing based Software Dirty Page technique to achieve efficient vGPU live migration. Particularly, we propose three core techniques for gMig: 1) Dynamic Graphics Address Remapping, which parses and manipulates GPU commands to adjust the address mapping to adapt to a different environment after migration, 2) Software Dirty Page, which utilizes a hashing based approach with sampling pre-filtering to detect page modification, overcomes the commodity GPU's hardware limitation, and speeds up the migration by only sending the dirtied pages, 3) Overlapped Migration Process, which significantly compress the hanging overhead by overlapping the dirty page verification and transmission concurrently. Our evaluation shows that gMig achieves GPU live migration with an average downtime of 302 ms on Windows and 119 ms on Linux. With the help of Software Dirty Page, the number of GPU pages transferred during the downtime is effectively reduced by up to 80.0%. The design of sampling filter and overlapped processing can bring about further 30.0% and 10.0% improvements in page processing.

Address Randomization for Dynamic Memory Allocators on the GPU

Conference Paper

Aug 2019

A Fast and Secure GPU Memory Allocator

Conference Paper

Aug 2019

C-GDR: High-Performance Container-Aware GPUDirect MPI Communication Schemes on RDMA Networks

Conference Paper

May 2019

CUDA offloading for energy‐efficient and high‐frame‐rate simulations using tablets

Article

Aug 2019

The multiple sensors and touch capabilities of mobile devices are defining new methods of computer interaction. However, the computing power of such devices is not currently sufficient for new applications that require compute‐intensive applications. Using graphics processing units (GPUs) for general‐purpose computing with GPU programming models such as Compute Unified Device Architecture (CUDA) has been proved to accelerate simulations in supercomputers. Although, CUDA‐capable chips such as the Tegra K1 have been released on tablets can accelerate computer simulations, their absolute computing power and performance per watt are not comparable with ordinary GPUs. In this paper, we analyze a heterogeneous system composed of both of a tablet (client) and notebook with a low‐power GPU (server). Intensive computations on a tablet device are offloaded to a notebook GPU using the rCUDA middleware. Molecular dynamics (MD) simulations are performed using our test system, and the computing speed and performance per watt are reported. Implementing dynamic parallelism (DP) reduced the latency, doubling the total frames per second in some cases. Our system achieves better computational performance, and higher performance per watt than a tablet powered by a CUDA‐capable GPU. We achieved 21.7 Gflops/W by combining multiple client tablets and server, compared with 21.3 Gflops/W from the server itself.

Redesigning the rCUDA communication layer for a better adaptation to the underlying hardware

Article

Aug 2019

The use of Graphics Processing Units (GPUs) has become a very popular way to accelerate the execution of many applications. However, GPUs are not exempt from side effects. For instance, GPUs are expensive devices which additionally consume a non‐negligible amount of energy even when they are not performing any computation. Furthermore, most applications present low GPU utilization. To address these concerns, the use of GPU virtualization has been proposed. In particular, remote GPU virtualization is a promising technology that allows applications to transparently leverage GPUs installed in any node of the cluster. In this paper, the remote GPU virtualization mechanism is comparatively analyzed across three different generations of GPUs. The first contribution of this study is an analysis about how the performance of the remote GPU virtualization technique is impacted by the underlying hardware. To that end, the Tesla K20, Tesla K40, and Tesla P100 GPUs along with FDR and EDR InfiniBand fabrics are used in the study. The analysis is performed in the context of the rCUDA middleware. It is clearly shown that the GPU virtualization middleware requires a comprehensive design of its communication layer, which should be perfectly adapted to every hardware generation in order to avoid a reduction in performance. This is precisely the second contribution of this work, ie, redesigning the rCUDA communication layer in order to improve the management of the underlying hardware. Results show that it is possible to improve bandwidth up to 29.43%, which translates into up to 4.81% average less execution time in the performance of the analyzed applications.

The Design of Soft Base Station Based on Docker

Conference Paper

Dec 2018

VCUDA: GPU accelerated high performance computing in virtual machines

No full-text available

Recommended publications

A Survey of Techniques for Optimizing Deep Learning on GPUs

Social Simulations Accelerated: Large-Scale Agent-Based Modeling on a GPU Cluster

DICE2D and Cluster