Figure 3 - uploaded by Ian Graves
Content may be subject to copyright.
Overall design of runtime.  

Overall design of runtime.  

Source publication
Conference Paper
Full-text available
Graphics Processing Units (GPUs) are increasingly becoming part of HPC clusters. Nevertheless, cloud computing services and resource management frameworks targeting heterogeneous clusters including GPUs are still in their infancy. Further, GPU software stacks (e.g., CUDA driver and runtime) currently provide very limited support to concurrency. In...

Context in source publication

Context 1
... overall design of our runtime is illustrated in Figure 3. The basic components are: connection manager, dispatcher, virtual- GPUs (vGPUs), and memory manager. ...

Similar publications

Article
Full-text available
High-performance computing are based more and more in heterogeneous architectures and GPGPUs have become one of the main integrated blocks in these, as the recently emerged Mali GPU in embedded systems or the NVIDIA GPUs in HPC servers. In both GPGPUs, programming could become a hurdle that can limit their adoption, since the programmer has to lear...

Citations

... Whereas, our approach does not require scheduler change and the number of GTT copies is the same as that in predictive copy approach. Gdev [21], GDM [22], RSVM [23] and VMBR [24] solve a GPU memory insufficiency problem. They use system memory as backup memory, and copy data from GPU memory to system memory at runtime when applications need more memory to run. ...
Article
Full-text available
The proliferation of GPU intensive workloads has created a new challenge for low-overhead and efficient GPU virtualization solutions over GPU clouds. gVirt is a full GPU virtualization solution for Intel’s integrated GPUs that share system’s on-board memory for graphics memory. In order to solve the inherent scalability limitation on the number of simultaneous virtual machines (VM) in gVirt, gScale proposed a dynamic sharing scheme for global graphics memory among VMs by copying the entries in a private graphics translation table (GTT) to a physical GTT along with a GPU context switch. However, copying entries between private GTT and physical GTT often causes significant overhead, which becomes worse when the global graphics memory space shared by each VM is overlapped. This paper identifies that the copy overhead caused by GPU context switch is one of the major bottlenecks in performance improvement and proposes a low-overhead dynamic memory management scheme called DymGPU. DymGPU provides two memory allocation algorithms such as size-based and utilization-based algorithms. While the size-based algorithm allocates memory space based on the memory size required by each VM, the utilization-based algorithm considers GPU utilization of each VM to allocate memory space. DymGPU is also dynamic in the sense that the global graphics memory space used by each VM is rearranged at runtime by periodically checking idle VMs and GPU utilization of each runnable VM. We have implemented our proposed approach in gVirt and confirmed that the proposed scheme reduces GPU context switch time by up to 53% and improved the overall performance of various GPU applications by up to 39%.
... There have been many studies on improving the performance of multiple applications by considering the dependencies from the data copy operations. Jablin et al. [14] and Becchi et al. [2] proposed SW runtime environments to handle data allocation and transfers on-the-fly by keeping track of dependencies across kernel executions. Using a similar technique, Sajjapongse et al. [20] distributed kernels to multiple GPUs, to reduce the waiting times on kernel dependencies. ...
... MP-Tomasulo [26] is a parallel execution engine that improved the performance of processors by reordering instructions. Our study is similar with these studies [2,14,20,26,28] in terms of investigating congestion cased by data dependency and the reordering tasks to improve the performance. In contrast, we focus on the multiple applications without dependencies that can be executed concurrently without dependencies among them. ...
Article
Full-text available
General-purpose graphics processing units (GPGPUs) have been widely adapted to the industry due to the high parallelism of graphics processing units (GPUs) compared with central processing units (CPUs). Especially, a GPGPU device has been adopted for various scientific workloads which have high parallelism. To handle the ever increasing demand, multiple applications are often run simultaneously in multiple GPGPU devices. However, when multiple applications are running concurrently, the overall performance of GPGPU devices varies significantly due to the different characteristics of GPGPU applications. To improve the efficiency, it is critical to anticipate the performance of applications and find optimal scheduling policy. In this paper, we analyze various types of scientific applications and identify factors that impact the performance during the concurrent execution of the applications in GPGPU devices. Our analysis results show that each application has distinct characteristic. By considering distinct characteristics of applications, a certain combination of applications has better performance compared with the others when executed concurrently in multiple GPGPU devices. Based on the finding of our analysis, we propose a simulator which predicts the performance of GPGPU devices when multiple applications are running concurrently. Our simulator collects performance metrics during the execution of applications and predicts the performance of certain combinations using the performance metrics. The experimental result shows that the best combination of applications can increase the performance by 39.44% and 65.98% compared with the average of combinations and the worst case, respectively when using a single GPGPU device. When utilizing multiple GPGPU devices, our result shows that the performance improve can be 24.78% and 39.32% compared with the average and the worst combinations, respectively.
... Therefore, in our current implementation, our frontend module simply throws out of memory exceptions when a container attempts to allocate more space than it requests. As discuss in related works, there are some existing approaches [4,19,32] to support memory over-commitment, and our work can be integrated with these solutions to support more flexible GPU memory sharing. ...
... Supporting memory over-commitment on GPU is also important topic for GPU sharing, especially for memory-intensive jobs. Several approaches [4,19,32] have been proposed based on the virtual memory mechanism, so that the GPU memory content can be swapped to host memory when its GPU kernel is not running. Although these techniques can increase the chances of GPU sharing, they also have the risk to introduce more performance overhead from the memory swapping operations due to the limited memory bandwidth between host and device. ...
... Sengupta et al. [31] focus on virtualizing the GPU as a whole in a cloud with multiple GPUs. Becchi et al. [3] study a virtual memory system that isolates the memory spaces of concurrent kernels and allows kernels whose aggregate memory footprint exceeds the GPU's memory capacity to execute concurrently. By contrast, Pagoda virtualizes the compute resources of a single GPU at the granularity of a warp. ...
Article
Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. This article presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 5.52X over PThreads running on a 20-core CPU, 1.76X over CUDA-HyperQ, and 1.44X over GeMTC, the state-of-the-art runtime GPU task scheduling system.
... Kato et al. [35], Wang et al. [36], Ji et al. [37], and Becchi et al. [38] proposed technologies for solving the problem of insufficient GPU memory when compute unified device architecture (CUDA) is performed in the NVIDIA GPU environment. When the amount of GPU memory is insufficient, the data in the GPU memory are moved to the system memory to secure space in the GPU memory, which is allocated to the applications. ...
Article
Full-text available
Advances in virtualization technology have enabled multiple virtual machines (VMs) to share resources in a physical machine (PM). With the widespread use of graphics-intensive applications, such as two-dimensional (2D) or 3D rendering, many graphics processing unit (GPU) virtualization solutions have been proposed to provide high-performance GPU services in a virtualized environment. Although elasticity is one of the major benefits in this environment, the allocation of GPU memory is still static in the sense that after the GPU memory is allocated to a VM, it is not possible to change the memory size at runtime. This causes underutilization of GPU memory or performance degradation of a GPU application due to the lack of GPU memory when an application requires a large amount of GPU memory. In this paper, we propose a GPU memory ballooning solution called gBalloon that dynamically adjusts the GPU memory size at runtime according to the GPU memory requirement of each VM and the GPU memory sharing overhead. The gBalloon extends the GPU memory size of a VM by detecting performance degradation due to the lack of GPU memory. The gBalloon also reduces the GPU memory size when the overcommitted or underutilized GPU memory of a VM creates additional overhead for the GPU context switch or the CPU load due to GPU memory sharing among the VMs. We implemented the gBalloon by modifying the gVirt , a full GPU virtualization solution for Intel’s integrated GPUs. Benchmarking results show that the gBalloon dynamically adjusts the GPU memory size at runtime, which improves the performance by up to 8% against the gVirt with 384 MB of high global graphics memory and 32% against the gVirt with 1024 MB of high global graphics memory.
... Multiple techniques to effectively share the GPU [11,54,55] have been proposed. Sharing is driven by the observation that GPU devices are often under-utilized. ...
Article
Full-text available
Integrated GPU systems are a cost-effective and energy-efficient option for accelerating data-intensive applications. While these platforms have reduced overhead of offloading computation to the GPU and potential for fine-grained resource scheduling, there remain several open challenges: (1) the distinct execution models inherent in the heterogeneous devices present on such platforms drives the need to dynamically match workload characteristics to the underlying resources, (2) the complex architecture and programming models of such systems require substantial application knowledge to achieve high performance, and (3) as such systems become prevalent, there is a need to extend their utility from running known regular data-parallel applications to the broader set of input-dependent, irregular applications common in enterprise settings. The key contribution of our research is to enable runtime specialization on such integrated GPU platforms by matching application characteristics to the underlying heterogeneous resources for both regular and irregular workloads. Our approach enables profile-driven resource management and optimizations for such platforms, providing high application performance and system throughput. Toward this end, this work proposes two novel schedulers with distinct goals: (a) a device-affinity, contention-aware scheduler that incorporates instrumentation-driven optimizations to improve the throughput of running diverse applications on integrated CPU–GPU servers, and (b) a specialized, affinity-aware work-stealing scheduler that efficiently distributes work across all CPU and GPU cores for the same application, taking into account both application characteristics and architectural differences of the underlying devices.
... Besides software methods, architectural solutions have been studied as well. The fairness of concurrent GPU application can be improved by augmenting memory scheduling scheme [10,24], or by providing virtual memory system [4] or making the GPU device preemptible [17,21,22]. Jog et al. [14,15,25] have also been looking at concurrent kernel execution but with the aim to make the memory system of the GPU aware of it as opposed to our approach which focuses on finding beneficial kernel pairs. ...
Conference Paper
Executing multiple OpenCL kernels on the same GPU concurrently is a promising method for improving hardware utilisation and system performance. Schemes of scheduling impact the resulting performance significantly by selecting different kernels to run together on the same GPU. Existing approaches use either execution time or relative speedup of kernels as a guide to group and map them to the device. However, these simple methods work on the cost of providing suboptimal performance. In this paper, we propose a graph-based algorithm to schedule co-run kernel in pairs to optimise the system performance. Target workloads are represented by a graph, in which vertices stand for distinct kernels while edges between two vertices represent the corresponding two kernels co-execution can deliver a better performance than run them one after another. Edges are weighted to provide information of performance gain from co-execution. Our algorithm works in the way of finding out the maximum weighted matching of the graph. By maximising the accumulated weights, our algorithm improves performance significantly comparing to other approaches.
... Also, in a data center, multiple applications or users are scheduled on the same physical hardware to increase utilization. Data center GPUs now include ways to share the same hardware between multiple tasks and users: current hardware can expose multiple virtual devices per physical GPU that share hardware with temporal partitioning [45], and other work has examined how to add memory virtualization and protection to GPUs [15]. ...
... 15: Normalized total GPU energy, including added instruction and memory accesses. The "No RF" entry is the upper bound for energy savings, which uses the baseline performance and a register file that consumes no energy.for ...
Thesis
Graphics Processing Units (GPUs) are becoming common in data centers for tasks like neural network training and image processing due to their high performance and efficiency. GPUs maintain high throughput by running thousands of threads simultaneously, issuing instructions from ready threads to hide latency in others that are stalled. While this is effective for keeping the arithmetic units busy, the challenge in GPU design is moving the data for computation at the same high rate. Any inefficiency in data movement and storage will compromise the throughput and energy efficiency of the system. Since energy consumption and cooling make up a large part of the cost of provisioning and running and a data center, making GPUs more suitable for this environment requires removing the bottlenecks and overheads that limit their efficiency. The performance of GPU workloads is often limited by the throughput of the memory resources inside each GPU core, and though many of the power-hungry structures in CPUs are not found in GPU designs, there is overhead for storing each thread's state. When sharing a GPU between workloads, contention for resources also causes interference and slowdown. This thesis develops techniques to manage and streamline the data movement and storage resources in GPUs in each of these places. The first part of this thesis resolves data movement restrictions inside each GPU core. The GPU memory system is optimized for sequential accesses, but many workloads load data in irregular or transposed patterns that cause a throughput bottleneck even when all loads are cache hits. This work identifies and leverages opportunities to merge requests across threads before sending them to the cache. While requests are waiting for merges, they can be reordered to achieve a higher cache hit rate. These methods yielded a 38% speedup for memory throughput limited workloads. Another opportunity for optimization is found in the register file. Since it must store the registers for thousands of active threads, it is the largest on-chip data storage structure on a GPU. The second work in this thesis replaces the register file with a smaller, more energy-efficient register buffer. Compiler directives allow the GPU to know ahead of time which registers will be accessed, allowing the hardware to store only the registers that will be imminently accessed in the buffer, with the rest moved to main memory. This technique reduced total GPU energy by 11%. Finally, in a data center, many different applications will be launching GPU jobs, and just as multiple processes can share the same CPU to increase its utilization, running multiple workloads on the same GPU can increase its overall throughput. However, co-runners interfere with each other in unpredictable ways, especially when sharing memory resources. The final part of this thesis controls this interference, allowing a GPU to be shared between two tiers of workloads: one tier with a high performance target and another suitable for batch jobs without deadlines. At a 90% performance target, this technique increased GPU throughput by 9.3%. GPUs' high efficiency and performance makes them a valuable accelerator in the data center. The contributions in this thesis further increase their efficiency by removing data movement and storage overheads and unlock additional performance by enabling resources to be shared between workloads while controlling interference.
... The attack offers a number of advantages that may make them an attractive target for attackers compared to CPU covert channels including: (1) With GPU-accelerated computing available on major cloud platforms such as Google Cloud Platform, IBM cloud, and Amazon web service [26] this threat is substantial [35]. The model of sharing GPUs on the cloud is evolving but allowing sharing of remote GPUs is a possibility [3,8,31,34]. Therefore, GPU covert channels may provide the attackers with additional opportunities to co-locate, which is a pre-requisite for these types of attacks [35]; (2) GPGPUs operate as an accelerator with separate resources that do not benefit from protections offered by an Operating system. ...
... We assume that the two kernels can launch applications to the same GPGPU; in a cloud setting a first problem is to establish this ability. Since no standard sharing model of GPGPUs in the cloud has emerged, we do not focus on this problem [3,8,31,34]. In most existing settings, the GPGPU is shared among applications on the same physical node, via I/O passthrough. ...
Conference Paper
Full-text available
General Purpose Graphics Processing Units (GPGPUs) are present in most modern computing platforms. They are also increasingly integrated as a computational resource on clusters, data centers, and cloud infrastructure, making them possible targets for attacks. We present a first study of covert channel attacks on GPGPUs. GPGPU attacks offer a number of attractive properties relative to CPU covert channels. These channels also have characteristics different from their counterparts on CPUs. To enable the attack, we first reverse engineer the hardware block scheduler as well as the warp to warp scheduler to characterize how co-location is established. We exploit this information to manipulate the scheduling algorithms to create co-residency between the trojan and the spy. We study contention on different resources including caches, functional units and memory, and construct operational covert channels on all these resources. We also investigate approaches to increase the bandwidth of the channel including: (1) using synchronization to reduce the communication cycle and increase robustness of the channel; (2) exploiting the available parallelism on the GPU to increase the bandwidth; and (3) exploiting the scheduling algorithms to create exclusive co-location to prevent interference from other possible applications. We demonstrate operational versions of all channels on three different Nvidia GPGPUs, obtaining error-free bandwidth of over 4 Mbps, making it the fastest known microarchitectural covert channel under realistic conditions.
... System virtualization can be categorized into three major classes: full, para and hardware-supported virtualization. Full virtualization completely emulates the CPU, memory, and I/O devices in order to provide a guest OS with an environment iden- [Becchi et al. 2012;Duato et al. 2009;Duato et al. 2010a;Duato et al. 2010b;2011;Giunta et al. 2010;Gupta et al. 2009;Hansen 2007;Humphreys et al. 2002;Jang et al. 2013;Kato et al. 2012;Kuzkin and Tormasov 2011;Laccetti et al. 2013;Lagar-Cavilla et al. 2007;Lama et al. 2013;Lee et al. 2016;Li et al. 2012;Liang and Chang 2011;Merritt et al. 2011;Montella et al. 2014;Montella et al. 2016a;Montella et al. 2016b;Niederauer et al. 2003;Oikawa et al. 2012;Peña et al. 2014;Pérez et al. 2016;Prades et al. 2016;Ravi et al. 2011;Reaño et al. 2012;Reaño et al. 2013;Reaño et al. 2015a;Reaño et al. 2015b;Rossbach et al. 2011;Sengupta et al. 2013;Sengupta et al. 2014;Shi et al. 2009;Shi et al. 2011;Shi et al. 2012;Tien and You 2014;Vinaya et al. 2012;Xiao et al. 2012;You et al. 2015;Zhang et al. 2016] Para & full virtualization [Dalton et al. 2009;Dong et al. 2015;Dowty and Sugerman 2009;Gottschlag et al. 2013;Huang et al. 2016;Shan et al. 2013;Song et al. 2014;Suzuki et al. 2014;Xue et al. 2016; Hardware virtualization [Abramson et al. 2006;Amazon 2010;Expósito et al. 2013;Herrera 2014;Jo et al. 2013a;Jo et al. 2013b;Ou et al. 2012;Shainer et al. 2011;Shea and Liu 2013;Vu et al. 2014;Yang et al. 2012a;Yang et al. 2012b;Yeh et al. 2013; tical to the underlying hardware. Privileged instructions of a guest OS that modify the system state are trapped into the hypervisor, by a binary translation technique that automatically inserts trapping operations in the binary code of the guest OS. ...
Article
Full-text available
The integration of graphics processing units (GPUs) on high-end compute nodes has established a new accelerator-based heterogeneous computing model, which now permeates high-performance computing. The same paradigm nevertheless has limited adoption in cloud computing or other large-scale distributed computing paradigms. Heterogeneous computing with GPUs can benefit the Cloud by reducing operational costs and improving resource and energy efficiency. However, such a paradigm shift would require effective methods for virtualizing GPUs, as well as other accelerators. In this survey article, we present an extensive and in-depth survey of GPU virtualization techniques and their scheduling methods. We review a wide range of virtualization techniques implemented at the GPU library, driver, and hardware levels. Furthermore, we review GPU scheduling methods that address performance and fairness issues between multiple virtual machines sharing GPUs. We believe that our survey delivers a perspective on the challenges and opportunities for virtualization of heterogeneous computing environments.