Article

VCUDA: GPU accelerated high performance computing in virtual machines

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... VMGL [28] is GPU virtualization technologies for graphics processing, support remote rendering to handle openGL-based tasks in virtual machines. In particular, the RPC-based GPU virtualization technology, such as vCUDA [29] and rCUDA [30], used in this paper, one of the API forwarding-based GPU virtualization technology using RPC communications, and the user virtual machine uses a modified GPU API to request GPU operations to the VM or host system that owns GPUs. The user VM has a server-client structure that sends API, API parameters, and data to the GPU owner through internal RPC communication, and the GPU owner processes the requested task and returns the task result to the user VM. ...
... vCUDA [29], rCUDA [30] and virtio-CL [26] are RPC-based GPU virtualization solutions in a cloud environment. RPC-based GPU virtualization technology has a server-client structure composed of GPGPU task requesters and responders and redirects GPGPU commands to actual GPUs through the modified GPGPU API to deliver the commands. ...
Article
Full-text available
In remote procedure call (RPC)-based graphic processing unit (GPU) virtualization environments, GPU tasks requested by multiple-user virtual machines (VMs) are delivered to the VM owning the GPU and are processed in a multi-process form. However, because the thread executing the computing on general GPUs cannot arbitrarily stop the task or trigger context switching, GPU monopoly may be prolonged owing to a long-running general-purpose computing on graphics processing unit (GPGPU) task. Furthermore, when scheduling tasks on the GPU, the time for which each user VM uses the GPU is not considered. Thus, in cloud environments that must provide fair use of computing resources, equal use of GPUs between each user VM cannot be guaranteed. We propose a GPGPU task scheduling scheme based on thread division processing that supports GPU use evenly by multiple VMs that process GPGPU tasks in an RPC-based GPU virtualization environment. Our method divides the threads of the GPGPU task into several groups and controls the execution time of each thread group to prevent a specific GPGPU task from a long time monopolizing the GPU. The efficiency of the proposed technique is verified through an experiment in an environment where multiple VMs simultaneously perform GPGPU tasks.
... GPU's are many core architectures and using Nvidia's CUDA [3] platform the raw data generated from gigabytes of well logs coupled with parallel algorithms designed for GPU execution aims to reduce the time and complexity needed to determine the hydrocarbon bearing potential of a reservoir. Using the raw data is a viable alternative to using wireline logs or paper-based printouts of the waves generated. ...
... Resistivity is the formations resistivity and samples by an induction-type resistivity tool [2]. The petrophysical evaluation of the of the log data with the main properties such as lithology, porosity, permeability and water saturation is essential for the evaluation of the reservoir formation [3]. Figure 1 indicates a code snippet of the petrophysical properties under evaluation. ...
Conference Paper
Oil and Gas companies keep exploring every new possible method to increase the likelihood of finding a commercial hydrocarbon bearing prospect. Well logging generates gigabytes of data from various probes and sensors. After processing, a prospective reservoir will indicate areas of oil, gas, water and reservoir rock. Incorporating High Performance Computing (HPC) methodologies will allow for thousands of potential wells to be indicative of its hydrocarbon bearing potential. This study will present the use of the Graphics Processing Unit (GPU) computing as another method of analyzing probable reserves. Raw well log data from the Kansas Geological Society (1999-2018) forms the basis of the data analysis. Parallel algorithms are developed and make use of Nvidia’s Compute Unified Device Architecture (CUDA). The results gathered highlight a 5 times speedup using a Nvidia GeForce GT 330M GPU as compared to an Intel Core i7 740QM Central Processing Unit (CPU). The processed results display depth wise areas of shale and rock formations as well as water, oil and/or gas reserves.
... Given the important role of CUDA in heterogeneous computing, many researchers are also committed to virtualizing GPUs based on CUDA and extending it to distributed systems. Examples include vCUDA [17], rCUDA [3], and the virtualized GPU computing platform [6] proposed by Yang Jingwei et al. The implementation approach is similar to that of dOpenCL, where the client side takes over the CUDA API calls and communicates with the server-side proxy. ...
Article
Full-text available
Heterogeneous computing has been developing continuously in the field of high-performance computing because of its high performance and energy efficiency. More and more accelerators have emerged, such as GPU, FPGA, DSP, AI accelerator, and so on. Usually, the accelerator is connected to the host CPU as a peripheral device to form a tightly coupled heterogeneous computing node, and then, a parallel system is constructed by multiple nodes. This organization is computationally efficient, but not flexible. When new accelerators appear, it is difficult to join the system that has been built. At the hardware level, we create an array of accelerators and connect them to the existing system through a high-speed network. At the software level, we dynamically organize computing resources from various arrays to build a virtual heterogeneous computing node. This approach also includes a standard programming environment. Therefore, it is a more flexible, elastic, and scalable heterogeneous computing organization. In this paper, a supernode OpenCL implementation is proposed for hybrid parallel computing systems, in which virtual supernodes can be dynamically constructed between different computing arrays, and a standard OpenCL environment is implemented based on RDMA communication of high-speed interconnection, which can be combined with the system-level MPI programming environment, thereby realizing the large-scale parallel computing of the hybrid array. SNCL is compatible with existing MPI/OpenCL programs without the need for additional modifications. Experiments show that the runtime overhead of the supernode OpenCL environment is very low, and it is suitable for deploying applications with high computing density and large data scale between different arrays to utilize their computing power without affecting scalability.
... vCUDA [14] is a GPGPU virtualization solution enabling applications running in virtual machine instances to take advantage of hardware acceleration performing a GPGPU multiplexing. vCUDA uses the API call interception and redirection leveraging a VMRPC, a specialized remote procedure call system for VMs. ...
... Below we detail the scheduling solutions designed to target these features. [127,168]. GPU sharing across different inference requests can significantly improve the resource utilization by better leveraging GPUs' parallel compute capability. However, it can also incur the risks of violating the latency requirements due to the uncertain running time. ...
Preprint
Full-text available
Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-SystemGroup/Awesome-DL-Scheduling-Papers
... With several advancements in the electronic hardware industry, High Performance Computing (HPC) applications utilize HPC servers. These servers use several conventional processors and graphical processing units (GPUs) [1]- [3]. These advancements play a significant role in increasing the computational power which is very essential for many critical HPC applications. ...
Article
Full-text available
Molecular dynamics (MD) simulations involve computations of forces between atoms and the total energy of the chemical systems. The scientific community is dependent on high-end servers for such computations that are generally sequential and highly power hungry, thereby restricting these computations in reaching experimentally relevant large systems. This work explores the concept of parallelization of the code and accelerating them by exploring the usage of high level synthesis (HLS) based Field Programmable Gate Array (FPGA). This work proposes a hardware and software based interface to implement parallel algorithms in an FPGA framework and communication between the software and hardware interface is implemented. The forces of Au 147 obtained through the ANN based interatomic potentials in the proposed model shows an acceleration of 1.5 times compared with an expensive server with several nodes. Taking this work forward can result in a lab-on-a-chip application and this would potentially be applied onto several large experimentally relevant chemical systems.
... A high-end NVIDIA GPU device could cost as much as 2 to 5× that of a high-end Intel Xeon CPU, and, in data centers, a GPU VM (virtual machine) instance could be 10× more expensive than a regular one. These practical observations demonstrate the necessity of efficient mechanisms to share GPUs among different workloads [2][3][4][5][6][7][8], thereby increasing utilization, saving on energy consumption, and improving the cost-efficiency as well as throughput for these systems. ...
Preprint
Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive resource, and boosting utilization of GPUs without causing performance degradation of individual workloads is an important and challenging problem. Although services like MPS support simultaneous execution of multiple co-operative kernels on a single device, they do not solve the above problem for uncooperative kernels, MPS being oblivious to the resource needs of each kernel. We propose a fully automated compiler-assisted scheduling framework. The compiler constructs GPU tasks by identifying kernel launches and their related GPU operations (e.g. memory allocations). For each GPU task, a probe is instrumented in the host-side code right before its launch point. At runtime, the probe conveys the information about the task's resource requirements (e.g. memory and compute cores) to a scheduler, such that the scheduler can place the task on an appropriate device based on the task's resource requirements and devices' load in a memory-safe, resource-aware manner. To demonstrate its advantages, we prototyped a throughput-oriented scheduler based on the framework, and evaluated it with the Rodinia benchmark suite and the Darknet neural network framework on NVIDIA GPUs. The results show that the proposed solution outperforms existing state-of-the-art solutions by leveraging its knowledge about applications' multiple resource requirements, which include memory as well as SMs. It improves throughput by up to 2.5x for Rodinia benchmarks, and up to 2.7x for Darknet neural networks. In addition, it improves job turnaround time by up to 4.9x, and limits individual kernel performance degradation to at most 2.5%.
... Shi et al present a similar solution for GPU acceleration, termed vCUDA [63]. vCUDA encapsulates runtime APIs into RPC calls to achieve API interception and redirection. ...
Preprint
Full-text available
The Internet is responsible for accelerating growth in several fields such as digital media, healthcare, the military. Furthermore, the Internet was founded on the principle of allowing clients to communicating with servers. However, serverless computing is one such field that tries to break free from this paradigm. Event-driven compute services allow users to build more agile applications using capacity provisioning and a pay-for-value billing model. This paper provides a formal account of the research contributions in the field of Serverless computing.
... vCUDA uses runtime API interception and redirection to provide GPU access to virtual machines. 16 Similarly to the previous tools, vCUDA redirects API calls of CUDA applications in the virtual machine to a server process running on the host which in turn forward them to the CUDA driver. ...
Article
Full-text available
In high‐performance computing and cloud computing the introduction of heterogeneous computing resources, such as GPU accelerator have led to a dramatic increase in performance and efficiency. While the benefits of virtualization features in these environments are well researched, GPUs do not offer virtualization support that enables fine‐grained control, increased flexibility, and fault tolerance. In this article, we present Cricket: A transparent and low‐overhead solution to GPU virtualization that enables future research into other virtualization techniques, due to its open‐source nature. Cricket supports remote execution and checkpoint/restart of CUDA applications. Both features enable the distribution of GPU tasks dynamically and flexibly across computing nodes and the multitenant usage of GPU resources, thereby improving flexibility and utilization for high‐performance and cloud computing.
... As a prevalent high-performance architecture, GPU has been widely used in various areas for different purposes such as accelerating big-data processing [10][11][12][13] and assisting operating systems [14][15][16][17][18][19] as a buffer cache. Fig. 3 shows a typical NVIDIA CUDA GPU architecture composed of a set of streaming multiprocessors (SMs) and a GPU main memory (shared among all SMs). ...
Article
A* search is a best-first search algorithm that is widely used in pathfinding and graph traversal. To meet the ever-increasing demand of performance, various high-performance architectures (e.g., multi-core CPU and GPU) have been explored to accelerate the A* search. However, the current GPU based A* search approaches are merely designed based on single-GPU architecture. Nowadays, the amount of data grows at an exponential rate, making it inefficient or even infeasible for the current A* to process the data sets entirely on a single GPU. In this paper, we propose DA*, a parallel A* search algorithm based on the multi-GPU architecture. DA* enables the efficient acceleration of the A* algorithm using multiple GPUs with effective graph partitioning and data communication strategies. To make the most of the parallelism of multi-GPU architecture, in the state extension phase, we adopt the method of multiple priority queues for the open list, which allows multiple states being calculated in parallel. In addition, we use the parallel hashing of replacement and frontier search mechanism to address node duplication detection and memory bottlenecks respectively. The evaluation shows that DA* is effective and efficient in accelerating A* based computational tasks on the multi-GPU system. Compared to the state-of-the-art A* search algorithm based on a single GPU, our algorithm can achieve up to 3× performance speedup with four GPUs.
... vCUDA [17] is another CUDA based accelerator virtualization solution for HPC clusters. This solution is RPC-based rather than middleware to achieve accelerator virtualization. ...
Preprint
Full-text available
Edge computing offers the distinct advantage of harnessing compute capabilities on resources located at the edge of the network to run workloads of relatively weak user devices. This is achieved by offloading computationally intensive workloads, such as deep learning from user devices to the edge. Using the edge reduces the overall communication latency of applications as workloads can be processed closer to where data is generated on user devices rather than sending them to geographically distant clouds. Specialised hardware accelerators, such as Graphics Processing Units (GPUs) available in the cloud-edge network can enhance the performance of computationally intensive workloads that are offloaded from devices on to the edge. The underlying approach required to facilitate this is virtualization of GPUs. This paper therefore sets out to investigate the potential of GPU accelerator virtualization to improve the performance of deep learning workloads in a cloud-edge environment. The AVEC accelerator virtualization framework is proposed that incurs minimum overheads and requires no source-code modification of the workload. AVEC intercepts local calls to a GPU on a device and forwards them to an edge resource seamlessly. The feasibility of AVEC is demonstrated on a real-world application, namely OpenPose using the Caffe deep learning library. It is observed that on a lab-based experimental test-bed AVEC delivers up to 7.48x speedup despite communication overheads incurred due to data transfers.
... For example, GPUs can achieve significant speedup for graph processing applications [1], and GPU-assisted network traffic processing in software routers outperforms the multicore-based counterparts [2]. In addition, GPUs can also be used to accelerate big-data processing [3], [4], [5], [6], and assist operating systems [7], [8], [9], [10], [11], [12] as a buffer cache. Rich thread-level parallelism in GPU has motivated co-running GPU kernels on a single GPU. ...
Article
Full-text available
Rich thread-level parallelism of GPU has motivated co-running GPU kernels on a single GPU. However, when GPU kernels co-run, it is possible that one kernel can leverage buffer overflow to attack another kernel running on the same GPU. Existing work has either large performance overhead or limited capability in detecting buffer overflow. In this paper, we introduce GMODx, a runtime software system that can detect GPU buffer overflow. GMODx performs always-on monitoring on allocated memory based on canary-based design. First, for fine-grained memory management, GMODx introduces a set of byte arrays to store buffer information for overflow detection. Techniques, including lock-free accesses to the byte arrays, delayed memory free, efficient memory reallocation, and garbage collection for the byte arrays, are proposed to achieve high performance. Second, for coarse-grained memory management, GMODx utilizes unified memory to delegate the always-on monitoring to the CPU. To reduce performance overhead, we propose several techniques, including customized list data structure and specific optimizations against the unified memory. Our experiments show that GMODx is capable of detecting buffer overflow for the fine-grained memory management without performance loss, and that it incurs small runtime overhead (4.2% on average and up to 9.7%) for the coarse-grained memory management.
... In MPS mode, kernels from different applications are not isolated, so it has potential issues about security, e.g., side channel attack in GPU [28]. API remoting is another approach to share GPUs among remote clients [12,29], but clients are limited by the APIs that are provided by the frameworks. ...
Preprint
Full-text available
Sharing GPUs in the cloud is cost effective and can facilitate the adoption of hardware accelerator enabled cloud. But sharing causes interference between co-located VMs and leads to performance degradation. In this paper, we proposed an interference-aware VM scheduler at the cluster level with the goal of minimizing interference. NVIDIA vGPU provides sharing capability and high performance, but it has unique performance characteristics, which have not been studied thoroughly before. Our study reveals several key observations. We leverage our observations to construct models based on machine learning techniques to predict interference between co-located VMs on the same GPU. We proposed a system architecture leveraging our models to schedule VMs to minimize the interference. The experiments show that our observations improves the model accuracy (by 15%˜40%) and the scheduler reduces application run-time overhead by 24.2% in simulated scenarios.
... For analysis our own malware program developed and mainly focused executable which open internet connections (Le et al. 2008;Shi et al. 2011). All malware samples were run in host and mainly focused the unknown executable which will try to access the internet based malicious activity and found the numerous number of API calls inside in the malware and related to unwanted internet connection. ...
Article
Full-text available
Progressive cyber-attacks emphasize secrecy and industriousness the more they are able to move alongside, exfiltration of information and cause harm. The more they stay under radar. The abusers swing progressively to cross-process infusion to preserve a strategic distance from identification. Cross-process infusion helps attackers to execute malicious codes that take on truly project appearance. Code infusion does not require aggressors to use specific procedures that can be quickly differentiated. Alternatively, they incorporate malignancy code into the normal procedure, allowing their operations a wider range of secrecy and naivety (e.g. explorer.exe, regsvr32.exe, svchost.exe…). For the purpose of detect malware injection The hypervisor injection attack proposed in this paper by using a method of X-cross application programming interface calls (API-HI-attack) raises awareness that malware is injecting into the simulation tool with X-cross-language API calls. The experimental results of the proposed work shows antimalware protector need to take more attention on API call hooking at process level injection by X-cross languages. The results proposed method with high true positive (92.96%) and less false positive (0.07%) over the existing methods.
... Similar to many other GPU sharing platforms, like vCUDA [28], r-CUDA [13], GVim [17], and Pegasus [18], DCUDA adopts a frontendbackend architecture to ease the implementation while providing full compatibility to all CUDA applications. As shown in Fig. 3, the frontend of DCUDA is implemented as a CUDA wrapper library, which dynamically links user applications and intercepts CUDA API calls. ...
Conference Paper
In clouds and data centers, GPU servers which consist of multiple GPUs are widely deployed. Current state-of-the-art GPU scheduling algorithm are "static" in assigning applications to different GPUs. These algorithms usually ignore the dynamics of the GPU utilization and are often inaccurate in estimating resource demand before assigning/running applications, so there is a large opportunity to further load balance and to improve GPU utilization. Based on CUDA (Compute Unified Device Architecture), we develop a runtime system called DCUDA which supports "dynamic" scheduling of running applications between multiple GPUs. In particular, DCUDA provides a realtime and lightweight method to accurately monitor the resource demand of applications and GPU utilization. Furthermore, it provides a universal migration facility to migrate "running applications" between GPUs with negligible overhead. More importantly, DCUDA transparently supports all CUDA applications without changing their source codes. Experiments with our prototype system show that DCUDA can reduce 78.3% of overloaded time of GPUs on average. As a result, for different workloads consisting of a wide range applications we studied, DCUDA can reduce the average execution time of applications by up to 42.1%. Furthermore, DCUDA also reduces 13.3% energy in the light load scenario.
... As HPC or DL tasks usually involve GPU resources in their solution, there are many attempts that have been made to introduce virtualized GPU into the virtualization environment. For example, there are such approaches including GViM [8], gVirtuS [9], and vCUDA [10] that are based on creating copies of the CUDA API for virtualizing GPU and providing them to virtualization environment [5] and in case of rCUDA solution [11], it proposed the technology of remote GPU usage. However, the above approaches degrade the performance of the GPU during the virtualization process [12]. ...
Article
Full-text available
Container based virtualization is an innovative technology that accelerates software development by providing portability and maintainability of applications. Recently, a growing number of workloads such as high performance computing (HPC) and Deep Learning(DL) are deployed in the container based environment. However, GPU resource management issues especially the GPU memory over subscription issue in container-based clusters, which brings substantial performance loss, is still challenging. This paper proposes an adaptive fair-share method to share effectively in container-based virtualization environment as well as an execution rescheduling method to manage the execution order of each container for acquiring maximum performance gain. We also proposed a checkpoint based mechanism especially for DL workload running with TensorFlow, which can efficiently solve the GPU memory over subscription problem. We demonstrate that our approach contributes to overall performance improvement as well as higher resource utilization compared to default and static fair-share methods with homogeneous and heterogeneous workloads. Compared to two other conditions, their results show that the proposed method reduces by 16.37%, 15.61% in average execution time and boosts approximately by 52.46%, 10.3% in average GPU memory utilization, respectively. We also evaluated our checkpoint based mechanism by running multiple CNN workloads with TensorFlow at the same time and the result shows our proposed mechanism can ensure each workload executing safely without out of memory (OOM) error occurs.
... Instead of the standard approach that implies the use of central processing unit (CPU) for computation, the tendency is to use graphics processing units (GPU) for this purpose. The reason is to be sought in the fact that modern GPUs contain a number of stream processors that offer massive computational power [3,4]. This approach demands to resolve problems such as transferring the data necessary for computation to the GPU [5]. ...
Article
Full-text available
The finite element method (FEM) has deservedly gained the reputation of the most powerful, highly efficient, and versatile numerical method in the field of structural analysis. Though typical application of FE programs implies the so-called “off-line” computations, the rapid pace of hardware development over the past couple of decades was the major impetus for numerous researchers to consider the possibility of real-time simulation based on FE models. Limitations of available hardware components in various phases of developments demanded remarkable innovativeness in the quest for suitable solutions to the challenge. Different approaches have been proposed depending on the demands of the specific field of application. Though it is still a relatively young field of work in global terms, an immense amount of work has already been done calling for a representative survey. This paper aims to provide such a survey, which of course cannot be exhaustive.
Article
GPU is becoming attractive around multiple academic and industrial area because of its massively parallel computing ability. However, there are still some obstacles which the GPU virtualization technologies should overcome to reach their maturity. These obstacles mainly include the problem of resource allocation strategy to guarantee possible higher yield. This shortage has already become an obvious barrier to the practical GPU usage in the cloud for satisfying business and academical requirements. There are many mature pieces of research in the area of oversubscribed cloud computing to enhance economic efficiency. However, the study on GPU oversubscription is almost blank for the just started use of GPU in cloud computing. This paper introduces gOver, an economy-oriented GPU resource oversubscription system based on the GPU virtualization platform. gOver is able to share and modulate GPU resource among workloads in an adaptive and dynamic manner, guaranteeing the QoS level at the same time. We evaluate the proposed gOver strategy with designed experiments with specific workload characteristics. The experimental results show that our dynamic GPU oversubscription solution improves the economic efficiency by 20% over traditional GPU sharing strategy, and outperforms the static oversubscription method by much better stability in QoS control.
Article
GPGPU‐powered supercomputers are vital for various science and engineering applications. On each cluster node, the GPU works as a coprocessor of the CPU, and the computing task runs alternatively on CPU and GPU. Due to this characteristic, traditional task scheduling strategy tends to result in significant workload imbalance and underutilization of GPUs. We design an adaptive scheduling strategy to alleviate such imbalance and underutilization. Our strategy proposes to logically treats all GPUs in the cluster as a whole. Every cluster node maintains a local information table of all GPUs. Once a GPU call request is received, a node selects a GPU to run the task in an adaptive manner based on this table. In addition, our strategy does not rely on a global queue, and thus avoids excessive internode communication overhead. Moreover, we encapsulate our strategy into an intermedia module between the cluster and users. Consequently, underlying details of task scheduling is transparent to users, which enhances usability. We validate our strategy through experiments.
Article
Graphics processing unit (GPU) virtualization technology enables a single GPU to be shared among multiple virtual machines (VMs), thereby allowing multiple VMs to perform GPU operations simultaneously with a single GPU. Because GPUs exhibit lower resource scalability than central processing units (CPUs), memory, and storage, many VMs encounter resource shortages while running GPU operations concurrently, implying that the VM performing the GPU operation must wait to use the GPU. In this paper, we propose a partial migration technique for general‐purpose graphics processing unit (GPGPU) tasks to prevent the GPU resource shortage in a remote procedure call‐based GPU virtualization environment. The proposed method allows a GPGPU task to be migrated to another physical server's GPU based on the available resources of the target's GPU device, thereby reducing the wait time of the VM to use the GPU. With this approach, we prevent resource shortages and minimize performance degradation for GPGPU operations running on multiple VMs. Our proposed method can prevent GPU memory shortage, improve GPGPU task performance by up to 14%, and improve GPU computational performance by up to 82%. In addition, experiments show that the migration of GPGPU tasks minimizes the impact on other VMs.
Article
Traditionally, High-Performance Computing (HPC) has been associated with large power requirements. The reason was that chip makers of the processors typically employed in HPC deployments have always focused on getting the highest performance from their designs, regardless of the energy their processors may consume. Actually, for many years only heat dissipation was the real barrier for achieving higher performance, at the cost of higher energy consumption. However, a new trend has recently appeared consisting on the use of low-power processors for HPC purposes. The MontBlanc and Isambard projects are good examples of this trend. These proposals, however, do not consider the use of GPUs. In this paper we propose to use GPUs in this kind of low-power processor based HPC deployments by making use of the remote GPU virtualization mechanism. To that end, we leverage the rCUDA middleware in a hybrid cluster composed of low-power Atom-based nodes and regular Xeon-based nodes equipped with GPUs. Our experiments show that, by making use of rCUDA, the execution time of applications belonging to the physics domain is noticeably reduced, achieving a speed up of up to 140x with just one remote NVIDIA V100 GPU with respect to the execution of the same applications using 8 Atom-based nodes. Additionally, a rough energy consumption estimation reports improvements in energy demands of up to 37x.
Article
Currently, the virtualization technologies for cloud computing infrastructures supporting extra devices such as GPU still requires development and refinement. This requirement is evident especially in the direction of resource sharing and allocation under some performance constraints such as the quality of service (QoS) guarantee, considering the closed GPU platform. This deficiency significantly limits the applicability range of the cloud platform which aims at supporting the efficient and fluent execution of business and academic workloads. This paper introduces gQoS, an adaptive virtualized GPU resource capacity sharing system under the QoS target, which is able to share and allocate the virtualized GPU resource among workloads adaptively, guaranteeing the QoS level with stability and accuracy. We evaluate the workloads and make the comparison between our gQoS strategy and other allocation strategies. The experiments show that our strategy guarantees much better accuracy and stability in QoS control, and the total GPU resource utilization under gQoS can be rewarded with at most 25.85% reduction compared with other strategies.
Article
This paper introduces gMig, an open-source and practical vGPU live migration solution for full virtualization. Taking the advantage of the dirty pattern of GPU workloads, gMig presents the One-Shot Pre-Copy mechanism combined with the hashing based Software Dirty Page technique to achieve efficient vGPU live migration. Particularly, we propose three core techniques for gMig: 1) Dynamic Graphics Address Remapping, which parses and manipulates GPU commands to adjust the address mapping to adapt to a different environment after migration, 2) Software Dirty Page, which utilizes a hashing based approach with sampling pre-filtering to detect page modification, overcomes the commodity GPU's hardware limitation, and speeds up the migration by only sending the dirtied pages, 3) Overlapped Migration Process, which significantly compress the hanging overhead by overlapping the dirty page verification and transmission concurrently. Our evaluation shows that gMig achieves GPU live migration with an average downtime of 302 ms on Windows and 119 ms on Linux. With the help of Software Dirty Page, the number of GPU pages transferred during the downtime is effectively reduced by up to 80.0%. The design of sampling filter and overlapped processing can bring about further 30.0% and 10.0% improvements in page processing.
Article
The multiple sensors and touch capabilities of mobile devices are defining new methods of computer interaction. However, the computing power of such devices is not currently sufficient for new applications that require compute‐intensive applications. Using graphics processing units (GPUs) for general‐purpose computing with GPU programming models such as Compute Unified Device Architecture (CUDA) has been proved to accelerate simulations in supercomputers. Although, CUDA‐capable chips such as the Tegra K1 have been released on tablets can accelerate computer simulations, their absolute computing power and performance per watt are not comparable with ordinary GPUs. In this paper, we analyze a heterogeneous system composed of both of a tablet (client) and notebook with a low‐power GPU (server). Intensive computations on a tablet device are offloaded to a notebook GPU using the rCUDA middleware. Molecular dynamics (MD) simulations are performed using our test system, and the computing speed and performance per watt are reported. Implementing dynamic parallelism (DP) reduced the latency, doubling the total frames per second in some cases. Our system achieves better computational performance, and higher performance per watt than a tablet powered by a CUDA‐capable GPU. We achieved 21.7 Gflops/W by combining multiple client tablets and server, compared with 21.3 Gflops/W from the server itself.
Article
The use of Graphics Processing Units (GPUs) has become a very popular way to accelerate the execution of many applications. However, GPUs are not exempt from side effects. For instance, GPUs are expensive devices which additionally consume a non‐negligible amount of energy even when they are not performing any computation. Furthermore, most applications present low GPU utilization. To address these concerns, the use of GPU virtualization has been proposed. In particular, remote GPU virtualization is a promising technology that allows applications to transparently leverage GPUs installed in any node of the cluster. In this paper, the remote GPU virtualization mechanism is comparatively analyzed across three different generations of GPUs. The first contribution of this study is an analysis about how the performance of the remote GPU virtualization technique is impacted by the underlying hardware. To that end, the Tesla K20, Tesla K40, and Tesla P100 GPUs along with FDR and EDR InfiniBand fabrics are used in the study. The analysis is performed in the context of the rCUDA middleware. It is clearly shown that the GPU virtualization middleware requires a comprehensive design of its communication layer, which should be perfectly adapted to every hardware generation in order to avoid a reduction in performance. This is precisely the second contribution of this work, ie, redesigning the rCUDA communication layer in order to improve the management of the underlying hardware. Results show that it is possible to improve bandwidth up to 29.43%, which translates into up to 4.81% average less execution time in the performance of the analyzed applications.
ResearchGate has not been able to resolve any references for this publication.