Conference PaperPDF Available

A virtual memory based runtime to support multi-tenancy in clusters with GPUs

June 2012

June 2012

DOI:10.1145/2287076.2287090

Conference: Proceedings of the International Symposium on High-Performance Parallel and Distributed Computing

Authors:

Michela Becchi

University of Missouri

Kittisak Sajjapongse

University of Missouri

Ian Graves

University of Missouri

Adam Procter

Intel

Show all 6 authorsHide

Graphics Processing Units (GPUs) are increasingly becoming part of HPC clusters. Nevertheless, cloud computing services and resource management frameworks targeting heterogeneous clusters including GPUs are still in their infancy. Further, GPU software stacks (e.g., CUDA driver and runtime) currently provide very limited support to concurrency. In this paper, we propose a runtime system that provides abstraction and sharing of GPUs, while allowing isolation of concurrent applications. A central component of our runtime is a memory manager that provides a virtual memory abstraction to the applications. Our runtime is flexible in terms of scheduling policies, and allows dynamic (as opposed to programmer-defined) binding of applications to GPUs. In addition, our framework supports dynamic load balancing, dynamic upgrade and downgrade of GPUs, and is resilient to their failures. Our runtime can be deployed in combination with VM-based cloud computing services to allow virtualization of heterogeneous clusters, or in combination with HPC cluster resource managers to form an integrated resource management infrastructure for heterogeneous clusters. Experiments conducted on a three-node cluster show that our GPU sharing scheme allows up to a 28% and a 50% performance improvement over serialized execution on short- and long-running jobs, respectively. Further, dynamic inter-node load balancing leads to an additional 18-20% performance benefit.

: Benchmark programs.

…

Overall design of runtime.

…

State diagram showing the transitions of isAllocated/toCopy2Dev/toCopy2Swap flags.

…

Execution time reported with a variable number of short-running jobs on a node with 1 GPU. The bare CUDA runtime is compared with our runtime.

…

Execution time reported with a variable number of short-running jobs on a node with 3 GPUs. The bare CUDA runtime cannot handle more than 8 concurrent jobs.

…

Figures - uploaded by Ian Graves

Content may be subject to copyright.

Content uploaded by Ian Graves

Content may be subject to copyright.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for profit or commercial advantage and that

copies bear this notice and the full citation on the first page. To copy

otherwise, republish, post on servers, or redistribute to lists requires prior

specific permission and/or a fee.

HPDC’12, June 18-22, 2012, Delft, The Netherlands.

A Virtual Memory Based Runtime to Support Multi-tenancy

in Clusters with GPUs

Michela Becchi

, Kittisak Sajjapongse

, Ian Graves

, Adam Procter

Vignesh Ravi

, Srimat Chakradhar

University of Missouri,

Ohio State University,

NEC Laboratories America

becchim@missouri.edu, ks5z9@mail.mizzou.edu, ilggdd@mail.mizzou.edu,

proctera@missouri.edu, raviv@cse.ohio-state.edu, chak@nec-labs.com

ABSTRACT

Graphics Processing Units (GPUs) are increasingly becoming part

of HPC clusters. Nevertheless, cloud computing services and

resource management frameworks targeting heterogeneous

clusters including GPUs are still in their infancy. Further, GPU

software stacks (e.g., CUDA driver and runtime) currently

provide very limited support to concurrency.

In this paper, we propose a runtime system that provides

abstraction and sharing of GPUs, while allowing isolation of

concurrent applications. A central component of our runtime is a

memory manager that provides a virtual memory abstraction to

the applications. Our runtime is flexible in terms of scheduling

policies, and allows dynamic (as opposed to programmer-defined)

binding of applications to GPUs. In addition, our framework

supports dynamic load balancing, dynamic upgrade and

downgrade of GPUs, and is resilient to their failures. Our runtime

can be deployed in combination with VM-based cloud computing

services to allow virtualization of heterogeneous clusters, or in

combination with HPC cluster resource managers to form an

integrated resource management infrastructure for heterogeneous

clusters. Experiments conducted on a three-node cluster show that

our GPU sharing scheme allows up to a 28% and a 50%

performance improvement over serialized execution on short- and

long-running jobs, respectively. Further, dynamic inter-node load

balancing leads to an additional 18-20% performance benefit.

Categories and Subject Descriptors

C.1.4.1 [Computer Systems Organization]: Processor

Architectures - Parallel Architectures, Distributed Architectures.

General Terms

Performance, Design, Experimentation.

Keywords

Cluster computing, runtime systems, virtualization, GPU, CUDA.

1. INTRODUCTION

Many-core processors are increasingly becoming part of high

performance computing (HPC) clusters. Within the last two to

three years GPUs have emerged as a means to achieve extreme-

scale, cost-effective, and power-efficient high performance

computing. The peak single-precision performance of the latest

GPU from NVIDIA – the Tesla C2050/C2070/C2075 card - is

more than 1 Teraflop, resulting in a price to performance ratio of

$2-3 per Gigaflop. Meanwhile, Intel has announced the upcoming

release of the Many Integrated Core processor (Intel MIC), with

peak performance of 1.2 Teraflops. Early benchmarking results

on molecular dynamics and linear algebra applications have been

demonstrated at the International Supercomputing Conference,

Hamburg, Germany, in June 2011.

Today some of the fastest supercomputers are based on

NVIDIA GPUs, including three of the top five fastest

supercomputers in the world. For example, Tianhe-1A, the second

fastest system, is equipped with 7,168 NVIDIA Fermi GPUs and

14,336 CPUs. Almost 80% of the HPC clusters in the top-500 list

are currently powered with Intel multi-core processors. The next

challenge for Intel will be to successfully position its MIC

processor in the many-core market.

Given the availability of these heterogeneous computing

infrastructures, it is essential to make efficient use of them. One

classical way to schedule batch jobs on HPC clusters is via PBS

cluster resource managers such as TORQUE [25]. Another

practical way to manage large clusters intended for multi-user

environments involves using virtualization, treating clusters as

private clouds (e.g. Eucalyptus [24]). This model [18][19] has

several benefits. First, end-users do not need to be aware of the

characteristics of the underlying hardware: they see a service-

oriented infrastructure. Second, resource management and load

balancing can be performed in a centralized way by the private

cloud administrator. Finally, when the overall resource

requirements exceed the cluster’s availability, more hardware

resources can be externally rented using a hybrid cloud model, in

a way that is dynamic and fully transparent to the user [19]. The

convergence of heterogeneous HPC and the cloud computing

model is confirmed by Amazon EC2 Cluster GPU instances [26].

Other vendors, such as Nimbix [27] and Hoopoe [28], are also

offering cloud services for GPU computing.

The use of GPUs in cluster and cloud environments, however, is

still at an initial stage. Recent projects - GViM [1], vCUDA [2],

rCUDA [3] and gVirtuS [4] - have addressed the issue of allowing

applications running within virtual machines (VMs) to access

GPUs by intercepting and redirecting library calls from the guest

to the CUDA runtime on the host. These frameworks either rely

on the scheduling mechanisms provided by the CUDA runtime, or

allow applications to execute on GPU in sequence, possibly

leading to low resource utilization and consequent suboptimal

performance. Ravi et al [6] have considered GPU sharing by

allowing concurrent execution of kernel functions invoked by

different applications. All these proposals have the following

limitations. First, they assume that the overall memory

requirements of the applications mapped onto the same GPU fit

the device memory capacity. As data sets become larger and

resource sharing increases, this assumption may not hold true.

Second, the proposed frameworks statically bind applications to

GPUs (that is, they do not allow runtime application-to-GPU re-

mapping). Not only can this lead to suboptimal scheduling, but it

also forces a complete application restart in case of GPU failure,

and prevents efficient load balancing if GPU devices are added to

or removed from the system.

One important problem to address when designing runtime

support for GPUs within cluster and cloud environments is the

following. In cluster settings, resource sharing and dynamic load

balancing are typical techniques aimed to increase the resource

utilization and optimize the aggregate performance of concurrent

applications. However, GPUs have been conceived to accelerate

single applications; as a consequence, software stacks for GPUs

currently include only very limited support for concurrency. As an

example, if multiple CUDA applications use the same GPU, the

CUDA driver and runtime will serve their requests in a first-

come-first-served fashion. In the presence of concurrency, the

CUDA runtime will fail in two scenarios: first, if the aggregate

memory requirements of the applications exceed the GPU

capacity; second, in case of too many concurrent applications. In

fact, the CUDA runtime associates a CUDA context on GPU to

each application thread, and reserves an initial memory allocation

to each CUDA context. Therefore, the creation of too many

contexts will lead to exceeding the GPU memory capacity. On a

NVIDIA Tesla C2050 device, for example, we experimentally

observed that the maximum number of application threads

supported by the CUDA runtime in the absence of conflicting

memory requirements is eight. This fact has the following

implication: existing GPU virtualization frameworks and resource

managers for heterogeneous clusters that rely on the CUDA

runtime serialize the execution of concurrent applications,

leading to GPU underutilization. Further, existing runtime

systems statically bind applications to GPU devices, preventing

dynamic scheduling and limiting the opportunities for load

balancing. We aim to overcome these limitations.

Current high-end GPUs have two important characteristics:

first, they have a device memory that is physically separated from

the host memory; second, they can be programmed using library

APIs. For example, NVIDIA GPUs can be programmed using

CUDA [22] or OpenCL. GPU library calls, which originate on the

CPU, come in at least three kinds: device memory allocations,

data transfers, and kernel launches (whereby kernels are GPU

implementations of user-defined functions). Since efficient GPU

memory accesses require regular access patterns, the use of nested

pointers is discouraged in GPU programming. As a consequence,

most GPU applications do not use pointer nesting. Moreover,

even if NVIDIA has recently introduced the possibility to

dynamically allocate memory within CUDA kernels, this feature

is not found in publicly available GPU benchmarks. In this work,

we focus on optimizing the handling of traditional GPU

applications. However, we also support pointer nesting by

requiring the programmer to register nested data structures using

our runtime API. Finally, we allow applications that perform

dynamic memory allocation within kernels to use our runtime

system, but we exclude them from our sharing and dynamic

scheduling mechanisms. Both pointer nesting and dynamic device

memory allocation can be detected by intercepting and parsing

the pseudo-assembly (PTX) representation of CUDA kernels sent

to the GPU devices.

Applications that use GPUs alternate CPU and GPU phases.

Because of this alternation, statically binding applications to

GPUs (that is, using the programmer-defined mapping of GPU

phases to GPU devices) may lead to inefficiencies. This holds

particularly for applications having multiple GPU phases (e.g.:

iterative solvers) and in the presence of multi-tenancy. We

observe that the application programmer optimizes its application

assuming dedicated and well-known resources; our runtime aims

at providing dynamic load balancing in multi-tenant clusters,

where the availability and utilization of the underlying resources

is hidden from the users and not known a-priori.

As an example, suppose we wish to schedule the two

applications app

and app

illustrated in Figure 1 on a single

GPU. Moreover, let us assume that the memory footprint of each

application in isolation fits within the device memory, but their

aggregate memory requirements exceed the GPU memory

capacity. In this situation, if the two applications are run on the

bare CUDA runtime, they must be serialized (otherwise the

execution will fail with an out-of-memory error). However,

serializing the two applications will lead to resource

underutilization, in that the GPU will be idle during the CPU

phases of both app

and app

. A better scheduling consists of

time-sharing the GPU between app

and app

, so that one

application uses the GPU while the other is running a CPU phase.

Such scheduling requires periodically unbinding and binding each

application from/to the GPU. In turn, dynamic binding involves

data transfers between CPU and GPU in order to restore the state

of the application. Note that the runtime must determine: (i) when

unbinding is advisable, and (ii) which data transfers must be

performed. For example, app

has no explicit data transfers

between the kernel calls k

, k

and k

: all necessary data

transfers must be added by the runtime. On the other hand, a data

transfer between k

and k

is already part of app

. In summary,

providing dynamic binding implies designing a virtual memory

capability for GPUs. In Section 2, we discuss other scenarios

where dynamic binding of applications to GPUs is desirable.

1.1 Our Contributions

In this work, we propose a runtime component that provides

abstraction and sharing of GPUs, while allowing isolation of

concurrent applications. Our contributions can be summarized as

follows.

 We propose dynamic (or runtime) application-to-GPU binding

as a mechanism to maximize device utilization and thereby

improve performance in the presence of multi-tenancy. In

particular, dynamic binding is suitable for applications with

multiple GPU phases, and in the presence of GPUs with

different capabilities.

 We identify virtual memory for GPUs as an essential

mechanism to provide dynamic binding. Specifically, we

propose a virtual memory based runtime. As added value, our

execution

m c

app

Figure 1: Example of two applications that can effectively

time-share a GPU. Light-grey blocks represent GPU phases

(m = device memory allocations, c

= host-to-device data

transfers, k

= kernel executions, c

= device-to-host data

transfers, and f = device memory de-allocations). Black blocks

represent CPU phases.

design enables: (i) detecting badly written applications in the

runtime and therefore avoiding overloading the GPU with

erroneous calls, and (ii) optimizing memory transfers between

the multi-core host and the GPU device.

 We introduce two forms of memory swapping in our runtime:

intra-application and inter-application. The former enables

applications whose kernels fit the device memory to run on the

GPU, even if their overall memory requirements exceed the

device memory capacity. The latter allows concurrent

applications whose aggregate memory requirements exceed the

device memory capacity to time-share the GPU.

 We provide a generic runtime component that easily supports

different scheduling mechanisms.

 We include in our runtime support for load balancing in case of

GPU addition and removal, resilience to GPU failures, and

checkpoint-restart capabilities.

The remainder of this paper is organized as follows. In Section

2, we discuss in more detail the objectives of our design. In

Section 3, we describe our reference hardware and software

architecture. In Section 4, we present our design and prototype

implementation. In Section 5, we report results from our

experimental evaluation. In Section 6, we relate our work to the

state of the art. We conclude our discussion in Section 7.

2. OBJECTIVES

The overall goal of this work is to provide a runtime component

that allows multiple applications to run concurrently on a

heterogeneous cluster whose nodes comprise CPUs and GPUs.

We foresee the use of our runtime system in two scenarios (Figure

2): (i) in combination with VM-based cloud computing services

(e.g.: Eucalyptus [24]), and (ii) in combination with HPC cluster

resource managers (e.g.: TORQUE [25]). In both cases a cluster-

level scheduler assigns VMs or jobs to heterogeneous compute

nodes. Our runtime component is replicated on each node and

schedules library calls originated by applications on the available

GPUs so as to optimize the overall performance. Our framework

must allow integration with cluster-level schedulers intended for

both homogeneous and heterogeneous clusters (the former

oblivious of the presence of GPUs).

Note that heterogeneous clusters that include GPUs require

scheduling at two granularities: on one hand, jobs must be

mapped onto compute nodes (coarse-grained scheduling); on the

other, specific library calls must be mapped onto GPUs (fine-

grained scheduling). Existing cluster-level schedulers perform

coarse-grained scheduling, whereas our runtime performs fine-

grained scheduling. The two schedulers may interact in two ways.

First, the cluster-level scheduler may be completely oblivious of

the GPUs installed on each node. In case of overloaded GPUs, the

node-level runtime may offload the computation to other nodes.

To this end, the runtime system must include a node-to-node

communication mechanism enabling inter-node code and data

transfer. Alternatively the node-level runtime may expose some

information to the cluster-level scheduler (e.g.: number of GPUs,

load level, etc.), so as to guide the cluster-level scheduling

decisions. While the first form of interaction may lead to

suboptimal scheduling decisions, it allows a straightforward

integration with existing cluster resource managers and VM-based

cloud computing services targeting homogeneous clusters.

Until recently, GPUs could not be accessed from applications

executing within VMs. Several projects – GViM [1], vCUDA [2],

rCUDA [3] and gVirtuS [4] - have addressed this issue for

applications using the CUDA Runtime API to access GPUs. The

general approach is to use API remoting to bridge two different

OS spaces: the guest-OS where the applications run and the host-

OS where the GPUs reside. In particular, API remoting is

implemented by introducing an interposed front-end library in the

guest-OS space and a back-end daemon in the host-OS. The front-

end library, which overrides the CUDA Runtime API, intercepts

CUDA calls and redirects them to the back-end through a socket

interface. In turn, the back-end issues those calls to the CUDA

runtime. Note that this mechanism provides GPU visibility from

within VMs, but does not add any form of abstraction. In fact,

applications still use CUDA Runtime primitives to direct their

calls to specific GPUs residing on the host where the VMs are

deployed. Moreover, the bare use of the scheduling mechanisms

offered by the CUDA Runtime may not be optimal when multiple

or multi-threaded applications are mapped onto a single GPU.

In this work, we aim to design a runtime that provides

abstraction and sharing of GPUs, while allowing isolation of

concurrent applications. In addition, the runtime must be flexible

in terms of scheduling policies, and allow dynamic binding of

applications to GPUs. Finally, the runtime must support dynamic

upgrade and

downgrade

of GPUs, and be resilient to GPU

failures. More detail on these objectives is provided below.

 Abstraction - GPUs installed in the cluster need to be

abstracted (or hidden) from the user's direct access. GPU

programming APIs generally require the application

programmer to explicitly select the target GPU (for example,

using the CUDA runtime cudaSetDevice primitive). This

gives the application control of the number of GPU devices to

use. Our design masks the explicit procurement of GPUs, thus

allowing a transparent mapping of applications onto GPUs. As

a side effect, applications can be efficiently mapped onto a

number of devices different from that for which they have been

originally programmed. Note that this abstraction is coherent

with the traditional parallel programming model for general

purpose processors. When a user writes a multithreaded

program, for example, he targets a generic multi-core

processor. At runtime, the operating system distributes

processing threads onto the available cores.

 GPU Sharing – As mentioned above, applications targeting

heterogeneous nodes alternate general-purpose CPU code with

library calls redirected and executed on GPUs. In the presence

of multi-tenancy, assigning each application a dedicated GPU

device for the entire lifetime of the application may not be

Cluster resource manager

guest OS

CUDA app

Intercept library

VM manager

CUDA app

CUDA runtime

GPU

our runtime

GPU

Intercept library

CUDA runtime

GPU

our runtime

GPU

Intercept library

CUDA runtime

GPU

our runtime

GPU

CUDA runtime

GPU

our runtime

GPU

CUDA runtime

GPU

our runtime

GPU

Intercept library

CUDA app

guest OS

CUDA app

Intercept library

guest OS

CUDA app

Intercept library

guest OS

CUDA app

Intercept library

guest OS

CUDA app

Intercept library

guest OS

CUDA app

Intercept library

CUDA runtime

GPU

our runtime

GPU

(a)

(b)

Figure 2: Two deployment scenarios for our runtime: (a) VM-

based cloud computing service and (b) HPC cluster resource

manager.

optimal, in that it may lead to resource underutilization. GPU

sharing is an obvious way to improve resource utilization.

However, sharing must be done judiciously: excessive sharing

may lead to high overhead and be counterproductive.

 Isolation – In the presence of resource sharing, concurrent

applications must run in complete isolation from one another.

In other words, each application must have the illusion of

running on a dedicated device. State-of-the-art runtime support

for GPUs provides partial isolation of different process

contexts. In particular, each process is assigned its own process

space on the GPU; however, GPU sharing is possible only as

long as the cumulative memory requirements of different

applications do not exceed the physical capacity of the GPU.

Our work aims to handle such memory issues seamlessly,

allowing GPU sharing irrespective of the overall memory

requirements of the applications. In other words, we want to

extend the concept of virtual memory to GPUs.

 Configurable Scheduling – The quality of a scheduling policy

depends on the objective function and assumptions about the

workload. A simple first-come-first-served scheduling

algorithm can be adequate in the absence of profiling

information. A credit-based scheduling algorithm may be more

suitable to settings that include fairness in the objective

function. Further, a scheduling algorithm that prioritizes short

running applications can be preferable if profiling information

is available. Yet another scheduling policy may be adopted in

the presence of expected quality of service requirements (e.g.:

execution deadlines). Our goal is to provide a runtime system

that can easily accommodate different scheduling algorithms.

 Dynamic Binding – In existing runtime systems (including the

CUDA runtime) the mapping of GPU kernels to GPU devices

is static, or programmer-defined. A dynamic application-to-

GPU binding may be preferable in several scenarios. First, let

us consider the situation of a node having GPU devices with

different compute capabilities. Existing work in the context of

heterogeneous multi-core systems [21] has shown that

performance can be optimized by maximizing the overall

processor utilization while favoring the use of more powerful

cores. The application of this concept to nodes equipped with

different GPUs suggests that the system throughput can be

maximized by dynamically migrating application threads from

less to more powerful GPUs as they become idle. Second,

dynamic binding can help when GPUs are shared by

applications cumulatively exceeding the memory capacity. In

fact, dynamically migrating application threads to different

devices may minimize waiting times. Finally, resuming

application threads on different devices allows load balancing

when GPUs are added or removed from the system (dynamic

upgrade and downgrade), and is beneficial in case of GPU

failures (by preventing a whole application restart).

 Checkpoint-Restart Capability – Along with dynamic binding,

our runtime provides a checkpoint-restart mechanism that

allows efficiently redirecting an application thread to a

different GPU. A checkpoint can be explicitly specified by the

user, or automatically triggered by the runtime. For example,

the runtime may monitor the execution time of particular

library calls (e.g. kernel functions) on a GPU. An automatic

checkpoint may be advisable after long-running kernel calls to

decrease the restart penalty in case of GPU failures. Note that

this kind of checkpoint is inserted dynamically at runtime.

3. REFERENCE ARCHITECTURE

The overall reference architecture is represented in Figure 2. The

underlying hardware platform consists of a cluster of

heterogeneous nodes. Each node has one or more multi-core

processors and a number of GPUs. The operating system performs

scheduling and resource management on the general-purpose

processors. Access to the GPUs is mediated by the CUDA driver

and runtime library API. Our runtime performs scheduling and

resource management on the available GPUs.

Each GPU has a device memory. Among others, the CUDA

runtime library contains functions to: (i) target a specific device

(cudaSetDevice), (ii) allocate and de-allocate device memory

(e.g., cudaMalloc/Free), (iii) perform data transfers between

the general purpose processor and the GPU devices (e.g.,

cudaMemcpy), (iv) transfer code onto the GPUs (the internal

functions __cudaRegisterFunction/FatBinary), and

(v) trigger the execution of user-written kernels

(cudaConfigureCall and cudaLaunch).

In addition, the CUDA runtime offers some CPU multi-

threading support. For example, CUDA 3.2 associates a CUDA

context to each application thread. Several contexts can coexist on

the GPU. Each of them has a dedicated virtual address space;

contains references to textures, modules and other entities; and is

used for error handling. CUDA contexts allow different

application threads to time-share the GPU processing cores, and

space-share the GPU memory. In CUDA 4.0, the use of CUDA

contexts is slightly modified to allow data sharing and concurrent

kernel execution across threads belonging to the same application.

As mentioned in Section 1, with both versions of the CUDA

runtime, the number of parallel CUDA contexts that can be

supported at runtime is limited by the device memory capacity.

As shown in Figure 2, our runtime component interacts with a

cluster-level scheduler, operates at the node level and must be

installed on all the nodes of the cluster. The cluster-level

scheduler maps jobs onto compute nodes. During execution, the

GPU library calls issued by applications are intercepted by our

frontend library and redirected to our runtime daemon on the node

where the job has been scheduled. Since our runtime is a stand-

alone process, a mechanism for inter-process communication

between the job and our runtime demon is needed. In our

prototype, we use the socket-based communication framework

provided as part of the open-source project gVirtuS [4][5]. This

framework relies on afunix sockets in a non-virtualized

environment and on proprietary VM-sockets in a virtualized one.

Although the design of an optimal cluster-level scheduler for

heterogeneous clusters is beyond the scope of this work, we want

to be able to integrate our runtime with existing cloud computing

services and cluster resource management frameworks that target

homogeneous clusters. In this situation, the cluster-level scheduler

dispatcher

GPU

vGPU

GPU

vGPU

CUDA driver/runtime

…

Memory manager

dispatcher

Swap

area

connection

manager

MC run time

Waiting contexts

Assig ned contexts

Failed contexts

Page

table

GPU

Figure 3: Overall design of runtime.

100

in use is unaware of both the GPU devices installed on the

compute nodes, and the fraction of execution time that each job

will spend on GPUs. Therefore, in the presence of nodes with

different hardware setups, simple cluster-level scheduling policies

may lead to queuing on nodes containing a lower number of

GPUs (or assigned a higher number of jobs targeting GPU). To

tackle this problem, we allow nodes to offload GPU library calls

to other nodes in the cluster. For this purpose, we introduce inter-

node communication between our runtime components. Note that

this mechanism operates at the granularity of GPU library calls,

and does not affect the portion of the job running on CPU.

4. DESIGN AND METHODOLOGY

In this section, we describe the design of our proposed runtime.

Our prototype implementation targets NVIDIA GPUs

programmed through the CUDA 3.2 runtime API. In Section 4.8,

we summarize the changes required to support CUDA 4.0.

4.1 Overall Design

The overall design of our runtime is illustrated in Figure 3. The

basic components are: connection manager, dispatcher, virtual-

GPUs (vGPUs), and memory manager. As mentioned before,

when applications execute on the CPU, library calls directed to

the CUDA runtime are intercepted by a frontend library and

redirected to our runtime. We say that each application establishes

a connection with the runtime, and uses the connection to issue a

sequence of CUDA calls and receive their return code. Multiple

applications establish concurrent connections. The connection

manager accepts and enqueues incoming connections. The

dispatcher dequeues pending connections and schedules their calls

on the available GPUs. If the devices on the node are overloaded,

the dispatcher may offload some connections to other nodes using

an inter-node communication mechanism. To allow controlled

GPU sharing, each GPU has an associated set of virtual-GPUs.

The dispatcher schedules applications onto GPUs by binding their

connections to the corresponding virtual-GPUs. Applications

bound to virtual-GPU vGPU

share GPU

. Finally, the memory

manager provides a virtual memory abstraction to applications.

Dispatcher and virtual-GPUs interact with the memory manager

to enable: (i) GPU sharing in the presence of concurrent

applications with conflicting memory requirements, (ii) load

balancing in case of GPU with different capabilities, GPU

addition and removal, (iii) GPU fault tolerance, and (iv)

checkpoint-restart capabilities.

4.2 Connection Manager

When used natively, the CUDA 3.2 runtime spawns a CUDA

context on the GPU for each application thread. Different

application threads can be directed to different GPUs by using the

cudaSetDevice primitive. One of the goals of our runtime is

to preserve the CUDA semantics. To this end, our frontend library

opens a separate connection for each application thread. CUDA

calls belonging to different connections can therefore be served

independently either on the same or on distinct GPUs. The

connection manager enqueues connections generated by

concurrent application threads in a pending connections list.

4.3 Dispatcher

The primary function of the dispatcher is to schedule CUDA calls

issued by application threads onto GPUs. The dispatcher can be

configured to use different scheduling algorithms: first-come-

first-served, shortest-job-first, credit-based scheduling, etc. Some

scheduling algorithms (e.g. shortest-job-first) require the

dispatcher to make scheduling decisions based on the kernels

executed by the applications, their parameters, and their execution

configuration. Higher resource utilization and better performance

can be achieved by supporting dynamic binding of applications to

GPUs: the dispatcher must be able to modify the application-to-

GPU mapping between kernel calls, and to unbind applications

from GPUs during their CPU-phases. These scheduling actions

must be hidden from the users.

To enable informed scheduling decisions, the dispatcher must

be able to delay application-to-GPU binding until the first kernel

launch is invoked. Unfortunately, the very first CUDA calls

issued by a CUDA application are not kernel launches, but

synchronous internal routines used to register the GPU machine

code (_cudaRegisterFatBinary), kernel functions

(_cudaRegisterFunction), variables and textures

(_cudaRegisterVar, _cudaRegisterSharedVar,

_cudaRegisterShared and _cudaRegisterTexture) to

the CUDA runtime. Moreover, kernel launches are never the first

non-internal CUDA calls issued by application threads: at the

very least, they must be preceded by memory allocations and data

transfers. Before kernel launches can be invoked by the client, all

of these previous calls must be serviced.

Two observations help us overcome this problem. First,

registration functions are always issued to the runtime prior to

CUDA contexts’ creation on the GPU. Therefore, these internal

calls can be safely issued by the dispatcher well before the

corresponding applications are bound to virtual-GPUs. The same

holds for device management functions, some of which are

ignored by our runtime (e.g. cudaSetDevice) or overridden

(e.g. cudaGetDeviceCount will return the number of virtual,

not physical, GPUs). Second, it is possible to delay GPU memory

operations until the related data are accessed within kernel calls:

the runtime responds to memory allocation requests by returning

virtual addresses, and these virtual pointers are mapped to real

device pointers at a later stage.

In summary, the dispatcher dequeues application threads from

the list of pending connections, and handles them as follows.

First, it issues registration functions to the CUDA runtime.

Second, it services device management functions (and typically

overrides them so as to hide the hardware setup of the node from

the users). Third, it handles memory operations with the aid of the

memory manager. In particular, the dispatcher does not issue

memory operations directly to the CUDA runtime, but instead

operates entirely in terms of virtual addresses generated by the

memory manager. Fourth, if there are any free virtual-GPUs, the

dispatcher schedules application threads to virtual-GPUs (and

enqueues them in the list of assigned contexts). If all virtual-

GPUs are busy, application threads are enqueued in the list of

waiting contexts for later scheduling. In addition, any failure

during the execution of an application thread will cause it to be

enqueued in a list of failed contexts, which is used by the

dispatcher for recovery.

To prevent the dispatcher from being a bottleneck, its

implementation is multithreaded: each dispatcher thread processes

a different connection. All queues used within the runtime

(pending connections; waiting, assigned and failed contexts) are

accessed using mutexes.

4.4 Virtual GPUs

In order to allow time-sharing of GPUs, we spawn a configurable

number of virtual-GPUs for each GPU installed on the system. A

virtual-GPU is essentially a worker thread that issues calls

originated from within application threads to the CUDA runtime.

101

F/F/F F/T/F

T/F/F T/T/F

T/F/T

copyDH

swap

launch

copyHD

copyDH

copyHD

copyDH

swap

swap swap

launch

copyDH

Figure 4: State diagram showing the transitions o

isAllocated/toCopy2Dev/toCopy2Swap flags.

Virtual-GPUs are statically bound to physical GPUs through a

cudaSetDevice invoked at system startup. Each virtual-GPU

can service one application thread at a time. A virtual-GPU is idle

when no application thread is bound to it, and is active otherwise.

Note that, since our runtime maps application threads onto

virtual-GPUs and the CUDA runtime spawns a CUDA context for

each virtual-GPU, this infrastructure preserves the semantics of

the CUDA runtime. We experimentally observed (see Section 5)

that the CUDA runtime cannot handle an arbitrary number of

concurrent threads. Therefore, limiting the number of virtual-

GPUs prevents our framework from overloading the CUDA

runtime, and allows proper operation even in the presence of a

large number of CUDA applications.

4.5 Memory Manager

The goal of the memory manager is to provide a virtual memory

abstraction for GPUs. Two ideas are at the basis of the design.

First, applications will not see device addresses returned by the

CUDA runtime, but they will see virtual addresses generated by

the runtime. Second, data resides in the host memory, and is

moved to the device only on demand. In this way, the host

memory represents a lower level in the memory hierarchy: when

some data must be moved to the device memory but the device

memory capacity is exceeded, the memory manager swaps data

from the device memory to the host memory. We allow memory

swapping in two situations: (i) within a single application, and (ii)

in the presence of multi-tenancy. The latter scenario is

characterized by the presence of concurrent applications, each of

whose memory footprints in isolation would fit within the device

memory, but whose aggregate memory requirements exceed the

GPU memory capacity. In addition, the swap functionality allows

an application to migrate from a less capable to a more capable

GPU when the latter becomes available.

Host-to-device data transfers deferral must be done judiciously.

Data transfers preceding the first kernel call cannot overlap with

GPU computation, and can thus be deferred without incurring

performance losses. After the first kernel call, application-to-GPU

binding is known and our runtime can be configured to either

defer or not defer data transfers. Not deferring allows

computation-communication overlapping at the expenses of an

increased swap overhead; deferring has the opposite effect.

The memory manager has two components: a page table

, and a

Strictly speaking this is a misnomer, since allocation need not

occur in multiples of any fixed “page size”, but we retain the

swap area. The page table stores the address translation, and the

swap area contains not yet allocated or swapped-out GPU data.

The main data structures used in the memory manager are the

following.

/* PAGE TABLE */

typedef struct {

void *virtual_ptr;

void *swap_ptr;

void *device_ptr;

size_t size;

bool isAllocated;

bool toCopy2Dev;

bool toCopy2Swap;

entry_t type;

void *params;

nesting_t nested;

} PageTableEntry;

map<Context*, list<PageTableEntry*> *> PageTable;

/* CAPACITY AND UTILIZATION of AVAILABLE GPUs */

int numGPUs;

uint64_t *CapacityList;

uint64_t *MemAvailList;

map<Context *, size_t> MemUsage;

Each page table entry (PTE), which is created upon a memory

allocation operation, contains three pointers: the virtual pointer

which is returned to the application (virtual_ptr), the pointer

of the data in the swap area (swap_ptr), and, if the data are

resident on the device, the device pointer (device_ptr). In

addition, each entry has a size, a type, and possible additional

parameters (params and nested). Finally, the flags

isAllocated, toCopy2Dev, and toCopy2Host are used to

guide device memory allocations, de-allocations and data

transfers, and indicate whether the PTE has been allocated on

device, whether the actual data reside only on the host, and

whether the actual data reside only on device, respectively. The

state transitions of the three flags depending on the call invoked

by the application are illustrated in Figure 4. In particular,

malloc represents any allocation operation (cudaMalloc,

cudaMallocArray, etc.), whereas copy

and copy

represent any device-host and host-device data transfer function

(cudaMemcpy, cudaMemcpy2D, etc.), respectively. Figure 4

assumes data transfer deferral and that all data referenced in a

kernel launch can be modified by the kernel execution: a more

fine-grained handling is possible if the information about read-

only and read-write parameters is available. The attributes type

and params allow distinguishing different kinds of memory

allocations and data transfers associated with the entry. The

nested attribute indicates whether the virtual address points to a

nested data structure, or whether it is a member of it. Nested data

structures must be declared to the runtime using a specific runtime

API call, and are associated additional attributes describing their

structure. These attributes are used by the memory manager in

order to ensure consistency between virtual and device pointers

within nested structures.

Each application thread (or context) has an associated list of

PTEs: the page table contains all the PTEs for all the active and

pending contexts in the node. In addition, the memory manager

keeps track of the capacity and the memory availability of each

GPU (CapacityList

and MemAvailList) and of the

term to make the analogy with conventional virtual memory

systems clear.

102

Table 1: For each application call, actions performed by our

runtime and possible errors returned. A blank in the third

column indicates any error generated by the CUDA runtime

(i.e., result codes ≠

cudaSuccess). PTE = page table entry.

Application

call

Actions performed by

runtime

Errors returned by

the runtime

Create PTE A virtual address

cannot be assigned

Malloc

Allocate swap

Swap memory

cannot be allocated

Check valid PTE No valid PTE

Copy

Move data to swap Swap-data size

mismatch

Check valid PTE No valid PTE

Copy

If (PTE.toCopy2Swap)

cudaMemcpy

Check valid PTE No valid PTE

De-allocate swap Cannot de-allocate

swap

Free

If (PTE.isAllocated)

cudaFree

Check valid PTE No valid PTE

If (^PTE.isAllocated)

cudaMalloc

If (PTE.toCopy2Dev)

cudaMemcpy

Launch

cudaLaunch

Check valid PTE No valid PTE

If (PTE.toCopy2Swap)

cudaMemcpy

Swap

If (PTE.isAllocated)

cudaFree

memory usage of each context (MemUsage). This information is

used to determine whether binding an application thread to a GPU

can potentially lead to exceeding its memory capacity.

Table 1 shows the actions performed by the runtime for each

memory-related call invoked by the application. For simplicity,

we show the data transfer deferral configuration. Note that, in this

case, malloc and copy

(data copy from host to device) do not

trigger any CUDA runtime actions. Swap is an internal function

which is triggered by the runtime when some data must be

swapped from device to host memory to make room for data on

the GPU. Like malloc, swap operates on a single page table entry.

Two scenarios are possible: intra-application swap and inter-

application swap. Independent of the kind, the swap operation can

be triggered by the runtime while trying to allocate device

memory to execute a kernel launch. Memory operations on nested

structures will be extended also to their PTE members.

Intra-application swap – Consider the following sequence of

calls coming from the same application app, where matmul is a

matrix multiplication kernel for square matrices.

1. malloc(&A_d, size);

2. malloc(&B_d, size);

3. malloc(&C_d, size);

4. copy

(A_d, A_h, size);

5. matmul(A_d, A_d, B_d); //B_d = A_d * A_d

6. matmul(B_d, B_d, C_d); //C_d = B_d * B_d

7. copy

(B_h, B_d, size);

8. copy

(C_h, C_d, size);

If the above application is run on the bare CUDA runtime and the

data sizes are such that only two matrices fit the device memory,

the execution will fail on the third instruction (that is, when trying

to allocate the third matrix). On the other hand, when our runtime

is used, no memory allocation is performed until the first kernel

launch (instruction 5). Previous instructions update only page

table and swap memory. Instruction 5 will cause the allocation of

matrices A_d and B_d and the data transfer of A_h to A_d, and

will execute properly. During execution of instruction 6, the

runtime will detect the need for freeing device memory. Before

trying to swap and unbind other applications from the GPU, the

runtime will analyze the page table of app and detect that data

A_d, not required by instruction 6, can be swapped to host. This

will allow the application to complete with no error. In summary,

intra-application swap enables the execution of applications that

would fail on the CUDA runtime even if run in isolation. In other

words, the maximum memory footprint of the “larger” kernel

(rather than the overall memory footprint of the application) will

determine whether the application can correctly run on the device.

Inter-application swap – This kind of swap may take place

when concurrent applications mapped onto the same device have

conflicting memory requirements. In particular, if device

memory cannot be allocated and intra-application swap is not

possible, the memory manager will be queried for applications

running on the same GPU and using the amount of memory

required. If such an application exists, it will be asked to swap.

The application may or may not accept the request: for instance,

an application running in a CPU phase with no pending requests

may swap, but an application in the middle of a kernel call may

not. If no application honors the swap request, the calling

application will unbind from the virtual-GPU and retry later.

Otherwise, all the page table entries belonging to the application

that accepts the request

will be swapped, and such application will

be temporarily unbound from the GPU. There may be situations

where multiple applications must swap for the required memory

to be freed. To reduce complexity and avoid inefficiencies, we do

not trigger the swap in these situations. Note that inter-application

swap implies coordination among virtual-GPUs and, as a

consequence, has a higher overhead than intra-application swap.

To avoid dead-locks, synchronization is required while accessing

the page table. Finally, note that enabling swaps only during CPU

phases allows GPU intensive applications to make full use of the

GPU.

To determine whether a memory allocation can be serviced, the

runtime will first use the memory utilization data in the memory

manager (CapacityList, MemAvailList, and MemUsage).

However, because of possible memory fragmentation on GPU, the

runtime may need to use the return code of the GPU memory

allocation function to ensure that the request can be honored.

Moreover, there may be cases where only some GPUs have the

required memory capacity.

Finally, we point out two additional benefits of our design.

First, bad memory operations (for instance, data transfers beyond

the boundary of an allocated area) can be detected by the memory

manager without overloading the CUDA runtime with calls that

would fail. Second, multiple data copy operations within the same

allocated area (i.e., the same page table entry) will trigger a

single, bulk memory transfer to the device memory.

4.6 Fault Tolerance & Checkpoint-Restart

The memory manager provides an implicit checkpoint capability

that allows load balancing if more powerful GPU become idle, if

103

Table 2: Benchmark programs.

Program Description Kernel

calls #

Short-running applications

Back Propagation (BP) Training of 20 neural networks

with 64K nodes per input layer

Breadth-First Search

(BFS)

Traversal of graph with 1M nodes 24

HotSpot (HS) Thermal simulation of 1M grids 1

Needleman-Wunsch

(NW)

DNA sequence alignment of 2K

potential pairs of sequences

256

Scalar Product (SP) Scalar product of vector pair (512

vector pairs of 1M elements)

Matrix Transpose (MT) Transpose (384x384) matrix 816

Parallel Reduction

(PR)

Parallel reduction of 4M elements 801

Scan (SC) Parallel prefix sum of 260K

elements

3,300

Black Scholes - small

(BS-S)

Processing of 4M financial

options

256

Vector Addition (VA) 100M-element vector addition 1

Long-running applications

Small Matrix

Multiplication (MM-S)

200 matrix multiplications of

2Kx2K square matrices and

variable CPU phases

200

Large Matrix

Multiplication (MM-L)

10 matrix multiplications of

10Kx10K square matrices and

variable CPU phases

Black Scholes - large

(BS-L)

Processing of 40M financial

options

256

GPUs are dynamically added and removed from the system, and

recovery in case of GPU failures. For each application thread, the

page table and the swap memory contain the state of the device

memory. In addition, an internal data structure (called Context)

contains other state information, such as: a link to the connection

object, the information about the last device call performed, and,

if the application thread fails, the error code. With this state

information, dynamic binding allows redirecting contexts to

different GPUs and resuming their operation. The dispatcher will

monitor the availability of the devices and schedule contexts from

the failed contexts list (in case of GPU failure or removal) and

unbind and reschedule applications from the assigned contexts list

in case of GPU addition. Our mechanism can be combined with

BLCR [29] in order to enable these mechanisms also after a full

restart of a node. Finally, our runtime has an internal check-

pointing primitive that can be dynamically triggered after long

running kernels, to allow fast recovery in case of failures.

4.7 Inter-node Offloading

If the GPUs installed on a node are overloaded, our runtime can

offload some application threads to other nodes. Note that this

mechanism allows transferring only the CUDA calls originating

within an application, and not its CPU phases. In particular, the

runtime redirects application threads in the list of pending

connections to other nodes using a TCP socket interface. A

measure of the load on the system is provided by the size of the

list of pending connections. We allow the dispatcher to process

pending connections only if the number of pending contexts is

below a given threshold.

4.8 CUDA 4 Support

We briefly discuss the modifications required by our runtime in

order to support CUDA 4.0. The most significant changes in

CUDA 4.0 are the following: (i) all threads belonging to the same

application are mapped onto the same CUDA context, and (ii)

each application thread can use multiple devices by issuing

multiple cudaSetDevice calls. The first change has been

introduced to enable application threads to share data on GPU.

Our current implementation does not differentiate threads

belonging to the same application from threads belonging to

different ones. Moreover, to avoid explicit procurement of threads

to GPUs, our runtime ignores all cudaSetDevice calls issued

by applications. Compatibility with CUDA 4.0 requires the

following changes. First, each thread connection should carry the

information about the corresponding application identifier. This

information will be used to ensure that application threads sharing

data are mapped onto the same device. Second,

cudaSetDevice calls can be used to identify groups of

CUDA calls that can potentially be assigned to different GPUs.

Note that, because of the dynamic binding capability of our

runtime, the latter modification is not strictly required. However,

its introduction can help making efficient scheduling decisions

with minimal overhead. Finally, CUDA 4.0 allows a more

efficient and direct GPU-to-GPU data transfer. Our runtime can

take advantage of this mechanism to provide faster thread-to-GPU

remapping.

5. EXPERIMENTAL RESULTS

The experiments were conducted in two environments: on a single

node and on a three-node cluster. Unless otherwise indicated, the

metric reported in all experiments is the overall execution time for

a batch of concurrent jobs (that is, the time elapsed between the

first job starts and the last job finishes processing). We observed

analogous trends when considering the average execution time

across the jobs in the batch.

In all experiments, we adopted a first-come-first-served

scheduling policy that assigns jobs to physical GPUs in a round-

robin fashion and attempts to perform load balancing (by keeping

the number of active vGPUs uniform across all available GPUs).

The runtime is configured to defer all data transfers. All data

reported in experiments using our runtime include all the

overheads introduced by our framework: call interception,

queuing delays, scheduling, memory management, and, whenever

performed, swap operations and relating synchronizations. Given

the parallel nature of the system, such overheads are not additive.

5.1 Hardware Setup

The system used in our node-level experiments includes eight

Intel Xeon E5620 processors running at 2.40 GHz and is equipped

with 48 GB of main memory and three NVIDIA Fermi GPUs

(two Tesla C2050s and one Tesla C1060). Each Tesla C2050 has

14 streaming multiprocessors (SMs) with 32 cores per SM, each

running at 1.15 GHz, and 3 GB of device memory. The Tesla

C1060 has 30 SMs with 8 cores per SM, and 4 GB of device

memory. In one experiment, we replaced the Tesla C1060 with

the less powerful NVIDIA Quadro 2000 GPU, equipped with four

48-core SMs and 1 GB device memory. In our cluster-level

experiments we used an additional node with the same CPU

configuration but equipped with a single Tesla C1060 GPU card.

5.2 Benchmarks

The benchmark applications used in our experiments are listed in

Table 2. These applications, obtained from Rodinia Benchmark

Suite [30] and NVIDIA’s CUDA SDK, cover several application

104

1248

Total Execution time (sec)

# of jobs

CUDA Ru n t i m e

1 vGPU

2 vGPUs

4 vGPUs

8 vGPUs

Figure 5: Execution time reported with a variable number o

short-running jobs on a node with 1 GPU. The bare CUDA

runtime is compared with our runtime.

8163248

Total execution time (sec)

# of jobs

CU DA r u n t i m e

1 vGPU

2 vGPUs

4 vGPUs

Figure 6: Execution time reported with a variable number o

short-running jobs on a node with 3 GPUs. The bare CUDA

runtime cannot handle more than 8 concurrent jobs.

domains, and differ in their memory occupancy, their GPU

intensity and their interleaving of computation between CPU and

GPU. We divide the workload into two categories: short-running

and long-running applications. When using a Tesla C2050 GPU,

the former report a running time between 3 and 5 seconds each,

and the latter between 30 and 90 seconds (depending on the CPU

phase injected – see Section 5.3.3). In the third column of Table

2, we report the number of kernel calls performed by each

application. All short-running applications and BS-L are GPU

intensive and have memory requirements well below the capacity

of the GPUs in use. MM-S and MM-L are injected CPU phases of

different length; MM-L has high memory requirements.

5.3 Node-level Experiments

5.3.1 Overhead Evaluation

First, we measured the overhead of our framework with respect to

the CUDA runtime. We allowed our runtime to use only one

physical GPU, and varied the number of virtual GPUs (vGPUs).

The execution time of the bare CUDA runtime gives a lower

bound that allows us to quantify the overhead associated with our

framework. The data in Figure 5 were obtained by randomly

drawing jobs from the pool of short-running applications in Table

2, and averaging the results over ten runs. To ensure apple-to-

apple comparison, we run each of the randomly drawn

combination of jobs on all 5 reported configurations (bare CUDA

runtime and our runtime using 1, 2, 4 and 8 vGPUs).

Since our experiments showed that the CUDA runtime cannot

handle more than eight concurrent CUDA contexts, we limited the

number of jobs to eight. As can be seen in Figure 5, the total

execution time of our runtime approaches the lower limit (CUDA

runtime) as we increase the number of vGPUs. Increasing the

number of vGPUs means increasing the sharing of the physical

GPU, thus amortizing the overhead of the framework (which, in

the worst case, accounts for about 10% of the execution time).

Note that the percentage overhead would decrease on long-

running applications.

5.3.2. Benefits of GPU Sharing

In our second set of experiments we evaluated the effect of GPU

sharing in the presence of more (three) physical GPUs. We used

the same workload as in Section 5.3.1, and again varied the

number of vGPUs per device. We recall that the number of

vGPUs represents the number of jobs that can time-share a GPU.

As mentioned in the previous section, we found that the CUDA

runtime does not currently support more than eight concurrent

jobs stably. Therefore, we do not report results using the bare

CUDA runtime beyond eight jobs. Figure 6 shows that, when

using four vGPUs per device, our runtime reports some

performance gain compared to the bare CUDA runtime. In fact,

the overhead of our framework is compensated by its ability to

load balance jobs on different physical GPUs. When running

higher number of concurrent jobs, our results confirm our

previous finding that increasing the amount of GPU sharing

positively impacts the performances. However, we do not observe

significant performance improvements when more than four

vGPUs are employed. We believe that four vGPUs per device

provide a good compromise between resource sharing and

runtime overhead, and we use this setting in the rest of our

experiments.

5.3.3. Conflicting Memory Needs: Effect of Swapping

The effect of swapping can be evaluated by using memory-hungry

applications. To this end, we considered large matrix

multiplication (MM-L). This benchmark program performs ten

square matrix multiplications on randomly generated matrices.

We set the data set size so to have conflicting memory

requirements when more than two jobs are mapped onto the same

GPU. In addition, we injected in the matrix multiplication

benchmark CPU phases of various size. CPU phases are

interleaved with kernel calls, and simulate different level of post-

processing on the product of the matrix multiplication.

The effect of swapping is evaluated by running 36 MM-L jobs

concurrently. In order to compare the swapping and no-swapping

cases, we conducted experiments with one vGPU (no swapping

required) and four vGPUs (swapping required). We recall that, in

the one vGPU case, jobs run sequentially on a physical GPU, and

therefore there is no memory contention. In the experiment, the

fraction of CPU work is varied while maintaining the level of

GPU work. Figure 7 shows that the total execution time grows

linearly with the fraction of CPU work in the case of serialized

execution (1 vGPU). In the case of GPU sharing (4 vGPUs), the

overall execution time is kept constant even if the amount of work

in each job increases. In fact, swapping can effectively reduce the

total execution time by hiding the CPU-driven latency. In the

chart, the number on the top of each bar indicates the swap

operations occurred during execution. This experiment

demonstrates that our swapping mechanism can effectively

resolve resource conflicts among the concurrently running

applications. In addition, despite its overhead, this mechanism

provides performance improvement to applications with a

considerable fraction of CPU work.

105

44 58

100

150

200

250

300

350

100/ 0 75/ 25 50/ 50 25/ 75 0/ 100

Total Execution Time (sec)

Workload composit ion - Fraction BlackScholes/ M atmul

serialized execution (1 VGPU)

GPU shar i ng (4 VGPUs)

Figure 8: 36 jobs (BS-L and MM-L) are run on a node with 3

GPUs. The workload composition is varied. We indicate the

number of swap operations occurred on top of each bar.

100

150

200

250

300

350

400

450

500

00.511.52

Total Execution time (sec)

Fraction of CPU code

serialized execut ion (1 vGPU)

GPU sh ar i ng ( 4 VGPUs)

Figure 7: 36 MM-L jobs (with conflicting memory

requirements) are run on a node with 3 GPUs. The fraction o

CPU code in the workload is varied. We indicate the number

of swap operations occurred on top of each bar.

We next investigated the performance of our runtime when

combining applications with different amount of CPU work. In

particular, we mixed BS-L with MM-L at different ratio (Figure

8). BS-L is a GPU-intensive application with very short CPU

phases, whereas MM-L was set to have a fraction of CPU work

equal to 1. The memory requirements of BS-L are below those of

MM-L. Again, we run 36 jobs concurrently. The results of these

experiments are shown in Figure 8. Again, the number on the top

of each bar indicates the number of swap operations occurred

during execution. As one might expect, the performance gain

from GPU sharing increases as MM-L becomes dominant.

Because BS-L is a GPU intensive application and swapping adds

additional overhead, this results in a longer execution time for

four vGPUs at a 75/25 mix of BS-L and MM-L.

5.3.4 Benefits of Dynamic Load Balancing

In Figure 9, we show the results of experiments performed on an

unbalanced node that contains two fast and one slow GPUs: two

Tesla C2050s and one Quadro 2000, respectively. In one setting,

our runtime performs load balancing as follows. The dispatcher

keeps track of fast GPUs becoming idle, and, in the absence of

pending jobs, it migrates running jobs from slow to fast GPUs.

The experiments are conducted on MM-S jobs with varying

CPU fraction, and using 4 vGPUs per device. The number of jobs

migrated is reported on top of each bar. As can be seen, despite

the overhead due to job migration, load balancing through

dynamic binding of jobs to GPUs is an effective way to improve

the performances of an unbalanced system. This holds especially

in the presence of small batches of jobs and of applications

alternating CPU and GPU phases. As the number of concurrent

jobs increases, the system performs load balancing by scheduling

on fast GPUs pending jobs, rather than by migrating jobs already

running on slow devices.

5.4 Cluster-level Experiments

We have integrated our runtime with TORQUE, a cluster-level

scheduler that can be used to run GPU jobs on heterogeneous

clusters. In this section, we show experiments performed on a

cluster of three nodes. The jobs are submitted at a head node and

executed on two compute nodes. The hardware configuration of

the compute nodes is described in Section 5.1. Having a three-

and a single-GPU compute node, our cluster is unbalanced.

When TORQUE is used on a cluster equipped with GPUs, it

relies on the CUDA runtime to execute GPU calls. Since the

CUDA runtime does not provide adequate support to

concurrency, TORQUE does not allow any form of GPU sharing

across jobs. Therefore, when configured to use compute nodes

equipped with GPUs, TORQUE serializes the execution of

concurrent jobs by enqueuing them on the head node and

submitting them to the compute nodes only when a GPU becomes

available. By coupling TORQUE with our runtime system, we are

able to provide GPU sharing to concurrent jobs.

When coupling TORQUE with our runtime, we conducted

experiments with three settings. In all cases, to force TORQUE to

submit to the compute nodes more jobs than available GPUs, we

hid from TORQUE the presence of GPUs, and handled it only

within our runtime. In the first setting, our runtime was

configured to use only one vGPU per device, and therefore to

serialize the execution of concurrent jobs. In the second setting,

we allowed GPU sharing by using four vGPUs per device. In the

third setting, we additionally enabled load balancing across

compute nodes by allowing inter-node communication and

offloading. We also performed experiments using TORQUE

natively on the bare CUDA runtime. However, the results

reported using this configuration are far worse than those reported

using TORQUE in combination with our runtime. Therefore, we

show the use of our runtime with one vGPU per device as

example of no GPU sharing.

In Figure 10, we show experiments conducted using a variable

number of short-running jobs drawn from the applications in

200

400

600

800

1000

1200

1400

12 24 36 12 24 36

Total execution time (sec)

# of jobs

no load balancing

load balancing through dynamic binding

cpu fraction = 0

cpu fraction = 1

Figure 9: Unbalanced node with 2 Tesla C2050s and 1 Quadro

2000: effect of load balancing through dynamic binding. The

number of MM-S jobs migrated to fast GPUs is reported on

of each bar.

106

200

400

600

800

1000

1200

1400

Tot (16 jobs) Avg (16 jobs) Tot (32 jobs) Avg (32 jobs) Tot (48 jobs) Avg (48 jobs)

Execution Time (sec)

Metric (# jobs)

serialized execution

GPU sh ar in g ( 4 vGPUs)

GPU sharing + load balancing

Figure 11: Two-node cluster using TORQUE: effect of

GPU sharing and load balancing via inter-node offloading

in the presence of long-running jobs and conflicting

memory requirements.

Table 2. In this set of experiments, jobs do not exhibit conflicting

memory requirements. Again, we average the results reported

over ten runs. As can be seen, GPU sharing allows up to a 28%

performance improvement over serialized execution. However

TORQUE, which relies on our runtime and is unaware of the

number and location of the GPUs in the cluster, divides the

workload equally between the two nodes. Thus, the node with

only one GPU is overloaded compared to the other node with

three GPUs. When, in addition to GPU sharing, we allow load

balancing through our inter-node offloading technique, the overall

throughput is further improved by up to 18%.

Finally, we want to show the benefits of our runtime system in a

cluster in the presence of jobs with conflicting memory

requirements. To this end, we run 16, 32 and 48 BS-L and MM-L

jobs (25/75 distribution). We recall that these two applications

have long runtimes. The results of this experiment are shown in

Figure 11. Again, serialized execution allows avoiding memory

conflicts. From the figure, it is clear that allowing jobs to share

GPUs increases the throughput significantly (up to 50%), despite

the overhead due to the need for swap operations. Moreover, in

the presence of load imbalances, the execution is further

accelerated by allowing the overloaded node to offload the excess

jobs remotely.

6. RELATED WORK

Our proposal is closely related to two categories of work: runtime

systems to enable GPU virtualization [1][2][3][4][6], and

memory-aware runtimes for heterogeneous nodes including CPUs

and GPUs [8][9]. As mentioned previously, GViM [1], vCUDA

[2], rCUDA [3] and gVirtuS [4] use API remoting to provide GPU

visibility from within virtual machines. GViM and vCUDA

leverage the multiplexing mechanism provided by the CUDA

runtime in order to allow GPU sharing among different

applications. In addition, GViM uses a Working Queue per GPU

to evenly distribute the load across different GPUs. However, as

discussed, the CUDA runtime cannot properly handle a large

number of concurrent applications, nor concurrent applications

whose aggregate memory requirements exceed the memory

capacity of the underlying GPU device. Our work addresses both

these issues, and allows dynamic scheduling of jobs on GPUs.

As additional feature, GViM provides a mechanism to minimize

the overhead of memory transfers when GPUs are used within

virtualized environments. In particular, its authors propose using

the mmap Unix system call to avoid the data copy between the

guest OS and the host OS. Whenever possible, they also propose

using page locked memory (along with the cudaMallocHost

primitive) in order to avoid the additional data copy between host

OS and GPU memory. Memory transfer optimization is

orthogonal to the objectives of this work. In the future, we plan to

include these optimizations in our runtime.

Guevara et al [7] propose kernel consolidation as a way to share

GPUs. They show that this mechanism is particularly effective in

the presence of kernels with complementary resource

requirements (e.g.: compute intensive and memory intensive

kernels). The concept of kernel consolidation has been

reconsidered and explored in the context of GPU virtualization by

Ravi et al [6]. Differently from us, Ravi et al assume that the

overall memory footprint of the consolidated applications fits the

device memory, and statically bind applications to GPUs. Our

proposal is in a way orthogonal to [6]. In fact, the delayed

application-to-GPU binding and the deferral of memory

operations offered by our runtime should allow easy and efficient

integration of kernel consolidation.

Gelado et al [8] and Becchi et al [9] propose two similar

memory-management frameworks for nodes including CPUs and

GPUs. The primary goal of these proposals is to hide the

underlying distributed memory system from the programmer, and

automatically move data between CPU and GPU as they are

needed. By doing so, these frameworks eliminate some

unrequired memory transfers between CPU and GPU. These

proposals, however, do not target multi-tenancy and conflicting

memory requirements among concurrent applications, which are

the main focus of our work. On one hand, the memory module we

design has some similarities with these two proposals (tracking of

mapping between CPU and GPU address spaces, memory transfer

deferral and optimization); on the other hand, it extends these

frameworks and focuses on the multi-tenancy scenario.

NVCR [15] provides a checkpoint-restart mechanism for

CUDA applications written using the CUDA driver and runtime

APIs. The framework is intended to be integrated with BLCR [29]

Checkpoints are inserted at each memory and kernel operation

using an intercept library. To ensure memory consistency and

reconstruct the device pointer information, NVCR requires

replaying all memory allocations performed by the application

after every restart, leading to a potentially high overhead. Our

virtual memory abstraction allows us to replay only memory

operations required by not-yet-executed kernel calls.

Total (32 jobs) Avg (32 jobs) Total (48 jobs) Avg (48 jobs)

Execution t ime (sec)

Metric (# jobs)

serialized execution (1 vGPU)

GPU shar ing (4 vGPUs)

GPU shari ng + Load Balanci ng

Figure 10: Two-node cluster using TORQUE: effect of GPU

sharing and load balancing via inter-node offloading in the

presence of short-running jobs and in the absence o

conflicting memory requirements.

107

Our runtime assumes that the memory footprint of each

application fits the most capable GPU in the system. Under this

assumption, we allow concurrency among applications with

conflicting memory requirements. In addition, our intra-

application swap capability allows relaxing the memory

requirements for applications to run on GPU. Related work

[16][17] has considered the problem of limiting the memory

requirements of single applications by reorganizing their memory

access patterns and splitting operators.

Finally, the interest in using GPUs for general purpose

computing is confirmed by recent work on automatic generation

of CUDA code [13][14], and on programming models and

runtime systems for heterogeneous nodes [10][11][12]. These

proposals are orthogonal to the work presented in this paper.

7. CONCLUSIONS AND FUTURE WORK

In conclusion, we have proposed a runtime system that provides

abstraction and sharing of GPUs, while allowing isolation of

concurrent applications. Two fundamental features of our runtime

are: (i) dynamic application-to-GPU binding and (ii) virtual

memory for GPUs. In particular, dynamic binding maximizes

device utilization and improves performances in the presence of

concurrent applications with multiple GPU phases and of GPUs

with different compute capabilities. Besides dynamic binding, the

virtual memory abstraction enables the following features: (i) load

balancing in case of GPU addition and removal, (ii) resilience to

GPU failures, and (iii) checkpoint-restart capabilities.

Our prototype implementation targets NVIDIA GPUs. In the

future, we intend to extend our runtime to support other many-

core devices, such as the Intel MIC. In addition, we intend to

evaluate our runtime on larger clusters and on multi-node

applications. Finally, we plan to explore alternative mapping and

scheduling algorithms, as well as security concerns related to

heterogeneous cluster and cloud computing infrastructures.

8. ACKNOWLEDGEMENTS

The authors thank the anonymous reviewers for the feedback that

helped improve the paper. This work has been supported by NEC

Research Laboratories. Adam Procter has been supported by U.S.

Department of Education GAANN grant no. P200A100053.

9. REFERENCES

[1] V. Gupta et al. 2009. GViM: GPU-accelerated virtual machines. In

Proc. of HPCVirt '09. ACM, New York, NY, USA, pp. 17-24.

[2] L. Shi, H. Chen, and J. Sun. 2009. vCUDA: GPU accelerated high

performance computing in virtual machines. In Proc. of IPDPS '09,

Washington, DC, USA, pp. 1-11.

[3] J. Duato et al. 2010. rCUDA: Reducing the number of GPU-based

accelerators in high performance clusters. In Proc. of HPCS’10, pp.

224–231.

[4] G. Giunta, R. Montella, G. Agrillo, and G. Coviello. 2010. A

GPGPU transparent virtualization component for high performance

computing clouds. In Proc. of Euro-Par 2010, Heidelberg, 2010.

[5] gVirtuS: http://osl.uniparthenope.it/projects/gvirtus

[6] V. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. 2011.

Supporting GPU sharing in cloud environments with a transparent

runtime consolidation framework. In Proc. of HPDC '11. ACM, New

York, NY, USA, pp. 217-228.

[7] M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron. 2009.

Enabling Task Parallelism in the CUDA Scheduler. In Workshop on

Programming Models for Emerging Architectures, Sep. 2009.

[8] I. Gelado et al. 2010. An asymmetric distributed shared memory

model for heterogeneous parallel systems. In Proc. of ASPLOS '10.

ACM, New York, NY, USA, pp. 347-358.

[9] M. Becchi, S. Byna, S. Cadambi, and S. Chakradhar. 2010. Data-

aware scheduling of legacy kernels on heterogeneous platforms with

distributed memory. In Proc. of SPAA '10. ACM, New York, NY,

USA, pp. 82-91.

[10] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. 2008.

Merge: a programming model for heterogeneous multi-core systems.

In Proc. of ASPLOS’08. ACM, New York, NY, USA, pp. 287-296.

[11] B. Saha et al. 2009. Programming model for a heterogeneous x86

platform. In Proc. of PLDI '09. New York, NY, USA, pp. 431-440.

[12] C.-K. Luk, S. Hong, and H. Kim. 2009. Qilin: exploiting parallelism

on heterogeneous multiprocessors with adaptive mapping. In Proc.

of MICRO’09. ACM, New York, NY, USA, pp. 45-55.

[13] S.-Z. Ueng, M. Lathara, S. Baghsorkhi, and W.-M. Hwu. 2008.

CUDA-Lite: Reducing GPU Programming Complexity. In

Languages and Compilers for Parallel Computing, Lecture Notes in

Comp. Sc., Vol. 5335. Springer-Verlag, Berlin, Heidelberg pp. 1-15.

[14] S. Lee and R. Eigenmann. 2010. OpenMPC: Extended OpenMP

Programming and Tuning for GPUs. In Proc. of SC '10. Washington,

DC, USA, pp. 1-11. Nov 2010.

[15] A. Nukada, H. Takizawa, and S. Matsuoka, 2011. NVCR: A

Transparent Checkpoint-Restart Library for NVIDIA CUDA. In

Proc. of IPDPDW’11, Shanghai, China, pp. 104-113, Sep 2011.

[16] N. Sundaram, A. Raghunathan, and S. Chakradhar. 2009. A

framework for efficient and scalable execution of domain-specific

templates on GPUs. In Proc. of IPDPS '09. IEEE Computer Society,

Washington, DC, USA, pp. 1-12.

[17] J. Kim, H. Kim, J. Hwan Lee, and J. Lee. 2011. Achieving a single

compute device image in OpenCL for multiple GPUs. In Proc. of

PPoPP '11. ACM, New York, NY, USA, pp. 277-288.

[18] H. Lim, S. Babu, J. Chase, and S. Parekh. 2009. Automated control

in cloud computing: challenges and opportunities. In Proc. of ACDC

'09. ACM, New York, NY, USA, pp. 13-18.

[19] P. Marshall, K. Keahey, and T. Freeman. 2010. Elastic Site: Using

Clouds to Elastically Extend Site Resources. In Proc. of CCGrid

2010, pp. 43-52, May 2010.

[20] P. Padala et al. 2009. Automated control of multiple virtualized

resources. In Proc. of EuroSys '09. New York, NY, USA, pp. 13-26.

[21] M. Becchi and P. Crowley. 2006. Dynamic thread assignment on

heterogeneous multiprocessor architectures. In Proc. of CF '06.

ACM, New York, NY, USA, pp. 29-40.

[22] J. Nickolls, I. Buck, M. Garland, K. Skadron. 2008. Scalable Parallel

Programming with CUDA. In ACM Queue. April 2008.

[23] G. Teodoro et al. 2009. Coordinating the use of GPU and CPU for

improving performance of compute intensive applications. In Proc.

of CLUSTER, pp. 1–10, 2009.

[24] Eucalyptus: http://www.eucalyptus.com

[25] TORQUE Resource Manager: http://www.clusterresources.com/

products/TORQUE-resource-manager.php

[26] Amazon EC2 Instances: http://aws.amazon.com/ec2/

[27] Nimbix Informatics Xcelerated: http://www.nimbix.net

[28] Hoopoe: http://www.hoopoe-cloud.com

[29] BLCR: https://ftg.lbl.gov/projects/CheckpointRestart

[30] Rodinia : https://www.cs.virginia.edu/~skadron/

wiki/rodinia/index.php/Main_Page

108

Data Resource Management in Throughput Processors

Thesis

Jan 2018

John Kloosterman

Graphics Processing Units (GPUs) are becoming common in data centers for tasks like neural network training and image processing due to their high performance and efficiency. GPUs maintain high throughput by running thousands of threads simultaneously, issuing instructions from ready threads to hide latency in others that are stalled. While this is effective for keeping the arithmetic units busy, the challenge in GPU design is moving the data for computation at the same high rate. Any inefficiency in data movement and storage will compromise the throughput and energy efficiency of the system. Since energy consumption and cooling make up a large part of the cost of provisioning and running and a data center, making GPUs more suitable for this environment requires removing the bottlenecks and overheads that limit their efficiency. The performance of GPU workloads is often limited by the throughput of the memory resources inside each GPU core, and though many of the power-hungry structures in CPUs are not found in GPU designs, there is overhead for storing each thread's state. When sharing a GPU between workloads, contention for resources also causes interference and slowdown. This thesis develops techniques to manage and streamline the data movement and storage resources in GPUs in each of these places. The first part of this thesis resolves data movement restrictions inside each GPU core. The GPU memory system is optimized for sequential accesses, but many workloads load data in irregular or transposed patterns that cause a throughput bottleneck even when all loads are cache hits. This work identifies and leverages opportunities to merge requests across threads before sending them to the cache. While requests are waiting for merges, they can be reordered to achieve a higher cache hit rate. These methods yielded a 38% speedup for memory throughput limited workloads. Another opportunity for optimization is found in the register file. Since it must store the registers for thousands of active threads, it is the largest on-chip data storage structure on a GPU. The second work in this thesis replaces the register file with a smaller, more energy-efficient register buffer. Compiler directives allow the GPU to know ahead of time which registers will be accessed, allowing the hardware to store only the registers that will be imminently accessed in the buffer, with the rest moved to main memory. This technique reduced total GPU energy by 11%. Finally, in a data center, many different applications will be launching GPU jobs, and just as multiple processes can share the same CPU to increase its utilization, running multiple workloads on the same GPU can increase its overall throughput. However, co-runners interfere with each other in unpredictable ways, especially when sharing memory resources. The final part of this thesis controls this interference, allowing a GPU to be shared between two tiers of workloads: one tier with a high performance target and another suitable for batch jobs without deadlines. At a 90% performance target, this technique increased GPU throughput by 9.3%. GPUs' high efficiency and performance makes them a valuable accelerator in the data center. The contributions in this thesis further increase their efficiency by removing data movement and storage overheads and unlock additional performance by enabling resources to be shared between workloads while controlling interference.

KubeShare: A Framework to Manage GPUs as First-Class and Shared Resources in Container Cloud

Conference Paper

Jun 2020

Towards predicting GPGPU performance for concurrent workloads in Multi-GPGPU environment

Article

Full-text available

Sep 2020
CLUSTER COMPUT

General-purpose graphics processing units (GPGPUs) have been widely adapted to the industry due to the high parallelism of graphics processing units (GPUs) compared with central processing units (CPUs). Especially, a GPGPU device has been adopted for various scientific workloads which have high parallelism. To handle the ever increasing demand, multiple applications are often run simultaneously in multiple GPGPU devices. However, when multiple applications are running concurrently, the overall performance of GPGPU devices varies significantly due to the different characteristics of GPGPU applications. To improve the efficiency, it is critical to anticipate the performance of applications and find optimal scheduling policy. In this paper, we analyze various types of scientific applications and identify factors that impact the performance during the concurrent execution of the applications in GPGPU devices. Our analysis results show that each application has distinct characteristic. By considering distinct characteristics of applications, a certain combination of applications has better performance compared with the others when executed concurrently in multiple GPGPU devices. Based on the finding of our analysis, we propose a simulator which predicts the performance of GPGPU devices when multiple applications are running concurrently. Our simulator collects performance metrics during the execution of applications and predicts the performance of certain combinations using the performance metrics. The experimental result shows that the best combination of applications can increase the performance by 39.44% and 65.98% compared with the average of combinations and the worst case, respectively when using a single GPGPU device. When utilizing multiple GPGPU devices, our result shows that the performance improve can be 24.78% and 39.32% compared with the average and the worst combinations, respectively.

Pagoda: A GPU Runtime System for Narrow Tasks

Article

Nov 2019

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. This article presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 5.52X over PThreads running on a 20-core CPU, 1.76X over CUDA-HyperQ, and 1.44X over GeMTC, the state-of-the-art runtime GPU task scheduling system.

Low-overhead dynamic sharing of graphics memory space in GPU virtualization environments

Article

Full-text available

Sep 2020
CLUSTER COMPUT

The proliferation of GPU intensive workloads has created a new challenge for low-overhead and efficient GPU virtualization solutions over GPU clouds. gVirt is a full GPU virtualization solution for Intel’s integrated GPUs that share system’s on-board memory for graphics memory. In order to solve the inherent scalability limitation on the number of simultaneous virtual machines (VM) in gVirt, gScale proposed a dynamic sharing scheme for global graphics memory among VMs by copying the entries in a private graphics translation table (GTT) to a physical GTT along with a GPU context switch. However, copying entries between private GTT and physical GTT often causes significant overhead, which becomes worse when the global graphics memory space shared by each VM is overlapped. This paper identifies that the copy overhead caused by GPU context switch is one of the major bottlenecks in performance improvement and proposes a low-overhead dynamic memory management scheme called DymGPU. DymGPU provides two memory allocation algorithms such as size-based and utilization-based algorithms. While the size-based algorithm allocates memory space based on the memory size required by each VM, the utilization-based algorithm considers GPU utilization of each VM to allocate memory space. DymGPU is also dynamic in the sense that the global graphics memory space used by each VM is rearranged at runtime by periodically checking idle VMs and GPU utilization of each runnable VM. We have implemented our proposed approach in gVirt and confirmed that the proposed scheme reduces GPU context switch time by up to 53% and improved the overall performance of various GPU applications by up to 39%.

Ballooning Graphics Memory Space in Full GPU Virtualization Environments

Article

Full-text available

Apr 2019

Advances in virtualization technology have enabled multiple virtual machines (VMs) to share resources in a physical machine (PM). With the widespread use of graphics-intensive applications, such as two-dimensional (2D) or 3D rendering, many graphics processing unit (GPU) virtualization solutions have been proposed to provide high-performance GPU services in a virtualized environment. Although elasticity is one of the major benefits in this environment, the allocation of GPU memory is still static in the sense that after the GPU memory is allocated to a VM, it is not possible to change the memory size at runtime. This causes underutilization of GPU memory or performance degradation of a GPU application due to the lack of GPU memory when an application requires a large amount of GPU memory. In this paper, we propose a GPU memory ballooning solution called gBalloon that dynamically adjusts the GPU memory size at runtime according to the GPU memory requirement of each VM and the GPU memory sharing overhead. The gBalloon extends the GPU memory size of a VM by detecting performance degradation due to the lack of GPU memory. The gBalloon also reduces the GPU memory size when the overcommitted or underutilized GPU memory of a VM creates additional overhead for the GPU context switch or the CPU load due to GPU memory sharing among the VMs. We implemented the gBalloon by modifying the gVirt , a full GPU virtualization solution for Intel’s integrated GPUs. Benchmarking results show that the gBalloon dynamically adjusts the GPU memory size at runtime, which improves the performance by up to 8% against the gVirt with 384 MB of high global graphics memory and 32% against the gVirt with 1024 MB of high global graphics memory.

vGPUPCD: Design and Implementation of Virtual GPU on Private Cloud Desktop

Conference Paper

Jun 2022

PILOT: a Runtime System to Manage Multi-tenant GPU Unified Memory Footprint

Conference Paper

Dec 2021

Planaria: Dynamic Architecture Fission for Spatial Multi-Tenant Acceleration of Deep Neural Networks

Conference Paper

Oct 2020

Towards Predicting GPGPU Performance for Concurrent Workloads

Conference Paper

Jun 2019

GViM: GPU-accelerated virtual machines

Article

Full-text available

Mar 2009

The use of virtualization to abstract underlying hardware can aid in sharing such resources and in efficiently managing their use by high performance applications. Unfortunately, virtualization also prevents efficient access to accelerators, such as Graphics Processing Units (GPUs), that have be-come critical components in the design and architecture of HPC systems. Supporting General Purpose computing on GPUs (GPGPU) with accelerators from different vendors presents significant challenges due to proprietary program-ming models, heterogeneity, and the need to share accelera-tor resources between different Virtual Machines (VMs). To address this problem, this paper presents GViM, a sys-tem designed for virtualizing and managing the resources of a general purpose system accelerated by graphics proces-sors. Using the NVIDIA GPU as an example, we discuss how such accelerators can be virtualized without additional hardware support and describe the basic extensions needed for resource management. Our evaluation with a Xen-based implementation of GViM demonstrate efficiency and flexi-bility in system usage coupled with only small performance penalties for the virtualized vs. non-virtualized solutions.

CUDA-Lite: Reducing GPU Programming Complexity

Conference Paper

Full-text available

Jul 2008

The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.

Automated control of multiple virtualized resources

Conference Paper

Full-text available

Apr 2009

Virtualized data centers enable consolidation of mul- tiple applications and sharing of multiple resources among these applications. However, current virtual- ization technologies are inadequate in achieving com- plex service level objectives (SLOs) for enterprise ap- plications with time-varying demands for multiple re- sources. In this paper, we present AutoControl, a resource allocation system that automatically adapts to dynamic workload changes in a shared virtualized infrastructure to achieve application SLOs. AutoCon- trol is a combination of an online model estimator and a novel multi-input, multi-output (MIMO) resource controller. The model estimator captures the complex relationship between application performance and re- source allocations, while the MIMO controller allo- cates the right amount of resources to ensure appli- cation SLOs. Our experimental results using RU- BiS and TPC-W benchmarks along with production- trace-driven workloads indicate that AutoControl can detect and adapt to CPU and disk I/O bottlenecks that occur over time and across multiple nodes and allocate multiple virtualized resources accordingly to achieve application SLOs. It can also provide service difierentiation according to the priorities of individual applications during resource contention.

Coordinating the use of GPU and CPU for improving performance of compute intensive applications

Conference Paper

Full-text available

Jan 2009

GPUs have recently evolved into very fast parallel co-processors capable of executing gen-eral purpose computations extremely efficiently. At the same time, multi-core CPUs evolution continued and today's CPUs have 4-8 cores. These two trends, however, have followed independent paths in the sense that we are aware of very few works that consider both devices cooperating to solve general computations. In this paper we investigate the coordinated use of CPU and GPU to improve efficiency of applications even further than using either device independently. We use Anthill runtime environment, a data-flow oriented framework in which applications are de-composed into a set of event-driven filters, where for each event, the runtime system can use either GPU or CPU for its processing. For evaluation, we use a histopathology application that uses image analysis techniques to classify tumor images for neuroblas-toma prognosis. Our experimental environment in-cludes dual and octa-core machines, augmented with GPUs and we evaluate our approach's performance for standalone and distributed executions. Our experiments show that a pure GPU opti-mization of the application achieved a factor of 15 to 49 times improvement over the single core CPU version, depending on the versions of the CPUs and GPUs. We also show that the execution can be further reduced by a factor of about 2 by using our runtime system that effectively choreographs the execution to run cooperatively both on GPU and on a single core of CPU. We improve on that by adding more cores, all of which were previously neglected or used ineffectively. In addition, the evaluation on a distributed environment has shown near linear scalability to multiple hosts.

Dynamic thread assignment on heterogeneous multiprocessor architectures

Conference Paper

Full-text available

May 2006

In a multi-programmed computing environment, threads of execution exhibit different runtime characteristics and hardware resource requirements. Not only do the behaviors of distinct threads differ, but each thread may also present diversity in its performance and resource usage over time. A heterogeneous chip multiprocessor (CMP) architecture consists of processor cores and caches of varying size and complexity. Prior work has shown that heterogeneous CMPs can meet the needs of a multi-programmed computing environment better than a homogeneous CMP system. In fact, the use of a combination of cores with different caches and instruction issue widths better accommodates threads with different computational requirements. A central issue in the design and use of heterogeneous systems is to determine an assignment of tasks to processors which better exploits the hardware resources in order to improve performance. In this paper we argue that the benefits of heterogeneous CMPs are bolstered by the use of a dynamic assignment policy, i.e., a runtime mechanism which observes the behavior of the running threads and exploits thread migration between cores. We validate our analysis by means of simulation. Specifically, our model assumes a combination of Alpha EV5 and Alpha EV6 processors and of integer and floating point programs from the SPEC2000 benchmark suite. We show that a dynamic assignment can outperform a static one by 20% to 40% on average and by as much as 80% in extreme cases, depending on the degree of multithreading simulated.

VCUDA: GPU accelerated high performance computing in virtual machines

Article

Jan 2009

Automated control in cloud computing: Challenges and opportunities

Article

Jun 2009

With advances in virtualization technology, virtual machine ser-vices offered by cloud utility providers are becoming increasingly powerful, anchoring the ecosystem of cloud services. Virtual com-puting services are attractive in part because they enable customers to acquire and release computing resources for guest applications adaptively in response to load surges and other dynamic behaviors. "Elastic" cloud computing APIs present a natural opportunity for feedback controllers to automate this adaptive resource provision-ing, and many recent works have explored feedback control poli-cies for a variety of network services under various assumptions. This paper addresses the challenge of building an effective con-troller as a customer add-on outside of the cloud utility service it-self. Such external controllers must function within the constraints of the utility service APIs. It is important to consider techniques for effective feedback control using cloud APIs, as well as how to design those APIs to enable more effective control. As one exam-ple, we explore proportional thresholding, a policy enhancement for feedback controllers that enables stable control across a wide range of guest cluster sizes using the coarse-grained control offered by popular virtual compute cloud services.

Enabling task parallelism in the CUDA scheduler

Article

Jan 2009

General purpose computing on graphics processing units (GPUs) introduces the challenge of scheduling in-dependent tasks on devices designed for data parallel or SPMD applications. This paper proposes an issue queue that merges workloads that would underutilize GPU pro-cessing resources such that they can be run concurrently on an NVIDIA GPU. Using kernels from microbenchmarks and two applications we show that throughput is increased in all cases where the GPU would have been underused by a sin-gle kernel. An exception is the case of memory-bound ker-nels, seen in a Nearest Neighbor application, for which the execution time still outperforms the same kernels executed serially by 12-20%. It can also be beneficial to choose a merged kernel that over-extends the GPU resources, as we show the worst case to be bounded by executing the kernels serially. This paper also provides an analysis of the latency penalty that can occur when two kernels with varying com-pletion times are merged.

Achieving a single compute device image in OpenCL for multiple GPUs

Conference Paper

Sep 2011

In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Conference Paper

Jun 2010

In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems. We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.

A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Abstract and Figures

Recommended publications

IVM: a task-based shared memory programming model and runtime system to enable uniform access to CPU...

Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes

Enabling Legacy Applications on Heterogeneous Platforms