Conference PaperPDF Available

A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Authors:

Abstract and Figures

Graphics Processing Units (GPUs) are increasingly becoming part of HPC clusters. Nevertheless, cloud computing services and resource management frameworks targeting heterogeneous clusters including GPUs are still in their infancy. Further, GPU software stacks (e.g., CUDA driver and runtime) currently provide very limited support to concurrency. In this paper, we propose a runtime system that provides abstraction and sharing of GPUs, while allowing isolation of concurrent applications. A central component of our runtime is a memory manager that provides a virtual memory abstraction to the applications. Our runtime is flexible in terms of scheduling policies, and allows dynamic (as opposed to programmer-defined) binding of applications to GPUs. In addition, our framework supports dynamic load balancing, dynamic upgrade and downgrade of GPUs, and is resilient to their failures. Our runtime can be deployed in combination with VM-based cloud computing services to allow virtualization of heterogeneous clusters, or in combination with HPC cluster resource managers to form an integrated resource management infrastructure for heterogeneous clusters. Experiments conducted on a three-node cluster show that our GPU sharing scheme allows up to a 28% and a 50% performance improvement over serialized execution on short- and long-running jobs, respectively. Further, dynamic inter-node load balancing leads to an additional 18-20% performance benefit.
Content may be subject to copyright.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, republish, post on servers, or redistribute to lists requires prior
specific permission and/or a fee.
HPDC’12, June 18-22, 2012, Delft, The Netherlands.
Copyright 2012 ACM 978-1-4503-0805-2/12/06…$10.00.
A Virtual Memory Based Runtime to Support Multi-tenancy
in Clusters with GPUs
Michela Becchi
1
, Kittisak Sajjapongse
1
, Ian Graves
1
, Adam Procter
1
,
Vignesh Ravi
2
, Srimat Chakradhar
3
1
University of Missouri,
2
Ohio State University,
3
NEC Laboratories America
becchim@missouri.edu, ks5z9@mail.mizzou.edu, ilggdd@mail.mizzou.edu,
proctera@missouri.edu, raviv@cse.ohio-state.edu, chak@nec-labs.com
ABSTRACT
Graphics Processing Units (GPUs) are increasingly becoming part
of HPC clusters. Nevertheless, cloud computing services and
resource management frameworks targeting heterogeneous
clusters including GPUs are still in their infancy. Further, GPU
software stacks (e.g., CUDA driver and runtime) currently
provide very limited support to concurrency.
In this paper, we propose a runtime system that provides
abstraction and sharing of GPUs, while allowing isolation of
concurrent applications. A central component of our runtime is a
memory manager that provides a virtual memory abstraction to
the applications. Our runtime is flexible in terms of scheduling
policies, and allows dynamic (as opposed to programmer-defined)
binding of applications to GPUs. In addition, our framework
supports dynamic load balancing, dynamic upgrade and
downgrade of GPUs, and is resilient to their failures. Our runtime
can be deployed in combination with VM-based cloud computing
services to allow virtualization of heterogeneous clusters, or in
combination with HPC cluster resource managers to form an
integrated resource management infrastructure for heterogeneous
clusters. Experiments conducted on a three-node cluster show that
our GPU sharing scheme allows up to a 28% and a 50%
performance improvement over serialized execution on short- and
long-running jobs, respectively. Further, dynamic inter-node load
balancing leads to an additional 18-20% performance benefit.
Categories and Subject Descriptors
C.1.4.1 [Computer Systems Organization]: Processor
Architectures - Parallel Architectures, Distributed Architectures.
General Terms
Performance, Design, Experimentation.
Keywords
Cluster computing, runtime systems, virtualization, GPU, CUDA.
1. INTRODUCTION
Many-core processors are increasingly becoming part of high
performance computing (HPC) clusters. Within the last two to
three years GPUs have emerged as a means to achieve extreme-
scale, cost-effective, and power-efficient high performance
computing. The peak single-precision performance of the latest
GPU from NVIDIA – the Tesla C2050/C2070/C2075 card - is
more than 1 Teraflop, resulting in a price to performance ratio of
$2-3 per Gigaflop. Meanwhile, Intel has announced the upcoming
release of the Many Integrated Core processor (Intel MIC), with
peak performance of 1.2 Teraflops. Early benchmarking results
on molecular dynamics and linear algebra applications have been
demonstrated at the International Supercomputing Conference,
Hamburg, Germany, in June 2011.
Today some of the fastest supercomputers are based on
NVIDIA GPUs, including three of the top five fastest
supercomputers in the world. For example, Tianhe-1A, the second
fastest system, is equipped with 7,168 NVIDIA Fermi GPUs and
14,336 CPUs. Almost 80% of the HPC clusters in the top-500 list
are currently powered with Intel multi-core processors. The next
challenge for Intel will be to successfully position its MIC
processor in the many-core market.
Given the availability of these heterogeneous computing
infrastructures, it is essential to make efficient use of them. One
classical way to schedule batch jobs on HPC clusters is via PBS
cluster resource managers such as TORQUE [25]. Another
practical way to manage large clusters intended for multi-user
environments involves using virtualization, treating clusters as
private clouds (e.g. Eucalyptus [24]). This model [18][19] has
several benefits. First, end-users do not need to be aware of the
characteristics of the underlying hardware: they see a service-
oriented infrastructure. Second, resource management and load
balancing can be performed in a centralized way by the private
cloud administrator. Finally, when the overall resource
requirements exceed the cluster’s availability, more hardware
resources can be externally rented using a hybrid cloud model, in
a way that is dynamic and fully transparent to the user [19]. The
convergence of heterogeneous HPC and the cloud computing
model is confirmed by Amazon EC2 Cluster GPU instances [26].
Other vendors, such as Nimbix [27] and Hoopoe [28], are also
offering cloud services for GPU computing.
The use of GPUs in cluster and cloud environments, however, is
still at an initial stage. Recent projects - GViM [1], vCUDA [2],
rCUDA [3] and gVirtuS [4] - have addressed the issue of allowing
applications running within virtual machines (VMs) to access
GPUs by intercepting and redirecting library calls from the guest
to the CUDA runtime on the host. These frameworks either rely
on the scheduling mechanisms provided by the CUDA runtime, or
allow applications to execute on GPU in sequence, possibly
leading to low resource utilization and consequent suboptimal
performance. Ravi et al [6] have considered GPU sharing by
allowing concurrent execution of kernel functions invoked by
different applications. All these proposals have the following
limitations. First, they assume that the overall memory
requirements of the applications mapped onto the same GPU fit
the device memory capacity. As data sets become larger and
97
resource sharing increases, this assumption may not hold true.
Second, the proposed frameworks statically bind applications to
GPUs (that is, they do not allow runtime application-to-GPU re-
mapping). Not only can this lead to suboptimal scheduling, but it
also forces a complete application restart in case of GPU failure,
and prevents efficient load balancing if GPU devices are added to
or removed from the system.
One important problem to address when designing runtime
support for GPUs within cluster and cloud environments is the
following. In cluster settings, resource sharing and dynamic load
balancing are typical techniques aimed to increase the resource
utilization and optimize the aggregate performance of concurrent
applications. However, GPUs have been conceived to accelerate
single applications; as a consequence, software stacks for GPUs
currently include only very limited support for concurrency. As an
example, if multiple CUDA applications use the same GPU, the
CUDA driver and runtime will serve their requests in a first-
come-first-served fashion. In the presence of concurrency, the
CUDA runtime will fail in two scenarios: first, if the aggregate
memory requirements of the applications exceed the GPU
capacity; second, in case of too many concurrent applications. In
fact, the CUDA runtime associates a CUDA context on GPU to
each application thread, and reserves an initial memory allocation
to each CUDA context. Therefore, the creation of too many
contexts will lead to exceeding the GPU memory capacity. On a
NVIDIA Tesla C2050 device, for example, we experimentally
observed that the maximum number of application threads
supported by the CUDA runtime in the absence of conflicting
memory requirements is eight. This fact has the following
implication: existing GPU virtualization frameworks and resource
managers for heterogeneous clusters that rely on the CUDA
runtime serialize the execution of concurrent applications,
leading to GPU underutilization. Further, existing runtime
systems statically bind applications to GPU devices, preventing
dynamic scheduling and limiting the opportunities for load
balancing. We aim to overcome these limitations.
Current high-end GPUs have two important characteristics:
first, they have a device memory that is physically separated from
the host memory; second, they can be programmed using library
APIs. For example, NVIDIA GPUs can be programmed using
CUDA [22] or OpenCL. GPU library calls, which originate on the
CPU, come in at least three kinds: device memory allocations,
data transfers, and kernel launches (whereby kernels are GPU
implementations of user-defined functions). Since efficient GPU
memory accesses require regular access patterns, the use of nested
pointers is discouraged in GPU programming. As a consequence,
most GPU applications do not use pointer nesting. Moreover,
even if NVIDIA has recently introduced the possibility to
dynamically allocate memory within CUDA kernels, this feature
is not found in publicly available GPU benchmarks. In this work,
we focus on optimizing the handling of traditional GPU
applications. However, we also support pointer nesting by
requiring the programmer to register nested data structures using
our runtime API. Finally, we allow applications that perform
dynamic memory allocation within kernels to use our runtime
system, but we exclude them from our sharing and dynamic
scheduling mechanisms. Both pointer nesting and dynamic device
memory allocation can be detected by intercepting and parsing
the pseudo-assembly (PTX) representation of CUDA kernels sent
to the GPU devices.
Applications that use GPUs alternate CPU and GPU phases.
Because of this alternation, statically binding applications to
GPUs (that is, using the programmer-defined mapping of GPU
phases to GPU devices) may lead to inefficiencies. This holds
particularly for applications having multiple GPU phases (e.g.:
iterative solvers) and in the presence of multi-tenancy. We
observe that the application programmer optimizes its application
assuming dedicated and well-known resources; our runtime aims
at providing dynamic load balancing in multi-tenant clusters,
where the availability and utilization of the underlying resources
is hidden from the users and not known a-priori.
As an example, suppose we wish to schedule the two
applications app
1
and app
2
illustrated in Figure 1 on a single
GPU. Moreover, let us assume that the memory footprint of each
application in isolation fits within the device memory, but their
aggregate memory requirements exceed the GPU memory
capacity. In this situation, if the two applications are run on the
bare CUDA runtime, they must be serialized (otherwise the
execution will fail with an out-of-memory error). However,
serializing the two applications will lead to resource
underutilization, in that the GPU will be idle during the CPU
phases of both app
1
and app
2
. A better scheduling consists of
time-sharing the GPU between app
1
and app
2
, so that one
application uses the GPU while the other is running a CPU phase.
Such scheduling requires periodically unbinding and binding each
application from/to the GPU. In turn, dynamic binding involves
data transfers between CPU and GPU in order to restore the state
of the application. Note that the runtime must determine: (i) when
unbinding is advisable, and (ii) which data transfers must be
performed. For example, app
1
has no explicit data transfers
between the kernel calls k
11
, k
12
and k
13
: all necessary data
transfers must be added by the runtime. On the other hand, a data
transfer between k
22
and k
23
is already part of app
2
. In summary,
providing dynamic binding implies designing a virtual memory
capability for GPUs. In Section 2, we discuss other scenarios
where dynamic binding of applications to GPUs is desirable.
1.1 Our Contributions
In this work, we propose a runtime component that provides
abstraction and sharing of GPUs, while allowing isolation of
concurrent applications. Our contributions can be summarized as
follows.
We propose dynamic (or runtime) application-to-GPU binding
as a mechanism to maximize device utilization and thereby
improve performance in the presence of multi-tenancy. In
particular, dynamic binding is suitable for applications with
multiple GPU phases, and in the presence of GPUs with
different capabilities.
We identify virtual memory for GPUs as an essential
mechanism to provide dynamic binding. Specifically, we
propose a virtual memory based runtime. As added value, our
execution
m c
HD
k
11
k
12
k
13
c
DH
f
m c
HD
k
21
k
22
k
23
c
DH
fc
DH
app
1
app
2
Figure 1: Example of two applications that can effectively
time-share a GPU. Light-grey blocks represent GPU phases
(m = device memory allocations, c
HD
= host-to-device data
transfers, k
ij
= kernel executions, c
DH
= device-to-host data
transfers, and f = device memory de-allocations). Black blocks
represent CPU phases.
98
design enables: (i) detecting badly written applications in the
runtime and therefore avoiding overloading the GPU with
erroneous calls, and (ii) optimizing memory transfers between
the multi-core host and the GPU device.
We introduce two forms of memory swapping in our runtime:
intra-application and inter-application. The former enables
applications whose kernels fit the device memory to run on the
GPU, even if their overall memory requirements exceed the
device memory capacity. The latter allows concurrent
applications whose aggregate memory requirements exceed the
device memory capacity to time-share the GPU.
We provide a generic runtime component that easily supports
different scheduling mechanisms.
We include in our runtime support for load balancing in case of
GPU addition and removal, resilience to GPU failures, and
checkpoint-restart capabilities.
The remainder of this paper is organized as follows. In Section
2, we discuss in more detail the objectives of our design. In
Section 3, we describe our reference hardware and software
architecture. In Section 4, we present our design and prototype
implementation. In Section 5, we report results from our
experimental evaluation. In Section 6, we relate our work to the
state of the art. We conclude our discussion in Section 7.
2. OBJECTIVES
The overall goal of this work is to provide a runtime component
that allows multiple applications to run concurrently on a
heterogeneous cluster whose nodes comprise CPUs and GPUs.
We foresee the use of our runtime system in two scenarios (Figure
2): (i) in combination with VM-based cloud computing services
(e.g.: Eucalyptus [24]), and (ii) in combination with HPC cluster
resource managers (e.g.: TORQUE [25]). In both cases a cluster-
level scheduler assigns VMs or jobs to heterogeneous compute
nodes. Our runtime component is replicated on each node and
schedules library calls originated by applications on the available
GPUs so as to optimize the overall performance. Our framework
must allow integration with cluster-level schedulers intended for
both homogeneous and heterogeneous clusters (the former
oblivious of the presence of GPUs).
Note that heterogeneous clusters that include GPUs require
scheduling at two granularities: on one hand, jobs must be
mapped onto compute nodes (coarse-grained scheduling); on the
other, specific library calls must be mapped onto GPUs (fine-
grained scheduling). Existing cluster-level schedulers perform
coarse-grained scheduling, whereas our runtime performs fine-
grained scheduling. The two schedulers may interact in two ways.
First, the cluster-level scheduler may be completely oblivious of
the GPUs installed on each node. In case of overloaded GPUs, the
node-level runtime may offload the computation to other nodes.
To this end, the runtime system must include a node-to-node
communication mechanism enabling inter-node code and data
transfer. Alternatively the node-level runtime may expose some
information to the cluster-level scheduler (e.g.: number of GPUs,
load level, etc.), so as to guide the cluster-level scheduling
decisions. While the first form of interaction may lead to
suboptimal scheduling decisions, it allows a straightforward
integration with existing cluster resource managers and VM-based
cloud computing services targeting homogeneous clusters.
Until recently, GPUs could not be accessed from applications
executing within VMs. Several projects – GViM [1], vCUDA [2],
rCUDA [3] and gVirtuS [4] - have addressed this issue for
applications using the CUDA Runtime API to access GPUs. The
general approach is to use API remoting to bridge two different
OS spaces: the guest-OS where the applications run and the host-
OS where the GPUs reside. In particular, API remoting is
implemented by introducing an interposed front-end library in the
guest-OS space and a back-end daemon in the host-OS. The front-
end library, which overrides the CUDA Runtime API, intercepts
CUDA calls and redirects them to the back-end through a socket
interface. In turn, the back-end issues those calls to the CUDA
runtime. Note that this mechanism provides GPU visibility from
within VMs, but does not add any form of abstraction. In fact,
applications still use CUDA Runtime primitives to direct their
calls to specific GPUs residing on the host where the VMs are
deployed. Moreover, the bare use of the scheduling mechanisms
offered by the CUDA Runtime may not be optimal when multiple
or multi-threaded applications are mapped onto a single GPU.
In this work, we aim to design a runtime that provides
abstraction and sharing of GPUs, while allowing isolation of
concurrent applications. In addition, the runtime must be flexible
in terms of scheduling policies, and allow dynamic binding of
applications to GPUs. Finally, the runtime must support dynamic
upgrade and
downgrade
of GPUs, and be resilient to GPU
failures. More detail on these objectives is provided below.
Abstraction - GPUs installed in the cluster need to be
abstracted (or hidden) from the user's direct access. GPU
programming APIs generally require the application
programmer to explicitly select the target GPU (for example,
using the CUDA runtime cudaSetDevice primitive). This
gives the application control of the number of GPU devices to
use. Our design masks the explicit procurement of GPUs, thus
allowing a transparent mapping of applications onto GPUs. As
a side effect, applications can be efficiently mapped onto a
number of devices different from that for which they have been
originally programmed. Note that this abstraction is coherent
with the traditional parallel programming model for general
purpose processors. When a user writes a multithreaded
program, for example, he targets a generic multi-core
processor. At runtime, the operating system distributes
processing threads onto the available cores.
GPU Sharing – As mentioned above, applications targeting
heterogeneous nodes alternate general-purpose CPU code with
library calls redirected and executed on GPUs. In the presence
of multi-tenancy, assigning each application a dedicated GPU
device for the entire lifetime of the application may not be
Cluster resource manager
guest OS
CUDA app
Intercept library
VM
1
VM
2
VM
3
VM
4
VM
k
VM
N
VM manager
CUDA app
CUDA app
CUDA app
CUDA app
CUDA app
OS
CUDA runtime
GPU
1
GPU
2
GPU
3
our runtime
GPU
n
Intercept library
OS
CUDA runtime
GPU
1
GPU
2
GPU
3
our runtime
GPU
n
Intercept library
OS
CUDA runtime
GPU
1
GPU
2
GPU
3
our runtime
GPU
n
OS
CUDA runtime
GPU
1
GPU
2
GPU
3
our runtime
GPU
n
OS
CUDA runtime
GPU
1
GPU
2
GPU
3
our runtime
GPU
n
Intercept library
CUDA app
guest OS
CUDA app
Intercept library
guest OS
CUDA app
Intercept library
guest OS
CUDA app
Intercept library
guest OS
CUDA app
Intercept library
guest OS
CUDA app
Intercept library
OS
CUDA runtime
GPU
1
GPU
2
GPU
3
our runtime
GPU
n
(a)
(b)
Figure 2: Two deployment scenarios for our runtime: (a) VM-
based cloud computing service and (b) HPC cluster resource
manager.
99
optimal, in that it may lead to resource underutilization. GPU
sharing is an obvious way to improve resource utilization.
However, sharing must be done judiciously: excessive sharing
may lead to high overhead and be counterproductive.
Isolation – In the presence of resource sharing, concurrent
applications must run in complete isolation from one another.
In other words, each application must have the illusion of
running on a dedicated device. State-of-the-art runtime support
for GPUs provides partial isolation of different process
contexts. In particular, each process is assigned its own process
space on the GPU; however, GPU sharing is possible only as
long as the cumulative memory requirements of different
applications do not exceed the physical capacity of the GPU.
Our work aims to handle such memory issues seamlessly,
allowing GPU sharing irrespective of the overall memory
requirements of the applications. In other words, we want to
extend the concept of virtual memory to GPUs.
Configurable Scheduling – The quality of a scheduling policy
depends on the objective function and assumptions about the
workload. A simple first-come-first-served scheduling
algorithm can be adequate in the absence of profiling
information. A credit-based scheduling algorithm may be more
suitable to settings that include fairness in the objective
function. Further, a scheduling algorithm that prioritizes short
running applications can be preferable if profiling information
is available. Yet another scheduling policy may be adopted in
the presence of expected quality of service requirements (e.g.:
execution deadlines). Our goal is to provide a runtime system
that can easily accommodate different scheduling algorithms.
Dynamic Binding – In existing runtime systems (including the
CUDA runtime) the mapping of GPU kernels to GPU devices
is static, or programmer-defined. A dynamic application-to-
GPU binding may be preferable in several scenarios. First, let
us consider the situation of a node having GPU devices with
different compute capabilities. Existing work in the context of
heterogeneous multi-core systems [21] has shown that
performance can be optimized by maximizing the overall
processor utilization while favoring the use of more powerful
cores. The application of this concept to nodes equipped with
different GPUs suggests that the system throughput can be
maximized by dynamically migrating application threads from
less to more powerful GPUs as they become idle. Second,
dynamic binding can help when GPUs are shared by
applications cumulatively exceeding the memory capacity. In
fact, dynamically migrating application threads to different
devices may minimize waiting times. Finally, resuming
application threads on different devices allows load balancing
when GPUs are added or removed from the system (dynamic
upgrade and downgrade), and is beneficial in case of GPU
failures (by preventing a whole application restart).
Checkpoint-Restart Capability – Along with dynamic binding,
our runtime provides a checkpoint-restart mechanism that
allows efficiently redirecting an application thread to a
different GPU. A checkpoint can be explicitly specified by the
user, or automatically triggered by the runtime. For example,
the runtime may monitor the execution time of particular
library calls (e.g. kernel functions) on a GPU. An automatic
checkpoint may be advisable after long-running kernel calls to
decrease the restart penalty in case of GPU failures. Note that
this kind of checkpoint is inserted dynamically at runtime.
3. REFERENCE ARCHITECTURE
The overall reference architecture is represented in Figure 2. The
underlying hardware platform consists of a cluster of
heterogeneous nodes. Each node has one or more multi-core
processors and a number of GPUs. The operating system performs
scheduling and resource management on the general-purpose
processors. Access to the GPUs is mediated by the CUDA driver
and runtime library API. Our runtime performs scheduling and
resource management on the available GPUs.
Each GPU has a device memory. Among others, the CUDA
runtime library contains functions to: (i) target a specific device
(cudaSetDevice), (ii) allocate and de-allocate device memory
(e.g., cudaMalloc/Free), (iii) perform data transfers between
the general purpose processor and the GPU devices (e.g.,
cudaMemcpy), (iv) transfer code onto the GPUs (the internal
functions __cudaRegisterFunction/FatBinary), and
(v) trigger the execution of user-written kernels
(cudaConfigureCall and cudaLaunch).
In addition, the CUDA runtime offers some CPU multi-
threading support. For example, CUDA 3.2 associates a CUDA
context to each application thread. Several contexts can coexist on
the GPU. Each of them has a dedicated virtual address space;
contains references to textures, modules and other entities; and is
used for error handling. CUDA contexts allow different
application threads to time-share the GPU processing cores, and
space-share the GPU memory. In CUDA 4.0, the use of CUDA
contexts is slightly modified to allow data sharing and concurrent
kernel execution across threads belonging to the same application.
As mentioned in Section 1, with both versions of the CUDA
runtime, the number of parallel CUDA contexts that can be
supported at runtime is limited by the device memory capacity.
As shown in Figure 2, our runtime component interacts with a
cluster-level scheduler, operates at the node level and must be
installed on all the nodes of the cluster. The cluster-level
scheduler maps jobs onto compute nodes. During execution, the
GPU library calls issued by applications are intercepted by our
frontend library and redirected to our runtime daemon on the node
where the job has been scheduled. Since our runtime is a stand-
alone process, a mechanism for inter-process communication
between the job and our runtime demon is needed. In our
prototype, we use the socket-based communication framework
provided as part of the open-source project gVirtuS [4][5]. This
framework relies on afunix sockets in a non-virtualized
environment and on proprietary VM-sockets in a virtualized one.
Although the design of an optimal cluster-level scheduler for
heterogeneous clusters is beyond the scope of this work, we want
to be able to integrate our runtime with existing cloud computing
services and cluster resource management frameworks that target
homogeneous clusters. In this situation, the cluster-level scheduler
dispatcher
dispatcher
GPU
1
vGPU
11
vGPU
12
vGPU
1k
GPU
2
vGPU
21
vGPU
22
vGPU
2k
vGPU
n1
vGPU
n2
vGPU
nk
CUDA driver/runtime
Memory manager
dispatcher
Swap
area
connection
manager
MC run time
Waiting contexts
Assig ned contexts
Failed contexts
Page
table
at
1
FE
at
2
FE
at
3
FE
at
k
FE
at
N
FE
GPU
n
Figure 3: Overall design of runtime.
100
in use is unaware of both the GPU devices installed on the
compute nodes, and the fraction of execution time that each job
will spend on GPUs. Therefore, in the presence of nodes with
different hardware setups, simple cluster-level scheduling policies
may lead to queuing on nodes containing a lower number of
GPUs (or assigned a higher number of jobs targeting GPU). To
tackle this problem, we allow nodes to offload GPU library calls
to other nodes in the cluster. For this purpose, we introduce inter-
node communication between our runtime components. Note that
this mechanism operates at the granularity of GPU library calls,
and does not affect the portion of the job running on CPU.
4. DESIGN AND METHODOLOGY
In this section, we describe the design of our proposed runtime.
Our prototype implementation targets NVIDIA GPUs
programmed through the CUDA 3.2 runtime API. In Section 4.8,
we summarize the changes required to support CUDA 4.0.
4.1 Overall Design
The overall design of our runtime is illustrated in Figure 3. The
basic components are: connection manager, dispatcher, virtual-
GPUs (vGPUs), and memory manager. As mentioned before,
when applications execute on the CPU, library calls directed to
the CUDA runtime are intercepted by a frontend library and
redirected to our runtime. We say that each application establishes
a connection with the runtime, and uses the connection to issue a
sequence of CUDA calls and receive their return code. Multiple
applications establish concurrent connections. The connection
manager accepts and enqueues incoming connections. The
dispatcher dequeues pending connections and schedules their calls
on the available GPUs. If the devices on the node are overloaded,
the dispatcher may offload some connections to other nodes using
an inter-node communication mechanism. To allow controlled
GPU sharing, each GPU has an associated set of virtual-GPUs.
The dispatcher schedules applications onto GPUs by binding their
connections to the corresponding virtual-GPUs. Applications
bound to virtual-GPU vGPU
ik
share GPU
i
. Finally, the memory
manager provides a virtual memory abstraction to applications.
Dispatcher and virtual-GPUs interact with the memory manager
to enable: (i) GPU sharing in the presence of concurrent
applications with conflicting memory requirements, (ii) load
balancing in case of GPU with different capabilities, GPU
addition and removal, (iii) GPU fault tolerance, and (iv)
checkpoint-restart capabilities.
4.2 Connection Manager
When used natively, the CUDA 3.2 runtime spawns a CUDA
context on the GPU for each application thread. Different
application threads can be directed to different GPUs by using the
cudaSetDevice primitive. One of the goals of our runtime is
to preserve the CUDA semantics. To this end, our frontend library
opens a separate connection for each application thread. CUDA
calls belonging to different connections can therefore be served
independently either on the same or on distinct GPUs. The
connection manager enqueues connections generated by
concurrent application threads in a pending connections list.
4.3 Dispatcher
The primary function of the dispatcher is to schedule CUDA calls
issued by application threads onto GPUs. The dispatcher can be
configured to use different scheduling algorithms: first-come-
first-served, shortest-job-first, credit-based scheduling, etc. Some
scheduling algorithms (e.g. shortest-job-first) require the
dispatcher to make scheduling decisions based on the kernels
executed by the applications, their parameters, and their execution
configuration. Higher resource utilization and better performance
can be achieved by supporting dynamic binding of applications to
GPUs: the dispatcher must be able to modify the application-to-
GPU mapping between kernel calls, and to unbind applications
from GPUs during their CPU-phases. These scheduling actions
must be hidden from the users.
To enable informed scheduling decisions, the dispatcher must
be able to delay application-to-GPU binding until the first kernel
launch is invoked. Unfortunately, the very first CUDA calls
issued by a CUDA application are not kernel launches, but
synchronous internal routines used to register the GPU machine
code (_cudaRegisterFatBinary), kernel functions
(_cudaRegisterFunction), variables and textures
(_cudaRegisterVar, _cudaRegisterSharedVar,
_cudaRegisterShared and _cudaRegisterTexture) to
the CUDA runtime. Moreover, kernel launches are never the first
non-internal CUDA calls issued by application threads: at the
very least, they must be preceded by memory allocations and data
transfers. Before kernel launches can be invoked by the client, all
of these previous calls must be serviced.
Two observations help us overcome this problem. First,
registration functions are always issued to the runtime prior to
CUDA contexts’ creation on the GPU. Therefore, these internal
calls can be safely issued by the dispatcher well before the
corresponding applications are bound to virtual-GPUs. The same
holds for device management functions, some of which are
ignored by our runtime (e.g. cudaSetDevice) or overridden
(e.g. cudaGetDeviceCount will return the number of virtual,
not physical, GPUs). Second, it is possible to delay GPU memory
operations until the related data are accessed within kernel calls:
the runtime responds to memory allocation requests by returning
virtual addresses, and these virtual pointers are mapped to real
device pointers at a later stage.
In summary, the dispatcher dequeues application threads from
the list of pending connections, and handles them as follows.
First, it issues registration functions to the CUDA runtime.
Second, it services device management functions (and typically
overrides them so as to hide the hardware setup of the node from
the users). Third, it handles memory operations with the aid of the
memory manager. In particular, the dispatcher does not issue
memory operations directly to the CUDA runtime, but instead
operates entirely in terms of virtual addresses generated by the
memory manager. Fourth, if there are any free virtual-GPUs, the
dispatcher schedules application threads to virtual-GPUs (and
enqueues them in the list of assigned contexts). If all virtual-
GPUs are busy, application threads are enqueued in the list of
waiting contexts for later scheduling. In addition, any failure
during the execution of an application thread will cause it to be
enqueued in a list of failed contexts, which is used by the
dispatcher for recovery.
To prevent the dispatcher from being a bottleneck, its
implementation is multithreaded: each dispatcher thread processes
a different connection. All queues used within the runtime
(pending connections; waiting, assigned and failed contexts) are
accessed using mutexes.
4.4 Virtual GPUs
In order to allow time-sharing of GPUs, we spawn a configurable
number of virtual-GPUs for each GPU installed on the system. A
virtual-GPU is essentially a worker thread that issues calls
originated from within application threads to the CUDA runtime.
101
F/F/F F/T/F
T/F/F T/T/F
T/F/T
copyDH
swap
launch
launch
copyHD
copyDH
copyHD
copyHD
copyHD
copyDH
swap
swap swap
launch
copyDH
copyDH
Figure 4: State diagram showing the transitions o
f
isAllocated/toCopy2Dev/toCopy2Swap flags.
Virtual-GPUs are statically bound to physical GPUs through a
cudaSetDevice invoked at system startup. Each virtual-GPU
can service one application thread at a time. A virtual-GPU is idle
when no application thread is bound to it, and is active otherwise.
Note that, since our runtime maps application threads onto
virtual-GPUs and the CUDA runtime spawns a CUDA context for
each virtual-GPU, this infrastructure preserves the semantics of
the CUDA runtime. We experimentally observed (see Section 5)
that the CUDA runtime cannot handle an arbitrary number of
concurrent threads. Therefore, limiting the number of virtual-
GPUs prevents our framework from overloading the CUDA
runtime, and allows proper operation even in the presence of a
large number of CUDA applications.
4.5 Memory Manager
The goal of the memory manager is to provide a virtual memory
abstraction for GPUs. Two ideas are at the basis of the design.
First, applications will not see device addresses returned by the
CUDA runtime, but they will see virtual addresses generated by
the runtime. Second, data resides in the host memory, and is
moved to the device only on demand. In this way, the host
memory represents a lower level in the memory hierarchy: when
some data must be moved to the device memory but the device
memory capacity is exceeded, the memory manager swaps data
from the device memory to the host memory. We allow memory
swapping in two situations: (i) within a single application, and (ii)
in the presence of multi-tenancy. The latter scenario is
characterized by the presence of concurrent applications, each of
whose memory footprints in isolation would fit within the device
memory, but whose aggregate memory requirements exceed the
GPU memory capacity. In addition, the swap functionality allows
an application to migrate from a less capable to a more capable
GPU when the latter becomes available.
Host-to-device data transfers deferral must be done judiciously.
Data transfers preceding the first kernel call cannot overlap with
GPU computation, and can thus be deferred without incurring
performance losses. After the first kernel call, application-to-GPU
binding is known and our runtime can be configured to either
defer or not defer data transfers. Not deferring allows
computation-communication overlapping at the expenses of an
increased swap overhead; deferring has the opposite effect.
The memory manager has two components: a page table
1
, and a
1
Strictly speaking this is a misnomer, since allocation need not
occur in multiples of any fixed “page size”, but we retain the
swap area. The page table stores the address translation, and the
swap area contains not yet allocated or swapped-out GPU data.
The main data structures used in the memory manager are the
following.
/* PAGE TABLE */
typedef struct {
void *virtual_ptr;
void *swap_ptr;
void *device_ptr;
size_t size;
bool isAllocated;
bool toCopy2Dev;
bool toCopy2Swap;
entry_t type;
void *params;
nesting_t nested;
} PageTableEntry;
map<Context*, list<PageTableEntry*> *> PageTable;
/* CAPACITY AND UTILIZATION of AVAILABLE GPUs */
int numGPUs;
uint64_t *CapacityList;
uint64_t *MemAvailList;
map<Context *, size_t> MemUsage;
Each page table entry (PTE), which is created upon a memory
allocation operation, contains three pointers: the virtual pointer
which is returned to the application (virtual_ptr), the pointer
of the data in the swap area (swap_ptr), and, if the data are
resident on the device, the device pointer (device_ptr). In
addition, each entry has a size, a type, and possible additional
parameters (params and nested). Finally, the flags
isAllocated, toCopy2Dev, and toCopy2Host are used to
guide device memory allocations, de-allocations and data
transfers, and indicate whether the PTE has been allocated on
device, whether the actual data reside only on the host, and
whether the actual data reside only on device, respectively. The
state transitions of the three flags depending on the call invoked
by the application are illustrated in Figure 4. In particular,
malloc represents any allocation operation (cudaMalloc,
cudaMallocArray, etc.), whereas copy
DH
and copy
HD
represent any device-host and host-device data transfer function
(cudaMemcpy, cudaMemcpy2D, etc.), respectively. Figure 4
assumes data transfer deferral and that all data referenced in a
kernel launch can be modified by the kernel execution: a more
fine-grained handling is possible if the information about read-
only and read-write parameters is available. The attributes type
and params allow distinguishing different kinds of memory
allocations and data transfers associated with the entry. The
nested attribute indicates whether the virtual address points to a
nested data structure, or whether it is a member of it. Nested data
structures must be declared to the runtime using a specific runtime
API call, and are associated additional attributes describing their
structure. These attributes are used by the memory manager in
order to ensure consistency between virtual and device pointers
within nested structures.
Each application thread (or context) has an associated list of
PTEs: the page table contains all the PTEs for all the active and
pending contexts in the node. In addition, the memory manager
keeps track of the capacity and the memory availability of each
GPU (CapacityList
and MemAvailList) and of the
term to make the analogy with conventional virtual memory
systems clear.
102
Table 1: For each application call, actions performed by our
runtime and possible errors returned. A blank in the third
column indicates any error generated by the CUDA runtime
(i.e., result codes
cudaSuccess). PTE = page table entry.
Application
call
Actions performed by
runtime
Errors returned by
the runtime
Create PTE A virtual address
cannot be assigned
Malloc
Allocate swap
Swap memory
cannot be allocated
Check valid PTE No valid PTE
Copy
HD
Move data to swap Swap-data size
mismatch
Check valid PTE No valid PTE
Copy
DH
If (PTE.toCopy2Swap)
cudaMemcpy
DH
-
Check valid PTE No valid PTE
De-allocate swap Cannot de-allocate
swap
Free
If (PTE.isAllocated)
cudaFree
-
Check valid PTE No valid PTE
If (^PTE.isAllocated)
cudaMalloc
-
If (PTE.toCopy2Dev)
cudaMemcpy
HD
-
Launch
cudaLaunch
-
Check valid PTE No valid PTE
If (PTE.toCopy2Swap)
cudaMemcpy
DH
-
Swap
If (PTE.isAllocated)
cudaFree
-
memory usage of each context (MemUsage). This information is
used to determine whether binding an application thread to a GPU
can potentially lead to exceeding its memory capacity.
Table 1 shows the actions performed by the runtime for each
memory-related call invoked by the application. For simplicity,
we show the data transfer deferral configuration. Note that, in this
case, malloc and copy
HD
(data copy from host to device) do not
trigger any CUDA runtime actions. Swap is an internal function
which is triggered by the runtime when some data must be
swapped from device to host memory to make room for data on
the GPU. Like malloc, swap operates on a single page table entry.
Two scenarios are possible: intra-application swap and inter-
application swap. Independent of the kind, the swap operation can
be triggered by the runtime while trying to allocate device
memory to execute a kernel launch. Memory operations on nested
structures will be extended also to their PTE members.
Intra-application swap – Consider the following sequence of
calls coming from the same application app, where matmul is a
matrix multiplication kernel for square matrices.
1. malloc(&A_d, size);
2. malloc(&B_d, size);
3. malloc(&C_d, size);
4. copy
HD
(A_d, A_h, size);
5. matmul(A_d, A_d, B_d); //B_d = A_d * A_d
6. matmul(B_d, B_d, C_d); //C_d = B_d * B_d
7. copy
DH
(B_h, B_d, size);
8. copy
DH
(C_h, C_d, size);
If the above application is run on the bare CUDA runtime and the
data sizes are such that only two matrices fit the device memory,
the execution will fail on the third instruction (that is, when trying
to allocate the third matrix). On the other hand, when our runtime
is used, no memory allocation is performed until the first kernel
launch (instruction 5). Previous instructions update only page
table and swap memory. Instruction 5 will cause the allocation of
matrices A_d and B_d and the data transfer of A_h to A_d, and
will execute properly. During execution of instruction 6, the
runtime will detect the need for freeing device memory. Before
trying to swap and unbind other applications from the GPU, the
runtime will analyze the page table of app and detect that data
A_d, not required by instruction 6, can be swapped to host. This
will allow the application to complete with no error. In summary,
intra-application swap enables the execution of applications that
would fail on the CUDA runtime even if run in isolation. In other
words, the maximum memory footprint of the “larger” kernel
(rather than the overall memory footprint of the application) will
determine whether the application can correctly run on the device.
Inter-application swap – This kind of swap may take place
when concurrent applications mapped onto the same device have
conflicting memory requirements. In particular, if device
memory cannot be allocated and intra-application swap is not
possible, the memory manager will be queried for applications
running on the same GPU and using the amount of memory
required. If such an application exists, it will be asked to swap.
The application may or may not accept the request: for instance,
an application running in a CPU phase with no pending requests
may swap, but an application in the middle of a kernel call may
not. If no application honors the swap request, the calling
application will unbind from the virtual-GPU and retry later.
Otherwise, all the page table entries belonging to the application
that accepts the request
will be swapped, and such application will
be temporarily unbound from the GPU. There may be situations
where multiple applications must swap for the required memory
to be freed. To reduce complexity and avoid inefficiencies, we do
not trigger the swap in these situations. Note that inter-application
swap implies coordination among virtual-GPUs and, as a
consequence, has a higher overhead than intra-application swap.
To avoid dead-locks, synchronization is required while accessing
the page table. Finally, note that enabling swaps only during CPU
phases allows GPU intensive applications to make full use of the
GPU.
To determine whether a memory allocation can be serviced, the
runtime will first use the memory utilization data in the memory
manager (CapacityList, MemAvailList, and MemUsage).
However, because of possible memory fragmentation on GPU, the
runtime may need to use the return code of the GPU memory
allocation function to ensure that the request can be honored.
Moreover, there may be cases where only some GPUs have the
required memory capacity.
Finally, we point out two additional benefits of our design.
First, bad memory operations (for instance, data transfers beyond
the boundary of an allocated area) can be detected by the memory
manager without overloading the CUDA runtime with calls that
would fail. Second, multiple data copy operations within the same
allocated area (i.e., the same page table entry) will trigger a
single, bulk memory transfer to the device memory.
4.6 Fault Tolerance & Checkpoint-Restart
The memory manager provides an implicit checkpoint capability
that allows load balancing if more powerful GPU become idle, if
103
Table 2: Benchmark programs.
Program Description Kernel
calls #
Short-running applications
Back Propagation (BP) Training of 20 neural networks
with 64K nodes per input layer
40
Breadth-First Search
(BFS)
Traversal of graph with 1M nodes 24
HotSpot (HS) Thermal simulation of 1M grids 1
Needleman-Wunsch
(NW)
DNA sequence alignment of 2K
potential pairs of sequences
256
Scalar Product (SP) Scalar product of vector pair (512
vector pairs of 1M elements)
1
Matrix Transpose (MT) Transpose (384x384) matrix 816
Parallel Reduction
(PR)
Parallel reduction of 4M elements 801
Scan (SC) Parallel prefix sum of 260K
elements
3,300
Black Scholes - small
(BS-S)
Processing of 4M financial
options
256
Vector Addition (VA) 100M-element vector addition 1
Long-running applications
Small Matrix
Multiplication (MM-S)
200 matrix multiplications of
2Kx2K square matrices and
variable CPU phases
200
Large Matrix
Multiplication (MM-L)
10 matrix multiplications of
10Kx10K square matrices and
variable CPU phases
10
Black Scholes - large
(BS-L)
Processing of 40M financial
options
256
GPUs are dynamically added and removed from the system, and
recovery in case of GPU failures. For each application thread, the
page table and the swap memory contain the state of the device
memory. In addition, an internal data structure (called Context)
contains other state information, such as: a link to the connection
object, the information about the last device call performed, and,
if the application thread fails, the error code. With this state
information, dynamic binding allows redirecting contexts to
different GPUs and resuming their operation. The dispatcher will
monitor the availability of the devices and schedule contexts from
the failed contexts list (in case of GPU failure or removal) and
unbind and reschedule applications from the assigned contexts list
in case of GPU addition. Our mechanism can be combined with
BLCR [29] in order to enable these mechanisms also after a full
restart of a node. Finally, our runtime has an internal check-
pointing primitive that can be dynamically triggered after long
running kernels, to allow fast recovery in case of failures.
4.7 Inter-node Offloading
If the GPUs installed on a node are overloaded, our runtime can
offload some application threads to other nodes. Note that this
mechanism allows transferring only the CUDA calls originating
within an application, and not its CPU phases. In particular, the
runtime redirects application threads in the list of pending
connections to other nodes using a TCP socket interface. A
measure of the load on the system is provided by the size of the
list of pending connections. We allow the dispatcher to process
pending connections only if the number of pending contexts is
below a given threshold.
4.8 CUDA 4 Support
We briefly discuss the modifications required by our runtime in
order to support CUDA 4.0. The most significant changes in
CUDA 4.0 are the following: (i) all threads belonging to the same
application are mapped onto the same CUDA context, and (ii)
each application thread can use multiple devices by issuing
multiple cudaSetDevice calls. The first change has been
introduced to enable application threads to share data on GPU.
Our current implementation does not differentiate threads
belonging to the same application from threads belonging to
different ones. Moreover, to avoid explicit procurement of threads
to GPUs, our runtime ignores all cudaSetDevice calls issued
by applications. Compatibility with CUDA 4.0 requires the
following changes. First, each thread connection should carry the
information about the corresponding application identifier. This
information will be used to ensure that application threads sharing
data are mapped onto the same device. Second,
cudaSetDevice calls can be used to identify groups of
CUDA calls that can potentially be assigned to different GPUs.
Note that, because of the dynamic binding capability of our
runtime, the latter modification is not strictly required. However,
its introduction can help making efficient scheduling decisions
with minimal overhead. Finally, CUDA 4.0 allows a more
efficient and direct GPU-to-GPU data transfer. Our runtime can
take advantage of this mechanism to provide faster thread-to-GPU
remapping.
5. EXPERIMENTAL RESULTS
The experiments were conducted in two environments: on a single
node and on a three-node cluster. Unless otherwise indicated, the
metric reported in all experiments is the overall execution time for
a batch of concurrent jobs (that is, the time elapsed between the
first job starts and the last job finishes processing). We observed
analogous trends when considering the average execution time
across the jobs in the batch.
In all experiments, we adopted a first-come-first-served
scheduling policy that assigns jobs to physical GPUs in a round-
robin fashion and attempts to perform load balancing (by keeping
the number of active vGPUs uniform across all available GPUs).
The runtime is configured to defer all data transfers. All data
reported in experiments using our runtime include all the
overheads introduced by our framework: call interception,
queuing delays, scheduling, memory management, and, whenever
performed, swap operations and relating synchronizations. Given
the parallel nature of the system, such overheads are not additive.
5.1 Hardware Setup
The system used in our node-level experiments includes eight
Intel Xeon E5620 processors running at 2.40 GHz and is equipped
with 48 GB of main memory and three NVIDIA Fermi GPUs
(two Tesla C2050s and one Tesla C1060). Each Tesla C2050 has
14 streaming multiprocessors (SMs) with 32 cores per SM, each
running at 1.15 GHz, and 3 GB of device memory. The Tesla
C1060 has 30 SMs with 8 cores per SM, and 4 GB of device
memory. In one experiment, we replaced the Tesla C1060 with
the less powerful NVIDIA Quadro 2000 GPU, equipped with four
48-core SMs and 1 GB device memory. In our cluster-level
experiments we used an additional node with the same CPU
configuration but equipped with a single Tesla C1060 GPU card.
5.2 Benchmarks
The benchmark applications used in our experiments are listed in
Table 2. These applications, obtained from Rodinia Benchmark
Suite [30] and NVIDIA’s CUDA SDK, cover several application
104
0
5
10
15
20
25
1248
Total Execution time (sec)
# of jobs
CUDA Ru n t i m e
1 vGPU
2 vGPUs
4 vGPUs
8 vGPUs
Figure 5: Execution time reported with a variable number o
f
short-running jobs on a node with 1 GPU. The bare CUDA
runtime is compared with our runtime.
0
10
20
30
40
50
60
70
80
8163248
Total execution time (sec)
# of jobs
CU DA r u n t i m e
1 vGPU
2 vGPUs
4 vGPUs
Figure 6: Execution time reported with a variable number o
f
short-running jobs on a node with 3 GPUs. The bare CUDA
runtime cannot handle more than 8 concurrent jobs.
domains, and differ in their memory occupancy, their GPU
intensity and their interleaving of computation between CPU and
GPU. We divide the workload into two categories: short-running
and long-running applications. When using a Tesla C2050 GPU,
the former report a running time between 3 and 5 seconds each,
and the latter between 30 and 90 seconds (depending on the CPU
phase injected – see Section 5.3.3). In the third column of Table
2, we report the number of kernel calls performed by each
application. All short-running applications and BS-L are GPU
intensive and have memory requirements well below the capacity
of the GPUs in use. MM-S and MM-L are injected CPU phases of
different length; MM-L has high memory requirements.
5.3 Node-level Experiments
5.3.1 Overhead Evaluation
First, we measured the overhead of our framework with respect to
the CUDA runtime. We allowed our runtime to use only one
physical GPU, and varied the number of virtual GPUs (vGPUs).
The execution time of the bare CUDA runtime gives a lower
bound that allows us to quantify the overhead associated with our
framework. The data in Figure 5 were obtained by randomly
drawing jobs from the pool of short-running applications in Table
2, and averaging the results over ten runs. To ensure apple-to-
apple comparison, we run each of the randomly drawn
combination of jobs on all 5 reported configurations (bare CUDA
runtime and our runtime using 1, 2, 4 and 8 vGPUs).
Since our experiments showed that the CUDA runtime cannot
handle more than eight concurrent CUDA contexts, we limited the
number of jobs to eight. As can be seen in Figure 5, the total
execution time of our runtime approaches the lower limit (CUDA
runtime) as we increase the number of vGPUs. Increasing the
number of vGPUs means increasing the sharing of the physical
GPU, thus amortizing the overhead of the framework (which, in
the worst case, accounts for about 10% of the execution time).
Note that the percentage overhead would decrease on long-
running applications.
5.3.2. Benefits of GPU Sharing
In our second set of experiments we evaluated the effect of GPU
sharing in the presence of more (three) physical GPUs. We used
the same workload as in Section 5.3.1, and again varied the
number of vGPUs per device. We recall that the number of
vGPUs represents the number of jobs that can time-share a GPU.
As mentioned in the previous section, we found that the CUDA
runtime does not currently support more than eight concurrent
jobs stably. Therefore, we do not report results using the bare
CUDA runtime beyond eight jobs. Figure 6 shows that, when
using four vGPUs per device, our runtime reports some
performance gain compared to the bare CUDA runtime. In fact,
the overhead of our framework is compensated by its ability to
load balance jobs on different physical GPUs. When running
higher number of concurrent jobs, our results confirm our
previous finding that increasing the amount of GPU sharing
positively impacts the performances. However, we do not observe
significant performance improvements when more than four
vGPUs are employed. We believe that four vGPUs per device
provide a good compromise between resource sharing and
runtime overhead, and we use this setting in the rest of our
experiments.
5.3.3. Conflicting Memory Needs: Effect of Swapping
The effect of swapping can be evaluated by using memory-hungry
applications. To this end, we considered large matrix
multiplication (MM-L). This benchmark program performs ten
square matrix multiplications on randomly generated matrices.
We set the data set size so to have conflicting memory
requirements when more than two jobs are mapped onto the same
GPU. In addition, we injected in the matrix multiplication
benchmark CPU phases of various size. CPU phases are
interleaved with kernel calls, and simulate different level of post-
processing on the product of the matrix multiplication.
The effect of swapping is evaluated by running 36 MM-L jobs
concurrently. In order to compare the swapping and no-swapping
cases, we conducted experiments with one vGPU (no swapping
required) and four vGPUs (swapping required). We recall that, in
the one vGPU case, jobs run sequentially on a physical GPU, and
therefore there is no memory contention. In the experiment, the
fraction of CPU work is varied while maintaining the level of
GPU work. Figure 7 shows that the total execution time grows
linearly with the fraction of CPU work in the case of serialized
execution (1 vGPU). In the case of GPU sharing (4 vGPUs), the
overall execution time is kept constant even if the amount of work
in each job increases. In fact, swapping can effectively reduce the
total execution time by hiding the CPU-driven latency. In the
chart, the number on the top of each bar indicates the swap
operations occurred during execution. This experiment
demonstrates that our swapping mechanism can effectively
resolve resource conflicts among the concurrently running
applications. In addition, despite its overhead, this mechanism
provides performance improvement to applications with a
considerable fraction of CPU work.
105
0
0
0
0
0
0
6
29
44 58
0
50
100
150
200
250
300
350
100/ 0 75/ 25 50/ 50 25/ 75 0/ 100
Total Execution Time (sec)
Workload composit ion - Fraction BlackScholes/ M atmul
serialized execution (1 VGPU)
GPU shar i ng (4 VGPUs)
Figure 8: 36 jobs (BS-L and MM-L) are run on a node with 3
GPUs. The workload composition is varied. We indicate the
number of swap operations occurred on top of each bar.
0
0
0
0
0
12
49
51
75
86
0
50
100
150
200
250
300
350
400
450
500
00.511.52
Total Execution time (sec)
Fraction of CPU code
serialized execut ion (1 vGPU)
GPU sh ar i ng ( 4 VGPUs)
Figure 7: 36 MM-L jobs (with conflicting memory
requirements) are run on a node with 3 GPUs. The fraction o
f
CPU code in the workload is varied. We indicate the number
of swap operations occurred on top of each bar.
We next investigated the performance of our runtime when
combining applications with different amount of CPU work. In
particular, we mixed BS-L with MM-L at different ratio (Figure
8). BS-L is a GPU-intensive application with very short CPU
phases, whereas MM-L was set to have a fraction of CPU work
equal to 1. The memory requirements of BS-L are below those of
MM-L. Again, we run 36 jobs concurrently. The results of these
experiments are shown in Figure 8. Again, the number on the top
of each bar indicates the number of swap operations occurred
during execution. As one might expect, the performance gain
from GPU sharing increases as MM-L becomes dominant.
Because BS-L is a GPU intensive application and swapping adds
additional overhead, this results in a longer execution time for
four vGPUs at a 75/25 mix of BS-L and MM-L.
5.3.4 Benefits of Dynamic Load Balancing
In Figure 9, we show the results of experiments performed on an
unbalanced node that contains two fast and one slow GPUs: two
Tesla C2050s and one Quadro 2000, respectively. In one setting,
our runtime performs load balancing as follows. The dispatcher
keeps track of fast GPUs becoming idle, and, in the absence of
pending jobs, it migrates running jobs from slow to fast GPUs.
The experiments are conducted on MM-S jobs with varying
CPU fraction, and using 4 vGPUs per device. The number of jobs
migrated is reported on top of each bar. As can be seen, despite
the overhead due to job migration, load balancing through
dynamic binding of jobs to GPUs is an effective way to improve
the performances of an unbalanced system. This holds especially
in the presence of small batches of jobs and of applications
alternating CPU and GPU phases. As the number of concurrent
jobs increases, the system performs load balancing by scheduling
on fast GPUs pending jobs, rather than by migrating jobs already
running on slow devices.
5.4 Cluster-level Experiments
We have integrated our runtime with TORQUE, a cluster-level
scheduler that can be used to run GPU jobs on heterogeneous
clusters. In this section, we show experiments performed on a
cluster of three nodes. The jobs are submitted at a head node and
executed on two compute nodes. The hardware configuration of
the compute nodes is described in Section 5.1. Having a three-
and a single-GPU compute node, our cluster is unbalanced.
When TORQUE is used on a cluster equipped with GPUs, it
relies on the CUDA runtime to execute GPU calls. Since the
CUDA runtime does not provide adequate support to
concurrency, TORQUE does not allow any form of GPU sharing
across jobs. Therefore, when configured to use compute nodes
equipped with GPUs, TORQUE serializes the execution of
concurrent jobs by enqueuing them on the head node and
submitting them to the compute nodes only when a GPU becomes
available. By coupling TORQUE with our runtime system, we are
able to provide GPU sharing to concurrent jobs.
When coupling TORQUE with our runtime, we conducted
experiments with three settings. In all cases, to force TORQUE to
submit to the compute nodes more jobs than available GPUs, we
hid from TORQUE the presence of GPUs, and handled it only
within our runtime. In the first setting, our runtime was
configured to use only one vGPU per device, and therefore to
serialize the execution of concurrent jobs. In the second setting,
we allowed GPU sharing by using four vGPUs per device. In the
third setting, we additionally enabled load balancing across
compute nodes by allowing inter-node communication and
offloading. We also performed experiments using TORQUE
natively on the bare CUDA runtime. However, the results
reported using this configuration are far worse than those reported
using TORQUE in combination with our runtime. Therefore, we
show the use of our runtime with one vGPU per device as
example of no GPU sharing.
In Figure 10, we show experiments conducted using a variable
number of short-running jobs drawn from the applications in
0
0
0
0
0
0
4
4
0
4
4
4
0
200
400
600
800
1000
1200
1400
12 24 36 12 24 36
Total execution time (sec)
# of jobs
no load balancing
load balancing through dynamic binding
cpu fraction = 0
cpu fraction = 1
Figure 9: Unbalanced node with 2 Tesla C2050s and 1 Quadro
2000: effect of load balancing through dynamic binding. The
number of MM-S jobs migrated to fast GPUs is reported on
to
p
of each bar.
106
0
200
400
600
800
1000
1200
1400
Tot (16 jobs) Avg (16 jobs) Tot (32 jobs) Avg (32 jobs) Tot (48 jobs) Avg (48 jobs)
Execution Time (sec)
Metric (# jobs)
serialized execution
GPU sh ar in g ( 4 vGPUs)
GPU sharing + load balancing
Figure 11: Two-node cluster using TORQUE: effect of
GPU sharing and load balancing via inter-node offloading
in the presence of long-running jobs and conflicting
memory requirements.
Table 2. In this set of experiments, jobs do not exhibit conflicting
memory requirements. Again, we average the results reported
over ten runs. As can be seen, GPU sharing allows up to a 28%
performance improvement over serialized execution. However
TORQUE, which relies on our runtime and is unaware of the
number and location of the GPUs in the cluster, divides the
workload equally between the two nodes. Thus, the node with
only one GPU is overloaded compared to the other node with
three GPUs. When, in addition to GPU sharing, we allow load
balancing through our inter-node offloading technique, the overall
throughput is further improved by up to 18%.
Finally, we want to show the benefits of our runtime system in a
cluster in the presence of jobs with conflicting memory
requirements. To this end, we run 16, 32 and 48 BS-L and MM-L
jobs (25/75 distribution). We recall that these two applications
have long runtimes. The results of this experiment are shown in
Figure 11. Again, serialized execution allows avoiding memory
conflicts. From the figure, it is clear that allowing jobs to share
GPUs increases the throughput significantly (up to 50%), despite
the overhead due to the need for swap operations. Moreover, in
the presence of load imbalances, the execution is further
accelerated by allowing the overloaded node to offload the excess
jobs remotely.
6. RELATED WORK
Our proposal is closely related to two categories of work: runtime
systems to enable GPU virtualization [1][2][3][4][6], and
memory-aware runtimes for heterogeneous nodes including CPUs
and GPUs [8][9]. As mentioned previously, GViM [1], vCUDA
[2], rCUDA [3] and gVirtuS [4] use API remoting to provide GPU
visibility from within virtual machines. GViM and vCUDA
leverage the multiplexing mechanism provided by the CUDA
runtime in order to allow GPU sharing among different
applications. In addition, GViM uses a Working Queue per GPU
to evenly distribute the load across different GPUs. However, as
discussed, the CUDA runtime cannot properly handle a large
number of concurrent applications, nor concurrent applications
whose aggregate memory requirements exceed the memory
capacity of the underlying GPU device. Our work addresses both
these issues, and allows dynamic scheduling of jobs on GPUs.
As additional feature, GViM provides a mechanism to minimize
the overhead of memory transfers when GPUs are used within
virtualized environments. In particular, its authors propose using
the mmap Unix system call to avoid the data copy between the
guest OS and the host OS. Whenever possible, they also propose
using page locked memory (along with the cudaMallocHost
primitive) in order to avoid the additional data copy between host
OS and GPU memory. Memory transfer optimization is
orthogonal to the objectives of this work. In the future, we plan to
include these optimizations in our runtime.
Guevara et al [7] propose kernel consolidation as a way to share
GPUs. They show that this mechanism is particularly effective in
the presence of kernels with complementary resource
requirements (e.g.: compute intensive and memory intensive
kernels). The concept of kernel consolidation has been
reconsidered and explored in the context of GPU virtualization by
Ravi et al [6]. Differently from us, Ravi et al assume that the
overall memory footprint of the consolidated applications fits the
device memory, and statically bind applications to GPUs. Our
proposal is in a way orthogonal to [6]. In fact, the delayed
application-to-GPU binding and the deferral of memory
operations offered by our runtime should allow easy and efficient
integration of kernel consolidation.
Gelado et al [8] and Becchi et al [9] propose two similar
memory-management frameworks for nodes including CPUs and
GPUs. The primary goal of these proposals is to hide the
underlying distributed memory system from the programmer, and
automatically move data between CPU and GPU as they are
needed. By doing so, these frameworks eliminate some
unrequired memory transfers between CPU and GPU. These
proposals, however, do not target multi-tenancy and conflicting
memory requirements among concurrent applications, which are
the main focus of our work. On one hand, the memory module we
design has some similarities with these two proposals (tracking of
mapping between CPU and GPU address spaces, memory transfer
deferral and optimization); on the other hand, it extends these
frameworks and focuses on the multi-tenancy scenario.
NVCR [15] provides a checkpoint-restart mechanism for
CUDA applications written using the CUDA driver and runtime
APIs. The framework is intended to be integrated with BLCR [29]
Checkpoints are inserted at each memory and kernel operation
using an intercept library. To ensure memory consistency and
reconstruct the device pointer information, NVCR requires
replaying all memory allocations performed by the application
after every restart, leading to a potentially high overhead. Our
virtual memory abstraction allows us to replay only memory
operations required by not-yet-executed kernel calls.
0
5
10
15
20
25
30
35
Total (32 jobs) Avg (32 jobs) Total (48 jobs) Avg (48 jobs)
Execution t ime (sec)
Metric (# jobs)
serialized execution (1 vGPU)
GPU shar ing (4 vGPUs)
GPU shari ng + Load Balanci ng
Figure 10: Two-node cluster using TORQUE: effect of GPU
sharing and load balancing via inter-node offloading in the
presence of short-running jobs and in the absence o
f
conflicting memory requirements.
107
Our runtime assumes that the memory footprint of each
application fits the most capable GPU in the system. Under this
assumption, we allow concurrency among applications with
conflicting memory requirements. In addition, our intra-
application swap capability allows relaxing the memory
requirements for applications to run on GPU. Related work
[16][17] has considered the problem of limiting the memory
requirements of single applications by reorganizing their memory
access patterns and splitting operators.
Finally, the interest in using GPUs for general purpose
computing is confirmed by recent work on automatic generation
of CUDA code [13][14], and on programming models and
runtime systems for heterogeneous nodes [10][11][12]. These
proposals are orthogonal to the work presented in this paper.
7. CONCLUSIONS AND FUTURE WORK
In conclusion, we have proposed a runtime system that provides
abstraction and sharing of GPUs, while allowing isolation of
concurrent applications. Two fundamental features of our runtime
are: (i) dynamic application-to-GPU binding and (ii) virtual
memory for GPUs. In particular, dynamic binding maximizes
device utilization and improves performances in the presence of
concurrent applications with multiple GPU phases and of GPUs
with different compute capabilities. Besides dynamic binding, the
virtual memory abstraction enables the following features: (i) load
balancing in case of GPU addition and removal, (ii) resilience to
GPU failures, and (iii) checkpoint-restart capabilities.
Our prototype implementation targets NVIDIA GPUs. In the
future, we intend to extend our runtime to support other many-
core devices, such as the Intel MIC. In addition, we intend to
evaluate our runtime on larger clusters and on multi-node
applications. Finally, we plan to explore alternative mapping and
scheduling algorithms, as well as security concerns related to
heterogeneous cluster and cloud computing infrastructures.
8. ACKNOWLEDGEMENTS
The authors thank the anonymous reviewers for the feedback that
helped improve the paper. This work has been supported by NEC
Research Laboratories. Adam Procter has been supported by U.S.
Department of Education GAANN grant no. P200A100053.
9. REFERENCES
[1] V. Gupta et al. 2009. GViM: GPU-accelerated virtual machines. In
Proc. of HPCVirt '09. ACM, New York, NY, USA, pp. 17-24.
[2] L. Shi, H. Chen, and J. Sun. 2009. vCUDA: GPU accelerated high
performance computing in virtual machines. In Proc. of IPDPS '09,
Washington, DC, USA, pp. 1-11.
[3] J. Duato et al. 2010. rCUDA: Reducing the number of GPU-based
accelerators in high performance clusters. In Proc. of HPCS’10, pp.
224–231.
[4] G. Giunta, R. Montella, G. Agrillo, and G. Coviello. 2010. A
GPGPU transparent virtualization component for high performance
computing clouds. In Proc. of Euro-Par 2010, Heidelberg, 2010.
[5] gVirtuS: http://osl.uniparthenope.it/projects/gvirtus
[6] V. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. 2011.
Supporting GPU sharing in cloud environments with a transparent
runtime consolidation framework. In Proc. of HPDC '11. ACM, New
York, NY, USA, pp. 217-228.
[7] M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron. 2009.
Enabling Task Parallelism in the CUDA Scheduler. In Workshop on
Programming Models for Emerging Architectures, Sep. 2009.
[8] I. Gelado et al. 2010. An asymmetric distributed shared memory
model for heterogeneous parallel systems. In Proc. of ASPLOS '10.
ACM, New York, NY, USA, pp. 347-358.
[9] M. Becchi, S. Byna, S. Cadambi, and S. Chakradhar. 2010. Data-
aware scheduling of legacy kernels on heterogeneous platforms with
distributed memory. In Proc. of SPAA '10. ACM, New York, NY,
USA, pp. 82-91.
[10] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. 2008.
Merge: a programming model for heterogeneous multi-core systems.
In Proc. of ASPLOS’08. ACM, New York, NY, USA, pp. 287-296.
[11] B. Saha et al. 2009. Programming model for a heterogeneous x86
platform. In Proc. of PLDI '09. New York, NY, USA, pp. 431-440.
[12] C.-K. Luk, S. Hong, and H. Kim. 2009. Qilin: exploiting parallelism
on heterogeneous multiprocessors with adaptive mapping. In Proc.
of MICRO’09. ACM, New York, NY, USA, pp. 45-55.
[13] S.-Z. Ueng, M. Lathara, S. Baghsorkhi, and W.-M. Hwu. 2008.
CUDA-Lite: Reducing GPU Programming Complexity. In
Languages and Compilers for Parallel Computing, Lecture Notes in
Comp. Sc., Vol. 5335. Springer-Verlag, Berlin, Heidelberg pp. 1-15.
[14] S. Lee and R. Eigenmann. 2010. OpenMPC: Extended OpenMP
Programming and Tuning for GPUs. In Proc. of SC '10. Washington,
DC, USA, pp. 1-11. Nov 2010.
[15] A. Nukada, H. Takizawa, and S. Matsuoka, 2011. NVCR: A
Transparent Checkpoint-Restart Library for NVIDIA CUDA. In
Proc. of IPDPDW’11, Shanghai, China, pp. 104-113, Sep 2011.
[16] N. Sundaram, A. Raghunathan, and S. Chakradhar. 2009. A
framework for efficient and scalable execution of domain-specific
templates on GPUs. In Proc. of IPDPS '09. IEEE Computer Society,
Washington, DC, USA, pp. 1-12.
[17] J. Kim, H. Kim, J. Hwan Lee, and J. Lee. 2011. Achieving a single
compute device image in OpenCL for multiple GPUs. In Proc. of
PPoPP '11. ACM, New York, NY, USA, pp. 277-288.
[18] H. Lim, S. Babu, J. Chase, and S. Parekh. 2009. Automated control
in cloud computing: challenges and opportunities. In Proc. of ACDC
'09. ACM, New York, NY, USA, pp. 13-18.
[19] P. Marshall, K. Keahey, and T. Freeman. 2010. Elastic Site: Using
Clouds to Elastically Extend Site Resources. In Proc. of CCGrid
2010, pp. 43-52, May 2010.
[20] P. Padala et al. 2009. Automated control of multiple virtualized
resources. In Proc. of EuroSys '09. New York, NY, USA, pp. 13-26.
[21] M. Becchi and P. Crowley. 2006. Dynamic thread assignment on
heterogeneous multiprocessor architectures. In Proc. of CF '06.
ACM, New York, NY, USA, pp. 29-40.
[22] J. Nickolls, I. Buck, M. Garland, K. Skadron. 2008. Scalable Parallel
Programming with CUDA. In ACM Queue. April 2008.
[23] G. Teodoro et al. 2009. Coordinating the use of GPU and CPU for
improving performance of compute intensive applications. In Proc.
of CLUSTER, pp. 1–10, 2009.
[24] Eucalyptus: http://www.eucalyptus.com
[25] TORQUE Resource Manager: http://www.clusterresources.com/
products/TORQUE-resource-manager.php
[26] Amazon EC2 Instances: http://aws.amazon.com/ec2/
[27] Nimbix Informatics Xcelerated: http://www.nimbix.net
[28] Hoopoe: http://www.hoopoe-cloud.com
[29] BLCR: https://ftg.lbl.gov/projects/CheckpointRestart
[30] Rodinia : https://www.cs.virginia.edu/~skadron/
wiki/rodinia/index.php/Main_Page
108
... Also, in a data center, multiple applications or users are scheduled on the same physical hardware to increase utilization. Data center GPUs now include ways to share the same hardware between multiple tasks and users: current hardware can expose multiple virtual devices per physical GPU that share hardware with temporal partitioning [45], and other work has examined how to add memory virtualization and protection to GPUs [15]. ...
... 15: Normalized total GPU energy, including added instruction and memory accesses. The "No RF" entry is the upper bound for energy savings, which uses the baseline performance and a register file that consumes no energy.for ...
Thesis
Graphics Processing Units (GPUs) are becoming common in data centers for tasks like neural network training and image processing due to their high performance and efficiency. GPUs maintain high throughput by running thousands of threads simultaneously, issuing instructions from ready threads to hide latency in others that are stalled. While this is effective for keeping the arithmetic units busy, the challenge in GPU design is moving the data for computation at the same high rate. Any inefficiency in data movement and storage will compromise the throughput and energy efficiency of the system. Since energy consumption and cooling make up a large part of the cost of provisioning and running and a data center, making GPUs more suitable for this environment requires removing the bottlenecks and overheads that limit their efficiency. The performance of GPU workloads is often limited by the throughput of the memory resources inside each GPU core, and though many of the power-hungry structures in CPUs are not found in GPU designs, there is overhead for storing each thread's state. When sharing a GPU between workloads, contention for resources also causes interference and slowdown. This thesis develops techniques to manage and streamline the data movement and storage resources in GPUs in each of these places. The first part of this thesis resolves data movement restrictions inside each GPU core. The GPU memory system is optimized for sequential accesses, but many workloads load data in irregular or transposed patterns that cause a throughput bottleneck even when all loads are cache hits. This work identifies and leverages opportunities to merge requests across threads before sending them to the cache. While requests are waiting for merges, they can be reordered to achieve a higher cache hit rate. These methods yielded a 38% speedup for memory throughput limited workloads. Another opportunity for optimization is found in the register file. Since it must store the registers for thousands of active threads, it is the largest on-chip data storage structure on a GPU. The second work in this thesis replaces the register file with a smaller, more energy-efficient register buffer. Compiler directives allow the GPU to know ahead of time which registers will be accessed, allowing the hardware to store only the registers that will be imminently accessed in the buffer, with the rest moved to main memory. This technique reduced total GPU energy by 11%. Finally, in a data center, many different applications will be launching GPU jobs, and just as multiple processes can share the same CPU to increase its utilization, running multiple workloads on the same GPU can increase its overall throughput. However, co-runners interfere with each other in unpredictable ways, especially when sharing memory resources. The final part of this thesis controls this interference, allowing a GPU to be shared between two tiers of workloads: one tier with a high performance target and another suitable for batch jobs without deadlines. At a 90% performance target, this technique increased GPU throughput by 9.3%. GPUs' high efficiency and performance makes them a valuable accelerator in the data center. The contributions in this thesis further increase their efficiency by removing data movement and storage overheads and unlock additional performance by enabling resources to be shared between workloads while controlling interference.
... Therefore, in our current implementation, our frontend module simply throws out of memory exceptions when a container attempts to allocate more space than it requests. As discuss in related works, there are some existing approaches [4,19,32] to support memory over-commitment, and our work can be integrated with these solutions to support more flexible GPU memory sharing. ...
... Supporting memory over-commitment on GPU is also important topic for GPU sharing, especially for memory-intensive jobs. Several approaches [4,19,32] have been proposed based on the virtual memory mechanism, so that the GPU memory content can be swapped to host memory when its GPU kernel is not running. Although these techniques can increase the chances of GPU sharing, they also have the risk to introduce more performance overhead from the memory swapping operations due to the limited memory bandwidth between host and device. ...
... There have been many studies on improving the performance of multiple applications by considering the dependencies from the data copy operations. Jablin et al. [14] and Becchi et al. [2] proposed SW runtime environments to handle data allocation and transfers on-the-fly by keeping track of dependencies across kernel executions. Using a similar technique, Sajjapongse et al. [20] distributed kernels to multiple GPUs, to reduce the waiting times on kernel dependencies. ...
... MP-Tomasulo [26] is a parallel execution engine that improved the performance of processors by reordering instructions. Our study is similar with these studies [2,14,20,26,28] in terms of investigating congestion cased by data dependency and the reordering tasks to improve the performance. In contrast, we focus on the multiple applications without dependencies that can be executed concurrently without dependencies among them. ...
Article
Full-text available
General-purpose graphics processing units (GPGPUs) have been widely adapted to the industry due to the high parallelism of graphics processing units (GPUs) compared with central processing units (CPUs). Especially, a GPGPU device has been adopted for various scientific workloads which have high parallelism. To handle the ever increasing demand, multiple applications are often run simultaneously in multiple GPGPU devices. However, when multiple applications are running concurrently, the overall performance of GPGPU devices varies significantly due to the different characteristics of GPGPU applications. To improve the efficiency, it is critical to anticipate the performance of applications and find optimal scheduling policy. In this paper, we analyze various types of scientific applications and identify factors that impact the performance during the concurrent execution of the applications in GPGPU devices. Our analysis results show that each application has distinct characteristic. By considering distinct characteristics of applications, a certain combination of applications has better performance compared with the others when executed concurrently in multiple GPGPU devices. Based on the finding of our analysis, we propose a simulator which predicts the performance of GPGPU devices when multiple applications are running concurrently. Our simulator collects performance metrics during the execution of applications and predicts the performance of certain combinations using the performance metrics. The experimental result shows that the best combination of applications can increase the performance by 39.44% and 65.98% compared with the average of combinations and the worst case, respectively when using a single GPGPU device. When utilizing multiple GPGPU devices, our result shows that the performance improve can be 24.78% and 39.32% compared with the average and the worst combinations, respectively.
... Sengupta et al. [31] focus on virtualizing the GPU as a whole in a cloud with multiple GPUs. Becchi et al. [3] study a virtual memory system that isolates the memory spaces of concurrent kernels and allows kernels whose aggregate memory footprint exceeds the GPU's memory capacity to execute concurrently. By contrast, Pagoda virtualizes the compute resources of a single GPU at the granularity of a warp. ...
Article
Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. This article presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 5.52X over PThreads running on a 20-core CPU, 1.76X over CUDA-HyperQ, and 1.44X over GeMTC, the state-of-the-art runtime GPU task scheduling system.
... Whereas, our approach does not require scheduler change and the number of GTT copies is the same as that in predictive copy approach. Gdev [21], GDM [22], RSVM [23] and VMBR [24] solve a GPU memory insufficiency problem. They use system memory as backup memory, and copy data from GPU memory to system memory at runtime when applications need more memory to run. ...
Article
Full-text available
The proliferation of GPU intensive workloads has created a new challenge for low-overhead and efficient GPU virtualization solutions over GPU clouds. gVirt is a full GPU virtualization solution for Intel’s integrated GPUs that share system’s on-board memory for graphics memory. In order to solve the inherent scalability limitation on the number of simultaneous virtual machines (VM) in gVirt, gScale proposed a dynamic sharing scheme for global graphics memory among VMs by copying the entries in a private graphics translation table (GTT) to a physical GTT along with a GPU context switch. However, copying entries between private GTT and physical GTT often causes significant overhead, which becomes worse when the global graphics memory space shared by each VM is overlapped. This paper identifies that the copy overhead caused by GPU context switch is one of the major bottlenecks in performance improvement and proposes a low-overhead dynamic memory management scheme called DymGPU. DymGPU provides two memory allocation algorithms such as size-based and utilization-based algorithms. While the size-based algorithm allocates memory space based on the memory size required by each VM, the utilization-based algorithm considers GPU utilization of each VM to allocate memory space. DymGPU is also dynamic in the sense that the global graphics memory space used by each VM is rearranged at runtime by periodically checking idle VMs and GPU utilization of each runnable VM. We have implemented our proposed approach in gVirt and confirmed that the proposed scheme reduces GPU context switch time by up to 53% and improved the overall performance of various GPU applications by up to 39%.
... Kato et al. [35], Wang et al. [36], Ji et al. [37], and Becchi et al. [38] proposed technologies for solving the problem of insufficient GPU memory when compute unified device architecture (CUDA) is performed in the NVIDIA GPU environment. When the amount of GPU memory is insufficient, the data in the GPU memory are moved to the system memory to secure space in the GPU memory, which is allocated to the applications. ...
Article
Full-text available
Advances in virtualization technology have enabled multiple virtual machines (VMs) to share resources in a physical machine (PM). With the widespread use of graphics-intensive applications, such as two-dimensional (2D) or 3D rendering, many graphics processing unit (GPU) virtualization solutions have been proposed to provide high-performance GPU services in a virtualized environment. Although elasticity is one of the major benefits in this environment, the allocation of GPU memory is still static in the sense that after the GPU memory is allocated to a VM, it is not possible to change the memory size at runtime. This causes underutilization of GPU memory or performance degradation of a GPU application due to the lack of GPU memory when an application requires a large amount of GPU memory. In this paper, we propose a GPU memory ballooning solution called gBalloon that dynamically adjusts the GPU memory size at runtime according to the GPU memory requirement of each VM and the GPU memory sharing overhead. The gBalloon extends the GPU memory size of a VM by detecting performance degradation due to the lack of GPU memory. The gBalloon also reduces the GPU memory size when the overcommitted or underutilized GPU memory of a VM creates additional overhead for the GPU context switch or the CPU load due to GPU memory sharing among the VMs. We implemented the gBalloon by modifying the gVirt , a full GPU virtualization solution for Intel’s integrated GPUs. Benchmarking results show that the gBalloon dynamically adjusts the GPU memory size at runtime, which improves the performance by up to 8% against the gVirt with 384 MB of high global graphics memory and 32% against the gVirt with 1024 MB of high global graphics memory.
Article
Full-text available
The use of virtualization to abstract underlying hardware can aid in sharing such resources and in efficiently managing their use by high performance applications. Unfortunately, virtualization also prevents efficient access to accelerators, such as Graphics Processing Units (GPUs), that have be-come critical components in the design and architecture of HPC systems. Supporting General Purpose computing on GPUs (GPGPU) with accelerators from different vendors presents significant challenges due to proprietary program-ming models, heterogeneity, and the need to share accelera-tor resources between different Virtual Machines (VMs). To address this problem, this paper presents GViM, a sys-tem designed for virtualizing and managing the resources of a general purpose system accelerated by graphics proces-sors. Using the NVIDIA GPU as an example, we discuss how such accelerators can be virtualized without additional hardware support and describe the basic extensions needed for resource management. Our evaluation with a Xen-based implementation of GViM demonstrate efficiency and flexi-bility in system usage coupled with only small performance penalties for the virtualized vs. non-virtualized solutions.
Conference Paper
Full-text available
The computer industry has transitioned into multi-core and many-core parallel systems. The CUDA programming environment from NVIDIA is an attempt to make programming many-core GPUs more accessible to programmers. However, there are still many burdens placed upon the programmer to maximize performance when using CUDA. One such burden is dealing with the complex memory hierarchy. Efficient and correct usage of the various memories is essential, making a difference of 2-17x in performance. Currently, the task of determining the appropriate memory to use and the coding of data transfer between memories is still left to the programmer. We believe that this task can be better performed by automated tools. We present CUDA-lite, an enhancement to CUDA, as one such tool. We leverage programmer knowledge via annotations to perform transformations and show preliminary results that indicate auto-generated code can have performance comparable to hand coding.
Conference Paper
Full-text available
Virtualized data centers enable consolidation of mul- tiple applications and sharing of multiple resources among these applications. However, current virtual- ization technologies are inadequate in achieving com- plex service level objectives (SLOs) for enterprise ap- plications with time-varying demands for multiple re- sources. In this paper, we present AutoControl, a resource allocation system that automatically adapts to dynamic workload changes in a shared virtualized infrastructure to achieve application SLOs. AutoCon- trol is a combination of an online model estimator and a novel multi-input, multi-output (MIMO) resource controller. The model estimator captures the complex relationship between application performance and re- source allocations, while the MIMO controller allo- cates the right amount of resources to ensure appli- cation SLOs. Our experimental results using RU- BiS and TPC-W benchmarks along with production- trace-driven workloads indicate that AutoControl can detect and adapt to CPU and disk I/O bottlenecks that occur over time and across multiple nodes and allocate multiple virtualized resources accordingly to achieve application SLOs. It can also provide service difierentiation according to the priorities of individual applications during resource contention.
Conference Paper
Full-text available
GPUs have recently evolved into very fast parallel co-processors capable of executing gen-eral purpose computations extremely efficiently. At the same time, multi-core CPUs evolution continued and today's CPUs have 4-8 cores. These two trends, however, have followed independent paths in the sense that we are aware of very few works that consider both devices cooperating to solve general computations. In this paper we investigate the coordinated use of CPU and GPU to improve efficiency of applications even further than using either device independently. We use Anthill runtime environment, a data-flow oriented framework in which applications are de-composed into a set of event-driven filters, where for each event, the runtime system can use either GPU or CPU for its processing. For evaluation, we use a histopathology application that uses image analysis techniques to classify tumor images for neuroblas-toma prognosis. Our experimental environment in-cludes dual and octa-core machines, augmented with GPUs and we evaluate our approach's performance for standalone and distributed executions. Our experiments show that a pure GPU opti-mization of the application achieved a factor of 15 to 49 times improvement over the single core CPU version, depending on the versions of the CPUs and GPUs. We also show that the execution can be further reduced by a factor of about 2 by using our runtime system that effectively choreographs the execution to run cooperatively both on GPU and on a single core of CPU. We improve on that by adding more cores, all of which were previously neglected or used ineffectively. In addition, the evaluation on a distributed environment has shown near linear scalability to multiple hosts.
Conference Paper
Full-text available
In a multi-programmed computing environment, threads of execution exhibit different runtime characteristics and hardware resource requirements. Not only do the behaviors of distinct threads differ, but each thread may also present diversity in its performance and resource usage over time. A heterogeneous chip multiprocessor (CMP) architecture consists of processor cores and caches of varying size and complexity. Prior work has shown that heterogeneous CMPs can meet the needs of a multi-programmed computing environment better than a homogeneous CMP system. In fact, the use of a combination of cores with different caches and instruction issue widths better accommodates threads with different computational requirements. A central issue in the design and use of heterogeneous systems is to determine an assignment of tasks to processors which better exploits the hardware resources in order to improve performance. In this paper we argue that the benefits of heterogeneous CMPs are bolstered by the use of a dynamic assignment policy, i.e., a runtime mechanism which observes the behavior of the running threads and exploits thread migration between cores. We validate our analysis by means of simulation. Specifically, our model assumes a combination of Alpha EV5 and Alpha EV6 processors and of integer and floating point programs from the SPEC2000 benchmark suite. We show that a dynamic assignment can outperform a static one by 20% to 40% on average and by as much as 80% in extreme cases, depending on the degree of multithreading simulated.
Article
With advances in virtualization technology, virtual machine ser-vices offered by cloud utility providers are becoming increasingly powerful, anchoring the ecosystem of cloud services. Virtual com-puting services are attractive in part because they enable customers to acquire and release computing resources for guest applications adaptively in response to load surges and other dynamic behaviors. "Elastic" cloud computing APIs present a natural opportunity for feedback controllers to automate this adaptive resource provision-ing, and many recent works have explored feedback control poli-cies for a variety of network services under various assumptions. This paper addresses the challenge of building an effective con-troller as a customer add-on outside of the cloud utility service it-self. Such external controllers must function within the constraints of the utility service APIs. It is important to consider techniques for effective feedback control using cloud APIs, as well as how to design those APIs to enable more effective control. As one exam-ple, we explore proportional thresholding, a policy enhancement for feedback controllers that enables stable control across a wide range of guest cluster sizes using the coarse-grained control offered by popular virtual compute cloud services.
Article
General purpose computing on graphics processing units (GPUs) introduces the challenge of scheduling in-dependent tasks on devices designed for data parallel or SPMD applications. This paper proposes an issue queue that merges workloads that would underutilize GPU pro-cessing resources such that they can be run concurrently on an NVIDIA GPU. Using kernels from microbenchmarks and two applications we show that throughput is increased in all cases where the GPU would have been underused by a sin-gle kernel. An exception is the case of memory-bound ker-nels, seen in a Nearest Neighbor application, for which the execution time still outperforms the same kernels executed serially by 12-20%. It can also be beneficial to choose a merged kernel that over-extends the GPU resources, as we show the worst case to be bounded by executing the kernels serially. This paper also provides an analysis of the latency penalty that can occur when two kernels with varying com-pletion times are merged.
Conference Paper
In this paper, we propose an OpenCL framework that combines multiple GPUs and treats them as a single compute device. Providing a single virtual compute device image to the user makes an OpenCL application written for a single GPU portable to the platform that has multiple GPU devices. It also makes the application exploit full computing power of the multiple GPU devices and the total amount of GPU memories available in the platform. Our OpenCL framework automatically distributes at run-time the OpenCL kernel written for a single GPU into multiple CUDA kernels that execute on the multiple GPU devices. It applies a run-time memory access range analysis to the kernel by performing a sampling run and identifies an optimal workload distribution for the kernel. To achieve a single compute device image, the runtime maintains virtual device memory that is allocated in the main memory. The OpenCL runtime treats the memory as if it were the memory of a single GPU device and keeps it consistent to the memories of the multiple GPU devices. Our OpenCL-C-to-C translator generates the sampling code from the OpenCL kernel code and OpenCL-C-to-CUDA-C translator generates the CUDA kernel code for the distributed OpenCL kernel. We show the effectiveness of our OpenCL framework by implementing the OpenCL runtime and two source-to-source translators. We evaluate its performance with a system that contains 8 GPUs using 11 OpenCL benchmark applications.
Conference Paper
In this paper, we describe a runtime to automatically enhance the performance of applications running on heterogeneous platforms consisting of a multi-core (CPU) and a throughput-oriented many-core (GPU). The CPU and GPU are connected by a non-coherent interconnect such as PCI-E, and as such do not have shared memory. Heterogeneous platforms available today such as [9] are of this type. Our goal is to enable the programmer to seamlessly use such a system without rewriting the application and with minimal knowledge of the underlying architectural details. Assuming that applications perform function calls to computational kernels with available CPU and GPU implementations, our runtime achieves this goal by automatically scheduling the kernels and managing data placement. In particular, it intercepts function calls to well-known computational kernels and schedules them on CPU or GPU based on their argument size and location. To improve performance, it defers all data transfers between the CPU and the GPU until necessary. By managing data placement transparently to the programmer, it provides a unified memory view despite the underlying separate memory sub-systems. We experimentally evaluate our runtime on a heterogeneous platform consisting of a 2.5GHz quad-core Xeon CPU and an NVIDIA C870 GPU. Given array sorting, parallel reduction, dense and sparse matrix operations and ranking as computational kernels, we use our runtime to automatically retarget SSI [25], K-means [32] and two synthetic applications to the above platform with no code changes. We find that, in most cases, performance improves if the computation is moved to the data, and not vice-versa. For instance, even if a particular instance of a kernel is slower on the GPU than on the CPU, the overall application may be faster if the kernel is scheduled on the GPU anyway, especially if the kernel data is already located on the GPU memory due to prior decisions. Our results show that data-aware CPU/GPU scheduling improves performance by up to 25% over the best data-agnostic scheduling on the same platform.