Conference PaperPDF Available

Enabling FPGAs in Hyperscale Data Centers

Authors:

Abstract and Figures

FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DCs) and are used to offload and accelerate specific services, but they are not yet available to cloud users. This puts the cloud deployment of compute-intensive workloads at a disadvantage compared with on-site infrastructure installations, where the performance and energy efficiency of FPGAs are increasingly being exploited for application-specific accelerators and heterogeneous computing. The cloud is housed in DCs, and DCs are based on ever shrinking servers. Today, we observe the emergence of hyperscale data centers, which are based on densely packaged servers. The shrinking form factor brings the potential to deploy FPGAs on a large scale in such DCs. These FPGAs must be deployed as independent DC resources, and they must be accessible to the cloud users. Therefore, we propose to change the traditional paradigm of the CPU-FPGA interface by decoupling the FPGA from the CPU and connecting the FPGA as a standalone resource to the DC network. This allows cloud vendors to offer an FPGA to users in a similar way as a standard server. As existing infrastructure-as-a-service (IaaS) mechanisms are not suitable, we propose a new OpenStack (open source cloud computing software) service to integrate FPGAs in the cloud. This proposal is complemented by a framework that enables cloud users to combine multiple FPGAs into a programmable fabric. The proposed architecture and framework address the scalability problem that makes it difficult to provision large numbers of FPGAs. Together, they offer a novel solution for processing large and heterogeneous data sets in the cloud.
Content may be subject to copyright.
Enabling FPGAs in Hyperscale Data Centers
Jagath Weerasinghe, Francois Abel, Christoph Hagleitner
IBM Research - Zurich
S¨
aumerstrasse 4
8803 R¨
uschlikon, Switzerland
Email: {wee,fab,hle}@zurich.ibm.com
Andreas Herkersdorf
Institute for Integrated Systems
Technical University of Munich,
Munich, Germany
Email: herkersdorf@tum.de
Abstract—FPGAs (Field Programmable Gate Arrays) are mak-
ing their way into data centers (DCs) and are used to offload
and accelerate specific services, but they are not yet available to
cloud users. This puts the cloud deployment of compute-intensive
workloads at a disadvantage compared with on-site infrastructure
installations, where the performance and energy efficiency of
FPGAs are increasingly being exploited for application-specific
accelerators and heterogeneous computing.
The cloud is housed in DCs, and DCs are based on ever
shrinking servers. Today, we observe the emergence of hyperscale
data centers, which are based on densely packaged servers. The
shrinking form factor brings the potential to deploy FPGAs on
a large scale in such DCs. These FPGAs must be deployed as
independent DC resources, and they must be accessible to the
cloud users. Therefore, we propose to change the traditional
paradigm of the CPU-FPGA interface by decoupling the FPGA
from the CPU and connecting the FPGA as a standalone resource
to the DC network.
This allows cloud vendors to offer an FPGA to users in a
similar way as a standard server. As existing infrastructure-as-
a-service (IaaS) mechanisms are not suitable, we propose a new
OpenStack (open source cloud computing software) service to
integrate FPGAs in the cloud. This proposal is complemented
by a framework that enables cloud users to combine multiple
FPGAs into a programmable fabric. The proposed architecture
and framework address the scalability problem that makes it
difficult to provision large numbers of FPGAs. Together, they
offer a novel solution for processing large and heterogeneous
data sets in the cloud.
Keywords—FPGA; hyperscale data centers; cloud computing.
I. INTRODUCTION
The use of FPGAs in application-specific processing and
heterogeneous computing domains has been popular for 20
years or so. In contrast, in cloud DCs FPGAs have only
just started to be used for offloading and accelerating specific
application workloads [1]. In both cases, FPGAs are usually
treated as co-processor workers under the control of a CPU
master and are attached over a high-speed point-to-point
interconnect such as the PCIe or the HyperTransport buses [2].
To provide such an FPGA as an independent resource that can
be consumed by users in the cloud, it must be abstracted from
its attached CPU, which is not straight forward to achieve in
such a master-worker programming model [3]. As a result,
FPGAs have not made significant inroads into the cloud yet.
To accelerate and enable the large-scale deployment of
FPGAs in future DCs, we advocate for a change of paradigm
in the CPU-FPGA and FPGA-FPGA interfaces. We propose
an architecture that sets the FPGA free from the CPU and its
PCIe-bus by connecting the FPGA directly to the DC network
as a standalone resource. Cloud vendors can then provision
these FPGA resources in a similar manner to CPU, memory
and storage resources.
Enabling FPGAs on a large scale opens new opportuni-
ties for both cloud customers and cloud vendors. From the
customers’ perspective, FPGAs can be rented, used and re-
leased, similar to cloud infrastructure resources such as virtual
machines (VMs) and storage. For example, IaaS users can
rent FPGAs for education (e.g., university classes), research
(e.g., building HPC systems) and testing (e.g., evaluation
prior to deployment in real environments) purposes. From the
Platform as a Service (PaaS) vendors’ perspective, FPGAs can
be used to offer acceleration as a service to the application
developers on cloud platforms. For example, PaaS vendors
can provide FPGA-accelerated application interfaces to PaaS
users. A Software as a Service (SaaS) vendor can use FPGAs
to provide acceleration as a service and also to improve user
experience. Acceleration of Bing web search service is such an
example [1]. From the FPGA vendors’ perspective, the cloud
expands the FPGA user base and also opens new paths for
them to market their own products. For example, new products
can be placed on the cloud so that users can try them out
before actually purchasing them. In summary, the deployment
of FPGAs in DCs will benefit both the users and the various
cloud service providers and operators.
Meanwhile, the servers, which make up a cloud DC, are
continuously shrinking in terms of the form factor. This
leads to the emergence of a new class of hyperscale data
centers (HSDC) based on small and dense server packaging.
This miniaturization of the DC servers is a game-changing
requirement that will transform the traditional way of instan-
tiating and operating an FPGA in a DC infrastructure. First,
miniaturized servers aim to leverage the advanced semiconduc-
tor manufacturing processes for the gates by integrating a
complete server system on chip (SoC) for increased density
and power efficiency. As a result, legacy memory controllers
and high-speed I/Os are embedded on chip, thus eliminating
the need for an external PCIe bus to support these I/Os.
Second, the shrinking of the DC form factor unit will enable
the deployment of a large number of network-attached FPGAs,
exceeding by far the scaling capacity of traditional PCIe
bus attachments. Once such network-attached FPGAs become
available on a large scale in DCs, vendors will be able to
rent them out on the cloud. However, as existing server-
provisioning mechanisms are not suitable for this purpose,
we propose a new resource-provisioning service in OpenStack
for integrating such standalone FPGAs in the cloud. Also, as
users will be able to request multiple FPGAs from the cloud,
we want to provide them with the possibility to implement a
programmable interconnection network of FPGAs in a cost-
effective, scalable and flexible manner on the cloud. Therefore,
we also propose a framework to interconnect multiple FPGAs
in a user defined topology, and for the cloud vendor to
deploy such a topology in its infrastructure. We expect this
software-defined approach of FPGA networking to offer new
technical perspectives and solutions for processing large and
heterogeneous data sets in the cloud.
The remainder of this paper is organized as follows: Section
II introduces the concept of a hyperscale data center based
on the DOME mircoserver [4], and Section III analyzes prior
art. The rationale behind the network-attached FPGA proposal
is elaborated in Section IV. Section V presents the network-
attached FPGA architecture, the cloud integration service for
FPGAs and the framework for setting up a fabric of FPGAs on
the cloud. Initial scaling perspectives and sizing are discussed
in Section VI, and we conclude in Section VII.
II. HYPERSCALE DATA CEN TE RS
The scaling of modern DCs has been fuelled by the con-
tinuous shrinking of the server node infrastructure. After the
tower-, rack- and blade-server form factors, a new class of
hyperscale servers (HSS) is emerging. In an HSS, the form
factor is almost exclusively optimized for the performance-
per-cost metric. This is achieved by increasing the density of
CPUs per real estate and by sharing the cost of resources
such as networking, storage, power supply, management and
cooling.
At the time of writing, there are several HSSs on the market
[5] [6] [7] and at the research stage [4] [8]. Among these, the
HSS of DOME [4] has the objective of building the world’s
highest density and most energy efficient rack unit. In this
paper, we refer to that specific type of HSS for supporting our
discussion on hyperscale data centers (HSDC). Figure 1-(b)
shows the packaging concept of the HSS rack chassis (19” by
2U1) proposed in [4]. In essence, this HSS is disaggregated
[9] into multiple node boards (Figure 1-(a)), each the size
of a double-height dual in-line memory module (DIMM -
133mm x 55mm), which are densely plugged into a carrier
base board. Each node board is a standalone resource, and
the DC network is used to interconnect all the resources. A
node board with a CPU and its DRAM is called a compute
module. Similarly, a node board with solid-state disks (SSD)
is called a storage module, and a node board with an Ethernet
switch is referred to as a networking module. The use of a
homogeneous form factor is a significant contributor to the
overall cost minimization of an HSS.
11U = one rack unit = 1.75 inches (44.45 mm)
The carrier base board is a passive board that provides
system management, 10 GbE networking between the node
boards and multiple 40 GbE uplinks. Figure 1-(d) shows an
example of a network interconnect based on a fat-tree topology
(a.k.a folded Clos topology) [10] that interconnects four racks
of 16 chassis, each with 128 node boards and two networking
modules. This ultra-dense packaging is combined with an
innovative cooling system [11] that enables the integration of
as many as 128 compute modules into a 2U chassis (Figure 1-
(b)). For a 19” rack with 16 such chassis, this amounts to 2K
compute modules and 100 TB of DRAM.
III. PRI OR ART
There are a few previous attempts to enable FPGAs in
the cloud. Chen et al. [3] and Byma et al. [12] proposed
frameworks to integrate virtualized FPGAs into the cloud
using the OpenStack infrastructure manager. In both cases,
FPGAs are virtualized by partitioning them to multiple slots,
where each slot or virtual FPGA is a partially reconfigurable
region in a physical FPGA which is attached over a PCIe-bus.
In [3], a virtual FPGA model is present in each virtual ma-
chine (VM) and acts as the communication channel between
the applications running in the VM and the virtual FPGA. The
commands and data communicated by the virtual FPGA model
are transferred to the virtual FPGA by the hypervisor. There
are a few drawbacks in this framework particularly from the
perspective of a cloud-based framework. First, users can not
deploy their own designs in the FPGAs, instead a limited set
of applications offered by the cloud vendor has to be used.
Second, if a user needs to deploy an application using several
virtual FPGAs connected together, the data have to be copied
back and forth through the VM and hypervisor stack to feed
the next FPGA. Third, VM migration disrupts the use of the
virtual FPGA because the physical FPGA is tightly coupled
to the hypervisor.
In contrast to [3], the framework proposed by [12] allows
users to deploy their own application in the FPGAs and
allows those applications to be accessed over the Ethernet
network. In addition to that, [12] has shown that the OpenStack
Glance image service can be used for managing FPGA bit
streams, which is an important factor when integrating FPGAs
into OpenStack. However, from the perspective of a cloud
deployment, also this framework has a few drawbacks. First,
as network, a plain Ethernet connection is offered to the virtual
FPGAs which limits the flexibility of the applications that can
run on FPGAs. Second, even though multiple virtual FPGAs
are enabled in a single physical FPGA, an isolation method,
such as VLAN or overlay virtual network (OVN) for multi-
tenant deployments is not supported, which is indispensable
in deploying applications in the cloud.
Catapult [1] is a highly customized, application-specific
FPGA-based reconfigurable fabric designed to accelerate page-
ranking algorithms in the Bing web search engine. It is a good
example to show the potential of FPGAs on a large scale in
cloud DCs. Authors claim that compared with a software-
only approach, Catapult can achieve 95% improvement in
Fig. 1. Hyperscale Data Center: (a) Compute module with 48 GB DRAM (on the back). (b) 2U Rack chassis with 128 compute modules. (c) Four racks
with 8K compute modules. (d) An example DC network interconnect.
ranking throughput for a fixed latency. In Catapult, similarly
to the above-mentioned systems, FPGAs are PCIe-attached.
But for scaling, these PCIe-attached FPGAs are connected by
a dedicated serial network, which breaks the homogeneity of
the DC network and increases the complexity of the system.
This complexity overhead can be traded off for the significant
boost in the overall performance. However, in the case of
general-purpose cloud DCs, maintaining such a customized
infrastructure is not acceptable because the FPGAs are used
in diverse kinds of applications at different scales similarly to
other DC resources such as servers and storage.
All these systems deploy FPGAs tightly coupled to a sever
over the PCIe bus. In [1] and [3], FPGAs are accessed through
the PCIe bus, whereas [12] uses a plain Ethernet connection.
In [1], FPGAs are chained in a dedicated network for scaling.
In contrast to those systems, the focus of our proposal is to
consider the FPGA as a DC network-connected standalone
resource with compute, memory and networking capabilities.
In the context of an HSDC, the FPGA resource is therefore
considered as a hyperscale server-class computer.
IV. FPGA ATTACHMENT OPTION S
The miniaturization of the DC servers is a game-changing
requirement that will transform the traditional way of instan-
tiating and operating an FPGA in a DC infrastructure. We
consider three approaches for setting up a large number of
FPGAs into an HSDC.
Fig. 2. Options for Attaching an FPGA to a CPU
One option is to incorporate the FPGA onto the same board
as the CPU when a tight or coherent memory coupling between
the two devices is desired (Figure 2-(a)). We do not expect
such a close coupling to be generalized outside the scope
of very specific applications, such as web searching or text-
analytics processing [13]. First, it breaks the homogeneity
of the compute module in an environment where server
homogeneity is sought to reduce the management overhead
and provide flexibility across compatible hardware platforms.
Second, in large DCs, failed resources can be kept in place
for months and years without being repaired or replaced, in
what is often referred to as a fail-in-place strategy. Therefore,
an FPGA will become unusable and its resources wasted if
its host CPU fails. Third, the footprint of the FPGA takes a
significant real estate away from the compute module –the
layout of a large FPGA on a printed circuit board is somehow
equivalent to the footprint of a DDR3 memory channel, i.e.
8-16GB–, which may require the size of the module to be
increased (e.g., by doubling the height of the standard node
board from 2U to 4U). Finally, the power consumption and
power dissipation of such a duo may exceed the design
capacity of a standard node board.
The second and by far the most popular option in use
today is to implement the FPGA on a daughter-card and
communicate with the CPU over a high-speed point-to-point
interconnect such as the PCIe-bus (Figure 2-(b)). This path
provides a better balance of power and physical space and
is already put to use by FPGAs [1] as well as graphics
processing units (GPU) in current DCs. However, this type
of interface comes with the following two drawbacks when
used in a DC. First, the use of the FPGA(s) is tightly bonded
to the workload of the CPU, and the fewer the PCIe-buses
per CPU, the higher is the chance of under-provisioning the
FPGA(s), and vice-versa. Catapult [1] uses one PCIe-attached
FPGA per CPU and solves this inelastic issue by deploying
a secondary inter-FPGA network at the price of additional
cost, increased cabling and management complexity. Second,
server applications are often migrated within DCs. The PCIe-
attached FPGA(s) affected must then be detached from the
bus before being migrated to a destination where an identical
number and type of FPGAs must exist, thus hindering the
entire migration process. Finally, despite the wide use of this
attachment model in high-performance computing, we do not
believe that is a way forward for the deployment of FPGAs
in the cloud because it confines this type of programmable
technology to the role of coarse accelerator in the service of
a traditional CPU-centric platform.
The third and preferred method for deploying FPGAs in
an HSDC is to set the FPGA free from the traditional CPU-
FPGA attachment by hooking up the FPGA directly to the
HSDC network (Figure 2-(c)). The main implication of this
scheme is that the FPGA must be turned into a standalone
appliance capable of communicating with a host CPU over the
network of the DC. From a practical point of view, and with
respect to the HSDC concept of section II, this is an FPGA
module equipped with an FPGA, some optional local memory
and a network controller interface (NIC). Joining a NIC to
an FPGA enables that FPGA to communicate with other DC
resources, such as servers, disks, I/O and other FPGA modules.
Multiple such FPGA modules can then be deployed in the
HSDC independently of the number of CPUs, thus overcoming
the limitations of the two previous options.
The networking layer of such an FPGA module can be
implemented with either a discrete or an integrated NIC. A
discrete NIC (e.g., dual 10 GbE NIC) is a sizable application-
specific integrated circuit (ASIC) typically featuring 500+
pins, 400+ mm2of packaging, and 5 to 15 W of power
consumption. The footprint and power consumption of such an
ASIC do not favour a shared-board implementation with the
FPGA (see above discussion on sharing board space between
an FPGA and a CPU). Inserting a discrete component also
adds a point of failure in the system. Integrating the NIC into
the reconfigurable fabric of the FPGA alleviates these issues
and is becoming practical with the latest FPGA devices which
can implement a 10 Gb/s network protocol stack in less than 5-
10% of their total resources [14]. Normally, the Ethernet media
access controller (MAC) of such a protocol stack connects
to an external physical layer device (PHY) whose task is
to perform encoding/decoding and serialization/deserialization
functions, as well as a transceiver, such as the enhanced
small form-factor pluggable transceiver (SFP+) whose task is
to physically move the data bits over the media according
to a specific physical layer standard. However, in the case
of a HSDC chassis, the need for an external PHY and an
external transceiver can be skipped by selecting the appropriate
FPGA device from a family. First, because of the dense
packing of an HSDC, the modules plugged on the same chassis
base board are all located within short distance and do not
require an external transceiver to communicate with each other.
Second, all mid- and high-end networking-oriented FPGAs
offer integrated high-speed transceivers that already support
most of the popular PHYs. These integrated transceivers
operate at line rates up to 32 Gb/s, and they commonly
support the 10GBASE-KR (10 Gb/s) and 40GBASE-KR4 (40
Gb/s) Ethernet standards, which we seek for interconnecting
our modules over a distance of up to 1 meter of copper
printed circuit board and two connectors. This removal of
an external PHY and transceiver is a key contributor in
the overall power, latency, cost and area savings. Finally,
the integrated version of the NIC provides the agility to
implement a specific protocol stack on demand, such as Virtual
Extensible LAN (VxLAN), Internet Protocol version 4 (IPV4),
version 6 (IPv6), Transmission Control Protocol (TCP), User
Datagram Protocol (UDP) or Remote Direct Memory Access
(RDMA) over Converged Ethernet (RoCE) Alternatively, it
can also adapt to emerging new protocols, such as Generic
Network Virtualization Encapsulation (Geneve) and Network
Virtualization Overlays (NVO3).
In summary, we advocate a direct attachment of the FPGA
to the DC network by means of an integrated NIC, and refer
to such a standalone FPGA as a network-attached FPGA. This
paves the way towards using FPGAs in resource-centric DCs
[9]. The combination of such network-attached FPGAs with
emerging software-defined networking (SDN) technologies
brings new technical perspectives and market value propo-
sitions, such as building large and programmable fabrics of
FPGAs on the cloud.
V. SY STE M ARCHITECTURE
In this section, we propose and describe A) the architecture
of such a network-attached FPGA, B) the way it is integrated
into a cloud environment, and C) how it can be deployed and
used on a large scale.
A. Network-attached FPGA Architecture
The high-level architecture of the proposed network-
attached FPGA concept is shown in Figure 3. It contains an
FPGA and an optional off-chip memory. The FPGA is split
into three main parts: i) a user logic part used for implementing
customized applications, ii) a network service layer (NSL),
Fig. 3. High-level Architecture of the Network-attached FPGA Module
Fig. 4. Low-level Architecture of the Network-attached FPGA Module
which connects with the DC network, and iii) a management
layer (ML) to run resource-management tasks.
In the context of an HSDC, the FPGA concept of Figure 3
is matched to the double-height DIMM form factor defined
in section II and is therefore referred to as an FPGA module.
The architecture of such a network-attached FPGA module is
now explained in detail with reference to Figure 4.
1) User Logic (vFPGA):Multiple user applications can
be hosted on a single physical FPGA (pFPGA), somehow
similar to multiple VMs running on the same hypervisor. Each
user gets a partition of the entire user logic and uses it to
implement its applications. This partitioning is achieved by
a feature called partial reconfiguration, a technology used to
dynamically reconfigure a region of the FPGA while other
regions are running untouched2. We refer to such a partition
of user logic as a virtual FPGA (vFPGA), and it is depicted in
Figure 4 as vFPGA1 and vFPGA23. For the sake of simplicity,
2This partial reconfiguration feature is not further discussed as it exceeds
the scope of this paper
3Note that in the figure, vFPGA1 and vFPGA2 are not proportional in size
to the NSL and the ML
in this discussion we assume there is only one vFPGA in the
user logic. A vFPGA is assigned an ID (vFPGAID), an IP
address, a MAC address and a tenant ID. The vFPGA connects
to the DC network through the NSL, and can therefore
communicate with other vFPGAs. A vFPGA can also have
off-chip local memory assigned to it.
2) Network Service Layer (NSL):The NSL is a HW im-
plementation of the physical, data link, network and transport
layers (L1-L4) used in a typical protocol layered architecture.
These layers are mapped into the following three components:
i) an Ethernet media access controller (MAC) , ii) a network
and transport stack, and iii) an application interface.
a) Ethernet MAC:The Ethernet MAC implements the
data link layer of the Ethernet standard. The MAC performs
functions such as frame delineation, cyclic redundancy check
(CRC), virtual LAN extraction and collection of statistics.
b) Network and Transport Stack:The network and
transport stack provides a HW implementation of L3-L4 pro-
tocols. Applications running on a cloud HW infrastructure are
inherently diverse. These applications impose different com-
munication requirements on the infrastructure. For example,
one system may require a reliable, stream-based connection
such as TCP for inter-application communication, whereas
another system may need an unreliable, message-oriented
communication, such as UDP. For the applications where la-
tency is critical, RoCE might be preferred. Having this network
and transport stack implemented in HW within the FPGA
provides low-latency and enables to instantiate these protocols
on demand. Again, we leverage partial reconfiguration feature
of the FPGA to achieve such a flexibility.
Usually, a network protocol stack contains a control plane
and a data plane. The control plane learns how to forward
packets, whereas the data plane performs the actual packet
forwarding based on the rules learnt by the control plane.
Usually, these two planes sit close to each other in a network
stack, with the control plane distributed over the network.
With the emergence of SDN, we observe that these planes
are getting separated from each other. In the FPGA, it is
important to use as few resources as possible for the NSL
in order to leave more space for the user logic. To minimize
the complexity of the stack, inspired by the SDN concepts,
we decouple the control plane from the HW implementation
of the data plane and place it in software.
The vFPGAs must be securely isolated in multi-tenant
environments. For this isolation, it is important to use widely
used techniques such as VLANs or OVNs in order to coexist
with other infrastructure resources. Therefore, we implement a
tunnel endpoint (TEP) of an OVN in the network and transport
stack. The TEP implemented in FPGA hardware also provides
an acceleration, as software-based TEPs degrade both the
network and CPU performance significantly [15].
The forwarding data base (FDB) sitting in the network
and transport stack contains the information on established
connections belonging to connection-oriented protocols and
the information on allowed packets from connection-less pro-
tocols. This information includes mac addresses, IP addresses,
OVN IDs and port numbers belonging to source and desti-
nation vFPGAs. The control plane running in a centralized
network management software feeds this information to the
FDB through the ML.
c) Application Interface:The application interface com-
prises FIFO-based connection interfaces resembling socket
buffers in a TCP or UDP connection. The vFPGA reads from
and writes to these FIFOs to communicate with other vFPGAs.
One or more FIFO interfaces can be assigned to a single
vFPGA.
3) Management Layer (ML):The management layer con-
tains a memory manager and a management stack. The mem-
ory manager enables access to memory assigned to vFPGAs
and the management stack enables the vFPGAs to be remotely
managed by a centralized management software.
a) Memory Manager:The memory manager contains a
memory controller and a virtualization layer. The memory con-
troller provides the interface for accessing memory from the
vFPGAs. The virtualization layer allows the physical memory
to be partitioned and shared between different vFPGAs in the
same device. This layer is configured through the management
stack according to the vFPGA memory requirements. It uses
the vFPGAID to calculate the offset when accessing the
physical memory that belongs to a particular vFPGA.
b) Management Stack:The management stack runs a
set of agents to enable the centralized resource-management
software to manage the FPGA remotely. The agents include
functions such as device registration, network and memory
configuration, FPGA reconfiguration, and a service to make
the FPGA nodes discoverable. The management stack may run
on an embedded operating system in a soft core processor or
preferably in a hard core processor, like the processing system
in a Xilinx FPGA Zynq device. The network connection of the
embedded OS is then shared with the HW network stack of the
NSL to reduce the number of physical network connections to
the FPGA module.
B. Cloud integration
Cloud integration is the process of making the above-
mentioned vFPGAs available in the cloud so that users can rent
them. In this section, we present a framework for integrating
FPGAs in the cloud that consists of a new accelerator service
for OpenStack, a way to integrate FPGAs into OpenStack, a
way to provision FPGAs on the cloud, and a way for the user
to rent an FPGA on the cloud.
1) Accelerator Service for OpenStack:We propose a new
service for OpenStack to enable network-attached FPGAs. In
previous research, FPGAs [12] [3] and GPUs [16] have been
integrated into the cloud by using the Nova compute service
in OpenStack. In those cases, heterogeneous devices are PCIe-
attached and are usually requested as an option with virtual
machines or as a single appliance, which requires a few simple
operations to make the device ready for use.
In our deployment, in contrast, standalone FPGAs are
requested independent of a host because we want to consider
them as a new class of compute resource. Therefore, similar
to Nova,Cinder and Neutron in OpenStack, which translate
high-level service API calls into device-specific commands for
compute, storage and network resources, we propose the ac-
celerator service shown in Figure 5, to integrate and provision
FPGAs in the cloud. In the figure, the parts in red show the
new extensions we propose for OpenStack. To setup network
connections with the standalone FPGAs we need to carry out
management tasks. For that, we use an SDN stack connected
to the Neutron network service, and we call it the network
manager. Here we explain the high-level functionality of the
accelerator-service and the network-manager components.
Accelerator Service: The accelerator service comprises an
API front end, a scheduler, a queue, a data base of FPGA
resources (DB), and a worker. The API front end receives the
accelerator service calls from the users through the OpenStack
dashboard or through a command line interface, and dispatches
them to the relevant components in the accelerator service.
The DB contains the information on pFPGA resources. The
scheduler matches the user-requested vFPGA to the user logic
of a pFPGA by searching the information in the DB, and
forwards the result to the worker. The worker executes four
main tasks: i) registration of FPGA nodes in the DB; ii)
retrieving vFPGA bit streams from the Swift object store; iii)
forwarding service calls to FPGA plug-ins, and iv) forwarding
network management tasks to the network manager through
the Neutron service. The queue is just there to pass service
calls between the API front end, the scheduler and the worker.
The FPGA plug-in translates the generic service calls received
from the worker into device-specific commands and forwards
them to the relevant FPGA devices. We foresee the need for
one specific plug-in per FPGA vendor to be hooked to the
worker. Other heterogeneous devices like GPUs and DSPs will
be hooked to the worker in a similar manner.
Network Manager: The network manager is connected to
the OpenStack Neutron service through a plug-in. The network
manager has an API front end, a set of applications, a network
topology discovery service, a virtualization layer, and an SDN
controller. The API front end receives network service calls
from the accelerator-worker through the Neutron and exposes
applications running in the network manager. These applica-
tions include connection management, security and service
level agreements (shown in red in the network manager in
Figure 5). The virtualization layer provides a simplified view
of the overall DC network, including FPGA devices, to the
above applications. The SDN controller configures both the
FPGAs and network switches according to the commands
received by the applications through the virtualization layer.
2) Integrating FPGAs into OpenStack:In this sub section,
the process of integrating FPGAs into OpenStack is outlined.
The IaaS vendor executes this process as explained below.
When the IaaS vendor powers up an FPGA module, the
ML of the FPGA starts up with a pre-configured IP address.
This IP address is called the management IP. The accelerator
service and the network manager use this management IP to
communicate with the ML for executing management tasks.
Second, the network-attached FPGA module is registered in
Fig. 5. OpenStack Architecture with Network-attached FPGAs
the accelerator-DB in the OpenStack accelerator service. This
is achieved by triggering the registration process after entering
the management IP into the accelerator service. Then the
accelerator service acquires the FPGA module information
automatically from the ML over the network and stores them
in the FPGA resource pool in the accelerator-DB. Third, a
few special files 4needed for vFPGA bitstream generation are
uploaded to the OpenStack Swift object store.
3) Provisioning an FPGA on the Cloud:From the IaaS
vendors’ perspective, let’s now look at the process of pro-
visioning a single vFPGA. When a request for renting a
vFPGA arrives, the accelerator-scheduler searches the FPGA
pool to find a user logic resource that matches the vFPGA
request. Once matched, the tenant ID and an IP address are
configured for the vFPGA in the associated pFPGA. After that,
the vFPGA is offered to the user with a few special files which
are used to generate a bitstream for user application.
4) Renting an FPGA on the Cloud:From the user’s
perspective, the process of renting a single vFPGA on the
cloud and configuring a bitstream to it is as follows. First,
the user specifies the resources that it wants to rent by using a
GUI provided by the IaaS vendor. This includes FPGA-internal
resources, such as logic cells, DSP slices and Block RAM as
well as module resources, such as DC network bandwidth and
memory capacity. The IaaS vendor uses this specification to
provision a vFPGA as explained above.
Upon success, a reference to the provisioned vFPGA is
returned to the user with a vFPGAID, an IP address and the
files needed to compile a design for that vFPGA. Second,
the user compiles it’s design to a bitstream and uploads it to
the OpenStack Swift object store through the Glance image
service. Finally, the user associates the uploaded bitstream
4Users need these files to generate a bitstream for the vFPGA
Fig. 6. Two Examples of FPGA Fabrics
with the returned vFPGAID and requests the accelerator
service to boot that vFPGA. At the successful conclusion of
the renting process, the vFPGA and its associated memory are
accessible over the DC network.
C. Fabric of FPGAs on the Cloud
Motivated by the success of large-scale SW-based dis-
tributed applications such as those based on MapReduce and
deep learning [17], we want to give the users a possibility to
distribute their applications on a large number of FPGAs. This
sub section describes a framework for interconnecting such a
large number of FPGAs in the cloud that offers the potential
for FPGAs to be used in large-scale distributed applications.
We refer to a multiple number of FPGAs connected in a
particular topology as an FPGA fabric. When the intercon-
nects of this fabric are reconfigurable, we refer to it as a
programmable fabric. Users can define their programmable
fabrics of vFPGAs on the cloud and rent them using the
proposed framework. Figure 6 shows such two fabrics in which
vFPGAs with different sizes are shown in different patterns.
Fig. 7. FPGA Fabric Deployment; SW: Network Switch
These two fabrics are used to build two different types of
applications. As an example, a fabric of FPGAs arranged in
a pipeline, shown in Figure 6-(a), is used in Catapult [1] for
accelerating page-ranking algorithms, which we discussed in
prior art. Figure 6-(b)) shows a high-level view of a fabric that
can be used for mapping and reducing type of operations.
1) Renting an FPGA Fabric on the Cloud:The renting
and provisioning steps of such a fabric in the cloud are as
follows. First, user decides on the required number of vFPGAs
and customizes them as mentioned above in the case of a
single vFPGA. Second, the user defines its fabric topology
by connecting those customized vFPGAs on a GUI or with
a script. We call this fabric a vFPGA Fabric (vFF). In vFF,
the number of network connections between two vFPGAs can
be selected. If a network connection is required between a
vFPGA and the SW applications that uses the vFF (explained
in the next sub section), it is also configured in this step.
Third, the user rents the defined vFF from the IaaS vendor.
At this step, the user-defined fabric description is passed to
the OpenStack accelerator service. Then, similar to a single
vFPGA explained earlier, the accelerator service matches the
vFF to the hardware infrastructure as shown in Figure 7. In
addition to the steps followed when matching a single vFPGA,
the scheduler considers the proximity of vFPGAs and optimal
resource utilization when matching a vFF to the hardware
infrastructure. After that, the accelerator service requests the
network manager to configure the NSL of assoicated pFPGAs
and intermediate network switches to form the fabric in HW
infrastructure. Fourth, the user associates a bitstream with each
vFPGA of the vFF and requests to boot the fabric. Finally, on
successful provisioning, an ID representing the fabric (vFFID)
is returned to the users that is used in the programming phase
to access the vFF.
2) Using an FPGA Fabric from SW Applications:The
way a vFF is used from a SW application is explained here. We
consider the pipeline-based vFF shown in Figure 6-(a) as an
Fig. 8. FPGA Fabric Programming Model
example and show how it can be used from a SW application.
We assume this fabric runs an application based on data-flow
computing. The text-analytics acceleration engine explained in
[13] is an example of such an application. Also, we assume
that the L4 protocol used is a connection-oriented protocol
such as TCP.
To make the applications agnostic to the network protocols
and to facilitate the programming, we propose a library and
an API to use both the vFPGAs and vFFs. The vFFID
returned at the end of the fabric-deployment phase is used
to access the vFF from the SW applications by means of
the vFFID. Below are the steps for accessing a vFF. First,
the vFF is initialized from the SW application. This initiates
a connection for sending data to the vFF as shown by (1)-
a-conn in Figure 8 . The immediate SDN-enabled switch
triggers this connection-request packet and forwards it to
the network manager. On behalf of the first vFPGA in the
pipeline, the network manager establishes the connection and
updates the relevant FDB entries in the associated pFPGA.
For receiving data, the library starts a local listener and tells
the network manager to initiate a connection on behalf of
the last vFPGA in the pipeline ((1)-b-conn). Then, the SW
application can start sending data and receiving the result
by calling send() and receive() on the vFFH, respectively. If
configured by the user at the vFF definition stage, connections
are created for sending back intermediate results from the
vFPGA to the SW application. When close() is called on the
vFFH, the connections established are closed detaching the
fabric from the SW application. The connections for accessing
memory associated with each vFPGA are also established in
a similar manner through the network manager in the fabric
initialization phase. The SW applications can write to and read
from the memory using the vFPGAID.
VI. OU TLO OK
A. Resource Estimation
This section preestimates the FPGA resource utilization for
the NSL and the ML. According to the commercial implemen-
tations, an UDP and a TCP engine consume approximately
5K [18] and 10K [14] lookup tables (LUTs), respectively. A
memory controller requires around 4K LUTs [19]. Assuming
that the management stack in the ML is implemented in a hard
core processor, we estimate the total resource utilization for
both the NSL and the ML to be approximately 30K LUTs,
which accounts for 15 to 20% of the total LUT resources
available in a modern FPGA device such as the Xilinx 7 series.
This resource utilization is comparable to previous attempts
of enabling FPGAs in the cloud based on PCIe [1] [12],
which use 20-25% of FPGA resources for communication
and management tasks. However, as we already discussed,
the network-attached FPGA we propose alleviate most of
the limitations posed by the PCIe-bus enabling large scale
deployments in the DCs.
B. Scaling Perspectives
As explained earlier, the network-attached FPGA module
enables large-scale deployment of FPGAs in DCs similarly to
compute modules. Table I shows the single precision floating-
point compute performance of a full rack of resources in
an HSDC built using FreeScale T4240-based [20] compute
modules [11] and Xilinx Zynq 7100 SoC-based [21] FPGA
modules. Such a full rack of FPGAs can achieve close to 1000
TFLOPS and provides the user with the impressive number of
71 million configurable logic blocks (CLBs). Here, we assume
that only 32U of rack space is used, out of 42U, for deploying
above resources.
TABLE I
COMPUTE PERFORMANCE PER RAC K
Per Rack Hyperscale Server Hyperscale FPGA
Modules 2048 2048
Cores 24586 4096 (ARM)
CLBs (106)71
Memory (TB) 100 100
TFLOPS 442 958
VII. CONCLUSION
FPGAs must be deployed in data centers, and they must
be made available to the cloud users. As FPGAs can be
reconfigured for specific workloads, users can leverage them
to improve the performance and energy efficiency of their pro-
cessing in the cloud. The provisioning of FPGAs as standalone
resources with direct connections to the data center network
is the key enabler for a large-scale deployment of FPGAs in
the cloud.
This is a profound change of paradigm in the CPU-FPGA
interaction. The standalone network attachment promotes the
FPGA to the rank of a peer processor in the data center. Data
centers must take this paradigm shift into account if they want
to host FPGAs and other similar heterogeneous computing
resources on a large scale in the future. Some emerging
hyperscale data centers are already embracing this trend by
moving away from the traditional server-centric architecture
towards a more resource-centric one. These hyperscale data
centers are paving the way for large scale deployment of
standalone network-attached FPGAs.
Once these FPGAs are plentiful in hyperscale data centers,
the accelerator provisioning service of the OpenStack software
can offer them to cloud users. As a result, a user can lease
large numbers of FPGAs on the cloud and can request them
to be interconnected into a preferred topology.
REFERENCES
[1] A. Putnam et al., “A reconfigurable fabric for accelerating large-scale
datacenter services,” in Proceeding of the 41st Annual International
Symposium on Computer Architecuture, ser. ISCA ’14. Piscataway,
NJ, USA: IEEE Press, 2014, pp. 13–24.
[2] W. Muiris et al., “Programming models for reconfigurable application
accelerators,” in 1st Workshop on Programming Models for Emerging
Architectures (PMEA 2009), sept 2009.
[3] F. Chen et al., “Enabling FPGAs in the cloud,” in Proceedings of the
11th ACM Conference on Computing Frontiers, ser. CF ’14. New York,
NY, USA: ACM, 2014, pp. 3:1–3:10.
[4] R. Luijten and A. Doering, “The DOME embedded 64 bit microserver
demonstrator,” in 2013 International Conference on IC Design Technol-
ogy (ICICDT), May 2013, pp. 203–206.
[5] Hewlett-Packard, “HP Moonshot: An accelerator for hyperscale
workloads,” 2013. [Online]. Available: www.hp.com
[6] SeaMicro, “Seamicro SM15000 fabric compute systems.” [Online].
Available: http://www.seamicro.com/
[7] Dell, “Dell poweredge C5220 microserver.” [Online]. Available:
http://www.dell.com/
[8] Hewlett Packard, “The machine: A new kind of computer.” [Online].
Available: http://www.hpl.hp.com/
[9] S. Han et al., “Network support for resource disaggregation in next-
generation datacenters,” in Proceedings of the Twelfth ACM Workshop
on Hot Topics in Networks, ser. HotNets-XII. New York, NY, USA:
ACM, 2013, pp. 10:1–10:7.
[10] C. E. Leiserson, “Fat-trees: Universal networks for hardware-efficient
supercomputing,” IEEE Trans. Comput., vol. 34, no. 10, pp. 892–901,
Oct. 1985. [Online]. Available: http://dl.acm.org/citation.cfm?id=4492.
4495
[11] R. Luijten et al., “Dual function heat-spreading and performance of
the IBM/ASTRON DOME 64-bit µserver demonstrator,” in IC Design
Technology (ICICDT), 2014 IEEE International Conference on, May
2014, pp. 1–4.
[12] S. Byma et al., “FPGAs in the cloud: Booting virtualized hardware
accelerators with openstack,” in Proceedings of the 2014 IEEE 22Nd
International Symposium on Field-Programmable Custom Computing
Machines, ser. FCCM ’14, 2014, pp. 109–116.
[13] R. Polig et al., “Giving text analytics a boost,IEEE Micro, vol. 34,
no. 4, pp. 6–14, July 2014.
[14] Mle, “TCP/UDP/IP Network Protocol Accelerator.” [Online]. Available:
http://www.missinglinkelectronics.com
[15] J. Weerasinghe and F. Abel, “On the cost of tunnel endpoint processing
in overlay virtual networks,” in Utility and Cloud Computing (UCC),
2014 IEEE/ACM 7th International Conference on, Dec 2014, pp. 756–
761.
[16] S. Crago et al., “Heterogeneous cloud computing,” in 2011 IEEE
International Conference on Cluster Computing (CLUSTER), Sept 2011,
pp. 378–385.
[17] J. Dean et al., “Large scale distributed deep networks,” in Neural
Information Processing Systems, NIPS 2012.
[18] Xilinx, “UDPIP - Hardware UDP/IP Stack Core.” [Online]. Available:
http://www.xilinx.com
[19] G. Kalokerinos et al., “FPGA implementation of a configurable
cache/scratchpad memory with virtualized user-level RDMA capability,”
in SAMOS ’09. International Symposium on, July 2009, pp. 149–156.
[20] FreeScale, “T4240 product brief,” Oct 2014. [Online]. Available:
www.freescale.com
[21] Xilinx, “DSP solution.” [Online]. Available: http://www.xilinx.com/
products/technology/dsp.html
... In both academia and industry, increased efforts are being made to extend multitenancy and resource virtualization from CPUs to FPGAs, to enable better management and use of available datacenter resources [12,15,[36][37][38][39][40][41][42][43][44][45][46]. Multitenancy can be achieved through spatial and temporal multiplexing. ...
Article
Full-text available
Side-channel disassembly attacks recover CPU instructions from power or electromagnetic side-channel traces measured during code execution. These attacks typically rely on physical access, proximity to the victim device, and high sampling rate measuring instruments. In this work, however, we analyze the CPU instruction-level power side-channel leakage in an environment that lacks physical access or expensive measuring equipment. We show that instruction leakage is present even in a multitenant FPGA scenario, where the victim uses a soft-core CPU, and the adversary deploys on-chip voltage-fluctuation sensors. Unlike previous remote power side-channel attacks, which either require a considerable number of victim traces or attack large victim circuits such as machine learning accelerators, we take an evaluator’s point of view and provide an analysis of the instruction-level power side-channel leakage of a small open-source RISC-V soft processor core. To investigate whether the power side-channel traces leak secrets, we profile the victim device and implement various instruction opcode classifiers based on both classical machine learning algorithms used in disassembly attacks, and novel, deep learning approaches. We explore how parameters such as placement, trace averaging, profiling templates, and different FPGA families (including a cloud-scale FPGA) impact the classification accuracy. Despite the limited leakage of the soft-core CPU victim and a reduced accuracy and sampling rate of on-chip sensors, we show that in a worst-case scenario for the evaluator, i.e., an attacker breaching physical separation, we can identify the opcode of executed instructions with an average accuracy as high as 86.46%. Our analysis shows that determining the executed instruction type is not a classification bottleneck, while leakages between instructions of the same type can be challenging for deep learning models to distinguish. We also show that the instruction-level leakage is significantly reduced in a cloud-scale FPGA scenario with higher soft-core CPU frequencies. Nevertheless, our results show that even small circuits, such as soft-core CPUs, leak potentially exploitable information through on-chip power side channels, and users should deploy mitigation techniques against disassembly attacks to protect their proprietary code and data.
Article
FPGAs are increasingly common in modern applications, and cloud providers now support on-demand FPGA acceleration in datacenters. Applications in datacenters run on virtual infrastructure, where consolidation, multi-tenancy, and workload migration enable economies of scale that are fundamental to the provider’s business. However, a general strategy for virtualizing FPGAs has yet to emerge. While manufacturers struggle with hardware-based approaches, we propose a compiler/runtime-based solution called Synergy . We show a compiler transformation for Verilog programs that produces code able to yield control to software at sub-clock-tick granularity according to the semantics of the original program. Synergy uses this property to efficiently support core virtualization primitives: suspend and resume, program migration, and spatial/temporal multiplexing, on hardware which is available today . We use Synergy to virtualize FPGA workloads across a cluster of Intel SoCs and Xilinx FPGAs on Amazon F1. The workloads require no modification, run within 3–4 x of unvirtualized performance, and incur a modest increase in FPGA fabric usage.
Article
FPGAs are increasingly popular in cloud environments for their ability to offer on-demand acceleration and improved compute efficiency. Providers would like to increase utilization, by multiplexing customers on a single device, similar to how processing cores and memory are shared. Nonetheless, multi-tenancy still faces major architectural limitations including: a) inefficient sharing of memory interfaces across hardware tasks exacerbated by technological limitations and peculiarities, b) insufficient solutions for performance and data isolation and high quality of service, c) absent or simplistic allocation strategies to effectively distribute external FPGA memory across hardware tasks. This paper presents a full-stack solution for enabling multi-tenancy on FPGAs. Specifically, our work proposes an intra-fpga virtualization layer to share FPGA interfaces and its resources across tenants. To achieve efficient inter-connectivity between virtual FPGAs (vFGPAs) and external interfaces, we employ a compact network-on-chip architecture to optimize resource utilization. Dedicated memory management units implement the concept of virtual memory in FPGAs, providing mechanisms to isolate the address space and enable memory protection. We also introduce a memory segmentation scheme to effectively allocate FPGA address space and enhance isolation through hardware-software support, while preserving the efficacy of memory transactions. We assess our solution on an Alveo U250 Data Center FPGA Card, employing ten real-world benchmarks from the Rodinia and Rosetta suites. Our framework preserves the performance of hardware tasks from a non-virtualized environment, while enhancing the device aggregate throughput through resource sharing; up to 3.96x in isolated and up to 2.31x in highly congested settings, where an external interface is shared across four vFPGAs. Finally, our work ensures high-quality of service, with hardware tasks achieving up to 0.95x of their native performance, even when resource sharing introduces interference from other accelerators.
Article
Mobile edge computing has emerged as a prevalent computing paradigm to support applications that demand low latency and high computational capacity. Hardware reconfigurable accelerators exhibit high energy efficiency and low latency compared to general-purpose servers, making them ideal for integration into mobile edge computing systems. This paper investigates the problem of joint task offloading, access point selection, and resource allocation in heterogeneous edge environments for latency minimization. Given the heterogeneity of edge computing devices and the interdependence of the decisions required for offloading, access point selection, and resource allocation, it is challenging to optimize over them simultaneously. We decomposed the proposed problem into two disjoint subproblems and developed algorithms for each of them. The first subproblem is to jointly determine access point selection and communication resource allocation decisions, for which we have proposed an algorithm with a provable approximation ratio of $2.62/(1-8\lambda )$ , where $\lambda$ is a tunable parameter balancing the approximation ratio and time complexity. Additionally, we offer a faster variant of the algorithm with an approximation ratio of $(\sqrt{3}+1)^{2}$ . The second subproblem is to determine offloading and computing resource allocation decisions jointly and is NP-hard, where we developed algorithms based on relaxation and rounding. We conducted comprehensive numerical simulations to evaluate the proposed algorithms, and the results demonstrated that our algorithms outperformed existing baselines and achieved near-optimal performance across various settings.
Article
Field-programmable gate arrays (FPGAs) have become critical components in many cloud computing platforms. These devices possess the fine-grained parallelism and specialization needed to accelerate applications ranging from machine learning to networking and signal processing, among many others. Unfortunately, fine-grained programmability also makes FPGAs a security risk. Here, we review the current scope of attacks on cloud FPGAs and their remediation. Many of the FPGA security limitations are enabled by the shared power distribution network in FPGA devices. The simultaneous sharing of FPGAs is a particular concern. Other attacks on the memory, host microprocessor, and input/output channels are also possible. After examining current attacks, we describe trends in cloud architecture and how they are likely to impact possible future attacks. FPGA integration into cloud hypervisors and system software will provide extensive computing opportunities but invite new avenues of attack. We identify a series of system, software, and FPGA architectural changes that will facilitate improved security for cloud FPGAs and the overall systems in which they are located.
Article
The computing demand for massive applications has led to the ubiquitous deployment of computing power. This trend results in the urgent need for higher-level computing resource scheduling services. The Computing and Network Convergence (CNC), a new type of infrastructure, has become a hot topic. To realize the visions of CNC, such as computing-network integration, ubiquitous collaboration, latency-free, and ready-to-use, an intelligent scheduling strategy for CNC should integrate and collaborate with the network. However, the Computing and Network Convergence is built on the cloud, edge, and endless terminals, making the scheduling problem more difficult due to its wide-area requests, available flexibility arrangements, interconnections, and resource adaptations. In view of this, in this survey, we comprehensively review the literature on scheduling in various scenarios. We cover the scheduling problem of Computing and Network Convergence from heterogeneous resources, multiple-objective optimization, and diverse tasks. Possible explanations and implications are discussed. Finally, we point out important challenges for future work.
Article
The computer architecture landscape is being reshaped by the new opportunities, challenges and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators. While it is well understood how to do this for, e.g., network function virtualisation and protocols such as TCP/IP, support for higher networking layers is still largely missing, limiting the potential of accelerators. In this paper, we present Strega , an open-source ¹ light-weight HTTP server that enables crucial functionality such as FPGA-accelerated functions being called through a RESTful protocol (FPGA-as-a-Function). Our experimental analysis shows that a single Strega node sustains a throughput of 1.7 M HTTP requests per second with an end-to-end latency as low as 16 μ s, outperforming nginx running on 32 vCPUs in both metrics, and can even be an alternative to the traditional OpenCL flow over the PCIe bus. Through this work, we pave the way for running microservices directly on FPGAs, bypassing CPU overhead and realising the full potential of FPGA acceleration in distributed cloud applications.
Article
Full-text available
Algorithms can be accelerated by offloading compute-intensive operations to application accelerators comprising reconfigurable hardware devices known as Field Pro-grammable Gate Arrays (FPGAs). We examine three types of accelerator programming model – master-worker, message passing and shared memory – and a typical FPGA system configuration that utilises each model. We assess their impact on the partitioning of any given algorithm between the CPU and the accelerators. The ray tracing algorithm is subsequently used to review the advantages and disadvantages of each programming model. We conclude by comparing their attributes and outlining a set of recommendations for determining the most appropriate model for different algorithm types.
Conference Paper
Full-text available
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurablefabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution--- or, while maintaining equivalent throughput, reduces the tail latency by 29%
Conference Paper
Full-text available
Algorithms can be accelerated by offloading compute-intensive operations to application accelerators comprising reconfigurable hardware devices known as Field Programmable Gate Arrays (FPGAs). We examine three types of accelerator programming model – master-worker, message passing and shared memory – and a typical FPGA system configuration that utilises each model. We assess their impact on the partitioning of any given algorithm between the CPU and the accelerators. The ray tracing algorithm is subsequently used to review the advantages and disadvantages of each programming model. We conclude by comparing their attributes and outlining a set of recommendations for determining the most appropriate model for different algorithm types.
Article
Full-text available
Cloud computing is becoming a major trend for delivering and accessing infrastructure on demand via the network. Meanwhile, the usage of FPGAs (Field Programmable Gate Arrays) for computation acceleration has made significant inroads into multiple application domains due to their ability to achieve high throughput and predictable latency, while providing programmability, low power consumption and time-to-value. Many types of workloads, e.g. databases, big data analytics, and high performance computing, can be and have been accelerated by FPGAs. As more and more workloads are being deployed in the cloud, it is appropriate to consider how to make FPGAs and their capabilities available in the cloud. However, such integration is non-trivial due to issues related to FPGA resource abstraction and sharing, compatibility with applications and accelerator logics, and security, among others. In this paper, a general framework for integrating FPGAs into the cloud is proposed and a prototype of the framework is implemented based on OpenStack, Linux-KVM and Xilinx FPGAs. The prototype enables isolation between multiple processes in multiple VMs, precise quantitative acceleration resource allocation, and priority-based workload scheduling. Experimental results demonstrate the effectiveness of this prototype, an acceptable overhead, and good scalability when hosting multiple VMs and processes.
Article
Full-text available
Recent work in unsupervised feature learning and deep learning has shown that be-ing able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network train-ing. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k cate-gories. We show that these same techniques dramatically accelerate the training of a more modestly-sized deep network for a commercial speech recognition ser-vice. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.
Conference Paper
For the IBM-ASTRON DOME μServer project, we are currently building two types of memory DIMM-like form factor compute node boards. The first is based on a 4 core 2.2 GHz SoC and the second on a 12 core / 24 thread 1.8 GHz SoC. Both employ the 64-bit Power instruction set. Our innovative hot-water based cooling infrastructure also supplies the electrical power to our compute node board. We show initial performance results and conclude with the key lessons we have learnt and an outlook on our next activities.
Conference Paper
We present a new approach for integrating virtualized FPGA-based hardware accelerators into commercial-scale cloud computing systems, with minimal virtualization overhead. Partially reconfigurable regions across multiple FPGAs are offered as generic cloud resources through OpenStack (open-source cloud software), thereby allowing users to "boot" custom designed or predefined network-connected hardware accelerators with the same commands they would use to boot a regular Virtual Machine. We propose a hardware and software framework to enable this virtualization. This is a first attempt at closely fitting FPGAs into existing cloud computing models, where resources are virtualized, flexible, and have the illusion of infinite scalability. Our system can set up and tear down virtual accelerators in approximately 2.6 seconds on average, much faster than regular virtual machines. The static virtualization hardware on the physical FPGAs causes only a three cycle latency increase and a one cycle pipeline stall per packet in accelerators when compared to a non-virtualized system. We present a case study analyzing the design and performance of an application-level load balancer using a fully implemented prototype of our system. Our study shows that FPGA cloud compute resources can easily outperform virtual machines, while the system's virtualization and abstraction significantly reduces design iteration time and design complexity.
Article
The amount of textual data has reached a new scale and continues to grow at an unprecedented rate. IBM's SystemT software is a powerful text-analytics system that offers a query-based interface to reveal the valuable information that lies within these mounds of data. However, traditional server architectures are not capable of analyzing so-called big data efficiently, despite the high memory bandwidth that is available. The authors show that by using a streaming hardware accelerator implemented in reconfigurable logic, the throughput rates of the SystemT's information extraction queries can be improved by an order of magnitude. They also show how such a system can be deployed by extending SystemT's existing compilation flow and by using a multithreaded communication interface that can efficiently use the accelerator's bandwidth.
Conference Paper
Datacenters have traditionally been architected as a collection of servers wherein each server aggregates a fixed amount of computing, memory, storage, and communication resources. In this paper, we advocate an alternative construction in which the resources within a server are disaggregated and the datacenter is instead architected as a collection of standalone resources. Disaggregation brings greater modularity to datacenter infrastructure, allowing operators to optimize their deployments for improved efficiency and performance. However, the key enabling or blocking factor for disaggregation will be the network since communication that was previously contained within a single server now traverses the datacenter fabric. This paper thus explores the question of whether we can build networks that enable disaggregation at datacenter scales.
Conference Paper
We describe the motivation, goals and decision process of the IBM-ASTRON DOME microserver project. With our research demonstrator we aim to evaluate the applicability of this technology for performing the processing required for the Square Kilometer Array instrument as well as for new business workloads. Our focus is on energy efficiency employing hot-water cooling and cost-effectiveness. We show how we were able to get business applications running on high-performance system-on-a-chip parts designed for embedded systems, and show the current status of our project.