Conference PaperPDF Available

An FPGA Platform for Hyperscalers

Authors:

Abstract and Figures

FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. Meanwhile, DC infrastructures are being redesigned to pack ever more compute capacity into the same volume and power envelopes. This redesign leads to the disaggregation of the server and its resources into a collection of standalone computing, memory, and storage modules. To embrace this evolution, we developed a platform that decouples the FPGA from the CPU of the server by connecting the FPGA directly to the DC network. This proposal turns the FPGA into a disaggregated standalone computing resource that can be deployed at large scale into emerging hyperscale data centers. This paper describes an infrastructure which integrates 64 FPGAs (Kintex UltraScale XCKU060) from Xilinx in a 19" × 2U chassis, and provides a bi-sectional bandwidth of 640 Gb/s. The platform is designed for cost effectiveness and makes use of hot-water cooling for optimized energy efficiency. As a result, a DC rack can fit 16 platforms, for a total of 1024 FPGAs + 16 TB of DRR4 memory.
Content may be subject to copyright.
An FPGA Platform for Hyperscalers
F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, S. Paredes
IBM Research – Zurich
Säumerstrasse 4, 8803 Rüschlikon, Switzerland
{fab, wee, hle, wei, spa}@zurich.ibm.com
Abstract—FPGAs (Field Programmable Gate Arrays) are
making their way into data centers (DC). They are used as
accelerators to boost the compute power of individual server
nodes and to improve the overall power efficiency. Meanwhile,
DC infrastructures are being redesigned to pack ever more
compute capacity into the same volume and power envelopes.
This redesign leads to the disaggregation of the server and its
resources into a collection of standalone computing, memory, and
storage modules.
To embrace this evolution, we developed a platform that
decouples the FPGA from the CPU of the server by connecting
the FPGA directly to the DC network. This proposal turns the
FPGA into a disaggregated standalone computing resource that
can be deployed at large scale into emerging hyperscale data
centers.
This paper describes an infrastructure which integrates 64
FPGAs (Kintex* UltraScale* XCKU060) from Xilinx* in a 19” ×
2U1 chassis, and provides a bi-sectional bandwidth of 640 Gb/s.
The platform is designed for cost effectiveness and makes use of
hot-water cooling for optimized energy efficiency. As a result, a
DC rack can fit 16 platforms, for a total of 1024 FPGAs + 16 TB
of DRR4 memory.
I. INTRODUCTION
Data-center disaggregation refers to the break-up of the
traditional server architecture into a collection of standalone
and modular computing, memory, and storage resources. In
practice, it translates into the transmutation of the traditional
rack and blade servers into sled and micro-servers. This move
is purely driven by the performance-per-dollar metric, which is
improved by increasing the density of the servers and by
sharing resources, such as power supplies, PCB backplanes,
cooling, fans, networking uplinks, and other management
infrastructure.
The density of a server is increased when just about all
extraneous parts to the processor, to the local memory, and to
the boot storage are stripped from the motherboard. This
pruning process leads to servers with shrunken form factors.
For example, Facebook* has recently contributed the design of
its system-on-chip (SoC) server called Mono Lake to the Open
Compute Project. This single-socket server assembles an Intel*
Xeon*-D, 32 GB of DRAM and 128 GB of boot storage on a
motherboard of just 210×160 mm.
At the same time, FPGAs are getting their foot in the door
of DCs: They start to be used for offloading and accelerating
application-specific workloads, such as network encryption,
web-page ranking, memory caching, deep learning, high-
frequency trading, user authentication, and privacy protection.
While graphics processing units (GPU) offer unprecedented
compute density with 10s of TFlops for dual/single/half-
1 1U = one rack unit = 1.75 inches (44.45 mm)
precision floating-point and fixed-point operations, the
advantages of FPGAs are their flexibility for building a custom
control path and deep execution pipelines. At the level of DC
applications, this translates into substantial improvements in
energy efficiency, price per performance, and latency.
Unfortunately, these advantages cannot be realized using
the common approach of deploying high-end FPGA boards as
PCIe-attached extension cards in standard 2-socket servers,
because the additional cost and power consumption diminish
the energy-efficiency gains and cost savings. Moreover, this
bus attachment limits the number of FPGAs that can be
deployed per server and therefore hinders the potential of
offloading large-scale applications. Finally, the form factor of
the traditional PCIe interface is typically no longer compatible
with the emerging dense and cost-optimized servers.
We observed those trends and concluded that if FPGAs
want to continue gaining ground in future DCs, a change of
paradigm is required for the FPGA-to-CPU attachment as well
as for the form factor of the FPGA cards.
This paper showcases a hyperscale infrastructure based on
the concepts of disaggregated FPGAs that we introduced in [1]
and evaluated in [2][3].
II. DISAGGREGATED FPGAS
In [1], we advocated the disaggregation of the FPGA from
the server by means of an integrated 10GbE network controller
interface (NIC) that connects the FPGA directly to the DC
network as a standalone resource. This approach sets the
FPGA free from the CPU and its traditional bus attachment,
and becomes the key enabler for large-scale deployments of
FPGAs in DCs.
Figure 1a shows the implementation of such a standalone
disaggregated FPGA based on a Kintex® UltraScaleTM
XCKU060 from Xilinx. The card is physically similar to a
half-length low-profile merchant PCIe x16 card without its two
Small Form-factor Pluggable (SFP+) optical transceiver cages.
Instead, the high-speed serial links are routed to the card-edge
connector and operated over the backplane version of the 10
Gb Ethernet standard (10GBASE-KR). This configuration
saves 30% of board space, 2/4 Watt of power consumption,
and 50/100$ per 10 Gb/s duplex interconnect.
One major implication of the FPGA being dismantled from
the server host is that the card must be turned into a self-
contained appliance capable of executing tasks that were
previously under the control of a host CPU. These tasks
include the ability to perform power-up and -down actions, to
hook itself up to the network after power-up, and to perform
all sorts of local health-monitoring and system-management
duties. On the disaggregated FPGA card, these tasks are
handled by a pervasive 32-bit ARM* controller implemented
with a programmable system-on-chip (PSoC*) device from
Cypress*.
III. PLATFORM SLED
At the time of writing, DC networks are being upgraded to
40/100 GbE. Attaching each FPGA to a 100 GbE network is
not justified and too expensive. Instead, the platform assembles
a cluster of 32 FPGAs onto a passive carrier board, and
interconnects them via an Ethernet switch. The switch is then
considered as the network point-of-delivery and its fast up-
links are used to expose the individual FPGAs to the DC
network.
Figure 1b shows the passive carrier board with 32
connectors (organized into 4 banks of 8 connectors) for
plugging 32 disaggregated FPGA cards, referred here as FPGA
modules, each about the size of a double-height dual inline
memory module (140×62 mm). Each FPGA module connects
via a 10 GbE link to the south side of an Intel FM6000 Ethernet
L2/L3/L4 switch, for a total of 320 Gb/s of aggregate
bandwidth. The north side of the FM6000 switch connects to
eight 40 GbE up-links, which expose the FPGA cluster to the
DC network with another 320 Gb/s. This provides a uniform
and balanced (no over-subscription) distribution between the
north and south links of the Ethernet switch, which is desirable
when building large and scalable fat-tree topologies (a.k.a.
Folded Clos topologies). The Ethernet switch is visible in the
center of Figure 1b. It provides the same aggregate throughput
as a top-of-rack switch (i.e., 640 Gb/s) and was shrunk down to
the size of a smart phone (140×62 mm) to fit vertically into a
2U-height chassis.
A fully populated carrier board is referred to as a sled. Its
various I/O voltage rails are generated by two shared power
controllers (cf. modules at the far left and far right of Figure
1b), and the entire sled is managed by a 64-bit T4240 service
processor from Freescale*/Nxp*/Qualcomm* running Fedora*
23.
The sled implements a universal serial bus (USB) between
every PSoC of the FPGA modules and the service processor.
This dedicated management connection is used to transport a
new bitstream to a PSoC when the reconfiguration of an FPGA
is requested. Alternatively, a new partial bitstream can also be
delivered over the DC network to an internal configuration
access port (ICAP) of the FPGA.
IV. PLATFORM CHASSIS
Two sleds fit a 19” × 2U chassis, for a total of 64 FPGA
modules. Figure 2 shows the 10 GbE and 40 GbE
interconnection networks between the various connectors of
such an assembly. The chassis implements two identical sleds,
S0 and S1, each consisting of the following interconnects: the
red wiring within a sled corresponds to 10 GbE links
connecting the 32 FPGA modules to the south side of the
FM6000 Ethernet switch. The blue wiring within a sled
corresponds to 40 GbE up-links connecting the north side of
the same Ethernet switch to 8 Quad Small-Form-factor
Pluggable (QSFP) transceivers. The purple wires correspond to
10 GbE links which provide a low-latency ring topology
between every four neighboring FPGA modules of a given
sled. The green wiring also consists of 10 GbE links that
interconnect two sleds for providing a redundant path to
failover from the Ethernet switch of one sled to the switch of
the neighbor sled. Finally, the black wiring between pairs of
neighboring slots provides a PCIe x8 Gen3 interface.
The FPGA platform achieves its high packaging density by
implementing a module every 7.6 mm. This very small small
stride does not allow air-cooled heatsinks and fans. Hence, we
deployed a combination of a passive cooling solution at the
FPGA module level and an actively cooled element at the
chassis level. Our implementation is done by replacing the
FPGA lid with a custom-made heat spreader (cf. Figure 3) that
allows the transport of the thermal energy laterally from the
chip away to the borders of the module board, where the heat
spreader is then coupled to an active water-cooled heat sink
(cf. blue rails in Figure 4). The passive heat sink is built using
standard PCB lamination processes and materials.
V. APPLICATIONS
In [2], we first compared the network performance of our
disaggregated FPGA with that obtained from bare-metal
servers, virtual machines, and containers. The results showed
that standalone disaggregated FPGAs outperform them in
terms of network latency and throughput by a factor of up to
35x and 73x, respectively. We also observed that the Ethernet
NIC integrated within the FPGA fabric was consuming less
than 10% of the total FPGA resources.
The first application that we ported was a distributed text-
analytics application [3]. We compared it with i) a SW-only
implementation and ii) an implementation accelerated with
PCIe-attached FPGAs. The results showed that the
disaggregated FGPAs outperformed the two other
implementations, with improved latency, throughput, and,
latency variation by a factor of 40, 18, and 5, respectively.
These results confirm our assumptions about the
performance and efficiency of the platform. The first large-
scale applications that we target include a cloud deployment, a
deep-learning application (using reduced-precision logic), an
HPC application (e.g., stencils), and a data-management
application that combines FPGA modules with NVMe drive
modules within the same platform.
VI. RELATED WORK
The prevailing way of incorporating FPGAs in a server is
by connecting them to the CPU through a PCIe interface and to
use them as co-processors. The survey by Kachris et al. [4]
shows that this practice remains the common case for the
FPGA-based accelerators that have recently been proposed and
implemented to off-load some of the most widely used cloud
computing applications. The Microsoft* Catapult system,
which pioneered the use of FPGAs at scale in DCs, was built
in a similar way: a daughter card equipped with a single high-
end FPGA was added to every Microsoft Open Cloud Server
[5].
However, this PCIe attachment has two major issues in DC
deployment. First, the power consumption of a server is an
order of magnitude higher than that of an FPGA. Hence, the
power efficiency that can be gained by offloading tasks from
the server to 1 or 2 FPGAs is very limited [6]. Second, in DCs,
the workloads are heterogeneous and run at different scales.
Therefore, the scalability and the flexibility of the FPGA
infrastructure are vital to meet the dynamic processing
Figure 2: Block diagram of the PCB interconnect (left) and photo of two sleds in a chassis (right).
Figure 3: Passive cooling concept
Figure 1: (a) The disaggregated FPGA and (b) the carrier board.
Figure 4: Rendering of the 2U × 19"chassis.
demands. With PCIe attachment, a large number of FPGAs
cannot be assigned to run a workload independently of the
number of CPUs. The above-mentioned Catapult system got
around that limitation by implementing a secondary network
between FPGAs, at the cost of increased complexity and loss of
homogeneity in the DC. These drawbacks were later alleviated
in Catapult v2 by placing the FPGA between the servers’ NIC
and the Ethernet network in a so-called “bump-in-the-wire”
architecture [7]. This new arrangement enables FPGAs to reach
each other over the DC network, but does not remove the
dependency between the number of FPGAs and the number of
servers.
Enabling the FPGAs to generate and consume their own
networking packets independently of the servers opens new
opportunities, such as linking multiple FPGAs with low latency
and in any type of topology. For example, multiple FPGAs can
be configured into a pipeline, a ring or a tree according to the
application demands.
Finally, as FPGAs become plentiful in hyperscale data
centers, cloud vendors are willing to offer them for rent to their
users in a similar way as a standard server. The SuperVessel
cloud from IBM* [8] and the EC2 F1 Instances from Amazon*
[9] are two emerging ecosystems that propose remote access to
FPGAs in the cloud for students, developers, and other
customers. The first offer is limited to a single FPGA attached
to a CPU via a PCIe bus or via a coherent accelerator
processor interface (CAPI). The second provider can
instantiate up to 8 PCIe-attached FPGAs per server.
VII. SUMMARY
Our platform paves the way for the large-scale use of
standalone disaggregated FPGAs in DCs. This deployment is
particularly cost- and energy-efficient. First, the number of
spread-out FPGAs becomes independent of the number of two-
socket servers. Second, a large amount of network cables and
transceivers has been removed and replaced by PCB traces
inside a passive backplane. Finally, this network attachment
promotes the FPGA to the rank of remote peer processor,
which opens new perspectives for using them in a distributed
fashion.
REFERENCES
[1] J. Weerasinghe et al., “Enabling FPGAs in hyperscale data centers,” in
2015 IEEE International Conference on Cloud and Big Data
Computing (CBDCom), Beijing, China, 2015.
[2] J. Weerasinghe et al., “Disaggregated FPGAs: Network performance
comparison against bare-metal servers, virtual machines and Linux
containers,” in IEEE International Conference on Cloud Computing
Technology and Science (CloudCom), Luxembourg, 2016.
[3] J. Weerasinghe et al., “Network-attached FPGAs for data center
applications,” in IEEE International Conference on Field-Programmable
Technology (FPT '16), Xian, China, 2016.
[4] C. Kachris and D. Soudris, “A survey on reconfigurable accelerators for
cloud computing,” in 26th International Conference on Field
Programmable Logic and Applications (FPL), Lausanne, Switzerland,
2016.
[5] A. Putnam et al., “A reconfigurable fabric for accelerating large-scale
datacenter services,” in Proceeding of the 41st Annual International
Symposium on Computer Architecture,” ser. ISCA’14. Piscataway, NJ,
USA: IEEE Press, 2014.
[6] H. Giefers et al., “Analyzing the energy-efficiency of dense linear
algebra kernels by power-profiling a hybrid CPU/FPG system,” in
IEEE 25th International Conference on Application-Specific Systems,
Architectures and Processors (ASAP), Zurich, Switzerland, 2014.
[7] A.M. Caulfield et al., “A cloud-scale acceleration architecture,” In
Proceedings of the 49th Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO), Taipei, Taiwan, 2016.
[8] “SuperVessel cloud” [Online]. Available: https://www.ptopenlab.com/
[9] “Amazon EC2 F1 Instances,” [Online]. Available:
https://aws.amazon.com/ec2/instance-types/f1/
*→These are trademarks or registered trademarks of the respective
companies in the United States and other countries. Other product or service
names may be trademarks or service marks of IBM or other companies.
ACKNOWLEDGMENTS
This work was conducted in the context of the joint ASTRON and IBM
DOME project and was funded by the Netherlands Organization for Scientific
Research (NWO), the Dutch Ministry of EL&L, and the Province of Drenthe,
the Netherlands. We would like to thank Martin Schmatz, Ronald Luijten and
Andreas Doering who initiated this new packaging concept for their
microserver needs. Special thanks go to Ronald Otter, from Roneda PCB
Design & Consultancy, for the design of the PCBs, and to Alex Raimondi,
from Miromico AG, for the bring-up the FPGA boards.
Figure 5: Rendering of a 42U rack equipped
with 16 chassis.
... Figure 6 shows how the CIFAR-10 DNN is partitioned across nine FPGAs. We use the IBM * * cloudFPGA platform [48], [49], [59], [60] as testbed for distributed edge FPGA environments. This platform consists of disaggregated Kintex KU060 FPGAs (xcku060-ffva1156-2-i), each with a direct 10 GbE network attachment. ...
... Under this approach, a daughter card consisting of an FPGA and abundant RAM was developed. By creating a custom carrier board, 64 daughter cards can be accommodated in a single 2U rack chassis [75]. To achieve the desired homogeneity within the DC, FPGAs have been provided with a soft network interface chip, with the advantage of loading only the required services. ...
Article
Full-text available
In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. To this date, the prevalent approach to super-computing is dominated by CPUs and GPUs. Given their fixed architectures with generic instruction sets, they have been favored with lots of tools and mature workflows which led to mass adoption and further growth. However, reconfigurable hardware such as FPGAs has repeatedly proven that it offers substantial advantages over this super-computing approach concerning performance and power consumption. In this survey, we review the most relevant works that advanced the field of heterogeneous super-computing using FPGAs focusing on their architectural characteristics. Each work was divided into three main parts: network, hardware, and software tools. All implementations face challenges that involve all three parts. These dependencies result in compromises that designers must take into account. The advantages and limitations of each approach are discussed and compared in detail. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines.
... The Amazon AWS F1 had up to eight FPGAs connected to the CPU through PCIe [30]. In 2017, IBM announced cloudFPGAs [31], and Microsoft presented an FPGA-accelerated computing project named Brainwave [32]. In 2019, Xilinx introduced its new device class, Adaptive Compute Acceleration Platform (ACAP), and the first device of this class was named Versal. ...
Article
Full-text available
Ensuring the security and privacy of computation and data management in the cloud and edge is an ever-important requirement. There are several working solutions today for trusted computing with general purpose processors, for instance, Intel SGX and ARM TrustZone. However, with the widespread commercial adoption of specialized hardware accelerators in the cloud and at the edge, most importantly FPGAs, two questions emerge: 1) How secure are they against threats? and 2) How could FPGAs be utilized for more efficient trusted computing? In this survey, we investigate these two questions precisely. Even though there have been numerous surveys in the past on the security of FPGAs, we believe it is timely to study the space of related work again, given the large number of data-centric applications aimed at targeting trusted execution environments that have recently appeared. Therefore, in addition to presenting an overview of state of the art, we also highlight some opportunities for FPGAs in the context of providing efficient trusted computation.
... A largescale FPGA cluster is one of the promising approaches for HPC in the near future due to its power efficiency and flexibility [6,37]. Implementing dedicated circuits optimized for the target application, and clustering FPGAs to perform large-scale parallel processing, it is possible to realize a computer system that has both high power efficiency and computational performance [1,6]. The state-of-the-art FPGA boards have high-bandwidth onboard memories such as HBM2 and connecting the FPGA nodes. ...
Article
Full-text available
FPGA clusters promise to play a critical role in high-performance computing (HPC) systems in the near future due to their flexibility and high power efficiency. The operation of large-scale general-purpose FPGA clusters on which multiple users run diverse applications requires a mechanism that provides an easily divided and reconfigured network topology. This paper proposes a Virtual Circuit-Switching Network (VCSN) that provides an arbitrary reconfigurable network topology and simple-to-operate network system among FPGA nodes. Users can operate communication as if a circuit-switching network through virtualization. This paper demonstrates that VCSN with 100 Gbps Ethernet achieves highly-efficient point-to-point communication among FPGAs due to its unique and efficient communication protocol. This paper compares VCSN with a direct connection network (DCN) that connects FPGAs directly. We also show a concrete procedure to realize collective communication on the FPGA cluster with VCSN. We demonstrate that the redundant virtual topology provided by VCSN can accelerate collective communication with simple operations on our FPGA cluster. Furthermore, based on experimental results, we model communication by DCN and VCSN in a large FPGA cluster and estimate the communication performance. The result shows that VCSN can perform gather communication up to about 1.97 times faster than DCN.
Article
Field-programmable gate arrays (FPGAs) have become critical components in many cloud computing platforms. These devices possess the fine-grained parallelism and specialization needed to accelerate applications ranging from machine learning to networking and signal processing, among many others. Unfortunately, fine-grained programmability also makes FPGAs a security risk. Here, we review the current scope of attacks on cloud FPGAs and their remediation. Many of the FPGA security limitations are enabled by the shared power distribution network in FPGA devices. The simultaneous sharing of FPGAs is a particular concern. Other attacks on the memory, host microprocessor, and input/output channels are also possible. After examining current attacks, we describe trends in cloud architecture and how they are likely to impact possible future attacks. FPGA integration into cloud hypervisors and system software will provide extensive computing opportunities but invite new avenues of attack. We identify a series of system, software, and FPGA architectural changes that will facilitate improved security for cloud FPGAs and the overall systems in which they are located.
Article
This article provides a survey of academic literature about field programmable gate array (FPGA) and their utilization for energy efficiency acceleration in data centers. The goal is to critically present the existing FPGAs energy optimization techniques and discuss how they can be applied to such systems. To do so, the article explores current energy trends and their projection to the future with particular attention to the requirements set out by the European Code of Conduct for Data Center Energy Efficiency . The article then proposes a complete analysis of over ten years of research in energy optimization techniques, classifying them by purpose, method of application, and impacts on the sources of consumption. Finally, we conclude with the challenges and possible innovations we expect for this sector.
Article
This paper studies small to medium-sized monolithic switches for FPGA implementation and presents a novel switch design that achieves high algorithmic performance and FPGA implementation efficiency. Crossbar switches based on virtual output queues (VOQs) and variations have been rather popular for implementing switches on FPGAs, with applications in network switches, memory interconnects, network-on-chip (NoC) routers etc. The implementation efficiency of crossbar-based switches is well-documented on ASICs, though we show that their disadvantages can outweigh their advantages on FPGAs. One of the most important challenges in such input-queued switches is the requirement for iterative scheduling algorithms. In contrast to ASICs, this is more harmful on FPGAs, as the reduced operating frequency and narrower packets cannot “hide” multiple iterations of scheduling that are required to achieve a modest scheduling performance.Our proposed design uses an output-queued switch internally for simplifying scheduling, and a queue balancing technique to avoid queue fragmentation and reduce the need for memory-sharing VOQs. Its implementation approaches the scheduling performance of a state-of-the-art FPGA-based switch, while requiring considerably fewer resources.
Conference Paper
Full-text available
FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. However, this approach limits the number of FPGAs per node and hinders the acceleration of large-scale distributed applications. We propose a system architecture to deploy large-scale DC applications on standalone FPGAs, independently of the number of CPUs. In our architecture, the FPGAs are directly attached to the DC network, making the FPGA infrastructure scalable and flexible. This FPGA infrastructure enables the creation of flexible multi-FPGA fabrics by connecting several FPGAs together over the DC network. We implemented a prototype of the network-attached FPGA and ported a distributed text-analytics application onto such a multi-FPGA fabric. We compared our approach with a SW-only implementation and an implementation accelerated with PCIe-attached FPGAs. The results show that the network-attached FPGAs outperform both other implementations by large margins.
Conference Paper
Full-text available
FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. Meanwhile, DC infrastructures are being redesigned to pack ever more compute capacity into the same volume and power envelops. This redesign leads to the disaggregation of the server and its resources into a collection of standalone computing, memory, and storage modules. To embrace this evolution, we propose an architecture that decouples the FPGA from the CPU of the server by connecting the FPGA directly to the DC network. This proposal turns the FPGA into a network-attached computing resource that can be incorporated with disaggregated servers into these emerging data centers. We implemented a prototype and compared its network performance with that obtained from bare metal servers (Native), virtual machines (VM), and containers (CT). The results show that standalone network-attached FPGAs outperform them in terms of network latency and throughput by a factor of up to 35x and 73x, respectively. We also observed that the proposed architecture consumes only 14% of the total FPGA resources.
Conference Paper
Full-text available
Data centers are experiencing an exponential increase in the amount of network traffic that they have to sustain due to cloud computing and several emerging web applications. To face this network load, large data centers are required with thousands of servers interconnected with high bandwidth switches. Current data center, based on general purpose processor, consume excessive power while their utilization is quite low. Hardware accelerators can provide high energy efficiency for many cloud applications but they lack the programming efficiency of processors. In the last few years, there several efforts for the efficient deployment of hardware accelerators in the data centers. This paper presents a thorough survey of the frameworks for the efficient utilization of the FPGAs in the data centers. Furthermore it presents the hardware accelerators that have been implemented for the most widely used cloud computing applications. Furthermore, the paper provides a qualitative categorization and comparison of the proposed schemes based on their main features such as speedup and energy efficiency.
Conference Paper
Full-text available
FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DCs) and are used to offload and accelerate specific services, but they are not yet available to cloud users. This puts the cloud deployment of compute-intensive workloads at a disadvantage compared with on-site infrastructure installations, where the performance and energy efficiency of FPGAs are increasingly being exploited for application-specific accelerators and heterogeneous computing. The cloud is housed in DCs, and DCs are based on ever shrinking servers. Today, we observe the emergence of hyperscale data centers, which are based on densely packaged servers. The shrinking form factor brings the potential to deploy FPGAs on a large scale in such DCs. These FPGAs must be deployed as independent DC resources, and they must be accessible to the cloud users. Therefore, we propose to change the traditional paradigm of the CPU-FPGA interface by decoupling the FPGA from the CPU and connecting the FPGA as a standalone resource to the DC network. This allows cloud vendors to offer an FPGA to users in a similar way as a standard server. As existing infrastructure-as-a-service (IaaS) mechanisms are not suitable, we propose a new OpenStack (open source cloud computing software) service to integrate FPGAs in the cloud. This proposal is complemented by a framework that enables cloud users to combine multiple FPGAs into a programmable fabric. The proposed architecture and framework address the scalability problem that makes it difficult to provision large numbers of FPGAs. Together, they offer a novel solution for processing large and heterogeneous data sets in the cloud.
Conference Paper
Full-text available
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurablefabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution--- or, while maintaining equivalent throughput, reduces the tail latency by 29%
Conference Paper
It has been shown that FPGA accelerators can outperform pure CPU systems for highly parallel applications and they are considered as a power-efficient alternative to software programmable processors. However, when using FPGA accelerator cards in a server environment multiple sources of power consumption have to get taken into account in order to rate the systems energy-efficiency. In this paper we study the energy-efficiency of a hybrid CPU/FPGA system for a dense linear algebra kernel. We present an FPGA GEMM accelerator architecture that can be tailored to various data types. The performance and energy consumption is compared against tuned, multi-threaded GEMM functions running on the host CPU. We measure the power consumption with internal current/voltage sensors and break down the power draw to the systems components in order to classify the energy consumed by the processor cores, the memory, the I/O bus system and the FPGA card. Our experimental results show that the FPGA-accelerated DGEMM is less energy-efficient than a multi-threaded software implementation with respect to the full systems power consumption, but the most efficient choice when only the dynamic parts of the power are factored in.