Conference PaperPDF Available

An FPGA Platform for Hyperscalers

August 2017

August 2017

DOI:10.1109/HOTI.2017.13

Conference: IEEE Hot Interconnects - 25th Annual Symposium on High-Performance Interconnects
At: Santa Clara, CA

Authors:

Show all 5 authorsHide

FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. Meanwhile, DC infrastructures are being redesigned to pack ever more compute capacity into the same volume and power envelopes. This redesign leads to the disaggregation of the server and its resources into a collection of standalone computing, memory, and storage modules. To embrace this evolution, we developed a platform that decouples the FPGA from the CPU of the server by connecting the FPGA directly to the DC network. This proposal turns the FPGA into a disaggregated standalone computing resource that can be deployed at large scale into emerging hyperscale data centers. This paper describes an infrastructure which integrates 64 FPGAs (Kintex UltraScale XCKU060) from Xilinx in a 19" × 2U chassis, and provides a bi-sectional bandwidth of 640 Gb/s. The platform is designed for cost effectiveness and makes use of hot-water cooling for optimized energy efficiency. As a result, a DC rack can fit 16 platforms, for a total of 1024 FPGAs + 16 TB of DRR4 memory.

(a) The disaggregated FPGA and (b) the carrier board.

…

Block diagram of the PCB interconnect (left) and photo of two sleds in a chassis (right).

…

Passive cooling concept

…

Figures - uploaded by Francois Abel

Content may be subject to copyright.

Content uploaded by Francois Abel

Content may be subject to copyright.

An FPGA Platform for Hyperscalers

F. Abel, J. Weerasinghe, C. Hagleitner, B. Weiss, S. Paredes

IBM Research – Zurich

Säumerstrasse 4, 8803 Rüschlikon, Switzerland

{fab, wee, hle, wei, spa}@zurich.ibm.com

Abstract—FPGAs (Field Programmable Gate Arrays) are

making their way into data centers (DC). They are used as

accelerators to boost the compute power of individual server

nodes and to improve the overall power efficiency. Meanwhile,

DC infrastructures are being redesigned to pack ever more

compute capacity into the same volume and power envelopes.

This redesign leads to the disaggregation of the server and its

resources into a collection of standalone computing, memory, and

storage modules.

To embrace this evolution, we developed a platform that

decouples the FPGA from the CPU of the server by connecting

the FPGA directly to the DC network. This proposal turns the

FPGA into a disaggregated standalone computing resource that

can be deployed at large scale into emerging hyperscale data

centers.

This paper describes an infrastructure which integrates 64

FPGAs (Kintex* UltraScale* XCKU060) from Xilinx* in a 19” ×

2U1 chassis, and provides a bi-sectional bandwidth of 640 Gb/s.

The platform is designed for cost effectiveness and makes use of

hot-water cooling for optimized energy efficiency. As a result, a

DC rack can fit 16 platforms, for a total of 1024 FPGAs + 16 TB

of DRR4 memory.

I. INTRODUCTION

Data-center disaggregation refers to the break-up of the

traditional server architecture into a collection of standalone

and modular computing, memory, and storage resources. In

practice, it translates into the transmutation of the traditional

rack and blade servers into sled and micro-servers. This move

is purely driven by the performance-per-dollar metric, which is

improved by increasing the density of the servers and by

sharing resources, such as power supplies, PCB backplanes,

cooling, fans, networking uplinks, and other management

infrastructure.

The density of a server is increased when just about all

extraneous parts to the processor, to the local memory, and to

the boot storage are stripped from the motherboard. This

pruning process leads to servers with shrunken form factors.

For example, Facebook* has recently contributed the design of

its system-on-chip (SoC) server called Mono Lake to the Open

Compute Project. This single-socket server assembles an Intel*

Xeon*-D, 32 GB of DRAM and 128 GB of boot storage on a

motherboard of just 210×160 mm.

At the same time, FPGAs are getting their foot in the door

of DCs: They start to be used for offloading and accelerating

application-specific workloads, such as network encryption,

web-page ranking, memory caching, deep learning, high-

frequency trading, user authentication, and privacy protection.

While graphics processing units (GPU) offer unprecedented

compute density with 10s of TFlops for dual/single/half-

1 1U = one rack unit = 1.75 inches (44.45 mm)

precision floating-point and fixed-point operations, the

advantages of FPGAs are their flexibility for building a custom

control path and deep execution pipelines. At the level of DC

applications, this translates into substantial improvements in

energy efficiency, price per performance, and latency.

Unfortunately, these advantages cannot be realized using

the common approach of deploying high-end FPGA boards as

PCIe-attached extension cards in standard 2-socket servers,

because the additional cost and power consumption diminish

the energy-efficiency gains and cost savings. Moreover, this

bus attachment limits the number of FPGAs that can be

deployed per server and therefore hinders the potential of

offloading large-scale applications. Finally, the form factor of

the traditional PCIe interface is typically no longer compatible

with the emerging dense and cost-optimized servers.

We observed those trends and concluded that if FPGAs

want to continue gaining ground in future DCs, a change of

paradigm is required for the FPGA-to-CPU attachment as well

as for the form factor of the FPGA cards.

This paper showcases a hyperscale infrastructure based on

the concepts of disaggregated FPGAs that we introduced in [1]

and evaluated in [2][3].

II. DISAGGREGATED FPGAS

In [1], we advocated the disaggregation of the FPGA from

the server by means of an integrated 10GbE network controller

interface (NIC) that connects the FPGA directly to the DC

network as a standalone resource. This approach sets the

FPGA free from the CPU and its traditional bus attachment,

and becomes the key enabler for large-scale deployments of

FPGAs in DCs.

Figure 1a shows the implementation of such a standalone

disaggregated FPGA based on a Kintex® UltraScaleTM

XCKU060 from Xilinx. The card is physically similar to a

half-length low-profile merchant PCIe x16 card without its two

Small Form-factor Pluggable (SFP+) optical transceiver cages.

Instead, the high-speed serial links are routed to the card-edge

connector and operated over the backplane version of the 10

Gb Ethernet standard (10GBASE-KR). This configuration

saves 30% of board space, 2/4 Watt of power consumption,

and 50/100$ per 10 Gb/s duplex interconnect.

One major implication of the FPGA being dismantled from

the server host is that the card must be turned into a self-

contained appliance capable of executing tasks that were

previously under the control of a host CPU. These tasks

include the ability to perform power-up and -down actions, to

hook itself up to the network after power-up, and to perform

all sorts of local health-monitoring and system-management

duties. On the disaggregated FPGA card, these tasks are

handled by a pervasive 32-bit ARM* controller implemented

with a programmable system-on-chip (PSoC*) device from

Cypress*.

III. PLATFORM SLED

At the time of writing, DC networks are being upgraded to

40/100 GbE. Attaching each FPGA to a 100 GbE network is

not justified and too expensive. Instead, the platform assembles

a cluster of 32 FPGAs onto a passive carrier board, and

interconnects them via an Ethernet switch. The switch is then

considered as the network point-of-delivery and its fast up-

links are used to expose the individual FPGAs to the DC

network.

Figure 1b shows the passive carrier board with 32

connectors (organized into 4 banks of 8 connectors) for

plugging 32 disaggregated FPGA cards, referred here as FPGA

modules, each about the size of a double-height dual inline

memory module (140×62 mm). Each FPGA module connects

via a 10 GbE link to the south side of an Intel FM6000 Ethernet

L2/L3/L4 switch, for a total of 320 Gb/s of aggregate

bandwidth. The north side of the FM6000 switch connects to

eight 40 GbE up-links, which expose the FPGA cluster to the

DC network with another 320 Gb/s. This provides a uniform

and balanced (no over-subscription) distribution between the

north and south links of the Ethernet switch, which is desirable

when building large and scalable fat-tree topologies (a.k.a.

Folded Clos topologies). The Ethernet switch is visible in the

center of Figure 1b. It provides the same aggregate throughput

as a top-of-rack switch (i.e., 640 Gb/s) and was shrunk down to

the size of a smart phone (140×62 mm) to fit vertically into a

2U-height chassis.

A fully populated carrier board is referred to as a sled. Its

various I/O voltage rails are generated by two shared power

controllers (cf. modules at the far left and far right of Figure

1b), and the entire sled is managed by a 64-bit T4240 service

processor from Freescale*/Nxp*/Qualcomm* running Fedora*

23.

The sled implements a universal serial bus (USB) between

every PSoC of the FPGA modules and the service processor.

This dedicated management connection is used to transport a

new bitstream to a PSoC when the reconfiguration of an FPGA

is requested. Alternatively, a new partial bitstream can also be

delivered over the DC network to an internal configuration

access port (ICAP) of the FPGA.

IV. PLATFORM CHASSIS

Two sleds fit a 19” × 2U chassis, for a total of 64 FPGA

modules. Figure 2 shows the 10 GbE and 40 GbE

interconnection networks between the various connectors of

such an assembly. The chassis implements two identical sleds,

S0 and S1, each consisting of the following interconnects: the

red wiring within a sled corresponds to 10 GbE links

connecting the 32 FPGA modules to the south side of the

FM6000 Ethernet switch. The blue wiring within a sled

corresponds to 40 GbE up-links connecting the north side of

the same Ethernet switch to 8 Quad Small-Form-factor

Pluggable (QSFP) transceivers. The purple wires correspond to

10 GbE links which provide a low-latency ring topology

between every four neighboring FPGA modules of a given

sled. The green wiring also consists of 10 GbE links that

interconnect two sleds for providing a redundant path to

failover from the Ethernet switch of one sled to the switch of

the neighbor sled. Finally, the black wiring between pairs of

neighboring slots provides a PCIe x8 Gen3 interface.

The FPGA platform achieves its high packaging density by

implementing a module every 7.6 mm. This very small small

stride does not allow air-cooled heatsinks and fans. Hence, we

deployed a combination of a passive cooling solution at the

FPGA module level and an actively cooled element at the

chassis level. Our implementation is done by replacing the

FPGA lid with a custom-made heat spreader (cf. Figure 3) that

allows the transport of the thermal energy laterally from the

chip away to the borders of the module board, where the heat

spreader is then coupled to an active water-cooled heat sink

(cf. blue rails in Figure 4). The passive heat sink is built using

standard PCB lamination processes and materials.

V. APPLICATIONS

In [2], we first compared the network performance of our

disaggregated FPGA with that obtained from bare-metal

servers, virtual machines, and containers. The results showed

that standalone disaggregated FPGAs outperform them in

terms of network latency and throughput by a factor of up to

35x and 73x, respectively. We also observed that the Ethernet

NIC integrated within the FPGA fabric was consuming less

than 10% of the total FPGA resources.

The first application that we ported was a distributed text-

analytics application [3]. We compared it with i) a SW-only

implementation and ii) an implementation accelerated with

PCIe-attached FPGAs. The results showed that the

disaggregated FGPAs outperformed the two other

implementations, with improved latency, throughput, and,

latency variation by a factor of 40, 18, and 5, respectively.

These results confirm our assumptions about the

performance and efficiency of the platform. The first large-

scale applications that we target include a cloud deployment, a

deep-learning application (using reduced-precision logic), an

HPC application (e.g., stencils), and a data-management

application that combines FPGA modules with NVMe drive

modules within the same platform.

VI. RELATED WORK

The prevailing way of incorporating FPGAs in a server is

by connecting them to the CPU through a PCIe interface and to

use them as co-processors. The survey by Kachris et al. [4]

shows that this practice remains the common case for the

FPGA-based accelerators that have recently been proposed and

implemented to off-load some of the most widely used cloud

computing applications. The Microsoft* Catapult system,

which pioneered the use of FPGAs at scale in DCs, was built

in a similar way: a daughter card equipped with a single high-

end FPGA was added to every Microsoft Open Cloud Server

[5].

However, this PCIe attachment has two major issues in DC

deployment. First, the power consumption of a server is an

order of magnitude higher than that of an FPGA. Hence, the

power efficiency that can be gained by offloading tasks from

the server to 1 or 2 FPGAs is very limited [6]. Second, in DCs,

the workloads are heterogeneous and run at different scales.

Therefore, the scalability and the flexibility of the FPGA

infrastructure are vital to meet the dynamic processing

Figure 2: Block diagram of the PCB interconnect (left) and photo of two sleds in a chassis (right).

Figure 3: Passive cooling concept

Figure 1: (a) The disaggregated FPGA and (b) the carrier board.

Figure 4: Rendering of the 2U × 19"chassis.

demands. With PCIe attachment, a large number of FPGAs

cannot be assigned to run a workload independently of the

number of CPUs. The above-mentioned Catapult system got

around that limitation by implementing a secondary network

between FPGAs, at the cost of increased complexity and loss of

homogeneity in the DC. These drawbacks were later alleviated

in Catapult v2 by placing the FPGA between the servers’ NIC

and the Ethernet network in a so-called “bump-in-the-wire”

architecture [7]. This new arrangement enables FPGAs to reach

each other over the DC network, but does not remove the

dependency between the number of FPGAs and the number of

servers.

Enabling the FPGAs to generate and consume their own

networking packets independently of the servers opens new

opportunities, such as linking multiple FPGAs with low latency

and in any type of topology. For example, multiple FPGAs can

be configured into a pipeline, a ring or a tree according to the

application demands.

Finally, as FPGAs become plentiful in hyperscale data

centers, cloud vendors are willing to offer them for rent to their

users in a similar way as a standard server. The SuperVessel

cloud from IBM* [8] and the EC2 F1 Instances from Amazon*

[9] are two emerging ecosystems that propose remote access to

FPGAs in the cloud for students, developers, and other

customers. The first offer is limited to a single FPGA attached

to a CPU via a PCIe bus or via a coherent accelerator

processor interface (CAPI). The second provider can

instantiate up to 8 PCIe-attached FPGAs per server.

VII. SUMMARY

Our platform paves the way for the large-scale use of

standalone disaggregated FPGAs in DCs. This deployment is

particularly cost- and energy-efficient. First, the number of

spread-out FPGAs becomes independent of the number of two-

socket servers. Second, a large amount of network cables and

transceivers has been removed and replaced by PCB traces

inside a passive backplane. Finally, this network attachment

promotes the FPGA to the rank of remote peer processor,

which opens new perspectives for using them in a distributed

fashion.

REFERENCES

[1] J. Weerasinghe et al., “Enabling FPGAs in hyperscale data centers,” in

2015 IEEE International Conference on Cloud and Big Data

Computing (CBDCom), Beijing, China, 2015.

[2] J. Weerasinghe et al., “Disaggregated FPGAs: Network performance

comparison against bare-metal servers, virtual machines and Linux

containers,” in IEEE International Conference on Cloud Computing

Technology and Science (CloudCom), Luxembourg, 2016.

[3] J. Weerasinghe et al., “Network-attached FPGAs for data center

applications,” in IEEE International Conference on Field-Programmable

Technology (FPT '16), Xian, China, 2016.

[4] C. Kachris and D. Soudris, “A survey on reconfigurable accelerators for

cloud computing,” in 26th International Conference on Field

Programmable Logic and Applications (FPL), Lausanne, Switzerland,

2016.

[5] A. Putnam et al., “A reconfigurable fabric for accelerating large-scale

datacenter services,” in Proceeding of the 41st Annual International

Symposium on Computer Architecture,” ser. ISCA’14. Piscataway, NJ,

USA: IEEE Press, 2014.

[6] H. Giefers et al., “Analyzing the energy-efficiency of dense linear

algebra kernels by power-profiling a hybrid CPU/FPG system,” in

IEEE 25th International Conference on Application-Specific Systems,

Architectures and Processors (ASAP), Zurich, Switzerland, 2014.

[7] A.M. Caulfield et al., “A cloud-scale acceleration architecture,” In

Proceedings of the 49th Annual IEEE/ACM International Symposium on

Microarchitecture (MICRO), Taipei, Taiwan, 2016.

[8] “SuperVessel cloud” [Online]. Available: https://www.ptopenlab.com/

[9] “Amazon EC2 F1 Instances,” [Online]. Available:

https://aws.amazon.com/ec2/instance-types/f1/

*→These are trademarks or registered trademarks of the respective

companies in the United States and other countries. Other product or service

names may be trademarks or service marks of IBM or other companies.

ACKNOWLEDGMENTS

This work was conducted in the context of the joint ASTRON and IBM

DOME project and was funded by the Netherlands Organization for Scientific

Research (NWO), the Dutch Ministry of EL&L, and the Province of Drenthe,

the Netherlands. We would like to thank Martin Schmatz, Ronald Luijten and

Andreas Doering who initiated this new packaging concept for their

microserver needs. Special thanks go to Ronald Otter, from Roneda PCB

Design & Consultancy, for the design of the PCBs, and to Alex Raimondi,

from Miromico AG, for the bring-up the FPGA boards.

Figure 5: Rendering of a 42U rack equipped

with 16 chassis.

DOSA: Organic Compilation for Neural Network Inference on Distributed FPGAs

Conference Paper

Jul 2023

A Survey on FPGA-based Heterogeneous Clusters Architectures

Article

Full-text available

Jan 2023

In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. To this date, the prevalent approach to super-computing is dominated by CPUs and GPUs. Given their fixed architectures with generic instruction sets, they have been favored with lots of tools and mature workflows which led to mass adoption and further growth. However, reconfigurable hardware such as FPGAs has repeatedly proven that it offers substantial advantages over this super-computing approach concerning performance and power consumption. In this survey, we review the most relevant works that advanced the field of heterogeneous super-computing using FPGAs focusing on their architectural characteristics. Each work was divided into three main parts: network, hardware, and software tools. All implementations face challenges that involve all three parts. These dependencies result in compromises that designers must take into account. The advantages and limitations of each approach are discussed and compared in detail. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines.

A Survey of Trusted Computing Solutions Using FPGAs

Article

Full-text available

Jan 2023

Ensuring the security and privacy of computation and data management in the cloud and edge is an ever-important requirement. There are several working solutions today for trusted computing with general purpose processors, for instance, Intel SGX and ARM TrustZone. However, with the widespread commercial adoption of specialized hardware accelerators in the cloud and at the edge, most importantly FPGAs, two questions emerge: 1) How secure are they against threats? and 2) How could FPGAs be utilized for more efficient trusted computing? In this survey, we investigate these two questions precisely. Even though there have been numerous surveys in the past on the security of FPGAs, we believe it is timely to study the space of related work again, given the large number of data-centric applications aimed at targeting trusted execution environments that have recently appeared. Therefore, in addition to presenting an overview of state of the art, we also highlight some opportunities for FPGAs in the context of providing efficient trusted computation.

VCSN: Virtual Circuit-Switching Network for Flexible and Simple-to-Operate Communication in HPC FPGA Cluster

Article

Full-text available

Jan 2023

FPGA clusters promise to play a critical role in high-performance computing (HPC) systems in the near future due to their flexibility and high power efficiency. The operation of large-scale general-purpose FPGA clusters on which multiple users run diverse applications requires a mechanism that provides an easily divided and reconfigured network topology. This paper proposes a Virtual Circuit-Switching Network (VCSN) that provides an arbitrary reconfigurable network topology and simple-to-operate network system among FPGA nodes. Users can operate communication as if a circuit-switching network through virtualization. This paper demonstrates that VCSN with 100 Gbps Ethernet achieves highly-efficient point-to-point communication among FPGAs due to its unique and efficient communication protocol. This paper compares VCSN with a direct connection network (DCN) that connects FPGAs directly. We also show a concrete procedure to realize collective communication on the FPGA cluster with VCSN. We demonstrate that the redundant virtual topology provided by VCSN can accelerate collective communication with simple operations on our FPGA cluster. Furthermore, based on experimental results, we model communication by DCN and VCSN in a large FPGA cluster and estimate the communication performance. The result shows that VCSN can perform gather communication up to about 1.97 times faster than DCN.

Automated parallel execution of distributed task graphs with FPGA clusters

Article

Jun 2024
FUTURE GENER COMP SY

A Visionary Look at the Security of Reconfigurable Cloud Computing

Article

Dec 2023

Field-programmable gate arrays (FPGAs) have become critical components in many cloud computing platforms. These devices possess the fine-grained parallelism and specialization needed to accelerate applications ranging from machine learning to networking and signal processing, among many others. Unfortunately, fine-grained programmability also makes FPGAs a security risk. Here, we review the current scope of attacks on cloud FPGAs and their remediation. Many of the FPGA security limitations are enabled by the shared power distribution network in FPGA devices. The simultaneous sharing of FPGAs is a particular concern. Other attacks on the memory, host microprocessor, and input/output channels are also possible. After examining current attacks, we describe trends in cloud architecture and how they are likely to impact possible future attacks. FPGA integration into cloud hypervisors and system software will provide extensive computing opportunities but invite new avenues of attack. We identify a series of system, software, and FPGA architectural changes that will facilitate improved security for cloud FPGAs and the overall systems in which they are located.

DiAD – Distributed Acceleration for Datacenter FPGAs

Conference Paper

Sep 2023

A Survey of FPGA Optimization Methods for Data Center Energy Efficiency

Article

Jul 2023

This article provides a survey of academic literature about field programmable gate array (FPGA) and their utilization for energy efficiency acceleration in data centers. The goal is to critically present the existing FPGAs energy optimization techniques and discuss how they can be applied to such systems. To do so, the article explores current energy trends and their projection to the future with particular attention to the requirements set out by the European Code of Conduct for Data Center Energy Efficiency . The article then proposes a complete analysis of over ten years of research in energy optimization techniques, classifying them by purpose, method of application, and impacts on the sources of consumption. Finally, we conclude with the challenges and possible innovations we expect for this sector.

ESSPER: Elastic and Scalable FPGA-Cluster System for High-Performance Reconfigurable Computing with Supercomputer Fugaku

Conference Paper

Feb 2023

Experimental Survey of FPGA-Based Monolithic Switches and a Novel Queue Balancer

Article

May 2023

This paper studies small to medium-sized monolithic switches for FPGA implementation and presents a novel switch design that achieves high algorithmic performance and FPGA implementation efficiency. Crossbar switches based on virtual output queues (VOQs) and variations have been rather popular for implementing switches on FPGAs, with applications in network switches, memory interconnects, network-on-chip (NoC) routers etc. The implementation efficiency of crossbar-based switches is well-documented on ASICs, though we show that their disadvantages can outweigh their advantages on FPGAs. One of the most important challenges in such input-queued switches is the requirement for iterative scheduling algorithms. In contrast to ASICs, this is more harmful on FPGAs, as the reduced operating frequency and narrower packets cannot “hide” multiple iterations of scheduling that are required to achieve a modest scheduling performance.Our proposed design uses an output-queued switch internally for simplifying scheduling, and a queue balancing technique to avoid queue fragmentation and reduce the need for memory-sharing VOQs. Its implementation approaches the scheduling performance of a state-of-the-art FPGA-based switch, while requiring considerably fewer resources.

Network-Attached FPGAs for Data Center Applications

Conference Paper

Full-text available

Dec 2016

FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. However, this approach limits the number of FPGAs per node and hinders the acceleration of large-scale distributed applications. We propose a system architecture to deploy large-scale DC applications on standalone FPGAs, independently of the number of CPUs. In our architecture, the FPGAs are directly attached to the DC network, making the FPGA infrastructure scalable and flexible. This FPGA infrastructure enables the creation of flexible multi-FPGA fabrics by connecting several FPGAs together over the DC network. We implemented a prototype of the network-attached FPGA and ported a distributed text-analytics application onto such a multi-FPGA fabric. We compared our approach with a SW-only implementation and an implementation accelerated with PCIe-attached FPGAs. The results show that the network-attached FPGAs outperform both other implementations by large margins.

Disaggregated FPGAs: Network Performance Comparison against Bare-Metal Servers, Virtual Machines and Linux Containers

Conference Paper

Full-text available

Dec 2016

FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DC). They are used as accelerators to boost the compute power of individual server nodes and to improve the overall power efficiency. Meanwhile, DC infrastructures are being redesigned to pack ever more compute capacity into the same volume and power envelops. This redesign leads to the disaggregation of the server and its resources into a collection of standalone computing, memory, and storage modules. To embrace this evolution, we propose an architecture that decouples the FPGA from the CPU of the server by connecting the FPGA directly to the DC network. This proposal turns the FPGA into a network-attached computing resource that can be incorporated with disaggregated servers into these emerging data centers. We implemented a prototype and compared its network performance with that obtained from bare metal servers (Native), virtual machines (VM), and containers (CT). The results show that standalone network-attached FPGAs outperform them in terms of network latency and throughput by a factor of up to 35x and 73x, respectively. We also observed that the proposed architecture consumes only 14% of the total FPGA resources.

A survey on reconfigurable accelerators for cloud computing

Conference Paper

Full-text available

Aug 2016

Data centers are experiencing an exponential increase in the amount of network traffic that they have to sustain due to cloud computing and several emerging web applications. To face this network load, large data centers are required with thousands of servers interconnected with high bandwidth switches. Current data center, based on general purpose processor, consume excessive power while their utilization is quite low. Hardware accelerators can provide high energy efficiency for many cloud applications but they lack the programming efficiency of processors. In the last few years, there several efforts for the efficient deployment of hardware accelerators in the data centers. This paper presents a thorough survey of the frameworks for the efficient utilization of the FPGAs in the data centers. Furthermore it presents the hardware accelerators that have been implemented for the most widely used cloud computing applications. Furthermore, the paper provides a qualitative categorization and comparison of the proposed schemes based on their main features such as speedup and energy efficiency.

Enabling FPGAs in Hyperscale Data Centers

Conference Paper

Full-text available

Aug 2015

FPGAs (Field Programmable Gate Arrays) are making their way into data centers (DCs) and are used to offload and accelerate specific services, but they are not yet available to cloud users. This puts the cloud deployment of compute-intensive workloads at a disadvantage compared with on-site infrastructure installations, where the performance and energy efficiency of FPGAs are increasingly being exploited for application-specific accelerators and heterogeneous computing. The cloud is housed in DCs, and DCs are based on ever shrinking servers. Today, we observe the emergence of hyperscale data centers, which are based on densely packaged servers. The shrinking form factor brings the potential to deploy FPGAs on a large scale in such DCs. These FPGAs must be deployed as independent DC resources, and they must be accessible to the cloud users. Therefore, we propose to change the traditional paradigm of the CPU-FPGA interface by decoupling the FPGA from the CPU and connecting the FPGA as a standalone resource to the DC network. This allows cloud vendors to offer an FPGA to users in a similar way as a standard server. As existing infrastructure-as-a-service (IaaS) mechanisms are not suitable, we propose a new OpenStack (open source cloud computing software) service to integrate FPGAs in the cloud. This proposal is complemented by a framework that enables cloud users to combine multiple FPGAs into a programmable fabric. The proposed architecture and framework address the scalability problem that makes it difficult to provision large numbers of FPGAs. Together, they offer a novel solution for processing large and heterogeneous data sets in the cloud.

A Reconfigurable Fabric For Accelerating Large-Scale Datacenter Services

Conference Paper

Full-text available

Jun 2014

Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurablefabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system when ranking candidate documents. Under high load, the largescale reconfigurable fabric improves the ranking throughput of each server by a factor of 95% for a fixed latency distribution--- or, while maintaining equivalent throughput, reduces the tail latency by 29%

A cloud-scale acceleration architecture

Conference Paper

Oct 2016

Analyzing the energy-efficiency of dense linear algebra kernels by power-profiling a hybrid CPU/FPGA system

Conference Paper

Jun 2014

It has been shown that FPGA accelerators can outperform pure CPU systems for highly parallel applications and they are considered as a power-efficient alternative to software programmable processors. However, when using FPGA accelerator cards in a server environment multiple sources of power consumption have to get taken into account in order to rate the systems energy-efficiency. In this paper we study the energy-efficiency of a hybrid CPU/FPGA system for a dense linear algebra kernel. We present an FPGA GEMM accelerator architecture that can be tailored to various data types. The performance and energy consumption is compared against tuned, multi-threaded GEMM functions running on the host CPU. We measure the power consumption with internal current/voltage sensors and break down the power draw to the systems components in order to classify the energy consumed by the processor cores, the memory, the I/O bus system and the FPGA card. Our experimental results show that the FPGA-accelerated DGEMM is less energy-efficient than a multi-threaded software implementation with respect to the full systems power consumption, but the most efficient choice when only the dynamic parts of the power are factored in.

An FPGA Platform for Hyperscalers

Abstract and Figures

Recommended publications

Network-Attached FPGAs for Data Center Applications

Disaggregated FPGAs: Network Performance Comparison against Bare-Metal Servers, Virtual Machines and...

System Architecture for Network-Attached FPGAs in the Cloud using Partial Reconfiguration

RESTful Web Services on Standalone Disaggregated FPGAs