PreprintPDF Available

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

November 2022

November 2022

DOI:10.48550/arXiv.2211.13696

License
CC BY 4.0

Authors:

Michiel Van Beirendonck

KU Leuven

Preprints and early-stage research may not have been peer reviewed yet.

Fully Homomorphic Encryption is a technique that allows computation on encrypted data. It has the potential to drastically change privacy considerations in the cloud, but high computational and memory overheads are preventing its broad adoption. TFHE is a promising Torus-based FHE scheme that heavily relies on bootstrapping, the noise-removal tool that must be invoked after every encrypted gate computation. We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is the first hardware accelerator to heavily exploit the inherent noise present in FHE calculations. Instead of double or single-precision floating-point arithmetic, it implements TFHE bootstrapping entirely with approximate fixed-point arithmetic. Using an in-depth analysis of noise propagation in bootstrapping FFT computations, FPT is able to use noise-trimmed fixed-point representations that are up to 50% smaller than prior implementations using floating-point or integer FFTs. FPT's microarchitecture is built as a streaming processor inspired by traditional streaming DSPs: it instantiates high-throughput computational stages that are directly cascaded, with simplified control logic and routing networks. FPT's streaming approach allows 100% utilization of arithmetic units and requires only small bootstrapping key caches, enabling an entirely compute-bound bootstrapping throughput of 1 BS / 35$\mu$s. This is in stark contrast to the established classical CPU approach to FHE bootstrapping acceleration, which tends to be heavily memory and bandwidth-constrained. FPT is fully implemented and evaluated as a bootstrapping FPGA kernel for an Alveo U280 datacenter accelerator card. FPT achieves almost three orders of magnitude higher bootstrapping throughput than existing CPU-based implementations, and 2.5$\times$ higher throughput compared to recent ASIC emulation experiments.

FPGA resource utilization (a) and throughput / resource utilization (b) of a size-256 FFT at different streaming widths. At iso-precision, SGen FFTs use 31-bit intermediate variables without scaling schedule, and our FFTs use 29-bit intermediate variables with scaling.

…

FPGA Relative LUT and DSP utilization of a size-256 FFT for various intermediate variable bitwidths.

…

Output approximation noise versus the number of fractional bits for the representation of the bootstrapping key and intermediate variables during the forward and inverse FFT.

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic

Encryption

Michiel Van Beirendonck, Jan-Pieter D’Anvers, Ingrid Verbauwhede

{rstname.lastname}@esat.kuleuven.be

COSIC KU Leuven

Leuven, Belgium

ABSTRACT

Fully Homomorphic Encryption is a technique that allows compu-

tation on encrypted data. It has the potential to drastically change

privacy considerations in the cloud, but high computational and

memory overheads are preventing its broad adoption. TFHE is a

promising Torus-based FHE scheme that heavily relies on boot-

strapping, the noise-removal tool that must be invoked after every

encrypted gate computation.

We present FPT, a Fixed-Point FPGA accelerator for TFHE boot-

strapping. FPT is the rst hardware accelerator to heavily exploit

the inherent noise present in FHE calculations. Instead of double

or single-precision oating-point arithmetic, it implements TFHE

bootstrapping entirely with approximate xed-point arithmetic.

Using an in-depth analysis of noise propagation in bootstrapping

FFT computations, FPT is able to use noise-trimmed xed-point rep-

resentations that are up to 50% smaller than prior implementations

using oating-point or integer FFTs.

FPT’s microarchitecture is built as a streaming processor inspired

by traditional streaming DSPs: it instantiates high-throughput com-

putational stages that are directly cascaded, with simplied con-

trol logic and routing networks. We explore dierent throughput-

balanced compositions of streaming kernels with a user-congurable

streaming width in order to construct a full bootstrapping pipeline.

FPT’s streaming approach allows 100% utilization of arithmetic

units and requires only small bootstrapping key caches, enabling

an entirely compute-bound bootstrapping throughput of 1 BS / 35

𝜇

This is in stark contrast to the established classical CPU approach to

FHE bootstrapping acceleration, which tends to be heavily memory

and bandwidth-constrained.

FPT is fully implemented and evaluated as a bootstrapping FPGA

kernel for an Alveo U280 datacenter accelerator card. FPT achieves

almost three orders of magnitude higher bootstrapping through-

put than existing CPU-based implementations, and 2.5

higher

throughput compared to recent ASIC emulation experiments.

1 INTRODUCTION AND MOTIVATION

Machine Learning (ML), driven by the availability of an abundance

of data, has seen rapid advances in recent years [

], leading to

new applications from autonomous driving [

] to medical diag-

nosis [

]. In many applications, ML models are developed by one

party, who makes them available to users as a cloud service [

The deployment of such applications comes at the risk of privacy

breaches, where the user data might be leaked, or IP theft, where

users steal the ML model from the developing party [46].

The “silver bullet” solution to prevent the leakage of this data is

to encrypt it with Fully Homomorphic Encryption (FHE) [

which is a technique that allows one to compute on encrypted data.

Server FPGA

Client

f(m)

Figure 1: FHE allows outsourced computation on data that

remains encrypted. The cloud receives encrypted data on

which it can compute and the public key (green), but does

not receive the secret decryption key (red). The cloud can

run computations on the data, but only the client can nally

decrypt and obtain the result. Cloud instances with FPGAs

enable custom hardware accelerators and have the potential

to drastically speed up FHE computations.

Fig. 1 illustrates a possible application of FHE to protect user data

in an ML environment. In this scenario, a client wants to use an

online-server-based ML service, without leaking any sensitive data.

To this end, the client encrypts their data with FHE, before sending

it to the cloud. The cloud service then computes an FHE program

on the encrypted data without obtaining any information about

the input and sends the (still encrypted) result back to the client.

Only the client can nally decrypt and obtain the result.

The drawback of FHE is that it is at the moment still orders of

magnitude slower than unencrypted calculations. The rst algo-

rithm to calculate an encrypted AND gate took up to 30 minutes to

nish[

]. FHE schemes and algorithms have seen signicant im-

provements in recent years, e.g. the recent TFHE scheme computes

encrypted AND gates in only 13ms [

] on a CPU. However,

even with these improvements, it is not uncommon to still see slow-

down factors of 10,000

compared to calculations on unencrypted

data [

], which currently still prevents practical deployment

of FHE in many applications.

To work around the speed limitations of FHE, designers have

shifted their focus from general-purpose CPUs to more dedicated

hardware implementations. Of these dedicated implementations,

GPU-based FHE accelerators are easiest to develop, but they typi-

cally only provide modest speedups [

]. At the other end

of the spectrum, ASIC emulations in advanced technology nodes

promise better FHE acceleration [

]. However, it can

take years for these ASICs to be fabricated and become available

arXiv:2211.13696v1 [cs.CR] 24 Nov 2022

Van Beirendonck et al.

[

], and they are typically specialized for a limited range of param-

eter sets. Finally, FPGA-based implementations can be developed

more quickly than ASIC implementations, are exible to change pa-

rameter sets, and can be readily deployed in FPGA-equipped cloud

instances while boosting large speedups. As a result, they have

been a popular target for FHE acceleration [

One costly operation in FHE calculations is bootstrapping. All

currently available FHE schemes have an inherent noise that is

increased with each operation. After a certain number of operations,

this noise needs to be reduced to allow further calculations, which

is done using this so-called bootstrap procedure.

Second-generation FHE schemes BFV [

], BGV [

], and CKKS

[

] have been the main focus of prior hardware accelerators. These

schemes require bootstrapping only after a certain number of opera-

tions. For these schemes, bootstrapping is a complex algorithm that

requires large data caches [

] and exhibits low arithmetic intensity,

and essentially all prior architectures that support bootstrapping

have hit the o-chip memory-bandwidth wall [37, 52].

Third-generation schemes like FHEW [

] and its successor

Torus FHE (TFHE) [

] have revisited the bootstrapping ap-

proach, making it cheaper but inherently linked to homomorphic

calculations. In these schemes, most of the homomorphic opera-

tions require an immediate bootstrap of the ciphertext. Moreover,

bootstrapping in TFHE is a versatile tool, which can additionally

be “programmed” with an arbitrary function that is applied to the

ciphertext, e.g. non-linear activation functions in ML neural net-

works [

]. This approach is called Programmable Bootstrapping

(PBS) and it constitutes the main cost of TFHE homomorphic calcu-

lations. Taking up to 99% of an encrypted gate computation, PBS is

a prime target for high-throughput hardware acceleration of TFHE.

In this work, we propose FPT, an FPGA-based accelerator for

TFHE Programmable Bootstrapping. FPT achieves a signicant

speedup over the previous state-of-the-art, which is attributable to

two major contributions:

•

FPT’s microarchitecture is built as a streaming processor,

challenging the established classical CPU approach to FHE

bootstrapping accelerators. Inspired by traditional stream-

ing DSPs, FPT instantiates high-throughput computational

stages that are directly cascaded, with simplied control logic

and routing networks. FPT’s streaming approach allows 100%

utilization of arithmetic units during bootstrapping, includ-

ing tool-generated high-radix and heavily optimized nega-

cyclic FFT units with user-congurable streaming widths.

Our streaming architecture is discussed in Section 3.

•

FPT (

ixed-

oint

FHE) is the rst hardware accelerator to

extensively optimize the representation of intermediate vari-

ables. TFHE PBS is dominated by FFT calculations, which

work on irrational (complex) numbers and need to be im-

plemented with sucient accuracy. Instead of using double

oating-point arithmetic or large integers as in previous

works, FPT implements PBS entirely with compact xed-

point arithmetic. We analyze in-depth the noise due to the

compact xed-point representation that we use inside PBS,

and we match it to the noise that is natively present in FHE.

Through this analysis, FPT is able to use xed-point repre-

sentations that are up to 50% smaller than prior implemen-

tations using oating-point or integer FFTs. In turn, these

50% smaller xed-point representations enable up to 80%

smaller FFT kernels. Our xed-point analysis is discussed in

Section 4.

FPT shows, for the rst time, that PBS can remain entirely compute-

bound with only small bootstrapping key data caches. FPT achieves

a massive PBS throughput of 1 PBS / 35

𝜇

s, which requires only mod-

est o-chip memory bandwidth, and is entirely bound by the logic

resources on our target Xilinx Alveo U280 FPGA. This represents

almost three orders of magnitude speedup over the popular TFHE

software library CONCRETE [

] on an Intel Xeon Silver 4208 CPU

at 2.1 GHz, a factor 7.1

speedup over a concurrently-developed

FPGA architecture [

], and a factor 2.5

speedup over recent 16nm

ASIC emulation experiments [33].

2 BACKGROUND

This section gives an intuitive idea of the workings of TFHE, with a

focus on the Programmable Bootstrapping step that is accelerated

by FPT. We the reader to [

] for a more in-depth overview

of TFHE.

2.1 Torus Fully Homomorphic Encryption

Torus Fully Homomorphic Encryption (TFHE) is a homomorphic

encryption scheme based on the Learning With Errors (LWE) prob-

lem. It operates on elements that are dened over the real Torus

T=R/Z

, i.e. the set

[

)

of real numbers modulo 1. In practice,

Torus elements are discretized as 32-bit or 64-bit integers.

A TFHE ciphertext can be constructed by combining three el-

ements: a secret vector

𝑠

with

𝑛

coecients following a uniform

binary distribution

𝑠$

← U(B𝑛)

, a public vector

𝑎$

← U(T𝑛)

sam-

pled from a uniform distribution, and a small error

𝑒$

←𝜒

from a

small distribution

𝜒(T)

. A message

𝜇∈T

can be encrypted as a

tuple:

𝑐=(𝑎, 𝑏 =𝑎·𝑠+𝑒+𝜇) ∈ T𝑛+1

. Using the secret

𝑠

, one can

decrypt the ciphertext back into (a noisy version of) the message

by computing

𝑏−𝑎·𝑠=𝜇+𝑒

. This type of ciphertext is called a

Torus LWE (TLWE) ciphertext.

TFHE additionally describes two variant ciphertexts: First, a

generalized version (TGLWE), where

𝑒

and

𝜇

are polynomials in

T𝑁[𝑋]=T[𝑋]/(𝑋𝑁+

)

, and where

𝑎

and

𝑠

are vectors of polyno-

mials of the form

T𝑁[𝑋]𝑘

. The TGLWE ciphertext is then similarly

formed as a tuple:

𝑐=(𝑎, 𝑏 =𝑎·𝑠+𝑒+𝜇) ∈ T𝑁[𝑋]𝑘+1

. The second

variant is a generalized version of a GSW [

] ciphertext (TGGSW),

which is essentially a matrix where each row is a TGLWE ciphertext:

𝑐∈T𝑁[𝑋](𝑘+1)𝑙×(𝑘+1).

The reason for dening TGLWE and TGGSW ciphertexts is that

they permit a homomorphic multiplication:

TGLWE(𝜇1)⊡TGGSW(𝜇2)=TGLWE(𝜇1·𝜇2),

known as the External Product (

⊡

). First, it decomposes each of the

polynomials in the TGLWE ciphertext into

𝑙

polynomials of

𝛽

bits,

an operation termed gadget decomposition. Next, the decomposed

TGLWE ciphertext and TGGSW are multiplied in a

(𝑘+

)𝑙−

vector

times

(𝑘+

)𝑙× (𝑘+

)

-matrix product where the elements of this

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

vector and matrix are polynomials in

T𝑁[𝑋]

. The output is again a

TGLWE ciphertext encrypting 𝜇1·𝜇2.

2.2 Programmable Bootstrapping

The main goal of bootstrapping is to reduce the noise in the cipher-

text. One way to reduce the ciphertext noise would be to decrypt

the ciphertext, after which the noise can be suppressed, but this

would not be secure. Bootstrapping does in essence decrypt the

ciphertext, but for security reasons this operation is performed,

homomorphically, inside the encrypted domain. This means that

one wants to homomorphically compute

𝑏−𝑎·𝑠=𝑒+𝜇

, and more

specically, as it is “programmable” bootstrapping, one wants to

additionally compute a function 𝑓(𝜇)on the data.

To achieve this programmable bootstrapping, one rst sets a

“test” polynomial

𝐹=Í𝑁−1

𝑖=0𝑓(𝑖) · 𝑋𝑖∈T𝑁[𝑋]

that encodes

𝑁

relevant values of the function

𝑓

. This polynomial is then rotated

with

𝑏−𝑎·𝑠

positions by calculating

𝐹·𝑋−(𝑏−𝑎·𝑠)

, after which

the output to the function can be found on the rst position of the

resulting polynomial. However, all of these calculations should be

done without revealing the value of 𝑠.

The high-level idea of how to achieve this is to rst rewrite the

above expression as follows:

𝐹·𝑋−(𝑏−𝑎·𝑠)=𝐹·𝑋−𝑏·

𝑛

𝑖=1

𝑋𝑎𝑖·𝑠𝑖.(1)

This expression can be calculated iteratively. Starting with the

polynomial 𝐴𝐶𝐶 =𝐹·𝑋−𝑏, one iteratively calculates:

𝐴𝐶𝐶 ←𝐴𝐶𝐶 ·𝑋𝑎𝑖·𝑠𝑖,(2)

which can be further rewritten, using the fact that

𝑠𝑖

is either zero

or one, to:

𝐴𝐶𝐶 ← (𝐴𝐶𝐶 ·𝑋𝑎𝑖−𝐴𝐶𝐶) ·𝑠𝑖+𝐴𝐶𝐶 . (3)

However, as we can not reveal

𝑠𝑖

, we encode the

𝑠𝑖

value in a

TGGSW ciphertext

𝐵𝐾𝑖

, and the

𝐴𝐶𝐶

value in a TGLWE ciphertext,

after which the expression becomes:

𝐴𝐶𝐶 ← (𝐴𝐶𝐶 ·𝑋𝑎𝑖−𝐴𝐶𝐶)⊡𝐵𝐾𝑖+𝐴𝐶𝐶, (4)

using the homomorphic multiplication operation

⊡

. Eq. (4) homo-

morphically multiplexes on the secret value

𝑠𝑖

, and is known as the

Controlled MUX (CMUX).

Collectively, the dierent TGGSW ciphertexts

𝐵𝐾1, . . . , 𝐵𝐾𝑛

, each

encrypting one secret coecient

𝑠1,·· · , 𝑠𝑛

, are known as the boot-

strapping key. The result of the operations described above is a

TGLWE accumulator

𝐴𝐶𝐶

which is “blindly” rotated with a secret

amount of

𝑏−𝑎·𝑠

positions, from which the output TLWE cipher-

text can be straightforwardly extracted. The computations during

PBS are given in Algorithm 1.

FPT implements two parameter sets of TFHE, given in Table 1.

Parameter Set I is a parameter set used by the CONCRETE Boolean

library with 128-bit security [

]. Parameter Set II is a 110-bit se-

curity parameter set that has previously been employed for bench-

marking purposes, allowing a direct comparison of FPT with prior

work [11, 64].

Algorithm 1: TFHE’s Programmable Bootstrapping

/* TLWE Ciphertext */

input :𝑐𝑖𝑛 =(𝑎1, . . . , 𝑎𝑛, 𝑏) ∈ T𝑛+1

/* TGGSW Bootstrapping Key */

input :𝐵𝐾 =(𝐵𝐾1, . . . , 𝐵𝐾𝑛) ∈ T𝑁[𝑋]𝑛×(𝑘+1)𝑙×(𝑘+1)

/* TGLWE Test Polynomial LUT */

input : 𝐹∈T𝑁[𝑋](𝑘+1)

/* TLWE Ciphertext */

output :𝑐𝑜𝑢 𝑡 ∈T𝑘𝑁 +1

/* Test Polynomial LUT */

1𝐴𝐶𝐶 ←𝐹·𝑋−𝑏

/* Blind Rotation */

2for 𝑖←1to 𝑛do

/* CMUX */

3𝐴𝐶𝐶 ← (𝐴𝐶𝐶 ·𝑋⌊2𝑁𝑎𝑖

𝑞⌉−𝐴𝐶𝐶)⊡𝐵𝐾𝑖+𝐴𝐶𝐶

4end

5return 𝑐𝑜𝑢𝑡 =SampleExtract (𝐴𝐶𝐶)

Parameter Set (I) (II)

TLWE dimension n 586 500

TGLWE dimension k 2 1

Polynomial size N 512 1024

Decomp. Base Log 𝛽8 10

Decomp. Level 𝑙2 2

Table 1: Parameter Sets: (I) is a parameter set used by the

CONCRETE Boolean library [12] with 128-bit security. (II) is

a 110-bit security parameter set popular for benchmarking

purposes[11, 64].

2.3 FFT polynomial multiplications

As can be seen in Algorithm 1, the TFHE programmable bootstrap-

ping mainly consists of iterative calculation of the external product

⊡

, which is a vector-matrix multiplication where the elements are

large polynomials of order

𝑁

. Bootstrapping is therefore dominated

by the calculation of the polynomial multiplications.

A schoolbook approach to polynomial multiplication would re-

sult in a computational complexity

𝑂(𝑁2)

. However, utilizing the

convolution theorem, the FFT can be used to compute these poly-

nomial multiplications in time

𝑂(𝑁log(𝑁))

, as the multiplication

of polynomials modulo

𝑋𝑁−

1corresponds to a cyclic convolution

of the input vectors. FHE schemes, however, need polynomial mul-

tiplications modulo

𝑋𝑁+

1, requiring negacyclic FFTs to compute

negative-wrapped convolutions. This negacyclic convolution has

a period 2

𝑁

, and thus a straightforward implementation would

require 2𝑁size FFTs. The cost of the negacyclic FFT on real input

data can be reduced using two techniques.

The fact that the FFT computes on complex numbers oers the

rst opportunity for optimization. Since the input polynomials are

purely real and have an imaginary component equal to zero, real-to-

complex (r2c) optimized FFTs can be used, which achieves roughly

a factor of two improvements in speed and memory usage [

]. This

is the approach taken by the TFHE and FHEW software libraries,

which compute size-2N r2c FFTs.

A second possible optimization is that negacyclic FFTs, which

would have a period and size of 2

𝑁

, can be computed instead as

Van Beirendonck et al.

a regular FFT with period and size

𝑁

by using a “twisting” pre-

processing step [

]. During twisting, the coecients of the input

polynomial

𝑎

are multiplied with the powers of the 2

𝑁

-th root of

unity 𝜓=𝜔2𝑁,

𝑎=(𝑎[0],𝜓𝑎 [1], . . . , 𝜓 𝑁−1𝑎[𝑁−1]).(5)

After twisting, one can perform multiplication using a regular cyclic

FFT on ˆ

𝑎, halving the required FFT size to 𝑁.

While both optimizations are well-known individually, it is less

straightforward to combine them. Intuitively, the twisting step is

incompatible with the r2c optimization, because it will make the

polynomial complex.

We use a third, but not-so-well-known technique from NuFHE

[

] based on the tangent FFT [

]. The crux of this method is to “fold”

polynomial coecients

𝑎[𝑖]

and

𝑎[𝑖+𝑁/

]

into a complex number

𝑎[𝑖] + 𝑗𝑎 [𝑖+𝑁/

]

before applying the twisting step and subsequent

cyclic size-

𝑁/

2FFT. This quarters the size of the FFT required from

the original naive size-2

𝑁

FFT. We adopt this technique in FPT and

use FFTs of size

𝑁/

256 and

𝑁/

512 for Parameters Sets I

and II (Table 1), respectively.

3 FPT MICROARCHITECTURE

In this section, we discuss FPT’s microarchitecture. First, we de-

scribe how FPT’s architecture is designed as a streaming processor

targeting maximum throughput. Next, we detail a batch bootstrap-

ping technique, which signicantly reduces FPT’s on-chip caches

and o-chip bandwidth. Finally, we present balanced implemen-

tations of the various computational stages, which enable 100%

utilization of the arithmetic units during FPT’s bootstrapping oper-

ation.

3.1 Streaming Processor

FHE accelerators for second-generation schemes have mostly been

built after a classical CPU architecture [

]. They include a

control unit that executes an instruction set, together with a set of

arithmetic Processing Elements (PEs) that support dierent opera-

tions, e.g. ciphertext multiplication, key-switching, or bootstrap-

ping. Dierent operations utilize dierent PEs, requiring careful

proling of FHE programs to balance PE relative throughputs and

utilization [37, 53].

These accelerators are often memory-bound during bootstrap-

ping, and in order to keep a high utilization level of PEs, an in-

creasing focus is spent on optimizing the memory hierarchy, often

including a multi-layer on-chip memory hierarchy with a large

ciphertext register le at the lowest level.

FPT challenges this established classical CPU approach to FHE

bootstrapping acceleration, and instead adopts a microarchitecture

that is inspired by streaming Digital Signal Processors (DSPs). Data

ows naturally through FPT’s wide and directly cascaded computa-

tional stages, with simplied hard-wired routing paths and without

complicated control logic. During FPT’s bootstrapping operation,

utilization of arithmetic units is 100%.

As illustrated in Fig. 2, FPT denes only a single xed PE, the

CMUX PE, and instantiates only a single instance of this PE with

wide datapaths and massive throughput. Taking advantage of the

regular structure of TFHE’s PBS, consisting of

𝑛

repeated CMUX

iterations, this single high-throughput PE suces to run PBS to

completion. The CMUX PE computes a single PBS CMUX iteration,

after which its datapath output hard-wires back into its datapath

input.

Internally, the CMUX PE computes a xed sequence of monomial

multiplication, gadget decomposition, and polynomial multiply-

add operations of the external product. Rather than dividing the

CMUX into sub-PEs that are sequenced to run from a register le,

FPT builds the CMUX with directly cascaded computational stages.

Stages are throughput-balanced in the most conceivably simple

way: each stage operates at the same throughput and processes

a number of polynomial coecients per clock cycle that we call

the streaming width. Stages are interconnected in a simple xed

pipeline with static latency, avoiding complicated control logic and

simplifying routing paths.

FPT is built to achieve maximum PBS throughput. As a gen-

eral trend that we will detail later (Fig. 3b), the Throughput/Area

(TP/A) of computational stages increases together with the stream-

ing width. This motivates FPT to instantiate only a single wide

CMUX PE with high streaming width, as opposed to many CMUX

PEs with smaller streaming widths.

In summary, FPT’s CMUX architecture enables massive PBS

throughput by more closely resembling the architecture of a stream-

ing Digital Signal Processor (DSP), rather than the classical CPU

architecture employed by prior FHE processors.

3.2 Batch Bootstrapping

TFHE bootstrapping requires two major inputs: the input cipher-

text coecients

𝑎1, . . . , 𝑎𝑛

and the bootstrapping keys

𝐵𝐾1, . . . , 𝐵𝐾𝑛

Each iteration of the CMUX PE requires one element of both. The ci-

phertext coecients

𝑎𝑖

are relatively small in size and are therefore

easy to accommodate. In contrast, a bootstrapping key coecient

𝐵𝐾𝑖∈T𝑁[𝑋](𝑘+1)𝑙×(𝑘+1)

is a large matrix of up to tens of kBs.

Since the full

𝐵𝐾

is typically too large to t entirely on-chip, the

𝐵𝐾𝑖

must be loaded from o-chip memory for every iteration. How-

ever, at high CMUX throughput levels, the required bandwidth

for

𝐵𝐾𝑖

could easily exceed 1.0 TB/s. This is larger even than the

bandwidth of HBM, and thus poses a memory bottleneck.

We propose a method, termed batch bootstrapping, to amortize

loading the bootstrapping key for each iteration. The result is that

FPT can operate entirely compute-bound, with modest o-chip

bandwidth and small on-chip caches. In contrast, prior FHE proces-

sors that supported bootstrapping of second-generation schemes

were often bottlenecked by the required memory bandwidth [

In fact, a recent architectural analysis of bootstrapping [

] found

that it exhibits low arithmetic intensity and requires large caches.

Their conclusion was that FHE processors only benet marginally

from bespoke high-throughput arithmetic units. With our design,

we show that the situation can be very dierent for TFHE’s PBS.

In FPT, we solve the memory bottleneck problem as follows: First,

due to internal pipelining, the latency of the CMUX will be much

larger than its throughput. To operate at peak throughput, FPT

processes multiple ciphertexts to keep its CMUX pipeline stages

full. Next, we enforce that the dierent ciphertexts processed con-

currently in the CMUX’s pipeline stages arrive in a single batch

of size

𝑏

, encrypted under the same

𝐵𝐾

. This ensures that these

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

Figure 2: FPT’s microarchitecture. FPT instantiates only a single PE, the CMUX PE. The CMUX is built with wide, directly

cascaded datapaths, targeting massive throughput. In light grey are illustrated two throughput-balanced architectures for the

external product (with 𝑘=

,𝑙=

): dotproduct-unrolled (left) and FFT-unrolled (right). Host-FPGA communication includes

three dierent interfaces: an input ciphertext FIFO, a ping-pong bootstrapping-key buer, and a test polynomial 𝐹SRAM.

ciphertexts are at the same CMUX iteration, and as a result, all

require the exact same input coecient 𝐵𝐾𝑖.

Batch bootstrapping then proceeds as follows: We instantiate a

simple BRAM ping-pong buer that holds two coecients of

𝐵𝐾

The CMUX reads

𝐵𝐾𝑖

from one half with the required bandwidth

of 1.0 TB/s, while the o-chip memory lls

𝐵𝐾𝑖+1

inside the other

half with a bandwidth of 1

/𝑏

TB/s. In a technique similar to C-

slow retiming [

], we can arbitrarily increase the batch size

𝑏

by introducing more pipeline registers within the CMUX, without

throughput penalty. With a batch size of

𝑏=

16, already the required

bandwidth can be supplied by DDR4 instead of HBM.

Our simple but crucial batch bootstrapping technique exploits

locality of reference to decouple the on-chip bandwidth from the

o-chip bandwidth. As a result, in our architecture, TFHE’s PBS is

entirely compute-bound with only kB-size caches, not larger than

the size of two coecients of the bootstrapping key.

3.3 Balancing the External Product

The external product

(⊡)

, computing a vector-matrix negacyclic

polynomial product, represents the bulk of the CMUX logic. As

discussed before, the polynomial multiplications are performed

using an FFT, and thus the

(⊡)

operations include forward and

inverse negacylic FFT computations, and pointwise dot-products

with

𝐵𝐾𝑖

(the bootstrapping key

𝐵𝐾𝑖

is already in the FFT domain).

In a streaming architecture, it is important to balance through-

puts of processing elements, which is not trivial as the external

product includes

(𝑘+

)𝑙

forward FFTs, but only

(𝑘+

)

inverse

FFT operations. We explore two dierent throughput-balanced

architectures for the external product as shown in light-grey in

Fig. 2: a dotproduct-unrolled architecture (left) and an FFT-unrolled

architecture (right).

The dotproduct-unrolled architecture (left) represents the more

obvious choice for parallelism, where we instantiate

𝑙

times more

FFT kernels compared to IFFT kernels. With the FFT-unrolled ar-

chitecture on the right, we make a more unconventional choice:

we balance throughputs by instantiating the FFT with

𝑙

times the

streaming width of the IFFT. These two architectural trade-os

can be understood as exploiting dierent types of “loop unrolling”

inside the external product. On the left, we rst loop-unroll the dot-

product before unrolling the FFT, while on the right, we loop-unroll

the FFT maximally.

The drawback of the FFT-unrolled architecture is that it is more

complex than the dotproduct-unrolled one. First, multiply-add op-

erations must be replaced by MACs, since polynomial coecients

that must be added are now spaced temporally over dierent clock

cycles. Second, the inverse FFT can only start processing once a full

MAC has been completed, requiring a Parallel-In Serial-Out (PISO)

block that double-buers the MAC output and matches through-

puts. Third and most importantly, FFT blocks can be challenging to

unroll and implement for arbitrary throughputs, and supporting

two FFT blocks with diering throughputs requires non-negligible

extra engineering eort.

The main advantage of the more unconventional FFT-unrolled

architecture is that it features fewer FFT kernels that can therefore

utilize higher streaming widths. As we will detail in the next section,

this favors the general (and often-neglected) trend of pipelined FFTs,

which typically feature signicantly higher TP/A as the streaming

width increases. At the most extreme end, a fully parallel FFT is a

circuit with only constant multiplications and xed routing paths,

featuring up to 300% more throughput per DSP or per LUT on our

target FPGA (Fig. 3b). FPT alleviates the extra engineering eort

and extra complexity of the FFT-unrolled architecture, by extending

and optimizing an existing FFT generator tool to support negacylic

FFTs.

3.4 Streaming Negacylic FFTs

State-of-the-art FHE processors have implemented mostly itera-

tive FFTs or NTTs that process polynomials in multiple passes

[

]. In these architectures, it can be dicult to support

arbitrary throughputs, as memory conicts arise when each pass

requires data at dierent strides. Instead, FPT instantiates pipelined

FFTs that naturally support a streaming architecture. Pipelined FFT

Van Beirendonck et al.

architectures consist of

𝑙𝑜𝑔 (𝑁)

stages that are connected in series.

The main advantage of these architectures is that they process a

continuous ow of data, which lends itself well to a fully streaming

external product design.

There are many pipelined FFT architectures that target high-

throughput and support arbitrary streaming widths, and we refer to

[

] for a recent survey. Generally, pipelined FFTs cascade two types

of units: rst, the well-known butteries with complex twiddle

factor multipliers, and, second, shuing circuits that compute stride

permutations. Pipelined FFTs feature a large design space, with

dierent possible overall architectures, area/precision trade-os in

computing twiddle factor “rotations”, varying radix structures that

determine which twiddle factors appear at which stages, and more.

As such, they are an excellent target for tool-generated circuits, and

we follow this approach for FPT.

Several FFT generator tools have been proposed in the literature.

Some IP cores do not oer the massive parallelism and arbitrary

streaming widths that we target for FPT [

]. At the other end

of the spectrum, a recent generator [

] built on top of FloPoCo

[

] can only generate fully-parallel FFTs, instead of supporting

arbitrary streaming widths. We synthesized at dierent streaming

widths the High-Level Synthesis (HLS) Super Sample Rate (SSR)

FFTs included in the Vitis DSP libraries of Xilinx [

], but found

that they are outperformed by the RTL Verilog FFTs generated by

the Spiral FFT IP Core generator [

]. Unfortunately, Spiral is not

open-source and oers only a web interface towards its generated

RTL [42].

Eventually, we settled on SGen[

–

] as the FFT generator tool

that provided the necessary congurability, extensibility, and per-

formance we targeted for FPT. SGen is an open-source generator

implemented in Scala and employs concepts introduced in Spiral.

It generates arbitrary-streaming-width FFTs through four Interme-

diate Representations (IRs) with dierent levels of optimization: an

algorithm-level representation SPL, a streaming-block-level repre-

sentation Streaming-SPL, an acyclic streaming IR, and an RTL-level

IR. Apart from the streaming width, SGen features a congurable

FFT point size, radix, and hardware arithmetic representations such

as xed-point, IEEE754 oating-point, or FloPoCo oating-point.

Most importantly, SGen is fully open-source and extensible,

which we make heavy use of to generate streaming FFTs for FPT.

First, we have extended SGen with operators for the forward and

inverse twisting step, necessary to support negacyclic FFTs (Sec-

tion 2.3). Next, we have implemented a set of optimizations aimed

at higher precision and better TP/A. In this category, rst, we have

extended SGen with radix-2

𝑘

structures [

], nding that radix-

FFTs are on average 10% smaller than SGen-generated radix-4 or

radix-16 FFTs. Second, we replace schoolbook complex multiplica-

tion in SGen, requiring 4 real multiplies and 2 real additions, with

a variant of Karatsuba multiplications that is sometimes attributed

to Gauss:

𝑋+𝑗𝑌 =(𝐴+𝑗 𝐵)·(𝐶+𝑗𝐷 )

𝑍=𝐶(𝐴−𝐵)

𝑋=(𝐶−𝐷)𝐵+𝑍

𝑌=(𝐶+𝐷)𝐴−𝑍

(6)

By pre-computing

𝐶−𝐷

and

𝐶+𝐷

for the constant twiddle factors,

this multiplication requires only 3 real multiplies and 3 adds, saving

scarce FPGA DSP units.

Third, we decouple the twiddle bit-width from the input bit-

width. On one hand, this allows us to take advantage of the asym-

metric 27

18 multipliers found in FPGA DSP blocks, while at the

same time, it has been found that twiddles can be quantized with

approximately four fewer bits without aecting output noise [

Finally, as data grows throughout the FFT stages, it must initially

be padded with zeros to prevent overows. We have extended SGen

with a scaling schedule, that instead divides the data by two when-

ever the most-signicant bit must grow. Since the least-signicant

bits have mostly accumulated noise [

], scaling increases the preci-

sion for xed input bit-width. Adding a scaling schedule allows us,

on average, to use FFTs with 2-bit smaller xed-point intermediate

variables while meeting the same precision targets, which proves

crucial to eciently map multipliers to DSP units as will be detailed

later in Section 4, Fig. 4.

Figure 3a illustrates the resource usage of negacyclic size-256

FFTs produced by our optimized variant of SGen at dierent stream-

ing widths. To quantize our improvements over SGen, we also add

cyclic FFTs both with and without our introduced changes to the

tool. Our changes result in signicantly fewer logic resources: over

60% fewer DSP blocks are utilized while keeping LUTs comparable.

As DSP blocks are the main limiting resource for FPT (Table 3), our

optimizations are a key enabler to building FPT with high streaming

widths.

Figure 3b illustrates our main motivation to propose the FFT-

unrolled architecture for the external product. We plot the relative

throughput per area unit (DSPs or LUTs) of tool-generated FFTs

for dierent streaming widths. The trend is clear: FFTs with higher

streaming widths feature up to 300% more throughput per DSP or

per LUT. Intuitively, as the streaming width increases, FFTs can

take more advantage of the native strengths of hardware circuits.

First, shuing circuits with MUXes and storage blocks are replaced

with xed routing paths. Second, twiddle factor multipliers can

be specialized to the specic set of twiddles they need to handle,

taking advantage of optimized algorithms for Single- or Multiple

Constant Multiplication (SCM, MCM).

3.5 Other operations

Compared to the external product and its streaming FFTs, the re-

mainder of the CMUX (Fig. 2, dark grey) represents mostly sim-

ple circuitry: additions, subtractions, gadget decomposition, and

monomial multiplication. Whereas the rst three can be streamed

straightforwardly, monomial multiplication requires special treat-

ment.

Monomial multiplication multiplies the accumulator

𝐴𝐶𝐶

with

the ciphertext-dependent monomial

𝑋⌊2𝑁 𝑎𝑖

𝑞⌉

. Its eect is to rotate

the polynomials of

𝐴𝐶𝐶

⌊2𝑁 𝑎𝑖

𝑞⌉

, and additionally negate those

coecients that wrap around. First, we truncate

2𝑁 𝑎𝑖

𝑞

already in

software to limit host-FPGA bandwidth. Next, an ecient architec-

ture for monomial multiplication is a coecient-wise barrel shifter

in 𝑙𝑜𝑔 (𝑁)stages.

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

(a)

(b)

Figure 3: FPGA resource utilization (a) and throughput / re-

source utilization (b) of a size-256 FFT at dierent streaming

widths. At iso-precision, SGen FFTs use 31-bit intermediate

variables without scaling schedule, and our FFTs use 29-bit

intermediate variables with scaling.

To stream this operation, we dene two streaming approaches:

coecient-wise streaming, and bitwise streaming. In coecient-

wise streaming, dierent coecients of a polynomial are spaced

temporally over dierent clock cycles. In bit-wise streaming, all

coecients arrive in parallel within the same clock cycle, but we

instead divide dierent bit chunks of each coecient over dierent

clock cycles. One can then make a simple observation: a rotation is

a dicult permutation to stream coecient-wise, as it must inter-

change coecients that are spaced over dierent clock cycles, but

it is a simple operation to stream bit-wise, as we must simply rotate

all the individual bit-chunks. We therefore add stream-reordering

blocks that switch a polynomial from coecient-wise streaming to

bit-wise streaming and vice versa. At the same time, we merge the

stream-reordering with the folding operation of the negacyclic FFT,

which packs coecients

𝑎[𝑖]

and

𝑎[𝑖+𝑁/

]

. The reordering block

can be implemented at full throughput either in a R/W memory

block or with a simple series of registers and MUXes.

Signed gadget decomposition involves taking unsigned 32-bit

coecients, and decomposing them into

𝑙

signed coecients of

𝛽

bits. In hardware, this involves a simple reinterpretation of the bits

and conditional subtraction. We merge this logic at the output of

monomial multiplication to take advantage of LUT packing. In the

bit-wise streamed representation, these operations must track the

propagating carries in ipops.

Gadget decomposition is approximate, e.g., for Parameter Set I,

𝑙·𝛽=

16-bit

32-bit. Contrarily to software implementations, FPT

employs a CMUX datapath that is natively adjusted to approximate

gadget decomposition. We discard bits prematurely that would later

be rounded, allowing us to stick to a native 16-bit datapath, rather

than growing back to 32-bits outside of the external product.

4 COMPACT FIXED-POINT

REPRESENTATION

FFT calculations involve irrational (complex) numbers, and approx-

imation errors arise when those numbers are represented with

nite precision during computation. However, if enough precision

is used, implementations of TFHE tolerate these approximation

errors. More specically, one typically aims for the total approxi-

mation error to be lower than the noise inherently present in the

FHE calculations.

On a CPU, the typical method is to use oating-point numbers

with single or double precision. This is ecient due to the integra-

tion of an existing Floating-Point Unit (FPU), and therefore the typ-

ical representation of choice for software designers. CPU and GPU

implementations of TFHE have been restricted to double-precision

oating-point FFTs because single-precision FFTs were found to

introduce too much noise to guarantee successful decryption of

bootstrapped ciphertexts [11].

In dedicated hardware implementations, FPUs are not natively

available and are costly to include. To simplify the implementation

and the analysis of the approximation error, some prior implemen-

tations opted to change the scheme to work with a prime modulus

instead of a power-of-two modulus [

], allowing the use of

exact NTTs instead of approximate FFTs for polynomial multiplica-

tion. The downside of this approach is that one needs to include

costly modular reduction units.

FPT is the rst TFHE accelerator to instead utilize xed-point

calculations, which avoids the costly implementation of FPUs or

modular reduction units. Moreover, instead of initializing very

large xed-point calculations to guarantee sucient accuracy, we

conduct an in-depth analysis that optimizes the xed-point bitwidth

to be just large enough so that the approximation noise is smaller

than the inherent TFHE noise. FPT’s optimized approach in which

there is no need for a costly FPU or modular reduction unit allows

a more lean and ecient design, coming at the cost of a one-time

engineering eort to nd the optimal parameters.

The potential eect of our xed-point analysis on the area usage

of our implementation is illustrated in Figure 4. In this gure, we

Van Beirendonck et al.

Figure 4: FPGA Relative LUT and DSP utilization of a size-

256 FFT for various intermediate variable bitwidths.

plotted the LUT and DSP usage of a size-256 FFT, in function of

the bit width of the intermediate variables. The plot gives relative

numbers compared to the resource use at bitwidth 53 (loosely corre-

sponding to the signicand precision of IEEE 754 double-precision

oating-point). As illustrated, reducing the bitwidth of the inter-

mediate variables can result in a large reduction of the resource

utilization, with only 20% of the LUT and DSP usage for bitwidths

below 24.

Reduction of the bitwidth of intermediate variables relies on two

parts, the location of the most signicant bit, and the location of

the least signicant bit. We will rst look at our strategy to set the

MSB position of intermediate variables, and then focus on the LSB.

4.1 Setting the MSBs

The location of the most signicant bit is important to avoid over-

ows. If an overow occurs, the intermediate variable will be com-

pletely distorted and thus the result of the calculation will be un-

usable. Two strategies can be adopted to deal with overows: a

worst-case scenario where one can choose parameters to avoid any

overow or an average-case scenario where one allows a certain

overow with suciently low probability.

Avoiding any overow comes at a signicant enlargement of the

parameters and thus at a signicant cost, which is why we adopt

the strategy to avoid overows with a maximal overow probability

𝑃𝑜 𝑓 =

−64

. To determine the ideal MSB position we measure the

variance and then assume a Gaussian distribution to calculate the

overow probability. For a given MSB position

𝑝𝑀𝑆 𝐵

and standard

deviation 𝜎, the probability of overow is:

𝑃𝑜 𝑓 =𝑃[|𝜒|>2𝑝𝑀𝑆 𝐵 /2|𝜒$

← N(0, 𝜎 )] (7)

=1−erf 2𝑝𝑀𝑆𝐵

2√2𝜎.(8)

Using this equation we determine the lowest

𝑝𝑀𝑆 𝐵

that fullls the

maximal overow probability of

𝑃𝑜 𝑓 =

−64

for each intermediate

variable.

Parameter Set (I) (II)

BK FixedPoint26( 7, 19) FixedPoint27( 8, 19)

FFT FixedPoint29(15, 14) FixedPoint30(18, 12)

IFFT FixedPoint29(23, 6) FixedPoint30(27, 3)

Table 2: Fixed-point data representations used by interme-

diate variables, in the format FixedPointwidth(integerBits,

fractionalBits).

4.2 Setting the LSBs

The position of the least signicant bits has an inuence on the

approximation noise that is introduced during the calculations. This

approximation noise can be tolerated up to a certain level. More

specically, the approximation noise should be small enough so that

the combination of the approximation noise and the inherent TFHE

noise still leads to a correct bootstrap with high probability. We

divide the total acceptable noise, for which we use theoretical noise

bounds of [

], into two equal parts for the approximation noise

and the inherent noise, thus allowing our approximation noise to

be at most half the total acceptable noise.

In our design, we focus on three main parameters: the intermedi-

ate variable widths during the forward and inverse FFT calculations,

and the bitwidth of the coecients of the bootstrapping key

𝐵𝐾

. We

assume the noise introduced due to each parameter is independent

(as each parameter comes from a separate block in our design),

which means that the variance of the total noise is equal to the sum

of the variances of each noise source (

𝜎2

𝑡𝑜𝑡 =𝜎2

𝐹 𝐹𝑇 +𝜎2

𝐼 𝐹 𝐹𝑇 +𝜎2

𝐵𝐾

We then limit the noise variance due to each noise source to 1

3th

of the total noise variance.

To nd optimal xed-point parameter values, we perform a pa-

rameter sweep by setting the parameters to very high widths (in our

example 53) resulting in very low noise, and then iteratively reduc-

ing one parameter until it hits the noise threshold while keeping the

other parameters at high widths. The result of this experiment can

be seen in Fig. 5, and our nal xed-point parameters are illustrated

in Table 2. Note that we give the IFFT data representation before

outputs are scaled by 1/𝑁.

4.3 Related and Future Work

One prior implementation proposing a custom hardware format for

TFHE’s FFTs is MATCHA [

], who propose to use (integer) Dyadic-

Value-Quantized Twiddle Factors (DVQTFs). Our xed-point pa-

rameter analysis improves on MATCHA’s in two key ways:

First, MATCHA only considers the bitwidth of twiddle factors,

and set a uniform bitwidth (either 38-bit or 64-bit) that is employed

throughout their external product calculations. Our analysis instead

shows that dierent intermediate variables can prot from dierent

xed-point representations, allowing for an overal smaller resource

utilisation (Fig. 5, Table 2). Moreover, our analysis allows us to

quantize

𝐵𝐾

smaller than other parameters, limiting both on-chip

𝐵𝐾𝑖buers and o-chip bandwidth.

Second, in MATCHA, instead of measuring the noise variance,

the authors conduct 10

tests for a parameter set to test if there

are no decryption failures at the end of bootstrapping. The down-

side of this approach is that it becomes expensive when multiple

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

Figure 5: Output approximation noise versus the number of

fractional bits for the representation of the bootstrapping

key and intermediate variables during the forward and in-

verse FFT.

parameters have to be set. Furthermore, this methodology does

not give exact values of the failure probability, as one only has the

information that no errors were found in 10

tests. Our approach

of measuring the approximation noise and matching it with the

theoretical noise bounds provides for a more rigorous and lean

design.

Finally, we note that there are other intermediate variables that

could be optimized, for example, the widths of the twiddle factors

in the FFT calculations. We heuristically set them to the width of

the intermediate variables minus 4, which gave a good balance

between failure probability and cost as also explained in Section 3.4.

Interesting future work could include a full search over all possible

parameters, which could result in improved xed-point parameters

over our heuristic approach.

5 IMPLEMENTATION

We implemented FPT for a Xilinx Alveo U280 datacenter accelera-

tor FPGA featuring 1.3M LUTs, 2.6M FFs, 9024 DSPs, and 41 MB

of on-chip SRAM. For both parameter sets, we employ our FFT-

unrolled architecture with a forward FFT streaming width of 128

complex coecients/clock cycle, and an IFFT streaming width of

128

/𝑙=

64 complex coecients per clock cycle. For Parameter

Set II with

𝑁=

1024, we have also implemented the dotproduct-

unrolled architecture with

(𝑘+

)𝑙=

4forward FFT kernels and

(𝑘+

2IFFT kernels, both of uniform streaming width 32. At

this datapoint, providing iso-throughput to the FFT-unrolled archi-

tecture, we found that the dotproduct-unrolled architecture incurs

10% more average DSP and LUT usage, and we therefore do not

evaluate it further.

Our FFT-unrolled architectures feature massive throughput, com-

pleting one CMUX every

(

256

128

)·(𝑘+

)𝑙=

12 clock cycles for

Parameter Set I, and every

(

512

128

) · (𝑘+

)𝑙=

16 clock cycles for

Parameter Set II. The latency of the CMUX is larger: 156 cycles for

Parameter Set I, and 224 cycles for Parameter Set II. In both cases,

we operate at peak throughput by lling the CMUX pipeline with a

batch of ciphertexts, of sizes

𝑏=

156

13 and

𝑏=

224

14,

respectively.

5.1 External I/O

The Alveo U280 includes three dierent host-FPGA memory inter-

faces: 32 GB of DDR4, 8 GB of HBM accessed through 32 Pseudo-

Channels (PCs), and 24 MB of PLRAM. PBS also requires three

host-side inputs: a batch of

𝑏

input ciphertexts

𝑐𝑖𝑛

, the long-term

bootstrapping key

𝐵𝐾

, and the test polynomial LUT

𝐹

to evaluate

over the ciphertext.

For the long-term bootstrapping key, we note that it is not abso-

lutely necessary to instantiate a ping-pong

𝐵𝐾𝑖

buer, as discussed

in Section 3.2, on our target Alveo U280 FPGA. For our parameter

sets and xed-point-trimmed

𝐵𝐾

bitwidths, the full

𝐵𝐾

measures

approximately 15 MB and ts entirely in a combination of the

on-chip BRAM and URAM. Nevertheless, we instantiate a small

ping-pong

𝐵𝐾𝑖

cache as a proof-of-concept. This requires an on-

chip ping-pong buer of only 2

/𝑛

of the full size of

𝐵𝐾

, allowing our

architecture to remain compute-bound on architectures with less

on-chip SRAM, such as smaller FPGAs or heavily memory-trimmed

down ASICs. Moreover, our technique ensures that our architecture

scales to new TFHE algorithms or related schemes like FHEW that

increase the size of the bootstrapping key.

For our batch sizes

𝑏

, the required

𝐵𝐾

bandwidth is only tens of

GB/s, which we provide by splitting the

𝐵𝐾

over a limited number

of HBM PCs 0-7, each providing 14 GB/s of bandwidth. The input

and output ciphertext batches are small and require negligible band-

width, which we allocate in a single HBM PC. Each HBM channel is

served by a separate AXI master on the PL-side, which are R/W for

the ciphertext and read-only for

𝐵𝐾

. For the test polynomial LUT

𝐹

we allocate an on-chip RAM that can store a congurable number

of test polynomials. Each input ciphertext is tagged with an index

of the LUT to apply, and correspondingly the test polynomial

𝐹

select from the RAM as input to the rst CMUX iteration. LUTs

depend on the specic FHE program, are typically limited in num-

ber, and do not change often. For example, bootstrapped Boolean

gates require only a single LUT. As such, we keep the RAM small,

and we share the same HBM PC and AXI master that is used by the

input and output ciphertexts.

5.2 Xilinx Run Time Kernel

FPT is accessible from the host as Xilinx Run Time (XRT) RTL

kernel and managed through XRT API calls. FPT’s CMUX pipeline

features 100% utilization during a single ciphertext batch bootstrap,

and does not require complex kernel overlapping to reach peak

throughput. To ensure that there are no pipeline bubbles between

the bootstrapping of dierent batches, we allow early pre-fetching

of the next ciphertext batch into an on-chip FIFO. As such, we build

FPT to support the Vitis

ap_ctrl_chain

kernel block level control

protocol, which permits overlapping kernel executions and allows

FPT to queue future ciphertext batch base HBM addresses.

Van Beirendonck et al.

LUT FF DSP BRAM

(40% av.) (35% av.) (61% av.) (25% av.)

FPT (I) 526K 916K 5494 505

CMUX 384K 707K 5494 310

MAC (384×) 97K 114K 2304 310

FFT256,128 159K 366K 2126 0

IFFT256,64 97K 192K 1064 0

(46% av.) (39% av.) (66% av.) (20% av.)

FPT (II) 595K 1024K 5980 412

CMUX 458K 827K 5980 215

MAC (256×) 66K 79K 1536 215

FFT512,128 222K 449K 2958 0

IFFT512,64 130K 255K 1486 0

Table 3: FPT Hardware Resource Utilization Breakdown for

Parameter Sets I and II. DSP blocks are the main limiting

resource with up to 66% of available FPGA resources utilized

by FPT.

5.3 Fixed-point Streaming Design in Chisel

While the outer host-FPGA communication logic of FPT is imple-

mented in SystemVerilog, we use Chisel [

] – an open-source HDL

embedded in Scala – to construct the inner streaming CMUX ker-

nel. Like SystemVerilog, Chisel is a full-edged HDL with direct

constructs to describe synthesizable combinational and sequential

logic and not a High-Level Synthesis (HLS) language. Our moti-

vation to select Chisel over SystemVerilog for the CMUX, is that

it makes the full capabilities of the Scala language available to de-

scribe circuit generators. We make heavy use of object-oriented

and functional programming tools to describe our CMUX stream-

ing architecture for a congurable streaming width, and in both

realizations shown in Fig. 2. Moreover, Chisel has a rich typesystem

that is further supported by external libraries. In FPT, the existing

DspComplex[FixedPoint]

is our main hardware datatype that we

use within our architecture. Building on existing FixedPoint test

infrastructure that we extended for FPT, our experiments in Sec-

tion 4 are directly run on the Chisel-generated Verilog rather than

an intermediate xed-point software model.

6 EVALUATION AND COMPARISON

6.1 Resource Utilization

FPT is implemented using Xilinx Vivado 2022.2 and packaged

as XRT kernel using Vitis 2022.2, targeting a clock frequency of

200 MHz. Table 3 presents a resource utilization breakdown of FPT,

for both Parameter Sets I and II. In both cases, DSP blocks are the

main limiting resource that prevents increasing to the next available

streaming width, with up to 66% of available DSP blocks utilized by

FPT. Note that whereas Fig. 2 presented our ping-pong

𝐵𝐾

buer as

a monolithic memory block, it is physically split into many smaller

memory blocks that are placed inside the MAC units that consume

them.

6.2 PBS Benchmarks

Table 4 compares FPT quantitatively with a number of prior TFHE

baselines. For our CPU baseline, we benchmark single-core PBS

in CONCRETE[

] on an Intel Xeon Silver 4208 CPU at 2.1 GHz.

A recent ASIC baseline is provided by MATCHA [

]. MATCHA

presents emulations of a 36.96mm

ASIC in a 16nm PTM process

technology. As FPGA baseline, we include a recent architecture of

Ye et al. [

], which was developed concurrently with our work

and signicantly improves the prior baseline of Gener et al. [

We refer to this architecture by the author initials YKP, and we

also include YKP’s benchmarks of cuFHE [

], a GPU-based imple-

mentation benchmarked on an NVIDIA GeForce RTX 3090 GPU at

1.7 GHz, in our comparison.

The main design goal of FPT is PBS throughput. Table 4 illus-

trates the massive PBS throughput that is enabled through FPT’s

streaming architecture: 937

more than CONCRETE, 7.1

than YKP, and 2.5×more than MATCHA or cuFHE.

In FPT’s current instantiation, we did not optimize for latency.

As the PCIe and AXI latencies of communicating the in- and out-

put ciphertext batches are negligible, FPT’s PBS latency is mostly

determined by its CMUX pipeline depth. In this work, we kept the

CMUX pipeline depth large, tting

𝑏

ciphertexts and enabling small

o-chip bandwidth through our batched bootstrapping technique.

Lower-latency implementations of FPT can opt to decrease the

CMUX pipeline depth, requiring either more o-chip bandwidth to

load

𝐵𝐾

or caching the full

𝐵𝐾

on-chip. Nevertheless, FPT’s latency

even in its current instantiation is competitive with MATCHA. We

note that FPT is estimated at 99W total on-chip power (FPGA and

HBM), oering a similar TP/W as MATCHA (40W) and signicantly

more than cuFHE (>200W) or YKP (50W).

6.3 Related Work

Qualitatively, FPT makes dierent design choices than either YKP

or MATCHA. MATCHA is built after the classical CPU approach

to FHE accelerators. It includes a set of TGGSW clusters with ex-

ternal product cores that operate from a register le. As one result,

MATCHA is bottlenecked by data movement and cache memory

access conicts.

YKP is an HLS-based architecture that redenes TFHE to use

NTT, breaking compatibility with existing TFHE libraries like CON-

CRETE, and disabling the xed-point optimizations of FPT. At the

architectural level, YKP includes some concepts also employed by

FPT. Similar to FPT, they include a pipelined implementation of

the CMUX that processes multiple ciphertext instances. However,

unlike FPT which builds a single streaming CMUX PE with large

and congurable streaming width, YKP implements and instanti-

ates multiple smaller CMUX PEs with inferior TP/A. Each CMUX

pipeline instance in YKP includes an SRAM that stores a coecient

𝐵𝐾𝑖

. However, unlike FPT where these SRAMs are loaded by o-

chip memory in ping-pong fashion, YKP loads coecients from

DRAM only after a full coecient has been consumed. This makes

the number of CMUX PEs they instantiate limited by the o-chip

memory bandwidth, whereas FPT’s design choices make it entirely

compute-bound.

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

Parameter Set LUT / FFs / DSP / BRAM Clock (MHz) Latency(ms) TP (PBS/ms)

FPT (I) 526K / 916K / 5494 / 17.5Mb 200 0.48 28.4

(II) 595K / 1024K / 5980 / 14.5Mb 200 0.58 25.0

842K / 662K / 7202 / 338Mb 180 3.76 3.5

YKP [62] (II) 442K / 342K / 6910 / 409Mb 180 1.88 2.7

MATCHA [33] (II) 36.96mm216nm PTM 2000 0.2 10

(I) 2100 33 0.03

CONCRETE[12] (II) Intel Xeon Silver 4208 2100 32 0.03

cuFHE [15] (II) NVIDIA GeForce RTX 3090 1700 9.34 9.6

Table 4: Comparison of TFHE PBS on a variety of platforms

Both MATCHA and YKP focus on an algorithmic technique

called bootstrapping key unrolling. This technique unrolls

𝑚

iter-

ations of the Blind Rotation loop (Algorithm 1, Line 2), requiring

an (exponentially) more expensive CMUX equation and larger

𝐵𝐾

but reducing the total number of iterations from

𝑛

𝑛/𝑚

. In FPT’s

design spirit of maximum throughput, bootstrapping key unrolling

is a bad trade-o. Already at

𝑚=

2, the adapted CMUX requires

more external products and 3

larger bootstrapping keys for

only 2

fewer iterations, resulting in inherently smaller overall

PBS TP/A. Bootstrapping key unrolling is essential for designs like

MATCHA and YKP to extract parallelism, which have many smaller

functional units with inferior TP/A. FPT, with its FFT-unrolled ar-

chitecture and large streaming width, nds ample parallelism and

larger TP/A within a single CMUX.

For completeness, we note that both MATCHA and YKP in-

clude key-switching as an operation of PBS. Key-switching includes

coecient-wise multiplication of a TLWE ciphertext with a key-

switching key. We opted not to include key-switching in FPT, be-

cause dierent FHE programs may choose to key-switch either be-

fore or after PBS [

]. Nevertheless, key-switching is an operation

with much lower throughput requirements than the CMUX [

In FPT, key-switching of the output ciphertext can be supported

without throughput penalty (but with slightly increased latency)

by instantiating a few integer multipliers on the AXI write-back

path.

7 CONCLUSION

In this paper, we introduced FPT, an accelerator for the Torus Fully

Homomorphic Encryption (TFHE) scheme. In contrast to previous

FHE architectures, our design follows a streaming approach with

high throughput and low control overhead. Owing to a batched

design and balanced streaming architecture, our accelerator is the

rst FHE bootstrapping implementation that is compute-bound and

not memory-bound, with small data caches and a 100% utilization of

the arithmetic units. Instead of using an NTT or oating-point FFT,

FPT achieves a signicant throughput increase by utilizing up to 80%

area-reduced xed-point FFTs with compact and optimized variable

representations. In the end, FPT achieves a TFHE bootstrapping

throughput of 28.4 bootstrappings per millisecond, which is 937

faster than CPU implementations, 7.1

faster than a concurrent

FPGA implementation, and 2.5

faster than state-of-the-art ASIC

and GPU designs.

ACKNOWLEDGMENTS

This work was supported in part by CyberSecurity Research Flan-

ders with reference number VR20192203, the Research Council

KU Leuven (C16/15/058), the Horizon 2020 ERC Advanced Grant

(101020005 Belfort), and the AMD Xilinx University Program through

the donation of a Xilinx Alveo U280 datacenter accelerator card.

Michiel Van Beirendonck is funded by FWO as Strategic Basic

(SB) PhD fellow (project number 1SD5621N). Jan-Pieter D’Anvers

is funded by FWO (Research Foundation – Flanders) as junior post-

doctoral fellow (contract number 133185).

Finally, the authors would like to thank Wouter Legiest for ex-

perimenting with a variety of FFT generator tools.

REFERENCES

[1]

Rashmi Agrawal, Leo de Castro, Guowei Yang, Chiraag Juvekar, Rabia Tugce

Yazicigil, Anantha P. Chandrakasan, Vinod Vaikuntanathan, and Ajay Joshi.

2022. FAB: An FPGA-based Accelerator for Bootstrappable Fully Homomorphic

Encryption. CoRR abs/2207.11872 (2022). https://doi.org/10.48550/arXiv.2207.

11872 arXiv:2207.11872

[2]

Alfred V. Aho, John E. Hopcroft, and Jerey D. Ullman. 1974. The Design and

Analysis of Computer Algorithms. Addison-Wesley.

[3]

Michael Armbrust, Armando Fox, Rean Grith, Anthony D. Joseph, Randy H.

Katz, Andy Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica,

and Matei Zaharia. 2010. A view of cloud computing. Commun. ACM 53, 4 (2010),

50–58. https://doi.org/10.1145/1721654.1721672

[4]

Jonathan Bachrach, Huy Vo, Brian C. Richards, Yunsup Lee, Andrew Water-

man, Rimas Avizienis, John Wawrzynek, and Krste Asanovic. 2012. Chisel:

constructing hardware in a Scala embedded language. In The 49th Annual Design

Automation Conference 2012, DAC ’12, San Francisco, CA, USA, June 3-7, 2012,

Patrick Groeneveld, Donatella Sciuto, and Soha Hassoun (Eds.). ACM, 1216–1225.

https://doi.org/10.1145/2228360.2228584

[5]

Ahmad Al Badawi, Bharadwaj Veeravalli, Chan Fook Mun, and Khin Mi Mi Aung.

2018. High-Performance FV Somewhat Homomorphic Encryption on GPUs: An

Implementation using CUDA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018, 2

(2018), 70–95. https://doi.org/10.13154/tches.v2018.i2.70-95

[6]

Daniel J. Bernstein. 2007. The Tangent FFT. In Applied Algebra, Algebraic Al-

gorithms and Error-Correcting Codes, 17th International Symposium, AAECC-17,

Bangalore, India, December 16-20, 2007, Proceedings (Lecture Notes in Computer

Science, Vol. 4851), Serdar Boztas and Hsiao-feng Lu (Eds.). Springer, 291–300.

https://doi.org/10.1007/978-3- 540-77224- 8_34

[7]

Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Leveled) Fully

Homomorphic Encryption without Bootstrapping. ACM Trans. Comput. Theory

6, 3 (2014), 13:1–13:36. https://doi.org/10.1145/2633600

[8]

Wei-Hsin Chang and Truong Q. Nguyen. 2008. On the Fixed-Point Accuracy

Analysis of FFT Algorithms. IEEE Trans. Signal Process. 56, 10-1 (2008), 4673–4682.

https://doi.org/10.1109/TSP.2008.924637

Van Beirendonck et al.

[9]

Jung Hee Cheon, Andrey Kim, Miran Kim, and Yong Soo Song. 2017. Homo-

morphic Encryption for Arithmetic of Approximate Numbers. In Advances in

Cryptology - ASIACRYPT 2017 - 23rd International Conference on the Theory and Ap-

plications of Cryptology and Information Security, Hong Kong, China, December 3-7,

2017, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 10624), Tsuyoshi

Takagi and Thomas Peyrin (Eds.). Springer, 409–437. https://doi.org/10.1007/978-

3-319- 70694-8_15

[10]

Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2016.

Faster Fully Homomorphic Encryption: Bootstrapping in Less Than 0.1 Seconds.

In Advances in Cryptology - ASIACRYPT 2016 - 22nd International Conference

on the Theory and Application of Cryptology and Information Security, Hanoi,

Vietnam, December 4-8, 2016, Proceedings, Part I (Lecture Notes in Computer Science,

Vol. 10031), Jung Hee Cheon and Tsuyoshi Takagi (Eds.). 3–33. https://doi.org/

10.1007/978-3- 662-53887- 6_1

[11]

Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2020.

TFHE: Fast Fully Homomorphic Encryption Over the Torus. J. Cryptol. 33, 1

(2020), 34–91. https://doi.org/10.1007/s00145-019- 09319-x

[12]

Ilaria Chillotti, Marc Joye, Damien Ligier, Jean-Baptiste Orla, and Samuel Tap.

2020. CONCRETE: Concrete operates on ciphertexts rapidly by extending TfhE.

In WAHC 2020–8th Workshop on Encrypted Computing & Applied Homomorphic

Cryptography, Vol. 15.

[13]

Ilaria Chillotti, Marc Joye, and Pascal Paillier. 2021. Programmable Bootstrapping

Enables Ecient Homomorphic Inference of Deep Neural Networks. In Cyber

Security Cryptography and Machine Learning - 5th International Symposium,

CSCML 2021, Be’er Sheva, Israel, July 8-9, 2021, Proceedings (Lecture Notes in

Computer Science, Vol. 12716), Shlomi Dolev, Oded Margalit, Benny Pinkas, and

Alexander A. Schwarzmann (Eds.). Springer, 1–19. https://doi.org/10.1007/978-

3-030- 78086-9_1

[14]

Ainhoa Cortés, Igone Vélez, Ibon Zalbide, Andoni Irizar, and Juan F. Sevillano.

2008. An FFT Core for DVB-T/DVB-H Receivers. VLSI Design 2008 (2008),

610420:1–610420:9. https://doi.org/10.1155/2008/610420

[15]

Wei Dai and Berk Sunar. 2015. cuHE: A Homomorphic Encryption Accelerator

Library. In Cryptography and Information Security in the Balkans - Second Inter-

national Conference, BalkanCryptSec 2015, Koper, Slovenia, September 3-4, 2015,

Revised Selected Papers (Lecture Notes in Computer Science, Vol. 9540), Enes Pasalic

and Lars R. Knudsen (Eds.). Springer, 169–186. https://doi.org/10.1007/978-3-

319-29172- 7_11

[16]

Leo de Castro, Rashmi Agrawal, Rabia Tugce Yazicigil, Anantha P. Chandrakasan,

Vinod Vaikuntanathan, Chiraag Juvekar, and Ajay Joshi. 2021. Does Fully Homo-

morphic Encryption Need Compute Acceleration? CoRR abs/2112.06396 (2021).

arXiv:2112.06396 https://arxiv.org/abs/2112.06396

[17]

Florent de Dinechin and Bogdan Pasca. 2011. Designing Custom Arithmetic

Data Paths with FloPoCo. IEEE Des. Test Comput. 28, 4 (2011), 18–27. https:

//doi.org/10.1109/MDT.2011.44

[18]

Yarkin Doröz, Erdinç Öztürk, and Berk Sunar. 2015. Accelerating Fully Homo-

morphic Encryption in Hardware. IEEE Trans. Computers 64, 6 (2015), 1509–1521.

https://doi.org/10.1109/TC.2014.2345388

[19]

Léo Ducas and Daniele Micciancio. 2015. FHEW: Bootstrapping Homomorphic

Encryption in Less Than a Second. In Advances in Cryptology - EUROCRYPT

2015 - 34th Annual International Conference on the Theory and Applications of

Cryptographic Techniques, Soa, Bulgaria, April 26-30, 2015, Proceedings, Part I

(Lecture Notes in Computer Science, Vol. 9056), Elisabeth Oswald and Marc Fischlin

(Eds.). Springer, 617–640. https://doi.org/10.1007/978- 3-662-46800- 5_24

[20]

Junfeng Fan and Frederik Vercauteren. 2012. Somewhat Practical Fully Homo-

morphic Encryption. IACR Cryptol. ePrint Arch. (2012), 144. http://eprint.iacr.

org/2012/144

[21]

M. Frigo and S.G. Johnson. 2005. The Design and Implementation of FFTW3.

Proc. IEEE 93, 2 (2005), 216–231. https://doi.org/10.1109/JPROC.2004.840301

[22]

Mario Garrido. 2021. A Survey on Pipelined FFT Hardware Architectures. Journal

of Signal Processing Systems (06 Jul 2021). https://doi.org/10.1007/s11265- 021-

01655-1

[23]

Mario Garrido, Jesús Grajal, Miguel A. Sánchez Marcos, and Oscar Gustafsson.

2013. Pipelined Radix-2

Feedforward FFT Architectures. IEEE Trans. Very Large

Scale Integr. Syst. 21, 1 (2013), 23–32. https://doi.org/10.1109/TVLSI.2011.2178275

[24]

Mario Garrido, Konrad Möller, and Martin Kumm. 2019. World’s Fastest FFT

Architectures: Breaking the Barrier of 100 GS/s. IEEE Trans. Circuits Syst. I Regul.

Pap. 66-I, 4 (2019), 1507–1516. https://doi.org/10.1109/TCSI.2018.2886626

[25]

Robin Geelen, Michiel Van Beirendonck, Hilder V. L. Pereira, Brian Human,

Tynan McAuley, Ben Selfridge, Daniel Wagner, Georgios Dimou, Ingrid Ver-

bauwhede, Frederik Vercauteren, and David W. Archer. 2022. BASALISC:

Flexible Asynchronous Hardware Accelerator for Fully Homomorphic Encryp-

tion. CoRR abs/2205.14017 (2022). https://doi.org/10.48550/arXiv.2205.14017

arXiv:2205.14017

[26]

Serhan Gener, Parker Newton, Daniel Tan, Silas Richelson, Guy Lemieux, and

Philip Brisk. 2021. An FPGA-based Programmable Vector Engine for Fast Fully

Homomorphic Encryption over the Torus. In SPSL: Secure and Private Systems

for Machine Learning (ISCA Workshop).

[27]

Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In Pro-

ceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009,

Bethesda, MD, USA, May 31 - June 2, 2009, Michael Mitzenmacher (Ed.). ACM,

169–178. https://doi.org/10.1145/1536414.1536440

[28]

Craig Gentry and Shai Halevi. 2011. Implementing Gentry’s Fully-Homomorphic

Encryption Scheme. In Advances in Cryptology - EUROCRYPT 2011 - 30th An-

nual International Conference on the Theory and Applications of Cryptographic

Techniques, Tallinn, Estonia, May 15-19, 2011. Proceedings (Lecture Notes in Com-

puter Science, Vol. 6632), Kenneth G. Paterson (Ed.). Springer, 129–148. https:

//doi.org/10.1007/978-3- 642-20465- 4_9

[29]

Craig Gentry, Amit Sahai, and Brent Waters. 2013. Homomorphic Encryp-

tion from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster,

Attribute-Based. In Advances in Cryptology - CRYPTO 2013 - 33rd Annual Cryp-

tology Conference, Santa Barbara, CA, USA, August 18-22, 2013. Proceedings, Part

I (Lecture Notes in Computer Science, Vol. 8042), Ran Canetti and Juan A. Garay

(Eds.). Springer, 75–92. https://doi.org/10.1007/978- 3-642-40041- 4_5

[30]

LLC Gisselquist Technology. [n. d.]. A Generic Piplined FFT Core Generator.

https://github.com/ZipCPU/dblclockt.

[31]

Kyoohyung Han, Seungwan Hong, Jung Hee Cheon, and Daejun Park. 2019.

Logistic Regression on Homomorphic Encrypted Data at Scale. In The Thirty-

Third AAAI Conference on Articial Intelligence, AAAI 2019, The Thirty-First

Innovative Applications of Articial Intelligence Conference, IAAI 2019, The Ninth

AAAI Symposium on Educational Advances in Articial Intelligence, EAAI 2019,

Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 9466–9471.

https://doi.org/10.1609/aaai.v33i01.33019466

[32]

Shousheng He and Mats Torkelson. 1996. A New Approach to Pipeline FFT

Processor. In Proceedings of IPPS ’96, The 10th International Parallel Processing

Symposium, April 15-19, 1996, Honolulu, Hawaii, USA. IEEE Computer Society,

766–770. https://doi.org/10.1109/IPPS.1996.508145

[33]

Lei Jiang, Qian Lou, and Nrushad Joshi. 2022. MATCHA: a fast and energy-

ecient accelerator for fully homomorphic encryption over the torus. In DAC

’22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA,

July 10 - 14, 2022, Rob Oshana (Ed.). ACM, 235–240. https://doi.org/10.1145/

3489517.3530435

[34]

Marc Joye. 2022. SoK: Fully Homomorphic Encryption over the [Discretized]

Torus. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022, 4 (2022), 661–692. https:

//doi.org/10.46586/tches.v2022.i4.661-692

[35]

Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee.

2021. Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through

Memory-centric Optimization with GPUs. IACR Trans. Cryptogr. Hardw. Embed.

Syst. 2021, 4 (2021), 114–148. https://doi.org/10.46586/tches.v2021.i4.114-148

[36]

Jongmin Kim, Gwangho Lee, Sangpyo Kim, Gina Sohn, Minsoo Rhu, John Kim,

and Jung Ho Ahn. 2022. ARK: Fully Homomorphic Encryption Accelerator with

Runtime Data Generation and Inter-Operation Key Reuse. In 55th IEEE/ACM

International Symposium on Microarchitecture, MICRO 2022, Chicago, IL, USA,

October 1-5, 2022. IEEE, 1237–1254. https://doi.org/10.1109/MICRO56248.2022.

00086

[37]

Sangpyo Kim, Jongmin Kim, Michael Jaemin Kim, Wonkyung Jung, John Kim,

Minsoo Rhu, and Jung Ho Ahn. 2022. BTS: an accelerator for bootstrappable fully

homomorphic encryption. In ISCA ’22: The 49th Annual International Symposium

on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, Valentina

Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang (Eds.). ACM, 711–725.

https://doi.org/10.1145/3470496.3527415

[38]

Igor Kononenko. 2001. Machine learning for medical diagnosis: history, state

of the art and perspective. Artif. Intell. Medicine 23, 1 (2001), 89–109. https:

//doi.org/10.1016/S0933-3657(01)00077- X

[39]

Joon-Woo Lee, HyungChul Kang, Yongwoo Lee, Woosuk Choi, Jieun Eom, Maxim

Deryabin, Eunsang Lee, Junghyun Lee, Donghoon Yoo, Young-Sik Kim, and Jong-

Seon No. 2022. Privacy-Preserving Machine Learning With Fully Homomorphic

Encryption for Deep Neural Network. IEEE Access 10 (2022), 30039–30054. https:

//doi.org/10.1109/ACCESS.2022.3159694

[40]

Charles E. Leiserson and James B. Saxe. 1991. Retiming Synchronous Circuitry.

Algorithmica 6, 1 (1991), 5–35. https://doi.org/10.1007/BF01759032

[41]

Ahmet Can Mert, Aikata, Sunmin Kwon, Youngsam Shin, Donghoon Yoo, Yong-

woo Lee, and Sujoy Sinha Roy. 2022. Medha: Microcoded Hardware Acceler-

ator for computing on Encrypted Data. CoRR abs/2210.05476 (2022). https:

//doi.org/10.48550/arXiv.2210.05476 arXiv:2210.05476

[42]

Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. [n.d.]. Spiral

DFT/FFT IP Core Generator. https://www.spiral.net/hardware/dftgen.html.

[43]

Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2012.

Computer Generation of Hardware for Linear Digital Signal Processing Trans-

forms. ACM Trans. Design Autom. Electr. Syst. 17, 2 (2012), 15:1–15:33. https:

//doi.org/10.1145/2159542.2159547

[44]

Mohammed Nabeel, Deepraj Soni, Mohammed Ashraf, Mizan Abraha Ge-

bremichael, Homer Gamil, Eduardo Chielle, Ramesh Karri, Mihai Sanduleanu,

and Michail Maniatakos. 2022. CoFHEE: A Co-processor for Fully Homomorphic

Encryption Execution. CoRR abs/2204.08742 (2022). https://doi.org/10.48550/

arXiv.2204.08742 arXiv:2204.08742

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

[45]

NuCypher. [n.d.]. NuFHE, a GPU-powered Torus FHE implementation. https:

//github.com/nucypher/nufhe/.

[46]

Nicolas Papernot, Patrick D. McDaniel, Arunesh Sinha, and Michael P. Wellman.

2018. SoK: Security and Privacy in Machine Learning. In 2018 IEEE European

Symposium on Security and Privacy, EuroS&P 2018, London, United Kingdom, April

24-26, 2018. IEEE, 399–414. https://doi.org/10.1109/EuroSP.2018.00035

[47]

Thomas Pöppelmann, Michael Naehrig, Andrew Putnam, and Adrián Macías.

2015. Accelerating Homomorphic Evaluation on Recongurable Hardware. In

Cryptographic Hardware and Embedded Systems - CHES 2015 - 17th International

Workshop, Saint-Malo, France, September 13-16, 2015, Proceedings (Lecture Notes

in Computer Science, Vol. 9293), Tim Güneysu and Helena Handschuh (Eds.).

Springer, 143–163. https://doi.org/10.1007/978- 3-662-48324- 4_8

[48]

Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria E. Presa

Reyes, Mei-Ling Shyu, Shu-Ching Chen, and S. S. Iyengar. 2019. A Survey on

Deep Learning: Algorithms, Techniques, and Applications. ACM Comput. Surv.

51, 5 (2019), 92:1–92:36. https://doi.org/10.1145/3234150

[49] M. Sadegh Riazi, Kim Laine, Blake Pelton, and Wei Dai. 2020. HEAX: An Archi-

tecture for Computing on Encrypted Data. In ASPLOS ’20: Architectural Support

for Programming Languages and Operating Systems, Lausanne, Switzerland, March

16-20, 2020, James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 1295–1309.

https://doi.org/10.1145/3373376.3378523

[50] Ronald L Rivest, Len Adleman, Michael L Dertouzos, et al. 1978. On data banks

and privacy homomorphisms. Foundations of secure computation 4, 11 (1978),

169–180.

[51]

Sujoy Sinha Roy, Furkan Turan, Kimmo Järvinen, Frederik Vercauteren, and

Ingrid Verbauwhede. 2019. FPGA-Based High-Performance Parallel Architecture

for Homomorphic Computing on Encrypted Data. In 25th IEEE International

Symposium on High Performance Computer Architecture, HPCA 2019, Washington,

DC, USA, February 16-20, 2019. IEEE, 387–398. https://doi.org/10.1109/HPCA.

2019.00052

[52]

Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas,

Ronald G. Dreslinski, Christopher Peikert, and Daniel Sánchez. 2021. F1: A

Fast and Programmable Accelerator for Fully Homomorphic Encryption. In MI-

CRO ’21: 54th Annual IEEE/ACM International Symposium on Microarchitecture,

Virtual Event, Greece, October 18-22, 2021. ACM, 238–252. https://doi.org/10.1145/

3466752.3480070

[53]

Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Nathan Manohar,

Nicholas Genise, Srinivas Devadas, Karim Eldefrawy, Chris Peikert, and Daniel

Sánchez. 2022. CraterLake: a hardware accelerator for ecient unbounded com-

putation on encrypted data. In ISCA ’22: The 49th Annual International Symposium

on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, Valentina

Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang (Eds.). ACM, 173–187.

https://doi.org/10.1145/3470496.3527393

[54]

François Serre and Markus Püschel. 2018. A DSL-Based FFT Hardware Generator

in Scala. In 28th International Conference on Field Programmable Logic and Ap-

plications, FPL 2018, Dublin, Ireland, August 27-31, 2018. IEEE Computer Society,

315–322. https://doi.org/10.1109/FPL.2018.00060

[55]

François Serre and Markus Püschel. 2019. DSL-Based Modular IP Core Generators:

Example FFT and Related Structures. In 26th IEEE Symposium on Computer

Arithmetic, ARITH 2019, Kyoto, Japan, June 10-12, 2019, Naofumi Takagi, Sylvie

Boldo, and Martin Langhammer (Eds.). IEEE, 190–191. https://doi.org/10.1109/

ARITH.2019.00043

[56]

François Serre and Markus Püschel. 2020. DSL-Based Hardware Generation

with Scala: Example Fast Fourier Transforms and Sorting Networks. ACM Trans.

Recongurable Technol. Syst. 13, 1 (2020), 1:1–1:23. https://doi.org/10.1145/

3359754

[57]

Furkan Turan, Sujoy Sinha Roy, and Ingrid Verbauwhede. 2020. HEAWS: An

Accelerator for Homomorphic Encryption on the Amazon AWS FPGA. IEEE Trans.

Computers 69, 8 (2020), 1185–1196. https://doi.org/10.1109/TC.2020.2988765

[58]

Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, and Berk Sunar. 2012. Accel-

erating fully homomorphic encryption using GPU. In IEEE Conference on High

Performance Extreme Computing, HPEC 2012, Waltham, MA, USA, September 10-12,

2012. IEEE, 1–5. https://doi.org/10.1109/HPEC.2012.6408660

[59]

P. Welch. 1969. A xed-point fast Fourier transform error analysis. IEEE Trans-

actions on Audio and Electroacoustics 17, 2 (1969), 151–157. https://doi.org/10.

1109/TAU.1969.1162035

[60] Xilinx. [n. d.]. Vitis DSP Library. https://xilinx.github.io/Vitis_Libraries/dsp.

[61]

Xilinx. 2022. Fast Fourier Transform v9.1. LogiCORE IP Product Guide. PG109.

[62]

Tian Ye, Rajgopal Kannan, and Viktor K. Prasanna. 2022. FPGA Acceleration

of Fully Homomorphic Encryption over the Torus. In IEEE High Performance

Extreme Computing Conference, HPEC 2022, Waltham, MA, USA, September 19-23,

2022. IEEE, 1–7. https://doi.org/10.1109/HPEC55821.2022.9926381

[63]

Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. 2020. A

Survey of Autonomous Driving: Common Practices and Emerging Technologies.

IEEE Access 8 (2020), 58443–58469. https://doi.org/10.1109/ACCESS.2020.2983149

[64]

Zama. 2022. Announcing Concrete-core v1.0.0-gamma with GPU acceleration.

https://www.zama.ai/post/concrete-core-v1- 0-gamma-with-gpu-acceleration

ResearchGate has not been able to resolve any citations for this publication.

Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through Memory-centric Optimization with GPUs

Article

Full-text available

Aug 2021

Fully Homomorphic encryption (FHE) has been gaining in popularity as an emerging means of enabling an unlimited number of operations in an encrypted message without decryption. A major drawback of FHE is its high computational cost. Specifically, a bootstrapping step that refreshes the noise accumulated through consequent FHE operations on the ciphertext can even take minutes of time. This significantly limits the practical use of FHE in numerous real applications. By exploiting the massive parallelism available in FHE, we demonstrate the first instance of the implementation of a GPU for bootstrapping CKKS, one of the most promising FHE schemes supporting the arithmetic of approximate numbers. Through analyzing CKKS operations, we discover that the major performance bottleneck is their high main-memory bandwidth requirement, which is exacerbated by leveraging existing optimizations targeted to reduce the required computation. These observations motivate us to utilize memory-centric optimizations such as kernel fusion and reordering primary functions extensively. Our GPU implementation shows a 7.02× speedup for a single CKKS multiplication compared to the state-of-the-art GPU implementation and an amortized bootstrapping time of 0.423us per bit, which corresponds to a speedup of 257× over a single-threaded CPU implementation. By applying this to logistic regression model training, we achieved a 40.0× speedup compared to the previous 8-thread CPU implementation with the same data.

SoK: Fully Homomorphic Encryption over the [Discretized] Torus

Article

Full-text available

Aug 2022

Marc Joye

First posed as a challenge in 1978 by Rivest et al., fully homomorphic encryption—the ability to evaluate any function over encrypted data—was only solved in 2009 in a breakthrough result by Gentry (Commun. ACM, 2010). After a decade of intense research, practical solutions have emerged and are being pushed for standardization. This paper explains the inner-workings of TFHE, a torus-based fully homomorphic encryption scheme. More exactly, it describes its implementation on a discretized version of the torus. It also explains in detail the technique of the programmable bootstrapping. Numerous examples are provided to illustrate the various concepts and definitions.

Privacy-Preserving Machine Learning With Fully Homomorphic Encryption for Deep Neural Network

Article

Full-text available

Jan 2022

Fully homomorphic encryption (FHE) is a prospective tool for privacy-preserving machine learning (PPML). Several PPML models have been proposed based on various FHE schemes and approaches. Although FHE schemes are suitable as tools for implementing PPML models, previous PPML models based on FHE, such as CryptoNet, SEALion, and CryptoDL, are limited to simple and nonstandard types of machine learning models; they have not proven to be efficient and accurate with more practical and advanced datasets. Previous PPML schemes replaced non-arithmetic activation functions with simple arithmetic functions instead of adopting approximation methods and did not use bootstrapping, which enables continuous homomorphic evaluations. Thus, they could neither use standard activation functions nor employ large numbers of layers. In this work, we first implement the standard ResNet-20 model with the RNS-CKKS FHE with bootstrapping and verify the implemented model with the CIFAR-10 dataset and plaintext model parameters. Instead of replacing the non-arithmetic functions with simple arithmetic functions, we use state-of-the-art approximation methods to evaluate these non-arithmetic functions, such as ReLU and Softmax, with sufficient precision. Further, for the first time, we use the bootstrapping technique of the RNS-CKKS scheme in the proposed model, which enables us to evaluate an arbitrary deep learning model on encrypted data. We numerically verify that the proposed model with the CIFAR-10 dataset shows 98.43% identical results to the original ResNet-20 model with non-encrypted data. The classification accuracy of the proposed model is 92.43%±2.65%, which is quite close to that of the original ResNet-20 CNN model (91.89%). It takes approximately 3 h for inference on a dual Intel Xeon Platinum 8280 CPU (112 cores) with 172 GB of memory. We believe that this opens the possibility of applying FHE to an advanced deep PPML model.

A Survey on Pipelined FFT Hardware Architectures

Article

Full-text available

Jul 2021
J SIGNAL PROCESS SYS

Mario Garrido

The field of pipelined FFT hardware architectures has been studied during the last 50 years. This paper is a survey that includes the main advances in the field related to architectures for complex input data and power-of-two FFT sizes. Furthermore, the paper is intended to be educational, so that the reader can learn how the architectures work. Finally, the paper divides the architectures into serial and parallel. This classification puts together those architectures that are conceived for a similar purpose and, therefore, are comparable.

ARK: Fully Homomorphic Encryption Accelerator with Runtime Data Generation and Inter-Operation Key Reuse

Conference Paper