Available via license: CC BY 4.0
Content may be subject to copyright.
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic
Encryption
Michiel Van Beirendonck, Jan-Pieter D’Anvers, Ingrid Verbauwhede
{rstname.lastname}@esat.kuleuven.be
COSIC KU Leuven
Leuven, Belgium
ABSTRACT
Fully Homomorphic Encryption is a technique that allows compu-
tation on encrypted data. It has the potential to drastically change
privacy considerations in the cloud, but high computational and
memory overheads are preventing its broad adoption. TFHE is a
promising Torus-based FHE scheme that heavily relies on boot-
strapping, the noise-removal tool that must be invoked after every
encrypted gate computation.
We present FPT, a Fixed-Point FPGA accelerator for TFHE boot-
strapping. FPT is the rst hardware accelerator to heavily exploit
the inherent noise present in FHE calculations. Instead of double
or single-precision oating-point arithmetic, it implements TFHE
bootstrapping entirely with approximate xed-point arithmetic.
Using an in-depth analysis of noise propagation in bootstrapping
FFT computations, FPT is able to use noise-trimmed xed-point rep-
resentations that are up to 50% smaller than prior implementations
using oating-point or integer FFTs.
FPT’s microarchitecture is built as a streaming processor inspired
by traditional streaming DSPs: it instantiates high-throughput com-
putational stages that are directly cascaded, with simplied con-
trol logic and routing networks. We explore dierent throughput-
balanced compositions of streaming kernels with a user-congurable
streaming width in order to construct a full bootstrapping pipeline.
FPT’s streaming approach allows 100% utilization of arithmetic
units and requires only small bootstrapping key caches, enabling
an entirely compute-bound bootstrapping throughput of 1 BS / 35
𝜇
s.
This is in stark contrast to the established classical CPU approach to
FHE bootstrapping acceleration, which tends to be heavily memory
and bandwidth-constrained.
FPT is fully implemented and evaluated as a bootstrapping FPGA
kernel for an Alveo U280 datacenter accelerator card. FPT achieves
almost three orders of magnitude higher bootstrapping through-
put than existing CPU-based implementations, and 2.5
×
higher
throughput compared to recent ASIC emulation experiments.
1 INTRODUCTION AND MOTIVATION
Machine Learning (ML), driven by the availability of an abundance
of data, has seen rapid advances in recent years [
48
], leading to
new applications from autonomous driving [
63
] to medical diag-
nosis [
38
]. In many applications, ML models are developed by one
party, who makes them available to users as a cloud service [
3
].
The deployment of such applications comes at the risk of privacy
breaches, where the user data might be leaked, or IP theft, where
users steal the ML model from the developing party [46].
The “silver bullet” solution to prevent the leakage of this data is
to encrypt it with Fully Homomorphic Encryption (FHE) [
27
,
50
],
which is a technique that allows one to compute on encrypted data.
Server FPGA
Client
m
f(m)
Figure 1: FHE allows outsourced computation on data that
remains encrypted. The cloud receives encrypted data on
which it can compute and the public key (green), but does
not receive the secret decryption key (red). The cloud can
run computations on the data, but only the client can nally
decrypt and obtain the result. Cloud instances with FPGAs
enable custom hardware accelerators and have the potential
to drastically speed up FHE computations.
Fig. 1 illustrates a possible application of FHE to protect user data
in an ML environment. In this scenario, a client wants to use an
online-server-based ML service, without leaking any sensitive data.
To this end, the client encrypts their data with FHE, before sending
it to the cloud. The cloud service then computes an FHE program
on the encrypted data without obtaining any information about
the input and sends the (still encrypted) result back to the client.
Only the client can nally decrypt and obtain the result.
The drawback of FHE is that it is at the moment still orders of
magnitude slower than unencrypted calculations. The rst algo-
rithm to calculate an encrypted AND gate took up to 30 minutes to
nish[
28
]. FHE schemes and algorithms have seen signicant im-
provements in recent years, e.g. the recent TFHE scheme computes
encrypted AND gates in only 13ms [
10
,
11
] on a CPU. However,
even with these improvements, it is not uncommon to still see slow-
down factors of 10,000
×
compared to calculations on unencrypted
data [
13
,
31
,
39
], which currently still prevents practical deployment
of FHE in many applications.
To work around the speed limitations of FHE, designers have
shifted their focus from general-purpose CPUs to more dedicated
hardware implementations. Of these dedicated implementations,
GPU-based FHE accelerators are easiest to develop, but they typi-
cally only provide modest speedups [
5
,
15
,
35
,
58
]. At the other end
of the spectrum, ASIC emulations in advanced technology nodes
promise better FHE acceleration [
25
,
36
,
37
,
52
,
53
]. However, it can
take years for these ASICs to be fabricated and become available
arXiv:2211.13696v1 [cs.CR] 24 Nov 2022
Van Beirendonck et al.
[
44
], and they are typically specialized for a limited range of param-
eter sets. Finally, FPGA-based implementations can be developed
more quickly than ASIC implementations, are exible to change pa-
rameter sets, and can be readily deployed in FPGA-equipped cloud
instances while boosting large speedups. As a result, they have
been a popular target for FHE acceleration [
1
,
18
,
41
,
47
,
49
,
51
,
57
].
One costly operation in FHE calculations is bootstrapping. All
currently available FHE schemes have an inherent noise that is
increased with each operation. After a certain number of operations,
this noise needs to be reduced to allow further calculations, which
is done using this so-called bootstrap procedure.
Second-generation FHE schemes BFV [
20
], BGV [
7
], and CKKS
[
9
] have been the main focus of prior hardware accelerators. These
schemes require bootstrapping only after a certain number of opera-
tions. For these schemes, bootstrapping is a complex algorithm that
requires large data caches [
16
] and exhibits low arithmetic intensity,
and essentially all prior architectures that support bootstrapping
have hit the o-chip memory-bandwidth wall [37, 52].
Third-generation schemes like FHEW [
19
] and its successor
Torus FHE (TFHE) [
10
,
11
] have revisited the bootstrapping ap-
proach, making it cheaper but inherently linked to homomorphic
calculations. In these schemes, most of the homomorphic opera-
tions require an immediate bootstrap of the ciphertext. Moreover,
bootstrapping in TFHE is a versatile tool, which can additionally
be “programmed” with an arbitrary function that is applied to the
ciphertext, e.g. non-linear activation functions in ML neural net-
works [
13
]. This approach is called Programmable Bootstrapping
(PBS) and it constitutes the main cost of TFHE homomorphic calcu-
lations. Taking up to 99% of an encrypted gate computation, PBS is
a prime target for high-throughput hardware acceleration of TFHE.
In this work, we propose FPT, an FPGA-based accelerator for
TFHE Programmable Bootstrapping. FPT achieves a signicant
speedup over the previous state-of-the-art, which is attributable to
two major contributions:
•
FPT’s microarchitecture is built as a streaming processor,
challenging the established classical CPU approach to FHE
bootstrapping accelerators. Inspired by traditional stream-
ing DSPs, FPT instantiates high-throughput computational
stages that are directly cascaded, with simplied control logic
and routing networks. FPT’s streaming approach allows 100%
utilization of arithmetic units during bootstrapping, includ-
ing tool-generated high-radix and heavily optimized nega-
cyclic FFT units with user-congurable streaming widths.
Our streaming architecture is discussed in Section 3.
•
FPT (
F
ixed-
P
oint
T
FHE) is the rst hardware accelerator to
extensively optimize the representation of intermediate vari-
ables. TFHE PBS is dominated by FFT calculations, which
work on irrational (complex) numbers and need to be im-
plemented with sucient accuracy. Instead of using double
oating-point arithmetic or large integers as in previous
works, FPT implements PBS entirely with compact xed-
point arithmetic. We analyze in-depth the noise due to the
compact xed-point representation that we use inside PBS,
and we match it to the noise that is natively present in FHE.
Through this analysis, FPT is able to use xed-point repre-
sentations that are up to 50% smaller than prior implemen-
tations using oating-point or integer FFTs. In turn, these
50% smaller xed-point representations enable up to 80%
smaller FFT kernels. Our xed-point analysis is discussed in
Section 4.
FPT shows, for the rst time, that PBS can remain entirely compute-
bound with only small bootstrapping key data caches. FPT achieves
a massive PBS throughput of 1 PBS / 35
𝜇
s, which requires only mod-
est o-chip memory bandwidth, and is entirely bound by the logic
resources on our target Xilinx Alveo U280 FPGA. This represents
almost three orders of magnitude speedup over the popular TFHE
software library CONCRETE [
12
] on an Intel Xeon Silver 4208 CPU
at 2.1 GHz, a factor 7.1
×
speedup over a concurrently-developed
FPGA architecture [
62
], and a factor 2.5
×
speedup over recent 16nm
ASIC emulation experiments [33].
2 BACKGROUND
This section gives an intuitive idea of the workings of TFHE, with a
focus on the Programmable Bootstrapping step that is accelerated
by FPT. We the reader to [
10
,
11
,
34
] for a more in-depth overview
of TFHE.
2.1 Torus Fully Homomorphic Encryption
Torus Fully Homomorphic Encryption (TFHE) is a homomorphic
encryption scheme based on the Learning With Errors (LWE) prob-
lem. It operates on elements that are dened over the real Torus
T=R/Z
, i.e. the set
[
0
,
1
)
of real numbers modulo 1. In practice,
Torus elements are discretized as 32-bit or 64-bit integers.
A TFHE ciphertext can be constructed by combining three el-
ements: a secret vector
𝑠
with
𝑛
coecients following a uniform
binary distribution
𝑠$
← U(B𝑛)
, a public vector
𝑎$
← U(T𝑛)
sam-
pled from a uniform distribution, and a small error
𝑒$
←𝜒
from a
small distribution
𝜒(T)
. A message
𝜇∈T
can be encrypted as a
tuple:
𝑐=(𝑎, 𝑏 =𝑎·𝑠+𝑒+𝜇) ∈ T𝑛+1
. Using the secret
𝑠
, one can
decrypt the ciphertext back into (a noisy version of) the message
by computing
𝑏−𝑎·𝑠=𝜇+𝑒
. This type of ciphertext is called a
Torus LWE (TLWE) ciphertext.
TFHE additionally describes two variant ciphertexts: First, a
generalized version (TGLWE), where
𝑒
and
𝜇
are polynomials in
T𝑁[𝑋]=T[𝑋]/(𝑋𝑁+
1
)
, and where
𝑎
and
𝑠
are vectors of polyno-
mials of the form
T𝑁[𝑋]𝑘
. The TGLWE ciphertext is then similarly
formed as a tuple:
𝑐=(𝑎, 𝑏 =𝑎·𝑠+𝑒+𝜇) ∈ T𝑁[𝑋]𝑘+1
. The second
variant is a generalized version of a GSW [
29
] ciphertext (TGGSW),
which is essentially a matrix where each row is a TGLWE ciphertext:
𝑐∈T𝑁[𝑋](𝑘+1)𝑙×(𝑘+1).
The reason for dening TGLWE and TGGSW ciphertexts is that
they permit a homomorphic multiplication:
TGLWE(𝜇1)⊡TGGSW(𝜇2)=TGLWE(𝜇1·𝜇2),
known as the External Product (
⊡
). First, it decomposes each of the
polynomials in the TGLWE ciphertext into
𝑙
polynomials of
𝛽
bits,
an operation termed gadget decomposition. Next, the decomposed
TGLWE ciphertext and TGGSW are multiplied in a
(𝑘+
1
)𝑙−
vector
times
(𝑘+
1
)𝑙× (𝑘+
1
)
-matrix product where the elements of this
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
vector and matrix are polynomials in
T𝑁[𝑋]
. The output is again a
TGLWE ciphertext encrypting 𝜇1·𝜇2.
2.2 Programmable Bootstrapping
The main goal of bootstrapping is to reduce the noise in the cipher-
text. One way to reduce the ciphertext noise would be to decrypt
the ciphertext, after which the noise can be suppressed, but this
would not be secure. Bootstrapping does in essence decrypt the
ciphertext, but for security reasons this operation is performed,
homomorphically, inside the encrypted domain. This means that
one wants to homomorphically compute
𝑏−𝑎·𝑠=𝑒+𝜇
, and more
specically, as it is “programmable” bootstrapping, one wants to
additionally compute a function 𝑓(𝜇)on the data.
To achieve this programmable bootstrapping, one rst sets a
“test” polynomial
𝐹=Í𝑁−1
𝑖=0𝑓(𝑖) · 𝑋𝑖∈T𝑁[𝑋]
that encodes
𝑁
relevant values of the function
𝑓
. This polynomial is then rotated
with
𝑏−𝑎·𝑠
positions by calculating
𝐹·𝑋−(𝑏−𝑎·𝑠)
, after which
the output to the function can be found on the rst position of the
resulting polynomial. However, all of these calculations should be
done without revealing the value of 𝑠.
The high-level idea of how to achieve this is to rst rewrite the
above expression as follows:
𝐹·𝑋−(𝑏−𝑎·𝑠)=𝐹·𝑋−𝑏·
𝑛
Ö
𝑖=1
𝑋𝑎𝑖·𝑠𝑖.(1)
This expression can be calculated iteratively. Starting with the
polynomial 𝐴𝐶𝐶 =𝐹·𝑋−𝑏, one iteratively calculates:
𝐴𝐶𝐶 ←𝐴𝐶𝐶 ·𝑋𝑎𝑖·𝑠𝑖,(2)
which can be further rewritten, using the fact that
𝑠𝑖
is either zero
or one, to:
𝐴𝐶𝐶 ← (𝐴𝐶𝐶 ·𝑋𝑎𝑖−𝐴𝐶𝐶) ·𝑠𝑖+𝐴𝐶𝐶 . (3)
However, as we can not reveal
𝑠𝑖
, we encode the
𝑠𝑖
value in a
TGGSW ciphertext
𝐵𝐾𝑖
, and the
𝐴𝐶𝐶
value in a TGLWE ciphertext,
after which the expression becomes:
𝐴𝐶𝐶 ← (𝐴𝐶𝐶 ·𝑋𝑎𝑖−𝐴𝐶𝐶)⊡𝐵𝐾𝑖+𝐴𝐶𝐶, (4)
using the homomorphic multiplication operation
⊡
. Eq. (4) homo-
morphically multiplexes on the secret value
𝑠𝑖
, and is known as the
Controlled MUX (CMUX).
Collectively, the dierent TGGSW ciphertexts
𝐵𝐾1, . . . , 𝐵𝐾𝑛
, each
encrypting one secret coecient
𝑠1,·· · , 𝑠𝑛
, are known as the boot-
strapping key. The result of the operations described above is a
TGLWE accumulator
𝐴𝐶𝐶
which is “blindly” rotated with a secret
amount of
𝑏−𝑎·𝑠
positions, from which the output TLWE cipher-
text can be straightforwardly extracted. The computations during
PBS are given in Algorithm 1.
FPT implements two parameter sets of TFHE, given in Table 1.
Parameter Set I is a parameter set used by the CONCRETE Boolean
library with 128-bit security [
12
]. Parameter Set II is a 110-bit se-
curity parameter set that has previously been employed for bench-
marking purposes, allowing a direct comparison of FPT with prior
work [11, 64].
Algorithm 1: TFHE’s Programmable Bootstrapping
/* TLWE Ciphertext */
input :𝑐𝑖𝑛 =(𝑎1, . . . , 𝑎𝑛, 𝑏) ∈ T𝑛+1
/* TGGSW Bootstrapping Key */
input :𝐵𝐾 =(𝐵𝐾1, . . . , 𝐵𝐾𝑛) ∈ T𝑁[𝑋]𝑛×(𝑘+1)𝑙×(𝑘+1)
/* TGLWE Test Polynomial LUT */
input : 𝐹∈T𝑁[𝑋](𝑘+1)
/* TLWE Ciphertext */
output :𝑐𝑜𝑢 𝑡 ∈T𝑘𝑁 +1
/* Test Polynomial LUT */
1𝐴𝐶𝐶 ←𝐹·𝑋−𝑏
/* Blind Rotation */
2for 𝑖←1to 𝑛do
/* CMUX */
3𝐴𝐶𝐶 ← (𝐴𝐶𝐶 ·𝑋⌊2𝑁𝑎𝑖
𝑞⌉−𝐴𝐶𝐶)⊡𝐵𝐾𝑖+𝐴𝐶𝐶
4end
5return 𝑐𝑜𝑢𝑡 =SampleExtract (𝐴𝐶𝐶)
Parameter Set (I) (II)
TLWE dimension n 586 500
TGLWE dimension k 2 1
Polynomial size N 512 1024
Decomp. Base Log 𝛽8 10
Decomp. Level 𝑙2 2
Table 1: Parameter Sets: (I) is a parameter set used by the
CONCRETE Boolean library [12] with 128-bit security. (II) is
a 110-bit security parameter set popular for benchmarking
purposes[11, 64].
2.3 FFT polynomial multiplications
As can be seen in Algorithm 1, the TFHE programmable bootstrap-
ping mainly consists of iterative calculation of the external product
⊡
, which is a vector-matrix multiplication where the elements are
large polynomials of order
𝑁
. Bootstrapping is therefore dominated
by the calculation of the polynomial multiplications.
A schoolbook approach to polynomial multiplication would re-
sult in a computational complexity
𝑂(𝑁2)
. However, utilizing the
convolution theorem, the FFT can be used to compute these poly-
nomial multiplications in time
𝑂(𝑁log(𝑁))
, as the multiplication
of polynomials modulo
𝑋𝑁−
1corresponds to a cyclic convolution
of the input vectors. FHE schemes, however, need polynomial mul-
tiplications modulo
𝑋𝑁+
1, requiring negacyclic FFTs to compute
negative-wrapped convolutions. This negacyclic convolution has
a period 2
𝑁
, and thus a straightforward implementation would
require 2𝑁size FFTs. The cost of the negacyclic FFT on real input
data can be reduced using two techniques.
The fact that the FFT computes on complex numbers oers the
rst opportunity for optimization. Since the input polynomials are
purely real and have an imaginary component equal to zero, real-to-
complex (r2c) optimized FFTs can be used, which achieves roughly
a factor of two improvements in speed and memory usage [
21
]. This
is the approach taken by the TFHE and FHEW software libraries,
which compute size-2N r2c FFTs.
A second possible optimization is that negacyclic FFTs, which
would have a period and size of 2
𝑁
, can be computed instead as
Van Beirendonck et al.
a regular FFT with period and size
𝑁
by using a “twisting” pre-
processing step [
2
]. During twisting, the coecients of the input
polynomial
𝑎
are multiplied with the powers of the 2
𝑁
-th root of
unity 𝜓=𝜔2𝑁,
ˆ
𝑎=(𝑎[0],𝜓𝑎 [1], . . . , 𝜓 𝑁−1𝑎[𝑁−1]).(5)
After twisting, one can perform multiplication using a regular cyclic
FFT on ˆ
𝑎, halving the required FFT size to 𝑁.
While both optimizations are well-known individually, it is less
straightforward to combine them. Intuitively, the twisting step is
incompatible with the r2c optimization, because it will make the
polynomial complex.
We use a third, but not-so-well-known technique from NuFHE
[
45
] based on the tangent FFT [
6
]. The crux of this method is to “fold”
polynomial coecients
𝑎[𝑖]
and
𝑎[𝑖+𝑁/
2
]
into a complex number
𝑎[𝑖] + 𝑗𝑎 [𝑖+𝑁/
2
]
before applying the twisting step and subsequent
cyclic size-
𝑁/
2FFT. This quarters the size of the FFT required from
the original naive size-2
𝑁
FFT. We adopt this technique in FPT and
use FFTs of size
𝑁/
2
=
256 and
𝑁/
2
=
512 for Parameters Sets I
and II (Table 1), respectively.
3 FPT MICROARCHITECTURE
In this section, we discuss FPT’s microarchitecture. First, we de-
scribe how FPT’s architecture is designed as a streaming processor
targeting maximum throughput. Next, we detail a batch bootstrap-
ping technique, which signicantly reduces FPT’s on-chip caches
and o-chip bandwidth. Finally, we present balanced implemen-
tations of the various computational stages, which enable 100%
utilization of the arithmetic units during FPT’s bootstrapping oper-
ation.
3.1 Streaming Processor
FHE accelerators for second-generation schemes have mostly been
built after a classical CPU architecture [
25
,
36
,
52
]. They include a
control unit that executes an instruction set, together with a set of
arithmetic Processing Elements (PEs) that support dierent opera-
tions, e.g. ciphertext multiplication, key-switching, or bootstrap-
ping. Dierent operations utilize dierent PEs, requiring careful
proling of FHE programs to balance PE relative throughputs and
utilization [37, 53].
These accelerators are often memory-bound during bootstrap-
ping, and in order to keep a high utilization level of PEs, an in-
creasing focus is spent on optimizing the memory hierarchy, often
including a multi-layer on-chip memory hierarchy with a large
ciphertext register le at the lowest level.
FPT challenges this established classical CPU approach to FHE
bootstrapping acceleration, and instead adopts a microarchitecture
that is inspired by streaming Digital Signal Processors (DSPs). Data
ows naturally through FPT’s wide and directly cascaded computa-
tional stages, with simplied hard-wired routing paths and without
complicated control logic. During FPT’s bootstrapping operation,
utilization of arithmetic units is 100%.
As illustrated in Fig. 2, FPT denes only a single xed PE, the
CMUX PE, and instantiates only a single instance of this PE with
wide datapaths and massive throughput. Taking advantage of the
regular structure of TFHE’s PBS, consisting of
𝑛
repeated CMUX
iterations, this single high-throughput PE suces to run PBS to
completion. The CMUX PE computes a single PBS CMUX iteration,
after which its datapath output hard-wires back into its datapath
input.
Internally, the CMUX PE computes a xed sequence of monomial
multiplication, gadget decomposition, and polynomial multiply-
add operations of the external product. Rather than dividing the
CMUX into sub-PEs that are sequenced to run from a register le,
FPT builds the CMUX with directly cascaded computational stages.
Stages are throughput-balanced in the most conceivably simple
way: each stage operates at the same throughput and processes
a number of polynomial coecients per clock cycle that we call
the streaming width. Stages are interconnected in a simple xed
pipeline with static latency, avoiding complicated control logic and
simplifying routing paths.
FPT is built to achieve maximum PBS throughput. As a gen-
eral trend that we will detail later (Fig. 3b), the Throughput/Area
(TP/A) of computational stages increases together with the stream-
ing width. This motivates FPT to instantiate only a single wide
CMUX PE with high streaming width, as opposed to many CMUX
PEs with smaller streaming widths.
In summary, FPT’s CMUX architecture enables massive PBS
throughput by more closely resembling the architecture of a stream-
ing Digital Signal Processor (DSP), rather than the classical CPU
architecture employed by prior FHE processors.
3.2 Batch Bootstrapping
TFHE bootstrapping requires two major inputs: the input cipher-
text coecients
𝑎1, . . . , 𝑎𝑛
and the bootstrapping keys
𝐵𝐾1, . . . , 𝐵𝐾𝑛
.
Each iteration of the CMUX PE requires one element of both. The ci-
phertext coecients
𝑎𝑖
are relatively small in size and are therefore
easy to accommodate. In contrast, a bootstrapping key coecient
𝐵𝐾𝑖∈T𝑁[𝑋](𝑘+1)𝑙×(𝑘+1)
is a large matrix of up to tens of kBs.
Since the full
𝐵𝐾
is typically too large to t entirely on-chip, the
𝐵𝐾𝑖
must be loaded from o-chip memory for every iteration. How-
ever, at high CMUX throughput levels, the required bandwidth
for
𝐵𝐾𝑖
could easily exceed 1.0 TB/s. This is larger even than the
bandwidth of HBM, and thus poses a memory bottleneck.
We propose a method, termed batch bootstrapping, to amortize
loading the bootstrapping key for each iteration. The result is that
FPT can operate entirely compute-bound, with modest o-chip
bandwidth and small on-chip caches. In contrast, prior FHE proces-
sors that supported bootstrapping of second-generation schemes
were often bottlenecked by the required memory bandwidth [
37
,
52
].
In fact, a recent architectural analysis of bootstrapping [
16
] found
that it exhibits low arithmetic intensity and requires large caches.
Their conclusion was that FHE processors only benet marginally
from bespoke high-throughput arithmetic units. With our design,
we show that the situation can be very dierent for TFHE’s PBS.
In FPT, we solve the memory bottleneck problem as follows: First,
due to internal pipelining, the latency of the CMUX will be much
larger than its throughput. To operate at peak throughput, FPT
processes multiple ciphertexts to keep its CMUX pipeline stages
full. Next, we enforce that the dierent ciphertexts processed con-
currently in the CMUX’s pipeline stages arrive in a single batch
of size
𝑏
, encrypted under the same
𝐵𝐾
. This ensures that these
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Figure 2: FPT’s microarchitecture. FPT instantiates only a single PE, the CMUX PE. The CMUX is built with wide, directly
cascaded datapaths, targeting massive throughput. In light grey are illustrated two throughput-balanced architectures for the
external product (with 𝑘=
1
,𝑙=
2
): dotproduct-unrolled (left) and FFT-unrolled (right). Host-FPGA communication includes
three dierent interfaces: an input ciphertext FIFO, a ping-pong bootstrapping-key buer, and a test polynomial 𝐹SRAM.
ciphertexts are at the same CMUX iteration, and as a result, all
require the exact same input coecient 𝐵𝐾𝑖.
Batch bootstrapping then proceeds as follows: We instantiate a
simple BRAM ping-pong buer that holds two coecients of
𝐵𝐾
.
The CMUX reads
𝐵𝐾𝑖
from one half with the required bandwidth
of 1.0 TB/s, while the o-chip memory lls
𝐵𝐾𝑖+1
inside the other
half with a bandwidth of 1
.
0
/𝑏
TB/s. In a technique similar to C-
slow retiming [
40
], we can arbitrarily increase the batch size
𝑏
by introducing more pipeline registers within the CMUX, without
throughput penalty. With a batch size of
𝑏=
16, already the required
bandwidth can be supplied by DDR4 instead of HBM.
Our simple but crucial batch bootstrapping technique exploits
locality of reference to decouple the on-chip bandwidth from the
o-chip bandwidth. As a result, in our architecture, TFHE’s PBS is
entirely compute-bound with only kB-size caches, not larger than
the size of two coecients of the bootstrapping key.
3.3 Balancing the External Product
The external product
(⊡)
, computing a vector-matrix negacyclic
polynomial product, represents the bulk of the CMUX logic. As
discussed before, the polynomial multiplications are performed
using an FFT, and thus the
(⊡)
operations include forward and
inverse negacylic FFT computations, and pointwise dot-products
with
𝐵𝐾𝑖
(the bootstrapping key
𝐵𝐾𝑖
is already in the FFT domain).
In a streaming architecture, it is important to balance through-
puts of processing elements, which is not trivial as the external
product includes
(𝑘+
1
)𝑙
forward FFTs, but only
(𝑘+
1
)
inverse
FFT operations. We explore two dierent throughput-balanced
architectures for the external product as shown in light-grey in
Fig. 2: a dotproduct-unrolled architecture (left) and an FFT-unrolled
architecture (right).
The dotproduct-unrolled architecture (left) represents the more
obvious choice for parallelism, where we instantiate
𝑙
times more
FFT kernels compared to IFFT kernels. With the FFT-unrolled ar-
chitecture on the right, we make a more unconventional choice:
we balance throughputs by instantiating the FFT with
𝑙
times the
streaming width of the IFFT. These two architectural trade-os
can be understood as exploiting dierent types of “loop unrolling”
inside the external product. On the left, we rst loop-unroll the dot-
product before unrolling the FFT, while on the right, we loop-unroll
the FFT maximally.
The drawback of the FFT-unrolled architecture is that it is more
complex than the dotproduct-unrolled one. First, multiply-add op-
erations must be replaced by MACs, since polynomial coecients
that must be added are now spaced temporally over dierent clock
cycles. Second, the inverse FFT can only start processing once a full
MAC has been completed, requiring a Parallel-In Serial-Out (PISO)
block that double-buers the MAC output and matches through-
puts. Third and most importantly, FFT blocks can be challenging to
unroll and implement for arbitrary throughputs, and supporting
two FFT blocks with diering throughputs requires non-negligible
extra engineering eort.
The main advantage of the more unconventional FFT-unrolled
architecture is that it features fewer FFT kernels that can therefore
utilize higher streaming widths. As we will detail in the next section,
this favors the general (and often-neglected) trend of pipelined FFTs,
which typically feature signicantly higher TP/A as the streaming
width increases. At the most extreme end, a fully parallel FFT is a
circuit with only constant multiplications and xed routing paths,
featuring up to 300% more throughput per DSP or per LUT on our
target FPGA (Fig. 3b). FPT alleviates the extra engineering eort
and extra complexity of the FFT-unrolled architecture, by extending
and optimizing an existing FFT generator tool to support negacylic
FFTs.
3.4 Streaming Negacylic FFTs
State-of-the-art FHE processors have implemented mostly itera-
tive FFTs or NTTs that process polynomials in multiple passes
[
1
,
25
,
41
,
49
]. In these architectures, it can be dicult to support
arbitrary throughputs, as memory conicts arise when each pass
requires data at dierent strides. Instead, FPT instantiates pipelined
FFTs that naturally support a streaming architecture. Pipelined FFT
Van Beirendonck et al.
architectures consist of
𝑙𝑜𝑔 (𝑁)
stages that are connected in series.
The main advantage of these architectures is that they process a
continuous ow of data, which lends itself well to a fully streaming
external product design.
There are many pipelined FFT architectures that target high-
throughput and support arbitrary streaming widths, and we refer to
[
22
] for a recent survey. Generally, pipelined FFTs cascade two types
of units: rst, the well-known butteries with complex twiddle
factor multipliers, and, second, shuing circuits that compute stride
permutations. Pipelined FFTs feature a large design space, with
dierent possible overall architectures, area/precision trade-os in
computing twiddle factor “rotations”, varying radix structures that
determine which twiddle factors appear at which stages, and more.
As such, they are an excellent target for tool-generated circuits, and
we follow this approach for FPT.
Several FFT generator tools have been proposed in the literature.
Some IP cores do not oer the massive parallelism and arbitrary
streaming widths that we target for FPT [
30
,
61
]. At the other end
of the spectrum, a recent generator [
24
] built on top of FloPoCo
[
17
] can only generate fully-parallel FFTs, instead of supporting
arbitrary streaming widths. We synthesized at dierent streaming
widths the High-Level Synthesis (HLS) Super Sample Rate (SSR)
FFTs included in the Vitis DSP libraries of Xilinx [
60
], but found
that they are outperformed by the RTL Verilog FFTs generated by
the Spiral FFT IP Core generator [
43
]. Unfortunately, Spiral is not
open-source and oers only a web interface towards its generated
RTL [42].
Eventually, we settled on SGen[
54
–
56
] as the FFT generator tool
that provided the necessary congurability, extensibility, and per-
formance we targeted for FPT. SGen is an open-source generator
implemented in Scala and employs concepts introduced in Spiral.
It generates arbitrary-streaming-width FFTs through four Interme-
diate Representations (IRs) with dierent levels of optimization: an
algorithm-level representation SPL, a streaming-block-level repre-
sentation Streaming-SPL, an acyclic streaming IR, and an RTL-level
IR. Apart from the streaming width, SGen features a congurable
FFT point size, radix, and hardware arithmetic representations such
as xed-point, IEEE754 oating-point, or FloPoCo oating-point.
Most importantly, SGen is fully open-source and extensible,
which we make heavy use of to generate streaming FFTs for FPT.
First, we have extended SGen with operators for the forward and
inverse twisting step, necessary to support negacyclic FFTs (Sec-
tion 2.3). Next, we have implemented a set of optimizations aimed
at higher precision and better TP/A. In this category, rst, we have
extended SGen with radix-2
𝑘
structures [
23
,
32
], nding that radix-
2
4
FFTs are on average 10% smaller than SGen-generated radix-4 or
radix-16 FFTs. Second, we replace schoolbook complex multiplica-
tion in SGen, requiring 4 real multiplies and 2 real additions, with
a variant of Karatsuba multiplications that is sometimes attributed
to Gauss:
𝑋+𝑗𝑌 =(𝐴+𝑗 𝐵)·(𝐶+𝑗𝐷 )
𝑍=𝐶(𝐴−𝐵)
𝑋=(𝐶−𝐷)𝐵+𝑍
𝑌=(𝐶+𝐷)𝐴−𝑍
(6)
By pre-computing
𝐶−𝐷
and
𝐶+𝐷
for the constant twiddle factors,
this multiplication requires only 3 real multiplies and 3 adds, saving
scarce FPGA DSP units.
Third, we decouple the twiddle bit-width from the input bit-
width. On one hand, this allows us to take advantage of the asym-
metric 27
×
18 multipliers found in FPGA DSP blocks, while at the
same time, it has been found that twiddles can be quantized with
approximately four fewer bits without aecting output noise [
8
,
14
].
Finally, as data grows throughout the FFT stages, it must initially
be padded with zeros to prevent overows. We have extended SGen
with a scaling schedule, that instead divides the data by two when-
ever the most-signicant bit must grow. Since the least-signicant
bits have mostly accumulated noise [
59
], scaling increases the preci-
sion for xed input bit-width. Adding a scaling schedule allows us,
on average, to use FFTs with 2-bit smaller xed-point intermediate
variables while meeting the same precision targets, which proves
crucial to eciently map multipliers to DSP units as will be detailed
later in Section 4, Fig. 4.
Figure 3a illustrates the resource usage of negacyclic size-256
FFTs produced by our optimized variant of SGen at dierent stream-
ing widths. To quantize our improvements over SGen, we also add
cyclic FFTs both with and without our introduced changes to the
tool. Our changes result in signicantly fewer logic resources: over
60% fewer DSP blocks are utilized while keeping LUTs comparable.
As DSP blocks are the main limiting resource for FPT (Table 3), our
optimizations are a key enabler to building FPT with high streaming
widths.
Figure 3b illustrates our main motivation to propose the FFT-
unrolled architecture for the external product. We plot the relative
throughput per area unit (DSPs or LUTs) of tool-generated FFTs
for dierent streaming widths. The trend is clear: FFTs with higher
streaming widths feature up to 300% more throughput per DSP or
per LUT. Intuitively, as the streaming width increases, FFTs can
take more advantage of the native strengths of hardware circuits.
First, shuing circuits with MUXes and storage blocks are replaced
with xed routing paths. Second, twiddle factor multipliers can
be specialized to the specic set of twiddles they need to handle,
taking advantage of optimized algorithms for Single- or Multiple
Constant Multiplication (SCM, MCM).
3.5 Other operations
Compared to the external product and its streaming FFTs, the re-
mainder of the CMUX (Fig. 2, dark grey) represents mostly sim-
ple circuitry: additions, subtractions, gadget decomposition, and
monomial multiplication. Whereas the rst three can be streamed
straightforwardly, monomial multiplication requires special treat-
ment.
Monomial multiplication multiplies the accumulator
𝐴𝐶𝐶
with
the ciphertext-dependent monomial
𝑋⌊2𝑁 𝑎𝑖
𝑞⌉
. Its eect is to rotate
the polynomials of
𝐴𝐶𝐶
by
⌊2𝑁 𝑎𝑖
𝑞⌉
, and additionally negate those
coecients that wrap around. First, we truncate
2𝑁 𝑎𝑖
𝑞
already in
software to limit host-FPGA bandwidth. Next, an ecient architec-
ture for monomial multiplication is a coecient-wise barrel shifter
in 𝑙𝑜𝑔 (𝑁)stages.
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
(a)
(b)
Figure 3: FPGA resource utilization (a) and throughput / re-
source utilization (b) of a size-256 FFT at dierent streaming
widths. At iso-precision, SGen FFTs use 31-bit intermediate
variables without scaling schedule, and our FFTs use 29-bit
intermediate variables with scaling.
To stream this operation, we dene two streaming approaches:
coecient-wise streaming, and bitwise streaming. In coecient-
wise streaming, dierent coecients of a polynomial are spaced
temporally over dierent clock cycles. In bit-wise streaming, all
coecients arrive in parallel within the same clock cycle, but we
instead divide dierent bit chunks of each coecient over dierent
clock cycles. One can then make a simple observation: a rotation is
a dicult permutation to stream coecient-wise, as it must inter-
change coecients that are spaced over dierent clock cycles, but
it is a simple operation to stream bit-wise, as we must simply rotate
all the individual bit-chunks. We therefore add stream-reordering
blocks that switch a polynomial from coecient-wise streaming to
bit-wise streaming and vice versa. At the same time, we merge the
stream-reordering with the folding operation of the negacyclic FFT,
which packs coecients
𝑎[𝑖]
and
𝑎[𝑖+𝑁/
2
]
. The reordering block
can be implemented at full throughput either in a R/W memory
block or with a simple series of registers and MUXes.
Signed gadget decomposition involves taking unsigned 32-bit
coecients, and decomposing them into
𝑙
signed coecients of
𝛽
bits. In hardware, this involves a simple reinterpretation of the bits
and conditional subtraction. We merge this logic at the output of
monomial multiplication to take advantage of LUT packing. In the
bit-wise streamed representation, these operations must track the
propagating carries in ipops.
Gadget decomposition is approximate, e.g., for Parameter Set I,
𝑙·𝛽=
16-bit
<
32-bit. Contrarily to software implementations, FPT
employs a CMUX datapath that is natively adjusted to approximate
gadget decomposition. We discard bits prematurely that would later
be rounded, allowing us to stick to a native 16-bit datapath, rather
than growing back to 32-bits outside of the external product.
4 COMPACT FIXED-POINT
REPRESENTATION
FFT calculations involve irrational (complex) numbers, and approx-
imation errors arise when those numbers are represented with
nite precision during computation. However, if enough precision
is used, implementations of TFHE tolerate these approximation
errors. More specically, one typically aims for the total approxi-
mation error to be lower than the noise inherently present in the
FHE calculations.
On a CPU, the typical method is to use oating-point numbers
with single or double precision. This is ecient due to the integra-
tion of an existing Floating-Point Unit (FPU), and therefore the typ-
ical representation of choice for software designers. CPU and GPU
implementations of TFHE have been restricted to double-precision
oating-point FFTs because single-precision FFTs were found to
introduce too much noise to guarantee successful decryption of
bootstrapped ciphertexts [11].
In dedicated hardware implementations, FPUs are not natively
available and are costly to include. To simplify the implementation
and the analysis of the approximation error, some prior implemen-
tations opted to change the scheme to work with a prime modulus
instead of a power-of-two modulus [
45
,
62
], allowing the use of
exact NTTs instead of approximate FFTs for polynomial multiplica-
tion. The downside of this approach is that one needs to include
costly modular reduction units.
FPT is the rst TFHE accelerator to instead utilize xed-point
calculations, which avoids the costly implementation of FPUs or
modular reduction units. Moreover, instead of initializing very
large xed-point calculations to guarantee sucient accuracy, we
conduct an in-depth analysis that optimizes the xed-point bitwidth
to be just large enough so that the approximation noise is smaller
than the inherent TFHE noise. FPT’s optimized approach in which
there is no need for a costly FPU or modular reduction unit allows
a more lean and ecient design, coming at the cost of a one-time
engineering eort to nd the optimal parameters.
The potential eect of our xed-point analysis on the area usage
of our implementation is illustrated in Figure 4. In this gure, we
Van Beirendonck et al.
Figure 4: FPGA Relative LUT and DSP utilization of a size-
256 FFT for various intermediate variable bitwidths.
plotted the LUT and DSP usage of a size-256 FFT, in function of
the bit width of the intermediate variables. The plot gives relative
numbers compared to the resource use at bitwidth 53 (loosely corre-
sponding to the signicand precision of IEEE 754 double-precision
oating-point). As illustrated, reducing the bitwidth of the inter-
mediate variables can result in a large reduction of the resource
utilization, with only 20% of the LUT and DSP usage for bitwidths
below 24.
Reduction of the bitwidth of intermediate variables relies on two
parts, the location of the most signicant bit, and the location of
the least signicant bit. We will rst look at our strategy to set the
MSB position of intermediate variables, and then focus on the LSB.
4.1 Setting the MSBs
The location of the most signicant bit is important to avoid over-
ows. If an overow occurs, the intermediate variable will be com-
pletely distorted and thus the result of the calculation will be un-
usable. Two strategies can be adopted to deal with overows: a
worst-case scenario where one can choose parameters to avoid any
overow or an average-case scenario where one allows a certain
overow with suciently low probability.
Avoiding any overow comes at a signicant enlargement of the
parameters and thus at a signicant cost, which is why we adopt
the strategy to avoid overows with a maximal overow probability
of
𝑃𝑜 𝑓 =
2
−64
. To determine the ideal MSB position we measure the
variance and then assume a Gaussian distribution to calculate the
overow probability. For a given MSB position
𝑝𝑀𝑆 𝐵
and standard
deviation 𝜎, the probability of overow is:
𝑃𝑜 𝑓 =𝑃[|𝜒|>2𝑝𝑀𝑆 𝐵 /2|𝜒$
← N(0, 𝜎 )] (7)
=1−erf 2𝑝𝑀𝑆𝐵
2√2𝜎.(8)
Using this equation we determine the lowest
𝑝𝑀𝑆 𝐵
that fullls the
maximal overow probability of
𝑃𝑜 𝑓 =
2
−64
for each intermediate
variable.
Parameter Set (I) (II)
BK FixedPoint26( 7, 19) FixedPoint27( 8, 19)
FFT FixedPoint29(15, 14) FixedPoint30(18, 12)
IFFT FixedPoint29(23, 6) FixedPoint30(27, 3)
Table 2: Fixed-point data representations used by interme-
diate variables, in the format FixedPointwidth(integerBits,
fractionalBits).
4.2 Setting the LSBs
The position of the least signicant bits has an inuence on the
approximation noise that is introduced during the calculations. This
approximation noise can be tolerated up to a certain level. More
specically, the approximation noise should be small enough so that
the combination of the approximation noise and the inherent TFHE
noise still leads to a correct bootstrap with high probability. We
divide the total acceptable noise, for which we use theoretical noise
bounds of [
11
], into two equal parts for the approximation noise
and the inherent noise, thus allowing our approximation noise to
be at most half the total acceptable noise.
In our design, we focus on three main parameters: the intermedi-
ate variable widths during the forward and inverse FFT calculations,
and the bitwidth of the coecients of the bootstrapping key
𝐵𝐾
. We
assume the noise introduced due to each parameter is independent
(as each parameter comes from a separate block in our design),
which means that the variance of the total noise is equal to the sum
of the variances of each noise source (
𝜎2
𝑡𝑜𝑡 =𝜎2
𝐹 𝐹𝑇 +𝜎2
𝐼 𝐹 𝐹𝑇 +𝜎2
𝐵𝐾
).
We then limit the noise variance due to each noise source to 1
/
3th
of the total noise variance.
To nd optimal xed-point parameter values, we perform a pa-
rameter sweep by setting the parameters to very high widths (in our
example 53) resulting in very low noise, and then iteratively reduc-
ing one parameter until it hits the noise threshold while keeping the
other parameters at high widths. The result of this experiment can
be seen in Fig. 5, and our nal xed-point parameters are illustrated
in Table 2. Note that we give the IFFT data representation before
outputs are scaled by 1/𝑁.
4.3 Related and Future Work
One prior implementation proposing a custom hardware format for
TFHE’s FFTs is MATCHA [
33
], who propose to use (integer) Dyadic-
Value-Quantized Twiddle Factors (DVQTFs). Our xed-point pa-
rameter analysis improves on MATCHA’s in two key ways:
First, MATCHA only considers the bitwidth of twiddle factors,
and set a uniform bitwidth (either 38-bit or 64-bit) that is employed
throughout their external product calculations. Our analysis instead
shows that dierent intermediate variables can prot from dierent
xed-point representations, allowing for an overal smaller resource
utilisation (Fig. 5, Table 2). Moreover, our analysis allows us to
quantize
𝐵𝐾
smaller than other parameters, limiting both on-chip
𝐵𝐾𝑖buers and o-chip bandwidth.
Second, in MATCHA, instead of measuring the noise variance,
the authors conduct 10
8
tests for a parameter set to test if there
are no decryption failures at the end of bootstrapping. The down-
side of this approach is that it becomes expensive when multiple
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Figure 5: Output approximation noise versus the number of
fractional bits for the representation of the bootstrapping
key and intermediate variables during the forward and in-
verse FFT.
parameters have to be set. Furthermore, this methodology does
not give exact values of the failure probability, as one only has the
information that no errors were found in 10
8
tests. Our approach
of measuring the approximation noise and matching it with the
theoretical noise bounds provides for a more rigorous and lean
design.
Finally, we note that there are other intermediate variables that
could be optimized, for example, the widths of the twiddle factors
in the FFT calculations. We heuristically set them to the width of
the intermediate variables minus 4, which gave a good balance
between failure probability and cost as also explained in Section 3.4.
Interesting future work could include a full search over all possible
parameters, which could result in improved xed-point parameters
over our heuristic approach.
5 IMPLEMENTATION
We implemented FPT for a Xilinx Alveo U280 datacenter accelera-
tor FPGA featuring 1.3M LUTs, 2.6M FFs, 9024 DSPs, and 41 MB
of on-chip SRAM. For both parameter sets, we employ our FFT-
unrolled architecture with a forward FFT streaming width of 128
complex coecients/clock cycle, and an IFFT streaming width of
128
/𝑙=
64 complex coecients per clock cycle. For Parameter
Set II with
𝑁=
1024, we have also implemented the dotproduct-
unrolled architecture with
(𝑘+
1
)𝑙=
4forward FFT kernels and
(𝑘+
2
)=
2IFFT kernels, both of uniform streaming width 32. At
this datapoint, providing iso-throughput to the FFT-unrolled archi-
tecture, we found that the dotproduct-unrolled architecture incurs
10% more average DSP and LUT usage, and we therefore do not
evaluate it further.
Our FFT-unrolled architectures feature massive throughput, com-
pleting one CMUX every
(
256
/
128
)·(𝑘+
1
)𝑙=
12 clock cycles for
Parameter Set I, and every
(
512
/
128
) · (𝑘+
1
)𝑙=
16 clock cycles for
Parameter Set II. The latency of the CMUX is larger: 156 cycles for
Parameter Set I, and 224 cycles for Parameter Set II. In both cases,
we operate at peak throughput by lling the CMUX pipeline with a
batch of ciphertexts, of sizes
𝑏=
156
/
12
=
13 and
𝑏=
224
/
16
=
14,
respectively.
5.1 External I/O
The Alveo U280 includes three dierent host-FPGA memory inter-
faces: 32 GB of DDR4, 8 GB of HBM accessed through 32 Pseudo-
Channels (PCs), and 24 MB of PLRAM. PBS also requires three
host-side inputs: a batch of
𝑏
input ciphertexts
𝑐𝑖𝑛
, the long-term
bootstrapping key
𝐵𝐾
, and the test polynomial LUT
𝐹
to evaluate
over the ciphertext.
For the long-term bootstrapping key, we note that it is not abso-
lutely necessary to instantiate a ping-pong
𝐵𝐾𝑖
buer, as discussed
in Section 3.2, on our target Alveo U280 FPGA. For our parameter
sets and xed-point-trimmed
𝐵𝐾
bitwidths, the full
𝐵𝐾
measures
approximately 15 MB and ts entirely in a combination of the
on-chip BRAM and URAM. Nevertheless, we instantiate a small
ping-pong
𝐵𝐾𝑖
cache as a proof-of-concept. This requires an on-
chip ping-pong buer of only 2
/𝑛
of the full size of
𝐵𝐾
, allowing our
architecture to remain compute-bound on architectures with less
on-chip SRAM, such as smaller FPGAs or heavily memory-trimmed
down ASICs. Moreover, our technique ensures that our architecture
scales to new TFHE algorithms or related schemes like FHEW that
increase the size of the bootstrapping key.
For our batch sizes
𝑏
, the required
𝐵𝐾
bandwidth is only tens of
GB/s, which we provide by splitting the
𝐵𝐾
over a limited number
of HBM PCs 0-7, each providing 14 GB/s of bandwidth. The input
and output ciphertext batches are small and require negligible band-
width, which we allocate in a single HBM PC. Each HBM channel is
served by a separate AXI master on the PL-side, which are R/W for
the ciphertext and read-only for
𝐵𝐾
. For the test polynomial LUT
𝐹
,
we allocate an on-chip RAM that can store a congurable number
of test polynomials. Each input ciphertext is tagged with an index
of the LUT to apply, and correspondingly the test polynomial
𝐹
to
select from the RAM as input to the rst CMUX iteration. LUTs
depend on the specic FHE program, are typically limited in num-
ber, and do not change often. For example, bootstrapped Boolean
gates require only a single LUT. As such, we keep the RAM small,
and we share the same HBM PC and AXI master that is used by the
input and output ciphertexts.
5.2 Xilinx Run Time Kernel
FPT is accessible from the host as Xilinx Run Time (XRT) RTL
kernel and managed through XRT API calls. FPT’s CMUX pipeline
features 100% utilization during a single ciphertext batch bootstrap,
and does not require complex kernel overlapping to reach peak
throughput. To ensure that there are no pipeline bubbles between
the bootstrapping of dierent batches, we allow early pre-fetching
of the next ciphertext batch into an on-chip FIFO. As such, we build
FPT to support the Vitis
ap_ctrl_chain
kernel block level control
protocol, which permits overlapping kernel executions and allows
FPT to queue future ciphertext batch base HBM addresses.
Van Beirendonck et al.
LUT FF DSP BRAM
(40% av.) (35% av.) (61% av.) (25% av.)
FPT (I) 526K 916K 5494 505
CMUX 384K 707K 5494 310
MAC (384×) 97K 114K 2304 310
FFT256,128 159K 366K 2126 0
IFFT256,64 97K 192K 1064 0
(46% av.) (39% av.) (66% av.) (20% av.)
FPT (II) 595K 1024K 5980 412
CMUX 458K 827K 5980 215
MAC (256×) 66K 79K 1536 215
FFT512,128 222K 449K 2958 0
IFFT512,64 130K 255K 1486 0
Table 3: FPT Hardware Resource Utilization Breakdown for
Parameter Sets I and II. DSP blocks are the main limiting
resource with up to 66% of available FPGA resources utilized
by FPT.
5.3 Fixed-point Streaming Design in Chisel
While the outer host-FPGA communication logic of FPT is imple-
mented in SystemVerilog, we use Chisel [
4
] – an open-source HDL
embedded in Scala – to construct the inner streaming CMUX ker-
nel. Like SystemVerilog, Chisel is a full-edged HDL with direct
constructs to describe synthesizable combinational and sequential
logic and not a High-Level Synthesis (HLS) language. Our moti-
vation to select Chisel over SystemVerilog for the CMUX, is that
it makes the full capabilities of the Scala language available to de-
scribe circuit generators. We make heavy use of object-oriented
and functional programming tools to describe our CMUX stream-
ing architecture for a congurable streaming width, and in both
realizations shown in Fig. 2. Moreover, Chisel has a rich typesystem
that is further supported by external libraries. In FPT, the existing
DspComplex[FixedPoint]
is our main hardware datatype that we
use within our architecture. Building on existing FixedPoint test
infrastructure that we extended for FPT, our experiments in Sec-
tion 4 are directly run on the Chisel-generated Verilog rather than
an intermediate xed-point software model.
6 EVALUATION AND COMPARISON
6.1 Resource Utilization
FPT is implemented using Xilinx Vivado 2022.2 and packaged
as XRT kernel using Vitis 2022.2, targeting a clock frequency of
200 MHz. Table 3 presents a resource utilization breakdown of FPT,
for both Parameter Sets I and II. In both cases, DSP blocks are the
main limiting resource that prevents increasing to the next available
streaming width, with up to 66% of available DSP blocks utilized by
FPT. Note that whereas Fig. 2 presented our ping-pong
𝐵𝐾
buer as
a monolithic memory block, it is physically split into many smaller
memory blocks that are placed inside the MAC units that consume
them.
6.2 PBS Benchmarks
Table 4 compares FPT quantitatively with a number of prior TFHE
baselines. For our CPU baseline, we benchmark single-core PBS
in CONCRETE[
12
] on an Intel Xeon Silver 4208 CPU at 2.1 GHz.
A recent ASIC baseline is provided by MATCHA [
33
]. MATCHA
presents emulations of a 36.96mm
2
ASIC in a 16nm PTM process
technology. As FPGA baseline, we include a recent architecture of
Ye et al. [
62
], which was developed concurrently with our work
and signicantly improves the prior baseline of Gener et al. [
26
].
We refer to this architecture by the author initials YKP, and we
also include YKP’s benchmarks of cuFHE [
15
], a GPU-based imple-
mentation benchmarked on an NVIDIA GeForce RTX 3090 GPU at
1.7 GHz, in our comparison.
The main design goal of FPT is PBS throughput. Table 4 illus-
trates the massive PBS throughput that is enabled through FPT’s
streaming architecture: 937
×
more than CONCRETE, 7.1
×
more
than YKP, and 2.5×more than MATCHA or cuFHE.
In FPT’s current instantiation, we did not optimize for latency.
As the PCIe and AXI latencies of communicating the in- and out-
put ciphertext batches are negligible, FPT’s PBS latency is mostly
determined by its CMUX pipeline depth. In this work, we kept the
CMUX pipeline depth large, tting
𝑏
ciphertexts and enabling small
o-chip bandwidth through our batched bootstrapping technique.
Lower-latency implementations of FPT can opt to decrease the
CMUX pipeline depth, requiring either more o-chip bandwidth to
load
𝐵𝐾
or caching the full
𝐵𝐾
on-chip. Nevertheless, FPT’s latency
even in its current instantiation is competitive with MATCHA. We
note that FPT is estimated at 99W total on-chip power (FPGA and
HBM), oering a similar TP/W as MATCHA (40W) and signicantly
more than cuFHE (>200W) or YKP (50W).
6.3 Related Work
Qualitatively, FPT makes dierent design choices than either YKP
or MATCHA. MATCHA is built after the classical CPU approach
to FHE accelerators. It includes a set of TGGSW clusters with ex-
ternal product cores that operate from a register le. As one result,
MATCHA is bottlenecked by data movement and cache memory
access conicts.
YKP is an HLS-based architecture that redenes TFHE to use
NTT, breaking compatibility with existing TFHE libraries like CON-
CRETE, and disabling the xed-point optimizations of FPT. At the
architectural level, YKP includes some concepts also employed by
FPT. Similar to FPT, they include a pipelined implementation of
the CMUX that processes multiple ciphertext instances. However,
unlike FPT which builds a single streaming CMUX PE with large
and congurable streaming width, YKP implements and instanti-
ates multiple smaller CMUX PEs with inferior TP/A. Each CMUX
pipeline instance in YKP includes an SRAM that stores a coecient
𝐵𝐾𝑖
. However, unlike FPT where these SRAMs are loaded by o-
chip memory in ping-pong fashion, YKP loads coecients from
DRAM only after a full coecient has been consumed. This makes
the number of CMUX PEs they instantiate limited by the o-chip
memory bandwidth, whereas FPT’s design choices make it entirely
compute-bound.
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Parameter Set LUT / FFs / DSP / BRAM Clock (MHz) Latency(ms) TP (PBS/ms)
FPT (I) 526K / 916K / 5494 / 17.5Mb 200 0.48 28.4
(II) 595K / 1024K / 5980 / 14.5Mb 200 0.58 25.0
842K / 662K / 7202 / 338Mb 180 3.76 3.5
YKP [62] (II) 442K / 342K / 6910 / 409Mb 180 1.88 2.7
MATCHA [33] (II) 36.96mm216nm PTM 2000 0.2 10
(I) 2100 33 0.03
CONCRETE[12] (II) Intel Xeon Silver 4208 2100 32 0.03
cuFHE [15] (II) NVIDIA GeForce RTX 3090 1700 9.34 9.6
Table 4: Comparison of TFHE PBS on a variety of platforms
Both MATCHA and YKP focus on an algorithmic technique
called bootstrapping key unrolling. This technique unrolls
𝑚
iter-
ations of the Blind Rotation loop (Algorithm 1, Line 2), requiring
an (exponentially) more expensive CMUX equation and larger
𝐵𝐾
,
but reducing the total number of iterations from
𝑛
to
𝑛/𝑚
. In FPT’s
design spirit of maximum throughput, bootstrapping key unrolling
is a bad trade-o. Already at
𝑚=
2, the adapted CMUX requires
3
×
more external products and 3
×
larger bootstrapping keys for
only 2
×
fewer iterations, resulting in inherently smaller overall
PBS TP/A. Bootstrapping key unrolling is essential for designs like
MATCHA and YKP to extract parallelism, which have many smaller
functional units with inferior TP/A. FPT, with its FFT-unrolled ar-
chitecture and large streaming width, nds ample parallelism and
larger TP/A within a single CMUX.
For completeness, we note that both MATCHA and YKP in-
clude key-switching as an operation of PBS. Key-switching includes
coecient-wise multiplication of a TLWE ciphertext with a key-
switching key. We opted not to include key-switching in FPT, be-
cause dierent FHE programs may choose to key-switch either be-
fore or after PBS [
11
]. Nevertheless, key-switching is an operation
with much lower throughput requirements than the CMUX [
62
].
In FPT, key-switching of the output ciphertext can be supported
without throughput penalty (but with slightly increased latency)
by instantiating a few integer multipliers on the AXI write-back
path.
7 CONCLUSION
In this paper, we introduced FPT, an accelerator for the Torus Fully
Homomorphic Encryption (TFHE) scheme. In contrast to previous
FHE architectures, our design follows a streaming approach with
high throughput and low control overhead. Owing to a batched
design and balanced streaming architecture, our accelerator is the
rst FHE bootstrapping implementation that is compute-bound and
not memory-bound, with small data caches and a 100% utilization of
the arithmetic units. Instead of using an NTT or oating-point FFT,
FPT achieves a signicant throughput increase by utilizing up to 80%
area-reduced xed-point FFTs with compact and optimized variable
representations. In the end, FPT achieves a TFHE bootstrapping
throughput of 28.4 bootstrappings per millisecond, which is 937
×
faster than CPU implementations, 7.1
×
faster than a concurrent
FPGA implementation, and 2.5
×
faster than state-of-the-art ASIC
and GPU designs.
ACKNOWLEDGMENTS
This work was supported in part by CyberSecurity Research Flan-
ders with reference number VR20192203, the Research Council
KU Leuven (C16/15/058), the Horizon 2020 ERC Advanced Grant
(101020005 Belfort), and the AMD Xilinx University Program through
the donation of a Xilinx Alveo U280 datacenter accelerator card.
Michiel Van Beirendonck is funded by FWO as Strategic Basic
(SB) PhD fellow (project number 1SD5621N). Jan-Pieter D’Anvers
is funded by FWO (Research Foundation – Flanders) as junior post-
doctoral fellow (contract number 133185).
Finally, the authors would like to thank Wouter Legiest for ex-
perimenting with a variety of FFT generator tools.
REFERENCES
[1]
Rashmi Agrawal, Leo de Castro, Guowei Yang, Chiraag Juvekar, Rabia Tugce
Yazicigil, Anantha P. Chandrakasan, Vinod Vaikuntanathan, and Ajay Joshi.
2022. FAB: An FPGA-based Accelerator for Bootstrappable Fully Homomorphic
Encryption. CoRR abs/2207.11872 (2022). https://doi.org/10.48550/arXiv.2207.
11872 arXiv:2207.11872
[2]
Alfred V. Aho, John E. Hopcroft, and Jerey D. Ullman. 1974. The Design and
Analysis of Computer Algorithms. Addison-Wesley.
[3]
Michael Armbrust, Armando Fox, Rean Grith, Anthony D. Joseph, Randy H.
Katz, Andy Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica,
and Matei Zaharia. 2010. A view of cloud computing. Commun. ACM 53, 4 (2010),
50–58. https://doi.org/10.1145/1721654.1721672
[4]
Jonathan Bachrach, Huy Vo, Brian C. Richards, Yunsup Lee, Andrew Water-
man, Rimas Avizienis, John Wawrzynek, and Krste Asanovic. 2012. Chisel:
constructing hardware in a Scala embedded language. In The 49th Annual Design
Automation Conference 2012, DAC ’12, San Francisco, CA, USA, June 3-7, 2012,
Patrick Groeneveld, Donatella Sciuto, and Soha Hassoun (Eds.). ACM, 1216–1225.
https://doi.org/10.1145/2228360.2228584
[5]
Ahmad Al Badawi, Bharadwaj Veeravalli, Chan Fook Mun, and Khin Mi Mi Aung.
2018. High-Performance FV Somewhat Homomorphic Encryption on GPUs: An
Implementation using CUDA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018, 2
(2018), 70–95. https://doi.org/10.13154/tches.v2018.i2.70-95
[6]
Daniel J. Bernstein. 2007. The Tangent FFT. In Applied Algebra, Algebraic Al-
gorithms and Error-Correcting Codes, 17th International Symposium, AAECC-17,
Bangalore, India, December 16-20, 2007, Proceedings (Lecture Notes in Computer
Science, Vol. 4851), Serdar Boztas and Hsiao-feng Lu (Eds.). Springer, 291–300.
https://doi.org/10.1007/978-3- 540-77224- 8_34
[7]
Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Leveled) Fully
Homomorphic Encryption without Bootstrapping. ACM Trans. Comput. Theory
6, 3 (2014), 13:1–13:36. https://doi.org/10.1145/2633600
[8]
Wei-Hsin Chang and Truong Q. Nguyen. 2008. On the Fixed-Point Accuracy
Analysis of FFT Algorithms. IEEE Trans. Signal Process. 56, 10-1 (2008), 4673–4682.
https://doi.org/10.1109/TSP.2008.924637
Van Beirendonck et al.
[9]
Jung Hee Cheon, Andrey Kim, Miran Kim, and Yong Soo Song. 2017. Homo-
morphic Encryption for Arithmetic of Approximate Numbers. In Advances in
Cryptology - ASIACRYPT 2017 - 23rd International Conference on the Theory and Ap-
plications of Cryptology and Information Security, Hong Kong, China, December 3-7,
2017, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 10624), Tsuyoshi
Takagi and Thomas Peyrin (Eds.). Springer, 409–437. https://doi.org/10.1007/978-
3-319- 70694-8_15
[10]
Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2016.
Faster Fully Homomorphic Encryption: Bootstrapping in Less Than 0.1 Seconds.
In Advances in Cryptology - ASIACRYPT 2016 - 22nd International Conference
on the Theory and Application of Cryptology and Information Security, Hanoi,
Vietnam, December 4-8, 2016, Proceedings, Part I (Lecture Notes in Computer Science,
Vol. 10031), Jung Hee Cheon and Tsuyoshi Takagi (Eds.). 3–33. https://doi.org/
10.1007/978-3- 662-53887- 6_1
[11]
Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2020.
TFHE: Fast Fully Homomorphic Encryption Over the Torus. J. Cryptol. 33, 1
(2020), 34–91. https://doi.org/10.1007/s00145-019- 09319-x
[12]
Ilaria Chillotti, Marc Joye, Damien Ligier, Jean-Baptiste Orla, and Samuel Tap.
2020. CONCRETE: Concrete operates on ciphertexts rapidly by extending TfhE.
In WAHC 2020–8th Workshop on Encrypted Computing & Applied Homomorphic
Cryptography, Vol. 15.
[13]
Ilaria Chillotti, Marc Joye, and Pascal Paillier. 2021. Programmable Bootstrapping
Enables Ecient Homomorphic Inference of Deep Neural Networks. In Cyber
Security Cryptography and Machine Learning - 5th International Symposium,
CSCML 2021, Be’er Sheva, Israel, July 8-9, 2021, Proceedings (Lecture Notes in
Computer Science, Vol. 12716), Shlomi Dolev, Oded Margalit, Benny Pinkas, and
Alexander A. Schwarzmann (Eds.). Springer, 1–19. https://doi.org/10.1007/978-
3-030- 78086-9_1
[14]
Ainhoa Cortés, Igone Vélez, Ibon Zalbide, Andoni Irizar, and Juan F. Sevillano.
2008. An FFT Core for DVB-T/DVB-H Receivers. VLSI Design 2008 (2008),
610420:1–610420:9. https://doi.org/10.1155/2008/610420
[15]
Wei Dai and Berk Sunar. 2015. cuHE: A Homomorphic Encryption Accelerator
Library. In Cryptography and Information Security in the Balkans - Second Inter-
national Conference, BalkanCryptSec 2015, Koper, Slovenia, September 3-4, 2015,
Revised Selected Papers (Lecture Notes in Computer Science, Vol. 9540), Enes Pasalic
and Lars R. Knudsen (Eds.). Springer, 169–186. https://doi.org/10.1007/978-3-
319-29172- 7_11
[16]
Leo de Castro, Rashmi Agrawal, Rabia Tugce Yazicigil, Anantha P. Chandrakasan,
Vinod Vaikuntanathan, Chiraag Juvekar, and Ajay Joshi. 2021. Does Fully Homo-
morphic Encryption Need Compute Acceleration? CoRR abs/2112.06396 (2021).
arXiv:2112.06396 https://arxiv.org/abs/2112.06396
[17]
Florent de Dinechin and Bogdan Pasca. 2011. Designing Custom Arithmetic
Data Paths with FloPoCo. IEEE Des. Test Comput. 28, 4 (2011), 18–27. https:
//doi.org/10.1109/MDT.2011.44
[18]
Yarkin Doröz, Erdinç Öztürk, and Berk Sunar. 2015. Accelerating Fully Homo-
morphic Encryption in Hardware. IEEE Trans. Computers 64, 6 (2015), 1509–1521.
https://doi.org/10.1109/TC.2014.2345388
[19]
Léo Ducas and Daniele Micciancio. 2015. FHEW: Bootstrapping Homomorphic
Encryption in Less Than a Second. In Advances in Cryptology - EUROCRYPT
2015 - 34th Annual International Conference on the Theory and Applications of
Cryptographic Techniques, Soa, Bulgaria, April 26-30, 2015, Proceedings, Part I
(Lecture Notes in Computer Science, Vol. 9056), Elisabeth Oswald and Marc Fischlin
(Eds.). Springer, 617–640. https://doi.org/10.1007/978- 3-662-46800- 5_24
[20]
Junfeng Fan and Frederik Vercauteren. 2012. Somewhat Practical Fully Homo-
morphic Encryption. IACR Cryptol. ePrint Arch. (2012), 144. http://eprint.iacr.
org/2012/144
[21]
M. Frigo and S.G. Johnson. 2005. The Design and Implementation of FFTW3.
Proc. IEEE 93, 2 (2005), 216–231. https://doi.org/10.1109/JPROC.2004.840301
[22]
Mario Garrido. 2021. A Survey on Pipelined FFT Hardware Architectures. Journal
of Signal Processing Systems (06 Jul 2021). https://doi.org/10.1007/s11265- 021-
01655-1
[23]
Mario Garrido, Jesús Grajal, Miguel A. Sánchez Marcos, and Oscar Gustafsson.
2013. Pipelined Radix-2
k
Feedforward FFT Architectures. IEEE Trans. Very Large
Scale Integr. Syst. 21, 1 (2013), 23–32. https://doi.org/10.1109/TVLSI.2011.2178275
[24]
Mario Garrido, Konrad Möller, and Martin Kumm. 2019. World’s Fastest FFT
Architectures: Breaking the Barrier of 100 GS/s. IEEE Trans. Circuits Syst. I Regul.
Pap. 66-I, 4 (2019), 1507–1516. https://doi.org/10.1109/TCSI.2018.2886626
[25]
Robin Geelen, Michiel Van Beirendonck, Hilder V. L. Pereira, Brian Human,
Tynan McAuley, Ben Selfridge, Daniel Wagner, Georgios Dimou, Ingrid Ver-
bauwhede, Frederik Vercauteren, and David W. Archer. 2022. BASALISC:
Flexible Asynchronous Hardware Accelerator for Fully Homomorphic Encryp-
tion. CoRR abs/2205.14017 (2022). https://doi.org/10.48550/arXiv.2205.14017
arXiv:2205.14017
[26]
Serhan Gener, Parker Newton, Daniel Tan, Silas Richelson, Guy Lemieux, and
Philip Brisk. 2021. An FPGA-based Programmable Vector Engine for Fast Fully
Homomorphic Encryption over the Torus. In SPSL: Secure and Private Systems
for Machine Learning (ISCA Workshop).
[27]
Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In Pro-
ceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009,
Bethesda, MD, USA, May 31 - June 2, 2009, Michael Mitzenmacher (Ed.). ACM,
169–178. https://doi.org/10.1145/1536414.1536440
[28]
Craig Gentry and Shai Halevi. 2011. Implementing Gentry’s Fully-Homomorphic
Encryption Scheme. In Advances in Cryptology - EUROCRYPT 2011 - 30th An-
nual International Conference on the Theory and Applications of Cryptographic
Techniques, Tallinn, Estonia, May 15-19, 2011. Proceedings (Lecture Notes in Com-
puter Science, Vol. 6632), Kenneth G. Paterson (Ed.). Springer, 129–148. https:
//doi.org/10.1007/978-3- 642-20465- 4_9
[29]
Craig Gentry, Amit Sahai, and Brent Waters. 2013. Homomorphic Encryp-
tion from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster,
Attribute-Based. In Advances in Cryptology - CRYPTO 2013 - 33rd Annual Cryp-
tology Conference, Santa Barbara, CA, USA, August 18-22, 2013. Proceedings, Part
I (Lecture Notes in Computer Science, Vol. 8042), Ran Canetti and Juan A. Garay
(Eds.). Springer, 75–92. https://doi.org/10.1007/978- 3-642-40041- 4_5
[30]
LLC Gisselquist Technology. [n. d.]. A Generic Piplined FFT Core Generator.
https://github.com/ZipCPU/dblclockt.
[31]
Kyoohyung Han, Seungwan Hong, Jung Hee Cheon, and Daejun Park. 2019.
Logistic Regression on Homomorphic Encrypted Data at Scale. In The Thirty-
Third AAAI Conference on Articial Intelligence, AAAI 2019, The Thirty-First
Innovative Applications of Articial Intelligence Conference, IAAI 2019, The Ninth
AAAI Symposium on Educational Advances in Articial Intelligence, EAAI 2019,
Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 9466–9471.
https://doi.org/10.1609/aaai.v33i01.33019466
[32]
Shousheng He and Mats Torkelson. 1996. A New Approach to Pipeline FFT
Processor. In Proceedings of IPPS ’96, The 10th International Parallel Processing
Symposium, April 15-19, 1996, Honolulu, Hawaii, USA. IEEE Computer Society,
766–770. https://doi.org/10.1109/IPPS.1996.508145
[33]
Lei Jiang, Qian Lou, and Nrushad Joshi. 2022. MATCHA: a fast and energy-
ecient accelerator for fully homomorphic encryption over the torus. In DAC
’22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA,
July 10 - 14, 2022, Rob Oshana (Ed.). ACM, 235–240. https://doi.org/10.1145/
3489517.3530435
[34]
Marc Joye. 2022. SoK: Fully Homomorphic Encryption over the [Discretized]
Torus. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022, 4 (2022), 661–692. https:
//doi.org/10.46586/tches.v2022.i4.661-692
[35]
Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee.
2021. Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through
Memory-centric Optimization with GPUs. IACR Trans. Cryptogr. Hardw. Embed.
Syst. 2021, 4 (2021), 114–148. https://doi.org/10.46586/tches.v2021.i4.114-148
[36]
Jongmin Kim, Gwangho Lee, Sangpyo Kim, Gina Sohn, Minsoo Rhu, John Kim,
and Jung Ho Ahn. 2022. ARK: Fully Homomorphic Encryption Accelerator with
Runtime Data Generation and Inter-Operation Key Reuse. In 55th IEEE/ACM
International Symposium on Microarchitecture, MICRO 2022, Chicago, IL, USA,
October 1-5, 2022. IEEE, 1237–1254. https://doi.org/10.1109/MICRO56248.2022.
00086
[37]
Sangpyo Kim, Jongmin Kim, Michael Jaemin Kim, Wonkyung Jung, John Kim,
Minsoo Rhu, and Jung Ho Ahn. 2022. BTS: an accelerator for bootstrappable fully
homomorphic encryption. In ISCA ’22: The 49th Annual International Symposium
on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, Valentina
Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang (Eds.). ACM, 711–725.
https://doi.org/10.1145/3470496.3527415
[38]
Igor Kononenko. 2001. Machine learning for medical diagnosis: history, state
of the art and perspective. Artif. Intell. Medicine 23, 1 (2001), 89–109. https:
//doi.org/10.1016/S0933-3657(01)00077- X
[39]
Joon-Woo Lee, HyungChul Kang, Yongwoo Lee, Woosuk Choi, Jieun Eom, Maxim
Deryabin, Eunsang Lee, Junghyun Lee, Donghoon Yoo, Young-Sik Kim, and Jong-
Seon No. 2022. Privacy-Preserving Machine Learning With Fully Homomorphic
Encryption for Deep Neural Network. IEEE Access 10 (2022), 30039–30054. https:
//doi.org/10.1109/ACCESS.2022.3159694
[40]
Charles E. Leiserson and James B. Saxe. 1991. Retiming Synchronous Circuitry.
Algorithmica 6, 1 (1991), 5–35. https://doi.org/10.1007/BF01759032
[41]
Ahmet Can Mert, Aikata, Sunmin Kwon, Youngsam Shin, Donghoon Yoo, Yong-
woo Lee, and Sujoy Sinha Roy. 2022. Medha: Microcoded Hardware Acceler-
ator for computing on Encrypted Data. CoRR abs/2210.05476 (2022). https:
//doi.org/10.48550/arXiv.2210.05476 arXiv:2210.05476
[42]
Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. [n.d.]. Spiral
DFT/FFT IP Core Generator. https://www.spiral.net/hardware/dftgen.html.
[43]
Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2012.
Computer Generation of Hardware for Linear Digital Signal Processing Trans-
forms. ACM Trans. Design Autom. Electr. Syst. 17, 2 (2012), 15:1–15:33. https:
//doi.org/10.1145/2159542.2159547
[44]
Mohammed Nabeel, Deepraj Soni, Mohammed Ashraf, Mizan Abraha Ge-
bremichael, Homer Gamil, Eduardo Chielle, Ramesh Karri, Mihai Sanduleanu,
and Michail Maniatakos. 2022. CoFHEE: A Co-processor for Fully Homomorphic
Encryption Execution. CoRR abs/2204.08742 (2022). https://doi.org/10.48550/
arXiv.2204.08742 arXiv:2204.08742
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
[45]
NuCypher. [n.d.]. NuFHE, a GPU-powered Torus FHE implementation. https:
//github.com/nucypher/nufhe/.
[46]
Nicolas Papernot, Patrick D. McDaniel, Arunesh Sinha, and Michael P. Wellman.
2018. SoK: Security and Privacy in Machine Learning. In 2018 IEEE European
Symposium on Security and Privacy, EuroS&P 2018, London, United Kingdom, April
24-26, 2018. IEEE, 399–414. https://doi.org/10.1109/EuroSP.2018.00035
[47]
Thomas Pöppelmann, Michael Naehrig, Andrew Putnam, and Adrián Macías.
2015. Accelerating Homomorphic Evaluation on Recongurable Hardware. In
Cryptographic Hardware and Embedded Systems - CHES 2015 - 17th International
Workshop, Saint-Malo, France, September 13-16, 2015, Proceedings (Lecture Notes
in Computer Science, Vol. 9293), Tim Güneysu and Helena Handschuh (Eds.).
Springer, 143–163. https://doi.org/10.1007/978- 3-662-48324- 4_8
[48]
Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria E. Presa
Reyes, Mei-Ling Shyu, Shu-Ching Chen, and S. S. Iyengar. 2019. A Survey on
Deep Learning: Algorithms, Techniques, and Applications. ACM Comput. Surv.
51, 5 (2019), 92:1–92:36. https://doi.org/10.1145/3234150
[49] M. Sadegh Riazi, Kim Laine, Blake Pelton, and Wei Dai. 2020. HEAX: An Archi-
tecture for Computing on Encrypted Data. In ASPLOS ’20: Architectural Support
for Programming Languages and Operating Systems, Lausanne, Switzerland, March
16-20, 2020, James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 1295–1309.
https://doi.org/10.1145/3373376.3378523
[50] Ronald L Rivest, Len Adleman, Michael L Dertouzos, et al. 1978. On data banks
and privacy homomorphisms. Foundations of secure computation 4, 11 (1978),
169–180.
[51]
Sujoy Sinha Roy, Furkan Turan, Kimmo Järvinen, Frederik Vercauteren, and
Ingrid Verbauwhede. 2019. FPGA-Based High-Performance Parallel Architecture
for Homomorphic Computing on Encrypted Data. In 25th IEEE International
Symposium on High Performance Computer Architecture, HPCA 2019, Washington,
DC, USA, February 16-20, 2019. IEEE, 387–398. https://doi.org/10.1109/HPCA.
2019.00052
[52]
Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas,
Ronald G. Dreslinski, Christopher Peikert, and Daniel Sánchez. 2021. F1: A
Fast and Programmable Accelerator for Fully Homomorphic Encryption. In MI-
CRO ’21: 54th Annual IEEE/ACM International Symposium on Microarchitecture,
Virtual Event, Greece, October 18-22, 2021. ACM, 238–252. https://doi.org/10.1145/
3466752.3480070
[53]
Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Nathan Manohar,
Nicholas Genise, Srinivas Devadas, Karim Eldefrawy, Chris Peikert, and Daniel
Sánchez. 2022. CraterLake: a hardware accelerator for ecient unbounded com-
putation on encrypted data. In ISCA ’22: The 49th Annual International Symposium
on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, Valentina
Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang (Eds.). ACM, 173–187.
https://doi.org/10.1145/3470496.3527393
[54]
François Serre and Markus Püschel. 2018. A DSL-Based FFT Hardware Generator
in Scala. In 28th International Conference on Field Programmable Logic and Ap-
plications, FPL 2018, Dublin, Ireland, August 27-31, 2018. IEEE Computer Society,
315–322. https://doi.org/10.1109/FPL.2018.00060
[55]
François Serre and Markus Püschel. 2019. DSL-Based Modular IP Core Generators:
Example FFT and Related Structures. In 26th IEEE Symposium on Computer
Arithmetic, ARITH 2019, Kyoto, Japan, June 10-12, 2019, Naofumi Takagi, Sylvie
Boldo, and Martin Langhammer (Eds.). IEEE, 190–191. https://doi.org/10.1109/
ARITH.2019.00043
[56]
François Serre and Markus Püschel. 2020. DSL-Based Hardware Generation
with Scala: Example Fast Fourier Transforms and Sorting Networks. ACM Trans.
Recongurable Technol. Syst. 13, 1 (2020), 1:1–1:23. https://doi.org/10.1145/
3359754
[57]
Furkan Turan, Sujoy Sinha Roy, and Ingrid Verbauwhede. 2020. HEAWS: An
Accelerator for Homomorphic Encryption on the Amazon AWS FPGA. IEEE Trans.
Computers 69, 8 (2020), 1185–1196. https://doi.org/10.1109/TC.2020.2988765
[58]
Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, and Berk Sunar. 2012. Accel-
erating fully homomorphic encryption using GPU. In IEEE Conference on High
Performance Extreme Computing, HPEC 2012, Waltham, MA, USA, September 10-12,
2012. IEEE, 1–5. https://doi.org/10.1109/HPEC.2012.6408660
[59]
P. Welch. 1969. A xed-point fast Fourier transform error analysis. IEEE Trans-
actions on Audio and Electroacoustics 17, 2 (1969), 151–157. https://doi.org/10.
1109/TAU.1969.1162035
[60] Xilinx. [n. d.]. Vitis DSP Library. https://xilinx.github.io/Vitis_Libraries/dsp.
[61]
Xilinx. 2022. Fast Fourier Transform v9.1. LogiCORE IP Product Guide. PG109.
[62]
Tian Ye, Rajgopal Kannan, and Viktor K. Prasanna. 2022. FPGA Acceleration
of Fully Homomorphic Encryption over the Torus. In IEEE High Performance
Extreme Computing Conference, HPEC 2022, Waltham, MA, USA, September 19-23,
2022. IEEE, 1–7. https://doi.org/10.1109/HPEC55821.2022.9926381
[63]
Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. 2020. A
Survey of Autonomous Driving: Common Practices and Emerging Technologies.
IEEE Access 8 (2020), 58443–58469. https://doi.org/10.1109/ACCESS.2020.2983149
[64]
Zama. 2022. Announcing Concrete-core v1.0.0-gamma with GPU acceleration.
https://www.zama.ai/post/concrete-core-v1- 0-gamma-with-gpu-acceleration