PreprintPDF Available

FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Fully Homomorphic Encryption is a technique that allows computation on encrypted data. It has the potential to drastically change privacy considerations in the cloud, but high computational and memory overheads are preventing its broad adoption. TFHE is a promising Torus-based FHE scheme that heavily relies on bootstrapping, the noise-removal tool that must be invoked after every encrypted gate computation. We present FPT, a Fixed-Point FPGA accelerator for TFHE bootstrapping. FPT is the first hardware accelerator to heavily exploit the inherent noise present in FHE calculations. Instead of double or single-precision floating-point arithmetic, it implements TFHE bootstrapping entirely with approximate fixed-point arithmetic. Using an in-depth analysis of noise propagation in bootstrapping FFT computations, FPT is able to use noise-trimmed fixed-point representations that are up to 50% smaller than prior implementations using floating-point or integer FFTs. FPT's microarchitecture is built as a streaming processor inspired by traditional streaming DSPs: it instantiates high-throughput computational stages that are directly cascaded, with simplified control logic and routing networks. FPT's streaming approach allows 100% utilization of arithmetic units and requires only small bootstrapping key caches, enabling an entirely compute-bound bootstrapping throughput of 1 BS / 35$\mu$s. This is in stark contrast to the established classical CPU approach to FHE bootstrapping acceleration, which tends to be heavily memory and bandwidth-constrained. FPT is fully implemented and evaluated as a bootstrapping FPGA kernel for an Alveo U280 datacenter accelerator card. FPT achieves almost three orders of magnitude higher bootstrapping throughput than existing CPU-based implementations, and 2.5$\times$ higher throughput compared to recent ASIC emulation experiments.
Content may be subject to copyright.
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic
Encryption
Michiel Van Beirendonck, Jan-Pieter D’Anvers, Ingrid Verbauwhede
{rstname.lastname}@esat.kuleuven.be
COSIC KU Leuven
Leuven, Belgium
ABSTRACT
Fully Homomorphic Encryption is a technique that allows compu-
tation on encrypted data. It has the potential to drastically change
privacy considerations in the cloud, but high computational and
memory overheads are preventing its broad adoption. TFHE is a
promising Torus-based FHE scheme that heavily relies on boot-
strapping, the noise-removal tool that must be invoked after every
encrypted gate computation.
We present FPT, a Fixed-Point FPGA accelerator for TFHE boot-
strapping. FPT is the rst hardware accelerator to heavily exploit
the inherent noise present in FHE calculations. Instead of double
or single-precision oating-point arithmetic, it implements TFHE
bootstrapping entirely with approximate xed-point arithmetic.
Using an in-depth analysis of noise propagation in bootstrapping
FFT computations, FPT is able to use noise-trimmed xed-point rep-
resentations that are up to 50% smaller than prior implementations
using oating-point or integer FFTs.
FPT’s microarchitecture is built as a streaming processor inspired
by traditional streaming DSPs: it instantiates high-throughput com-
putational stages that are directly cascaded, with simplied con-
trol logic and routing networks. We explore dierent throughput-
balanced compositions of streaming kernels with a user-congurable
streaming width in order to construct a full bootstrapping pipeline.
FPT’s streaming approach allows 100% utilization of arithmetic
units and requires only small bootstrapping key caches, enabling
an entirely compute-bound bootstrapping throughput of 1 BS / 35
𝜇
s.
This is in stark contrast to the established classical CPU approach to
FHE bootstrapping acceleration, which tends to be heavily memory
and bandwidth-constrained.
FPT is fully implemented and evaluated as a bootstrapping FPGA
kernel for an Alveo U280 datacenter accelerator card. FPT achieves
almost three orders of magnitude higher bootstrapping through-
put than existing CPU-based implementations, and 2.5
×
higher
throughput compared to recent ASIC emulation experiments.
1 INTRODUCTION AND MOTIVATION
Machine Learning (ML), driven by the availability of an abundance
of data, has seen rapid advances in recent years [
48
], leading to
new applications from autonomous driving [
63
] to medical diag-
nosis [
38
]. In many applications, ML models are developed by one
party, who makes them available to users as a cloud service [
3
].
The deployment of such applications comes at the risk of privacy
breaches, where the user data might be leaked, or IP theft, where
users steal the ML model from the developing party [46].
The “silver bullet” solution to prevent the leakage of this data is
to encrypt it with Fully Homomorphic Encryption (FHE) [
27
,
50
],
which is a technique that allows one to compute on encrypted data.
Server FPGA
Client
m
f(m)
Figure 1: FHE allows outsourced computation on data that
remains encrypted. The cloud receives encrypted data on
which it can compute and the public key (green), but does
not receive the secret decryption key (red). The cloud can
run computations on the data, but only the client can nally
decrypt and obtain the result. Cloud instances with FPGAs
enable custom hardware accelerators and have the potential
to drastically speed up FHE computations.
Fig. 1 illustrates a possible application of FHE to protect user data
in an ML environment. In this scenario, a client wants to use an
online-server-based ML service, without leaking any sensitive data.
To this end, the client encrypts their data with FHE, before sending
it to the cloud. The cloud service then computes an FHE program
on the encrypted data without obtaining any information about
the input and sends the (still encrypted) result back to the client.
Only the client can nally decrypt and obtain the result.
The drawback of FHE is that it is at the moment still orders of
magnitude slower than unencrypted calculations. The rst algo-
rithm to calculate an encrypted AND gate took up to 30 minutes to
nish[
28
]. FHE schemes and algorithms have seen signicant im-
provements in recent years, e.g. the recent TFHE scheme computes
encrypted AND gates in only 13ms [
10
,
11
] on a CPU. However,
even with these improvements, it is not uncommon to still see slow-
down factors of 10,000
×
compared to calculations on unencrypted
data [
13
,
31
,
39
], which currently still prevents practical deployment
of FHE in many applications.
To work around the speed limitations of FHE, designers have
shifted their focus from general-purpose CPUs to more dedicated
hardware implementations. Of these dedicated implementations,
GPU-based FHE accelerators are easiest to develop, but they typi-
cally only provide modest speedups [
5
,
15
,
35
,
58
]. At the other end
of the spectrum, ASIC emulations in advanced technology nodes
promise better FHE acceleration [
25
,
36
,
37
,
52
,
53
]. However, it can
take years for these ASICs to be fabricated and become available
arXiv:2211.13696v1 [cs.CR] 24 Nov 2022
Van Beirendonck et al.
[
44
], and they are typically specialized for a limited range of param-
eter sets. Finally, FPGA-based implementations can be developed
more quickly than ASIC implementations, are exible to change pa-
rameter sets, and can be readily deployed in FPGA-equipped cloud
instances while boosting large speedups. As a result, they have
been a popular target for FHE acceleration [
1
,
18
,
41
,
47
,
49
,
51
,
57
].
One costly operation in FHE calculations is bootstrapping. All
currently available FHE schemes have an inherent noise that is
increased with each operation. After a certain number of operations,
this noise needs to be reduced to allow further calculations, which
is done using this so-called bootstrap procedure.
Second-generation FHE schemes BFV [
20
], BGV [
7
], and CKKS
[
9
] have been the main focus of prior hardware accelerators. These
schemes require bootstrapping only after a certain number of opera-
tions. For these schemes, bootstrapping is a complex algorithm that
requires large data caches [
16
] and exhibits low arithmetic intensity,
and essentially all prior architectures that support bootstrapping
have hit the o-chip memory-bandwidth wall [37, 52].
Third-generation schemes like FHEW [
19
] and its successor
Torus FHE (TFHE) [
10
,
11
] have revisited the bootstrapping ap-
proach, making it cheaper but inherently linked to homomorphic
calculations. In these schemes, most of the homomorphic opera-
tions require an immediate bootstrap of the ciphertext. Moreover,
bootstrapping in TFHE is a versatile tool, which can additionally
be “programmed” with an arbitrary function that is applied to the
ciphertext, e.g. non-linear activation functions in ML neural net-
works [
13
]. This approach is called Programmable Bootstrapping
(PBS) and it constitutes the main cost of TFHE homomorphic calcu-
lations. Taking up to 99% of an encrypted gate computation, PBS is
a prime target for high-throughput hardware acceleration of TFHE.
In this work, we propose FPT, an FPGA-based accelerator for
TFHE Programmable Bootstrapping. FPT achieves a signicant
speedup over the previous state-of-the-art, which is attributable to
two major contributions:
FPT’s microarchitecture is built as a streaming processor,
challenging the established classical CPU approach to FHE
bootstrapping accelerators. Inspired by traditional stream-
ing DSPs, FPT instantiates high-throughput computational
stages that are directly cascaded, with simplied control logic
and routing networks. FPT’s streaming approach allows 100%
utilization of arithmetic units during bootstrapping, includ-
ing tool-generated high-radix and heavily optimized nega-
cyclic FFT units with user-congurable streaming widths.
Our streaming architecture is discussed in Section 3.
FPT (
F
ixed-
P
oint
T
FHE) is the rst hardware accelerator to
extensively optimize the representation of intermediate vari-
ables. TFHE PBS is dominated by FFT calculations, which
work on irrational (complex) numbers and need to be im-
plemented with sucient accuracy. Instead of using double
oating-point arithmetic or large integers as in previous
works, FPT implements PBS entirely with compact xed-
point arithmetic. We analyze in-depth the noise due to the
compact xed-point representation that we use inside PBS,
and we match it to the noise that is natively present in FHE.
Through this analysis, FPT is able to use xed-point repre-
sentations that are up to 50% smaller than prior implemen-
tations using oating-point or integer FFTs. In turn, these
50% smaller xed-point representations enable up to 80%
smaller FFT kernels. Our xed-point analysis is discussed in
Section 4.
FPT shows, for the rst time, that PBS can remain entirely compute-
bound with only small bootstrapping key data caches. FPT achieves
a massive PBS throughput of 1 PBS / 35
𝜇
s, which requires only mod-
est o-chip memory bandwidth, and is entirely bound by the logic
resources on our target Xilinx Alveo U280 FPGA. This represents
almost three orders of magnitude speedup over the popular TFHE
software library CONCRETE [
12
] on an Intel Xeon Silver 4208 CPU
at 2.1 GHz, a factor 7.1
×
speedup over a concurrently-developed
FPGA architecture [
62
], and a factor 2.5
×
speedup over recent 16nm
ASIC emulation experiments [33].
2 BACKGROUND
This section gives an intuitive idea of the workings of TFHE, with a
focus on the Programmable Bootstrapping step that is accelerated
by FPT. We the reader to [
10
,
11
,
34
] for a more in-depth overview
of TFHE.
2.1 Torus Fully Homomorphic Encryption
Torus Fully Homomorphic Encryption (TFHE) is a homomorphic
encryption scheme based on the Learning With Errors (LWE) prob-
lem. It operates on elements that are dened over the real Torus
T=R/Z
, i.e. the set
[
0
,
1
)
of real numbers modulo 1. In practice,
Torus elements are discretized as 32-bit or 64-bit integers.
A TFHE ciphertext can be constructed by combining three el-
ements: a secret vector
𝑠
with
𝑛
coecients following a uniform
binary distribution
𝑠$
U(B𝑛)
, a public vector
𝑎$
U(T𝑛)
sam-
pled from a uniform distribution, and a small error
𝑒$
𝜒
from a
small distribution
𝜒(T)
. A message
𝜇T
can be encrypted as a
tuple:
𝑐=(𝑎, 𝑏 =𝑎·𝑠+𝑒+𝜇) T𝑛+1
. Using the secret
𝑠
, one can
decrypt the ciphertext back into (a noisy version of) the message
by computing
𝑏𝑎·𝑠=𝜇+𝑒
. This type of ciphertext is called a
Torus LWE (TLWE) ciphertext.
TFHE additionally describes two variant ciphertexts: First, a
generalized version (TGLWE), where
𝑒
and
𝜇
are polynomials in
T𝑁[𝑋]=T[𝑋]/(𝑋𝑁+
1
)
, and where
𝑎
and
𝑠
are vectors of polyno-
mials of the form
T𝑁[𝑋]𝑘
. The TGLWE ciphertext is then similarly
formed as a tuple:
𝑐=(𝑎, 𝑏 =𝑎·𝑠+𝑒+𝜇) T𝑁[𝑋]𝑘+1
. The second
variant is a generalized version of a GSW [
29
] ciphertext (TGGSW),
which is essentially a matrix where each row is a TGLWE ciphertext:
𝑐T𝑁[𝑋](𝑘+1)𝑙×(𝑘+1).
The reason for dening TGLWE and TGGSW ciphertexts is that
they permit a homomorphic multiplication:
TGLWE(𝜇1)TGGSW(𝜇2)=TGLWE(𝜇1·𝜇2),
known as the External Product (
). First, it decomposes each of the
polynomials in the TGLWE ciphertext into
𝑙
polynomials of
𝛽
bits,
an operation termed gadget decomposition. Next, the decomposed
TGLWE ciphertext and TGGSW are multiplied in a
(𝑘+
1
)𝑙
vector
times
(𝑘+
1
)𝑙× (𝑘+
1
)
-matrix product where the elements of this
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
vector and matrix are polynomials in
T𝑁[𝑋]
. The output is again a
TGLWE ciphertext encrypting 𝜇1·𝜇2.
2.2 Programmable Bootstrapping
The main goal of bootstrapping is to reduce the noise in the cipher-
text. One way to reduce the ciphertext noise would be to decrypt
the ciphertext, after which the noise can be suppressed, but this
would not be secure. Bootstrapping does in essence decrypt the
ciphertext, but for security reasons this operation is performed,
homomorphically, inside the encrypted domain. This means that
one wants to homomorphically compute
𝑏𝑎·𝑠=𝑒+𝜇
, and more
specically, as it is “programmable” bootstrapping, one wants to
additionally compute a function 𝑓(𝜇)on the data.
To achieve this programmable bootstrapping, one rst sets a
“test” polynomial
𝐹=Í𝑁1
𝑖=0𝑓(𝑖) · 𝑋𝑖T𝑁[𝑋]
that encodes
𝑁
relevant values of the function
𝑓
. This polynomial is then rotated
with
𝑏𝑎·𝑠
positions by calculating
𝐹·𝑋−(𝑏𝑎·𝑠)
, after which
the output to the function can be found on the rst position of the
resulting polynomial. However, all of these calculations should be
done without revealing the value of 𝑠.
The high-level idea of how to achieve this is to rst rewrite the
above expression as follows:
𝐹·𝑋−(𝑏𝑎·𝑠)=𝐹·𝑋𝑏·
𝑛
Ö
𝑖=1
𝑋𝑎𝑖·𝑠𝑖.(1)
This expression can be calculated iteratively. Starting with the
polynomial 𝐴𝐶𝐶 =𝐹·𝑋𝑏, one iteratively calculates:
𝐴𝐶𝐶 𝐴𝐶𝐶 ·𝑋𝑎𝑖·𝑠𝑖,(2)
which can be further rewritten, using the fact that
𝑠𝑖
is either zero
or one, to:
𝐴𝐶𝐶 (𝐴𝐶𝐶 ·𝑋𝑎𝑖𝐴𝐶𝐶) ·𝑠𝑖+𝐴𝐶𝐶 . (3)
However, as we can not reveal
𝑠𝑖
, we encode the
𝑠𝑖
value in a
TGGSW ciphertext
𝐵𝐾𝑖
, and the
𝐴𝐶𝐶
value in a TGLWE ciphertext,
after which the expression becomes:
𝐴𝐶𝐶 (𝐴𝐶𝐶 ·𝑋𝑎𝑖𝐴𝐶𝐶)𝐵𝐾𝑖+𝐴𝐶𝐶, (4)
using the homomorphic multiplication operation
. Eq. (4) homo-
morphically multiplexes on the secret value
𝑠𝑖
, and is known as the
Controlled MUX (CMUX).
Collectively, the dierent TGGSW ciphertexts
𝐵𝐾1, . . . , 𝐵𝐾𝑛
, each
encrypting one secret coecient
𝑠1,·· · , 𝑠𝑛
, are known as the boot-
strapping key. The result of the operations described above is a
TGLWE accumulator
𝐴𝐶𝐶
which is “blindly” rotated with a secret
amount of
𝑏𝑎·𝑠
positions, from which the output TLWE cipher-
text can be straightforwardly extracted. The computations during
PBS are given in Algorithm 1.
FPT implements two parameter sets of TFHE, given in Table 1.
Parameter Set I is a parameter set used by the CONCRETE Boolean
library with 128-bit security [
12
]. Parameter Set II is a 110-bit se-
curity parameter set that has previously been employed for bench-
marking purposes, allowing a direct comparison of FPT with prior
work [11, 64].
Algorithm 1: TFHE’s Programmable Bootstrapping
/* TLWE Ciphertext */
input :𝑐𝑖𝑛 =(𝑎1, . . . , 𝑎𝑛, 𝑏) T𝑛+1
/* TGGSW Bootstrapping Key */
input :𝐵𝐾 =(𝐵𝐾1, . . . , 𝐵𝐾𝑛) T𝑁[𝑋]𝑛×(𝑘+1)𝑙×(𝑘+1)
/* TGLWE Test Polynomial LUT */
input : 𝐹T𝑁[𝑋](𝑘+1)
/* TLWE Ciphertext */
output :𝑐𝑜𝑢 𝑡 T𝑘𝑁 +1
/* Test Polynomial LUT */
1𝐴𝐶𝐶 𝐹·𝑋𝑏
/* Blind Rotation */
2for 𝑖1to 𝑛do
/* CMUX */
3𝐴𝐶𝐶 (𝐴𝐶𝐶 ·𝑋2𝑁𝑎𝑖
𝑞𝐴𝐶𝐶)𝐵𝐾𝑖+𝐴𝐶𝐶
4end
5return 𝑐𝑜𝑢𝑡 =SampleExtract (𝐴𝐶𝐶)
Parameter Set (I) (II)
TLWE dimension n 586 500
TGLWE dimension k 2 1
Polynomial size N 512 1024
Decomp. Base Log 𝛽8 10
Decomp. Level 𝑙2 2
Table 1: Parameter Sets: (I) is a parameter set used by the
CONCRETE Boolean library [12] with 128-bit security. (II) is
a 110-bit security parameter set popular for benchmarking
purposes[11, 64].
2.3 FFT polynomial multiplications
As can be seen in Algorithm 1, the TFHE programmable bootstrap-
ping mainly consists of iterative calculation of the external product
, which is a vector-matrix multiplication where the elements are
large polynomials of order
𝑁
. Bootstrapping is therefore dominated
by the calculation of the polynomial multiplications.
A schoolbook approach to polynomial multiplication would re-
sult in a computational complexity
𝑂(𝑁2)
. However, utilizing the
convolution theorem, the FFT can be used to compute these poly-
nomial multiplications in time
𝑂(𝑁log(𝑁))
, as the multiplication
of polynomials modulo
𝑋𝑁
1corresponds to a cyclic convolution
of the input vectors. FHE schemes, however, need polynomial mul-
tiplications modulo
𝑋𝑁+
1, requiring negacyclic FFTs to compute
negative-wrapped convolutions. This negacyclic convolution has
a period 2
𝑁
, and thus a straightforward implementation would
require 2𝑁size FFTs. The cost of the negacyclic FFT on real input
data can be reduced using two techniques.
The fact that the FFT computes on complex numbers oers the
rst opportunity for optimization. Since the input polynomials are
purely real and have an imaginary component equal to zero, real-to-
complex (r2c) optimized FFTs can be used, which achieves roughly
a factor of two improvements in speed and memory usage [
21
]. This
is the approach taken by the TFHE and FHEW software libraries,
which compute size-2N r2c FFTs.
A second possible optimization is that negacyclic FFTs, which
would have a period and size of 2
𝑁
, can be computed instead as
Van Beirendonck et al.
a regular FFT with period and size
𝑁
by using a “twisting” pre-
processing step [
2
]. During twisting, the coecients of the input
polynomial
𝑎
are multiplied with the powers of the 2
𝑁
-th root of
unity 𝜓=𝜔2𝑁,
ˆ
𝑎=(𝑎[0],𝜓𝑎 [1], . . . , 𝜓 𝑁1𝑎[𝑁1]).(5)
After twisting, one can perform multiplication using a regular cyclic
FFT on ˆ
𝑎, halving the required FFT size to 𝑁.
While both optimizations are well-known individually, it is less
straightforward to combine them. Intuitively, the twisting step is
incompatible with the r2c optimization, because it will make the
polynomial complex.
We use a third, but not-so-well-known technique from NuFHE
[
45
] based on the tangent FFT [
6
]. The crux of this method is to “fold”
polynomial coecients
𝑎[𝑖]
and
𝑎[𝑖+𝑁/
2
]
into a complex number
𝑎[𝑖] + 𝑗𝑎 [𝑖+𝑁/
2
]
before applying the twisting step and subsequent
cyclic size-
𝑁/
2FFT. This quarters the size of the FFT required from
the original naive size-2
𝑁
FFT. We adopt this technique in FPT and
use FFTs of size
𝑁/
2
=
256 and
𝑁/
2
=
512 for Parameters Sets I
and II (Table 1), respectively.
3 FPT MICROARCHITECTURE
In this section, we discuss FPT’s microarchitecture. First, we de-
scribe how FPT’s architecture is designed as a streaming processor
targeting maximum throughput. Next, we detail a batch bootstrap-
ping technique, which signicantly reduces FPT’s on-chip caches
and o-chip bandwidth. Finally, we present balanced implemen-
tations of the various computational stages, which enable 100%
utilization of the arithmetic units during FPT’s bootstrapping oper-
ation.
3.1 Streaming Processor
FHE accelerators for second-generation schemes have mostly been
built after a classical CPU architecture [
25
,
36
,
52
]. They include a
control unit that executes an instruction set, together with a set of
arithmetic Processing Elements (PEs) that support dierent opera-
tions, e.g. ciphertext multiplication, key-switching, or bootstrap-
ping. Dierent operations utilize dierent PEs, requiring careful
proling of FHE programs to balance PE relative throughputs and
utilization [37, 53].
These accelerators are often memory-bound during bootstrap-
ping, and in order to keep a high utilization level of PEs, an in-
creasing focus is spent on optimizing the memory hierarchy, often
including a multi-layer on-chip memory hierarchy with a large
ciphertext register le at the lowest level.
FPT challenges this established classical CPU approach to FHE
bootstrapping acceleration, and instead adopts a microarchitecture
that is inspired by streaming Digital Signal Processors (DSPs). Data
ows naturally through FPT’s wide and directly cascaded computa-
tional stages, with simplied hard-wired routing paths and without
complicated control logic. During FPT’s bootstrapping operation,
utilization of arithmetic units is 100%.
As illustrated in Fig. 2, FPT denes only a single xed PE, the
CMUX PE, and instantiates only a single instance of this PE with
wide datapaths and massive throughput. Taking advantage of the
regular structure of TFHE’s PBS, consisting of
𝑛
repeated CMUX
iterations, this single high-throughput PE suces to run PBS to
completion. The CMUX PE computes a single PBS CMUX iteration,
after which its datapath output hard-wires back into its datapath
input.
Internally, the CMUX PE computes a xed sequence of monomial
multiplication, gadget decomposition, and polynomial multiply-
add operations of the external product. Rather than dividing the
CMUX into sub-PEs that are sequenced to run from a register le,
FPT builds the CMUX with directly cascaded computational stages.
Stages are throughput-balanced in the most conceivably simple
way: each stage operates at the same throughput and processes
a number of polynomial coecients per clock cycle that we call
the streaming width. Stages are interconnected in a simple xed
pipeline with static latency, avoiding complicated control logic and
simplifying routing paths.
FPT is built to achieve maximum PBS throughput. As a gen-
eral trend that we will detail later (Fig. 3b), the Throughput/Area
(TP/A) of computational stages increases together with the stream-
ing width. This motivates FPT to instantiate only a single wide
CMUX PE with high streaming width, as opposed to many CMUX
PEs with smaller streaming widths.
In summary, FPT’s CMUX architecture enables massive PBS
throughput by more closely resembling the architecture of a stream-
ing Digital Signal Processor (DSP), rather than the classical CPU
architecture employed by prior FHE processors.
3.2 Batch Bootstrapping
TFHE bootstrapping requires two major inputs: the input cipher-
text coecients
𝑎1, . . . , 𝑎𝑛
and the bootstrapping keys
𝐵𝐾1, . . . , 𝐵𝐾𝑛
.
Each iteration of the CMUX PE requires one element of both. The ci-
phertext coecients
𝑎𝑖
are relatively small in size and are therefore
easy to accommodate. In contrast, a bootstrapping key coecient
𝐵𝐾𝑖T𝑁[𝑋](𝑘+1)𝑙×(𝑘+1)
is a large matrix of up to tens of kBs.
Since the full
𝐵𝐾
is typically too large to t entirely on-chip, the
𝐵𝐾𝑖
must be loaded from o-chip memory for every iteration. How-
ever, at high CMUX throughput levels, the required bandwidth
for
𝐵𝐾𝑖
could easily exceed 1.0 TB/s. This is larger even than the
bandwidth of HBM, and thus poses a memory bottleneck.
We propose a method, termed batch bootstrapping, to amortize
loading the bootstrapping key for each iteration. The result is that
FPT can operate entirely compute-bound, with modest o-chip
bandwidth and small on-chip caches. In contrast, prior FHE proces-
sors that supported bootstrapping of second-generation schemes
were often bottlenecked by the required memory bandwidth [
37
,
52
].
In fact, a recent architectural analysis of bootstrapping [
16
] found
that it exhibits low arithmetic intensity and requires large caches.
Their conclusion was that FHE processors only benet marginally
from bespoke high-throughput arithmetic units. With our design,
we show that the situation can be very dierent for TFHE’s PBS.
In FPT, we solve the memory bottleneck problem as follows: First,
due to internal pipelining, the latency of the CMUX will be much
larger than its throughput. To operate at peak throughput, FPT
processes multiple ciphertexts to keep its CMUX pipeline stages
full. Next, we enforce that the dierent ciphertexts processed con-
currently in the CMUX’s pipeline stages arrive in a single batch
of size
𝑏
, encrypted under the same
𝐵𝐾
. This ensures that these
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Figure 2: FPT’s microarchitecture. FPT instantiates only a single PE, the CMUX PE. The CMUX is built with wide, directly
cascaded datapaths, targeting massive throughput. In light grey are illustrated two throughput-balanced architectures for the
external product (with 𝑘=
1
,𝑙=
2
): dotproduct-unrolled (left) and FFT-unrolled (right). Host-FPGA communication includes
three dierent interfaces: an input ciphertext FIFO, a ping-pong bootstrapping-key buer, and a test polynomial 𝐹SRAM.
ciphertexts are at the same CMUX iteration, and as a result, all
require the exact same input coecient 𝐵𝐾𝑖.
Batch bootstrapping then proceeds as follows: We instantiate a
simple BRAM ping-pong buer that holds two coecients of
𝐵𝐾
.
The CMUX reads
𝐵𝐾𝑖
from one half with the required bandwidth
of 1.0 TB/s, while the o-chip memory lls
𝐵𝐾𝑖+1
inside the other
half with a bandwidth of 1
.
0
/𝑏
TB/s. In a technique similar to C-
slow retiming [
40
], we can arbitrarily increase the batch size
𝑏
by introducing more pipeline registers within the CMUX, without
throughput penalty. With a batch size of
𝑏=
16, already the required
bandwidth can be supplied by DDR4 instead of HBM.
Our simple but crucial batch bootstrapping technique exploits
locality of reference to decouple the on-chip bandwidth from the
o-chip bandwidth. As a result, in our architecture, TFHE’s PBS is
entirely compute-bound with only kB-size caches, not larger than
the size of two coecients of the bootstrapping key.
3.3 Balancing the External Product
The external product
()
, computing a vector-matrix negacyclic
polynomial product, represents the bulk of the CMUX logic. As
discussed before, the polynomial multiplications are performed
using an FFT, and thus the
()
operations include forward and
inverse negacylic FFT computations, and pointwise dot-products
with
𝐵𝐾𝑖
(the bootstrapping key
𝐵𝐾𝑖
is already in the FFT domain).
In a streaming architecture, it is important to balance through-
puts of processing elements, which is not trivial as the external
product includes
(𝑘+
1
)𝑙
forward FFTs, but only
(𝑘+
1
)
inverse
FFT operations. We explore two dierent throughput-balanced
architectures for the external product as shown in light-grey in
Fig. 2: a dotproduct-unrolled architecture (left) and an FFT-unrolled
architecture (right).
The dotproduct-unrolled architecture (left) represents the more
obvious choice for parallelism, where we instantiate
𝑙
times more
FFT kernels compared to IFFT kernels. With the FFT-unrolled ar-
chitecture on the right, we make a more unconventional choice:
we balance throughputs by instantiating the FFT with
𝑙
times the
streaming width of the IFFT. These two architectural trade-os
can be understood as exploiting dierent types of “loop unrolling”
inside the external product. On the left, we rst loop-unroll the dot-
product before unrolling the FFT, while on the right, we loop-unroll
the FFT maximally.
The drawback of the FFT-unrolled architecture is that it is more
complex than the dotproduct-unrolled one. First, multiply-add op-
erations must be replaced by MACs, since polynomial coecients
that must be added are now spaced temporally over dierent clock
cycles. Second, the inverse FFT can only start processing once a full
MAC has been completed, requiring a Parallel-In Serial-Out (PISO)
block that double-buers the MAC output and matches through-
puts. Third and most importantly, FFT blocks can be challenging to
unroll and implement for arbitrary throughputs, and supporting
two FFT blocks with diering throughputs requires non-negligible
extra engineering eort.
The main advantage of the more unconventional FFT-unrolled
architecture is that it features fewer FFT kernels that can therefore
utilize higher streaming widths. As we will detail in the next section,
this favors the general (and often-neglected) trend of pipelined FFTs,
which typically feature signicantly higher TP/A as the streaming
width increases. At the most extreme end, a fully parallel FFT is a
circuit with only constant multiplications and xed routing paths,
featuring up to 300% more throughput per DSP or per LUT on our
target FPGA (Fig. 3b). FPT alleviates the extra engineering eort
and extra complexity of the FFT-unrolled architecture, by extending
and optimizing an existing FFT generator tool to support negacylic
FFTs.
3.4 Streaming Negacylic FFTs
State-of-the-art FHE processors have implemented mostly itera-
tive FFTs or NTTs that process polynomials in multiple passes
[
1
,
25
,
41
,
49
]. In these architectures, it can be dicult to support
arbitrary throughputs, as memory conicts arise when each pass
requires data at dierent strides. Instead, FPT instantiates pipelined
FFTs that naturally support a streaming architecture. Pipelined FFT
Van Beirendonck et al.
architectures consist of
𝑙𝑜𝑔 (𝑁)
stages that are connected in series.
The main advantage of these architectures is that they process a
continuous ow of data, which lends itself well to a fully streaming
external product design.
There are many pipelined FFT architectures that target high-
throughput and support arbitrary streaming widths, and we refer to
[
22
] for a recent survey. Generally, pipelined FFTs cascade two types
of units: rst, the well-known butteries with complex twiddle
factor multipliers, and, second, shuing circuits that compute stride
permutations. Pipelined FFTs feature a large design space, with
dierent possible overall architectures, area/precision trade-os in
computing twiddle factor “rotations”, varying radix structures that
determine which twiddle factors appear at which stages, and more.
As such, they are an excellent target for tool-generated circuits, and
we follow this approach for FPT.
Several FFT generator tools have been proposed in the literature.
Some IP cores do not oer the massive parallelism and arbitrary
streaming widths that we target for FPT [
30
,
61
]. At the other end
of the spectrum, a recent generator [
24
] built on top of FloPoCo
[
17
] can only generate fully-parallel FFTs, instead of supporting
arbitrary streaming widths. We synthesized at dierent streaming
widths the High-Level Synthesis (HLS) Super Sample Rate (SSR)
FFTs included in the Vitis DSP libraries of Xilinx [
60
], but found
that they are outperformed by the RTL Verilog FFTs generated by
the Spiral FFT IP Core generator [
43
]. Unfortunately, Spiral is not
open-source and oers only a web interface towards its generated
RTL [42].
Eventually, we settled on SGen[
54
56
] as the FFT generator tool
that provided the necessary congurability, extensibility, and per-
formance we targeted for FPT. SGen is an open-source generator
implemented in Scala and employs concepts introduced in Spiral.
It generates arbitrary-streaming-width FFTs through four Interme-
diate Representations (IRs) with dierent levels of optimization: an
algorithm-level representation SPL, a streaming-block-level repre-
sentation Streaming-SPL, an acyclic streaming IR, and an RTL-level
IR. Apart from the streaming width, SGen features a congurable
FFT point size, radix, and hardware arithmetic representations such
as xed-point, IEEE754 oating-point, or FloPoCo oating-point.
Most importantly, SGen is fully open-source and extensible,
which we make heavy use of to generate streaming FFTs for FPT.
First, we have extended SGen with operators for the forward and
inverse twisting step, necessary to support negacyclic FFTs (Sec-
tion 2.3). Next, we have implemented a set of optimizations aimed
at higher precision and better TP/A. In this category, rst, we have
extended SGen with radix-2
𝑘
structures [
23
,
32
], nding that radix-
2
4
FFTs are on average 10% smaller than SGen-generated radix-4 or
radix-16 FFTs. Second, we replace schoolbook complex multiplica-
tion in SGen, requiring 4 real multiplies and 2 real additions, with
a variant of Karatsuba multiplications that is sometimes attributed
to Gauss:
𝑋+𝑗𝑌 =(𝐴+𝑗 𝐵)·(𝐶+𝑗𝐷 )
𝑍=𝐶(𝐴𝐵)
𝑋=(𝐶𝐷)𝐵+𝑍
𝑌=(𝐶+𝐷)𝐴𝑍
(6)
By pre-computing
𝐶𝐷
and
𝐶+𝐷
for the constant twiddle factors,
this multiplication requires only 3 real multiplies and 3 adds, saving
scarce FPGA DSP units.
Third, we decouple the twiddle bit-width from the input bit-
width. On one hand, this allows us to take advantage of the asym-
metric 27
×
18 multipliers found in FPGA DSP blocks, while at the
same time, it has been found that twiddles can be quantized with
approximately four fewer bits without aecting output noise [
8
,
14
].
Finally, as data grows throughout the FFT stages, it must initially
be padded with zeros to prevent overows. We have extended SGen
with a scaling schedule, that instead divides the data by two when-
ever the most-signicant bit must grow. Since the least-signicant
bits have mostly accumulated noise [
59
], scaling increases the preci-
sion for xed input bit-width. Adding a scaling schedule allows us,
on average, to use FFTs with 2-bit smaller xed-point intermediate
variables while meeting the same precision targets, which proves
crucial to eciently map multipliers to DSP units as will be detailed
later in Section 4, Fig. 4.
Figure 3a illustrates the resource usage of negacyclic size-256
FFTs produced by our optimized variant of SGen at dierent stream-
ing widths. To quantize our improvements over SGen, we also add
cyclic FFTs both with and without our introduced changes to the
tool. Our changes result in signicantly fewer logic resources: over
60% fewer DSP blocks are utilized while keeping LUTs comparable.
As DSP blocks are the main limiting resource for FPT (Table 3), our
optimizations are a key enabler to building FPT with high streaming
widths.
Figure 3b illustrates our main motivation to propose the FFT-
unrolled architecture for the external product. We plot the relative
throughput per area unit (DSPs or LUTs) of tool-generated FFTs
for dierent streaming widths. The trend is clear: FFTs with higher
streaming widths feature up to 300% more throughput per DSP or
per LUT. Intuitively, as the streaming width increases, FFTs can
take more advantage of the native strengths of hardware circuits.
First, shuing circuits with MUXes and storage blocks are replaced
with xed routing paths. Second, twiddle factor multipliers can
be specialized to the specic set of twiddles they need to handle,
taking advantage of optimized algorithms for Single- or Multiple
Constant Multiplication (SCM, MCM).
3.5 Other operations
Compared to the external product and its streaming FFTs, the re-
mainder of the CMUX (Fig. 2, dark grey) represents mostly sim-
ple circuitry: additions, subtractions, gadget decomposition, and
monomial multiplication. Whereas the rst three can be streamed
straightforwardly, monomial multiplication requires special treat-
ment.
Monomial multiplication multiplies the accumulator
𝐴𝐶𝐶
with
the ciphertext-dependent monomial
𝑋2𝑁 𝑎𝑖
𝑞
. Its eect is to rotate
the polynomials of
𝐴𝐶𝐶
by
2𝑁 𝑎𝑖
𝑞
, and additionally negate those
coecients that wrap around. First, we truncate
2𝑁 𝑎𝑖
𝑞
already in
software to limit host-FPGA bandwidth. Next, an ecient architec-
ture for monomial multiplication is a coecient-wise barrel shifter
in 𝑙𝑜𝑔 (𝑁)stages.
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
(a)
(b)
Figure 3: FPGA resource utilization (a) and throughput / re-
source utilization (b) of a size-256 FFT at dierent streaming
widths. At iso-precision, SGen FFTs use 31-bit intermediate
variables without scaling schedule, and our FFTs use 29-bit
intermediate variables with scaling.
To stream this operation, we dene two streaming approaches:
coecient-wise streaming, and bitwise streaming. In coecient-
wise streaming, dierent coecients of a polynomial are spaced
temporally over dierent clock cycles. In bit-wise streaming, all
coecients arrive in parallel within the same clock cycle, but we
instead divide dierent bit chunks of each coecient over dierent
clock cycles. One can then make a simple observation: a rotation is
a dicult permutation to stream coecient-wise, as it must inter-
change coecients that are spaced over dierent clock cycles, but
it is a simple operation to stream bit-wise, as we must simply rotate
all the individual bit-chunks. We therefore add stream-reordering
blocks that switch a polynomial from coecient-wise streaming to
bit-wise streaming and vice versa. At the same time, we merge the
stream-reordering with the folding operation of the negacyclic FFT,
which packs coecients
𝑎[𝑖]
and
𝑎[𝑖+𝑁/
2
]
. The reordering block
can be implemented at full throughput either in a R/W memory
block or with a simple series of registers and MUXes.
Signed gadget decomposition involves taking unsigned 32-bit
coecients, and decomposing them into
𝑙
signed coecients of
𝛽
bits. In hardware, this involves a simple reinterpretation of the bits
and conditional subtraction. We merge this logic at the output of
monomial multiplication to take advantage of LUT packing. In the
bit-wise streamed representation, these operations must track the
propagating carries in ipops.
Gadget decomposition is approximate, e.g., for Parameter Set I,
𝑙·𝛽=
16-bit
<
32-bit. Contrarily to software implementations, FPT
employs a CMUX datapath that is natively adjusted to approximate
gadget decomposition. We discard bits prematurely that would later
be rounded, allowing us to stick to a native 16-bit datapath, rather
than growing back to 32-bits outside of the external product.
4 COMPACT FIXED-POINT
REPRESENTATION
FFT calculations involve irrational (complex) numbers, and approx-
imation errors arise when those numbers are represented with
nite precision during computation. However, if enough precision
is used, implementations of TFHE tolerate these approximation
errors. More specically, one typically aims for the total approxi-
mation error to be lower than the noise inherently present in the
FHE calculations.
On a CPU, the typical method is to use oating-point numbers
with single or double precision. This is ecient due to the integra-
tion of an existing Floating-Point Unit (FPU), and therefore the typ-
ical representation of choice for software designers. CPU and GPU
implementations of TFHE have been restricted to double-precision
oating-point FFTs because single-precision FFTs were found to
introduce too much noise to guarantee successful decryption of
bootstrapped ciphertexts [11].
In dedicated hardware implementations, FPUs are not natively
available and are costly to include. To simplify the implementation
and the analysis of the approximation error, some prior implemen-
tations opted to change the scheme to work with a prime modulus
instead of a power-of-two modulus [
45
,
62
], allowing the use of
exact NTTs instead of approximate FFTs for polynomial multiplica-
tion. The downside of this approach is that one needs to include
costly modular reduction units.
FPT is the rst TFHE accelerator to instead utilize xed-point
calculations, which avoids the costly implementation of FPUs or
modular reduction units. Moreover, instead of initializing very
large xed-point calculations to guarantee sucient accuracy, we
conduct an in-depth analysis that optimizes the xed-point bitwidth
to be just large enough so that the approximation noise is smaller
than the inherent TFHE noise. FPT’s optimized approach in which
there is no need for a costly FPU or modular reduction unit allows
a more lean and ecient design, coming at the cost of a one-time
engineering eort to nd the optimal parameters.
The potential eect of our xed-point analysis on the area usage
of our implementation is illustrated in Figure 4. In this gure, we
Van Beirendonck et al.
Figure 4: FPGA Relative LUT and DSP utilization of a size-
256 FFT for various intermediate variable bitwidths.
plotted the LUT and DSP usage of a size-256 FFT, in function of
the bit width of the intermediate variables. The plot gives relative
numbers compared to the resource use at bitwidth 53 (loosely corre-
sponding to the signicand precision of IEEE 754 double-precision
oating-point). As illustrated, reducing the bitwidth of the inter-
mediate variables can result in a large reduction of the resource
utilization, with only 20% of the LUT and DSP usage for bitwidths
below 24.
Reduction of the bitwidth of intermediate variables relies on two
parts, the location of the most signicant bit, and the location of
the least signicant bit. We will rst look at our strategy to set the
MSB position of intermediate variables, and then focus on the LSB.
4.1 Setting the MSBs
The location of the most signicant bit is important to avoid over-
ows. If an overow occurs, the intermediate variable will be com-
pletely distorted and thus the result of the calculation will be un-
usable. Two strategies can be adopted to deal with overows: a
worst-case scenario where one can choose parameters to avoid any
overow or an average-case scenario where one allows a certain
overow with suciently low probability.
Avoiding any overow comes at a signicant enlargement of the
parameters and thus at a signicant cost, which is why we adopt
the strategy to avoid overows with a maximal overow probability
of
𝑃𝑜 𝑓 =
2
64
. To determine the ideal MSB position we measure the
variance and then assume a Gaussian distribution to calculate the
overow probability. For a given MSB position
𝑝𝑀𝑆 𝐵
and standard
deviation 𝜎, the probability of overow is:
𝑃𝑜 𝑓 =𝑃[|𝜒|>2𝑝𝑀𝑆 𝐵 /2|𝜒$
N(0, 𝜎 )] (7)
=1erf 2𝑝𝑀𝑆𝐵
22𝜎.(8)
Using this equation we determine the lowest
𝑝𝑀𝑆 𝐵
that fullls the
maximal overow probability of
𝑃𝑜 𝑓 =
2
64
for each intermediate
variable.
Parameter Set (I) (II)
BK FixedPoint26( 7, 19) FixedPoint27( 8, 19)
FFT FixedPoint29(15, 14) FixedPoint30(18, 12)
IFFT FixedPoint29(23, 6) FixedPoint30(27, 3)
Table 2: Fixed-point data representations used by interme-
diate variables, in the format FixedPointwidth(integerBits,
fractionalBits).
4.2 Setting the LSBs
The position of the least signicant bits has an inuence on the
approximation noise that is introduced during the calculations. This
approximation noise can be tolerated up to a certain level. More
specically, the approximation noise should be small enough so that
the combination of the approximation noise and the inherent TFHE
noise still leads to a correct bootstrap with high probability. We
divide the total acceptable noise, for which we use theoretical noise
bounds of [
11
], into two equal parts for the approximation noise
and the inherent noise, thus allowing our approximation noise to
be at most half the total acceptable noise.
In our design, we focus on three main parameters: the intermedi-
ate variable widths during the forward and inverse FFT calculations,
and the bitwidth of the coecients of the bootstrapping key
𝐵𝐾
. We
assume the noise introduced due to each parameter is independent
(as each parameter comes from a separate block in our design),
which means that the variance of the total noise is equal to the sum
of the variances of each noise source (
𝜎2
𝑡𝑜𝑡 =𝜎2
𝐹 𝐹𝑇 +𝜎2
𝐼 𝐹 𝐹𝑇 +𝜎2
𝐵𝐾
).
We then limit the noise variance due to each noise source to 1
/
3th
of the total noise variance.
To nd optimal xed-point parameter values, we perform a pa-
rameter sweep by setting the parameters to very high widths (in our
example 53) resulting in very low noise, and then iteratively reduc-
ing one parameter until it hits the noise threshold while keeping the
other parameters at high widths. The result of this experiment can
be seen in Fig. 5, and our nal xed-point parameters are illustrated
in Table 2. Note that we give the IFFT data representation before
outputs are scaled by 1/𝑁.
4.3 Related and Future Work
One prior implementation proposing a custom hardware format for
TFHE’s FFTs is MATCHA [
33
], who propose to use (integer) Dyadic-
Value-Quantized Twiddle Factors (DVQTFs). Our xed-point pa-
rameter analysis improves on MATCHA’s in two key ways:
First, MATCHA only considers the bitwidth of twiddle factors,
and set a uniform bitwidth (either 38-bit or 64-bit) that is employed
throughout their external product calculations. Our analysis instead
shows that dierent intermediate variables can prot from dierent
xed-point representations, allowing for an overal smaller resource
utilisation (Fig. 5, Table 2). Moreover, our analysis allows us to
quantize
𝐵𝐾
smaller than other parameters, limiting both on-chip
𝐵𝐾𝑖buers and o-chip bandwidth.
Second, in MATCHA, instead of measuring the noise variance,
the authors conduct 10
8
tests for a parameter set to test if there
are no decryption failures at the end of bootstrapping. The down-
side of this approach is that it becomes expensive when multiple
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Figure 5: Output approximation noise versus the number of
fractional bits for the representation of the bootstrapping
key and intermediate variables during the forward and in-
verse FFT.
parameters have to be set. Furthermore, this methodology does
not give exact values of the failure probability, as one only has the
information that no errors were found in 10
8
tests. Our approach
of measuring the approximation noise and matching it with the
theoretical noise bounds provides for a more rigorous and lean
design.
Finally, we note that there are other intermediate variables that
could be optimized, for example, the widths of the twiddle factors
in the FFT calculations. We heuristically set them to the width of
the intermediate variables minus 4, which gave a good balance
between failure probability and cost as also explained in Section 3.4.
Interesting future work could include a full search over all possible
parameters, which could result in improved xed-point parameters
over our heuristic approach.
5 IMPLEMENTATION
We implemented FPT for a Xilinx Alveo U280 datacenter accelera-
tor FPGA featuring 1.3M LUTs, 2.6M FFs, 9024 DSPs, and 41 MB
of on-chip SRAM. For both parameter sets, we employ our FFT-
unrolled architecture with a forward FFT streaming width of 128
complex coecients/clock cycle, and an IFFT streaming width of
128
/𝑙=
64 complex coecients per clock cycle. For Parameter
Set II with
𝑁=
1024, we have also implemented the dotproduct-
unrolled architecture with
(𝑘+
1
)𝑙=
4forward FFT kernels and
(𝑘+
2
)=
2IFFT kernels, both of uniform streaming width 32. At
this datapoint, providing iso-throughput to the FFT-unrolled archi-
tecture, we found that the dotproduct-unrolled architecture incurs
10% more average DSP and LUT usage, and we therefore do not
evaluate it further.
Our FFT-unrolled architectures feature massive throughput, com-
pleting one CMUX every
(
256
/
128
)·(𝑘+
1
)𝑙=
12 clock cycles for
Parameter Set I, and every
(
512
/
128
) · (𝑘+
1
)𝑙=
16 clock cycles for
Parameter Set II. The latency of the CMUX is larger: 156 cycles for
Parameter Set I, and 224 cycles for Parameter Set II. In both cases,
we operate at peak throughput by lling the CMUX pipeline with a
batch of ciphertexts, of sizes
𝑏=
156
/
12
=
13 and
𝑏=
224
/
16
=
14,
respectively.
5.1 External I/O
The Alveo U280 includes three dierent host-FPGA memory inter-
faces: 32 GB of DDR4, 8 GB of HBM accessed through 32 Pseudo-
Channels (PCs), and 24 MB of PLRAM. PBS also requires three
host-side inputs: a batch of
𝑏
input ciphertexts
𝑐𝑖𝑛
, the long-term
bootstrapping key
𝐵𝐾
, and the test polynomial LUT
𝐹
to evaluate
over the ciphertext.
For the long-term bootstrapping key, we note that it is not abso-
lutely necessary to instantiate a ping-pong
𝐵𝐾𝑖
buer, as discussed
in Section 3.2, on our target Alveo U280 FPGA. For our parameter
sets and xed-point-trimmed
𝐵𝐾
bitwidths, the full
𝐵𝐾
measures
approximately 15 MB and ts entirely in a combination of the
on-chip BRAM and URAM. Nevertheless, we instantiate a small
ping-pong
𝐵𝐾𝑖
cache as a proof-of-concept. This requires an on-
chip ping-pong buer of only 2
/𝑛
of the full size of
𝐵𝐾
, allowing our
architecture to remain compute-bound on architectures with less
on-chip SRAM, such as smaller FPGAs or heavily memory-trimmed
down ASICs. Moreover, our technique ensures that our architecture
scales to new TFHE algorithms or related schemes like FHEW that
increase the size of the bootstrapping key.
For our batch sizes
𝑏
, the required
𝐵𝐾
bandwidth is only tens of
GB/s, which we provide by splitting the
𝐵𝐾
over a limited number
of HBM PCs 0-7, each providing 14 GB/s of bandwidth. The input
and output ciphertext batches are small and require negligible band-
width, which we allocate in a single HBM PC. Each HBM channel is
served by a separate AXI master on the PL-side, which are R/W for
the ciphertext and read-only for
𝐵𝐾
. For the test polynomial LUT
𝐹
,
we allocate an on-chip RAM that can store a congurable number
of test polynomials. Each input ciphertext is tagged with an index
of the LUT to apply, and correspondingly the test polynomial
𝐹
to
select from the RAM as input to the rst CMUX iteration. LUTs
depend on the specic FHE program, are typically limited in num-
ber, and do not change often. For example, bootstrapped Boolean
gates require only a single LUT. As such, we keep the RAM small,
and we share the same HBM PC and AXI master that is used by the
input and output ciphertexts.
5.2 Xilinx Run Time Kernel
FPT is accessible from the host as Xilinx Run Time (XRT) RTL
kernel and managed through XRT API calls. FPT’s CMUX pipeline
features 100% utilization during a single ciphertext batch bootstrap,
and does not require complex kernel overlapping to reach peak
throughput. To ensure that there are no pipeline bubbles between
the bootstrapping of dierent batches, we allow early pre-fetching
of the next ciphertext batch into an on-chip FIFO. As such, we build
FPT to support the Vitis
ap_ctrl_chain
kernel block level control
protocol, which permits overlapping kernel executions and allows
FPT to queue future ciphertext batch base HBM addresses.
Van Beirendonck et al.
LUT FF DSP BRAM
(40% av.) (35% av.) (61% av.) (25% av.)
FPT (I) 526K 916K 5494 505
CMUX 384K 707K 5494 310
MAC (384×) 97K 114K 2304 310
FFT256,128 159K 366K 2126 0
IFFT256,64 97K 192K 1064 0
(46% av.) (39% av.) (66% av.) (20% av.)
FPT (II) 595K 1024K 5980 412
CMUX 458K 827K 5980 215
MAC (256×) 66K 79K 1536 215
FFT512,128 222K 449K 2958 0
IFFT512,64 130K 255K 1486 0
Table 3: FPT Hardware Resource Utilization Breakdown for
Parameter Sets I and II. DSP blocks are the main limiting
resource with up to 66% of available FPGA resources utilized
by FPT.
5.3 Fixed-point Streaming Design in Chisel
While the outer host-FPGA communication logic of FPT is imple-
mented in SystemVerilog, we use Chisel [
4
] an open-source HDL
embedded in Scala to construct the inner streaming CMUX ker-
nel. Like SystemVerilog, Chisel is a full-edged HDL with direct
constructs to describe synthesizable combinational and sequential
logic and not a High-Level Synthesis (HLS) language. Our moti-
vation to select Chisel over SystemVerilog for the CMUX, is that
it makes the full capabilities of the Scala language available to de-
scribe circuit generators. We make heavy use of object-oriented
and functional programming tools to describe our CMUX stream-
ing architecture for a congurable streaming width, and in both
realizations shown in Fig. 2. Moreover, Chisel has a rich typesystem
that is further supported by external libraries. In FPT, the existing
DspComplex[FixedPoint]
is our main hardware datatype that we
use within our architecture. Building on existing FixedPoint test
infrastructure that we extended for FPT, our experiments in Sec-
tion 4 are directly run on the Chisel-generated Verilog rather than
an intermediate xed-point software model.
6 EVALUATION AND COMPARISON
6.1 Resource Utilization
FPT is implemented using Xilinx Vivado 2022.2 and packaged
as XRT kernel using Vitis 2022.2, targeting a clock frequency of
200 MHz. Table 3 presents a resource utilization breakdown of FPT,
for both Parameter Sets I and II. In both cases, DSP blocks are the
main limiting resource that prevents increasing to the next available
streaming width, with up to 66% of available DSP blocks utilized by
FPT. Note that whereas Fig. 2 presented our ping-pong
𝐵𝐾
buer as
a monolithic memory block, it is physically split into many smaller
memory blocks that are placed inside the MAC units that consume
them.
6.2 PBS Benchmarks
Table 4 compares FPT quantitatively with a number of prior TFHE
baselines. For our CPU baseline, we benchmark single-core PBS
in CONCRETE[
12
] on an Intel Xeon Silver 4208 CPU at 2.1 GHz.
A recent ASIC baseline is provided by MATCHA [
33
]. MATCHA
presents emulations of a 36.96mm
2
ASIC in a 16nm PTM process
technology. As FPGA baseline, we include a recent architecture of
Ye et al. [
62
], which was developed concurrently with our work
and signicantly improves the prior baseline of Gener et al. [
26
].
We refer to this architecture by the author initials YKP, and we
also include YKP’s benchmarks of cuFHE [
15
], a GPU-based imple-
mentation benchmarked on an NVIDIA GeForce RTX 3090 GPU at
1.7 GHz, in our comparison.
The main design goal of FPT is PBS throughput. Table 4 illus-
trates the massive PBS throughput that is enabled through FPT’s
streaming architecture: 937
×
more than CONCRETE, 7.1
×
more
than YKP, and 2.5×more than MATCHA or cuFHE.
In FPT’s current instantiation, we did not optimize for latency.
As the PCIe and AXI latencies of communicating the in- and out-
put ciphertext batches are negligible, FPT’s PBS latency is mostly
determined by its CMUX pipeline depth. In this work, we kept the
CMUX pipeline depth large, tting
𝑏
ciphertexts and enabling small
o-chip bandwidth through our batched bootstrapping technique.
Lower-latency implementations of FPT can opt to decrease the
CMUX pipeline depth, requiring either more o-chip bandwidth to
load
𝐵𝐾
or caching the full
𝐵𝐾
on-chip. Nevertheless, FPT’s latency
even in its current instantiation is competitive with MATCHA. We
note that FPT is estimated at 99W total on-chip power (FPGA and
HBM), oering a similar TP/W as MATCHA (40W) and signicantly
more than cuFHE (>200W) or YKP (50W).
6.3 Related Work
Qualitatively, FPT makes dierent design choices than either YKP
or MATCHA. MATCHA is built after the classical CPU approach
to FHE accelerators. It includes a set of TGGSW clusters with ex-
ternal product cores that operate from a register le. As one result,
MATCHA is bottlenecked by data movement and cache memory
access conicts.
YKP is an HLS-based architecture that redenes TFHE to use
NTT, breaking compatibility with existing TFHE libraries like CON-
CRETE, and disabling the xed-point optimizations of FPT. At the
architectural level, YKP includes some concepts also employed by
FPT. Similar to FPT, they include a pipelined implementation of
the CMUX that processes multiple ciphertext instances. However,
unlike FPT which builds a single streaming CMUX PE with large
and congurable streaming width, YKP implements and instanti-
ates multiple smaller CMUX PEs with inferior TP/A. Each CMUX
pipeline instance in YKP includes an SRAM that stores a coecient
𝐵𝐾𝑖
. However, unlike FPT where these SRAMs are loaded by o-
chip memory in ping-pong fashion, YKP loads coecients from
DRAM only after a full coecient has been consumed. This makes
the number of CMUX PEs they instantiate limited by the o-chip
memory bandwidth, whereas FPT’s design choices make it entirely
compute-bound.
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
Parameter Set LUT / FFs / DSP / BRAM Clock (MHz) Latency(ms) TP (PBS/ms)
FPT (I) 526K / 916K / 5494 / 17.5Mb 200 0.48 28.4
(II) 595K / 1024K / 5980 / 14.5Mb 200 0.58 25.0
842K / 662K / 7202 / 338Mb 180 3.76 3.5
YKP [62] (II) 442K / 342K / 6910 / 409Mb 180 1.88 2.7
MATCHA [33] (II) 36.96mm216nm PTM 2000 0.2 10
(I) 2100 33 0.03
CONCRETE[12] (II) Intel Xeon Silver 4208 2100 32 0.03
cuFHE [15] (II) NVIDIA GeForce RTX 3090 1700 9.34 9.6
Table 4: Comparison of TFHE PBS on a variety of platforms
Both MATCHA and YKP focus on an algorithmic technique
called bootstrapping key unrolling. This technique unrolls
𝑚
iter-
ations of the Blind Rotation loop (Algorithm 1, Line 2), requiring
an (exponentially) more expensive CMUX equation and larger
𝐵𝐾
,
but reducing the total number of iterations from
𝑛
to
𝑛/𝑚
. In FPT’s
design spirit of maximum throughput, bootstrapping key unrolling
is a bad trade-o. Already at
𝑚=
2, the adapted CMUX requires
3
×
more external products and 3
×
larger bootstrapping keys for
only 2
×
fewer iterations, resulting in inherently smaller overall
PBS TP/A. Bootstrapping key unrolling is essential for designs like
MATCHA and YKP to extract parallelism, which have many smaller
functional units with inferior TP/A. FPT, with its FFT-unrolled ar-
chitecture and large streaming width, nds ample parallelism and
larger TP/A within a single CMUX.
For completeness, we note that both MATCHA and YKP in-
clude key-switching as an operation of PBS. Key-switching includes
coecient-wise multiplication of a TLWE ciphertext with a key-
switching key. We opted not to include key-switching in FPT, be-
cause dierent FHE programs may choose to key-switch either be-
fore or after PBS [
11
]. Nevertheless, key-switching is an operation
with much lower throughput requirements than the CMUX [
62
].
In FPT, key-switching of the output ciphertext can be supported
without throughput penalty (but with slightly increased latency)
by instantiating a few integer multipliers on the AXI write-back
path.
7 CONCLUSION
In this paper, we introduced FPT, an accelerator for the Torus Fully
Homomorphic Encryption (TFHE) scheme. In contrast to previous
FHE architectures, our design follows a streaming approach with
high throughput and low control overhead. Owing to a batched
design and balanced streaming architecture, our accelerator is the
rst FHE bootstrapping implementation that is compute-bound and
not memory-bound, with small data caches and a 100% utilization of
the arithmetic units. Instead of using an NTT or oating-point FFT,
FPT achieves a signicant throughput increase by utilizing up to 80%
area-reduced xed-point FFTs with compact and optimized variable
representations. In the end, FPT achieves a TFHE bootstrapping
throughput of 28.4 bootstrappings per millisecond, which is 937
×
faster than CPU implementations, 7.1
×
faster than a concurrent
FPGA implementation, and 2.5
×
faster than state-of-the-art ASIC
and GPU designs.
ACKNOWLEDGMENTS
This work was supported in part by CyberSecurity Research Flan-
ders with reference number VR20192203, the Research Council
KU Leuven (C16/15/058), the Horizon 2020 ERC Advanced Grant
(101020005 Belfort), and the AMD Xilinx University Program through
the donation of a Xilinx Alveo U280 datacenter accelerator card.
Michiel Van Beirendonck is funded by FWO as Strategic Basic
(SB) PhD fellow (project number 1SD5621N). Jan-Pieter D’Anvers
is funded by FWO (Research Foundation Flanders) as junior post-
doctoral fellow (contract number 133185).
Finally, the authors would like to thank Wouter Legiest for ex-
perimenting with a variety of FFT generator tools.
REFERENCES
[1]
Rashmi Agrawal, Leo de Castro, Guowei Yang, Chiraag Juvekar, Rabia Tugce
Yazicigil, Anantha P. Chandrakasan, Vinod Vaikuntanathan, and Ajay Joshi.
2022. FAB: An FPGA-based Accelerator for Bootstrappable Fully Homomorphic
Encryption. CoRR abs/2207.11872 (2022). https://doi.org/10.48550/arXiv.2207.
11872 arXiv:2207.11872
[2]
Alfred V. Aho, John E. Hopcroft, and Jerey D. Ullman. 1974. The Design and
Analysis of Computer Algorithms. Addison-Wesley.
[3]
Michael Armbrust, Armando Fox, Rean Grith, Anthony D. Joseph, Randy H.
Katz, Andy Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica,
and Matei Zaharia. 2010. A view of cloud computing. Commun. ACM 53, 4 (2010),
50–58. https://doi.org/10.1145/1721654.1721672
[4]
Jonathan Bachrach, Huy Vo, Brian C. Richards, Yunsup Lee, Andrew Water-
man, Rimas Avizienis, John Wawrzynek, and Krste Asanovic. 2012. Chisel:
constructing hardware in a Scala embedded language. In The 49th Annual Design
Automation Conference 2012, DAC ’12, San Francisco, CA, USA, June 3-7, 2012,
Patrick Groeneveld, Donatella Sciuto, and Soha Hassoun (Eds.). ACM, 1216–1225.
https://doi.org/10.1145/2228360.2228584
[5]
Ahmad Al Badawi, Bharadwaj Veeravalli, Chan Fook Mun, and Khin Mi Mi Aung.
2018. High-Performance FV Somewhat Homomorphic Encryption on GPUs: An
Implementation using CUDA. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2018, 2
(2018), 70–95. https://doi.org/10.13154/tches.v2018.i2.70-95
[6]
Daniel J. Bernstein. 2007. The Tangent FFT. In Applied Algebra, Algebraic Al-
gorithms and Error-Correcting Codes, 17th International Symposium, AAECC-17,
Bangalore, India, December 16-20, 2007, Proceedings (Lecture Notes in Computer
Science, Vol. 4851), Serdar Boztas and Hsiao-feng Lu (Eds.). Springer, 291–300.
https://doi.org/10.1007/978-3- 540-77224- 8_34
[7]
Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. 2014. (Leveled) Fully
Homomorphic Encryption without Bootstrapping. ACM Trans. Comput. Theory
6, 3 (2014), 13:1–13:36. https://doi.org/10.1145/2633600
[8]
Wei-Hsin Chang and Truong Q. Nguyen. 2008. On the Fixed-Point Accuracy
Analysis of FFT Algorithms. IEEE Trans. Signal Process. 56, 10-1 (2008), 4673–4682.
https://doi.org/10.1109/TSP.2008.924637
Van Beirendonck et al.
[9]
Jung Hee Cheon, Andrey Kim, Miran Kim, and Yong Soo Song. 2017. Homo-
morphic Encryption for Arithmetic of Approximate Numbers. In Advances in
Cryptology - ASIACRYPT 2017 - 23rd International Conference on the Theory and Ap-
plications of Cryptology and Information Security, Hong Kong, China, December 3-7,
2017, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 10624), Tsuyoshi
Takagi and Thomas Peyrin (Eds.). Springer, 409–437. https://doi.org/10.1007/978-
3-319- 70694-8_15
[10]
Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2016.
Faster Fully Homomorphic Encryption: Bootstrapping in Less Than 0.1 Seconds.
In Advances in Cryptology - ASIACRYPT 2016 - 22nd International Conference
on the Theory and Application of Cryptology and Information Security, Hanoi,
Vietnam, December 4-8, 2016, Proceedings, Part I (Lecture Notes in Computer Science,
Vol. 10031), Jung Hee Cheon and Tsuyoshi Takagi (Eds.). 3–33. https://doi.org/
10.1007/978-3- 662-53887- 6_1
[11]
Ilaria Chillotti, Nicolas Gama, Mariya Georgieva, and Malika Izabachène. 2020.
TFHE: Fast Fully Homomorphic Encryption Over the Torus. J. Cryptol. 33, 1
(2020), 34–91. https://doi.org/10.1007/s00145-019- 09319-x
[12]
Ilaria Chillotti, Marc Joye, Damien Ligier, Jean-Baptiste Orla, and Samuel Tap.
2020. CONCRETE: Concrete operates on ciphertexts rapidly by extending TfhE.
In WAHC 2020–8th Workshop on Encrypted Computing & Applied Homomorphic
Cryptography, Vol. 15.
[13]
Ilaria Chillotti, Marc Joye, and Pascal Paillier. 2021. Programmable Bootstrapping
Enables Ecient Homomorphic Inference of Deep Neural Networks. In Cyber
Security Cryptography and Machine Learning - 5th International Symposium,
CSCML 2021, Be’er Sheva, Israel, July 8-9, 2021, Proceedings (Lecture Notes in
Computer Science, Vol. 12716), Shlomi Dolev, Oded Margalit, Benny Pinkas, and
Alexander A. Schwarzmann (Eds.). Springer, 1–19. https://doi.org/10.1007/978-
3-030- 78086-9_1
[14]
Ainhoa Cortés, Igone Vélez, Ibon Zalbide, Andoni Irizar, and Juan F. Sevillano.
2008. An FFT Core for DVB-T/DVB-H Receivers. VLSI Design 2008 (2008),
610420:1–610420:9. https://doi.org/10.1155/2008/610420
[15]
Wei Dai and Berk Sunar. 2015. cuHE: A Homomorphic Encryption Accelerator
Library. In Cryptography and Information Security in the Balkans - Second Inter-
national Conference, BalkanCryptSec 2015, Koper, Slovenia, September 3-4, 2015,
Revised Selected Papers (Lecture Notes in Computer Science, Vol. 9540), Enes Pasalic
and Lars R. Knudsen (Eds.). Springer, 169–186. https://doi.org/10.1007/978-3-
319-29172- 7_11
[16]
Leo de Castro, Rashmi Agrawal, Rabia Tugce Yazicigil, Anantha P. Chandrakasan,
Vinod Vaikuntanathan, Chiraag Juvekar, and Ajay Joshi. 2021. Does Fully Homo-
morphic Encryption Need Compute Acceleration? CoRR abs/2112.06396 (2021).
arXiv:2112.06396 https://arxiv.org/abs/2112.06396
[17]
Florent de Dinechin and Bogdan Pasca. 2011. Designing Custom Arithmetic
Data Paths with FloPoCo. IEEE Des. Test Comput. 28, 4 (2011), 18–27. https:
//doi.org/10.1109/MDT.2011.44
[18]
Yarkin Doröz, Erdinç Öztürk, and Berk Sunar. 2015. Accelerating Fully Homo-
morphic Encryption in Hardware. IEEE Trans. Computers 64, 6 (2015), 1509–1521.
https://doi.org/10.1109/TC.2014.2345388
[19]
Léo Ducas and Daniele Micciancio. 2015. FHEW: Bootstrapping Homomorphic
Encryption in Less Than a Second. In Advances in Cryptology - EUROCRYPT
2015 - 34th Annual International Conference on the Theory and Applications of
Cryptographic Techniques, Soa, Bulgaria, April 26-30, 2015, Proceedings, Part I
(Lecture Notes in Computer Science, Vol. 9056), Elisabeth Oswald and Marc Fischlin
(Eds.). Springer, 617–640. https://doi.org/10.1007/978- 3-662-46800- 5_24
[20]
Junfeng Fan and Frederik Vercauteren. 2012. Somewhat Practical Fully Homo-
morphic Encryption. IACR Cryptol. ePrint Arch. (2012), 144. http://eprint.iacr.
org/2012/144
[21]
M. Frigo and S.G. Johnson. 2005. The Design and Implementation of FFTW3.
Proc. IEEE 93, 2 (2005), 216–231. https://doi.org/10.1109/JPROC.2004.840301
[22]
Mario Garrido. 2021. A Survey on Pipelined FFT Hardware Architectures. Journal
of Signal Processing Systems (06 Jul 2021). https://doi.org/10.1007/s11265- 021-
01655-1
[23]
Mario Garrido, Jesús Grajal, Miguel A. Sánchez Marcos, and Oscar Gustafsson.
2013. Pipelined Radix-2
k
Feedforward FFT Architectures. IEEE Trans. Very Large
Scale Integr. Syst. 21, 1 (2013), 23–32. https://doi.org/10.1109/TVLSI.2011.2178275
[24]
Mario Garrido, Konrad Möller, and Martin Kumm. 2019. World’s Fastest FFT
Architectures: Breaking the Barrier of 100 GS/s. IEEE Trans. Circuits Syst. I Regul.
Pap. 66-I, 4 (2019), 1507–1516. https://doi.org/10.1109/TCSI.2018.2886626
[25]
Robin Geelen, Michiel Van Beirendonck, Hilder V. L. Pereira, Brian Human,
Tynan McAuley, Ben Selfridge, Daniel Wagner, Georgios Dimou, Ingrid Ver-
bauwhede, Frederik Vercauteren, and David W. Archer. 2022. BASALISC:
Flexible Asynchronous Hardware Accelerator for Fully Homomorphic Encryp-
tion. CoRR abs/2205.14017 (2022). https://doi.org/10.48550/arXiv.2205.14017
arXiv:2205.14017
[26]
Serhan Gener, Parker Newton, Daniel Tan, Silas Richelson, Guy Lemieux, and
Philip Brisk. 2021. An FPGA-based Programmable Vector Engine for Fast Fully
Homomorphic Encryption over the Torus. In SPSL: Secure and Private Systems
for Machine Learning (ISCA Workshop).
[27]
Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In Pro-
ceedings of the 41st Annual ACM Symposium on Theory of Computing, STOC 2009,
Bethesda, MD, USA, May 31 - June 2, 2009, Michael Mitzenmacher (Ed.). ACM,
169–178. https://doi.org/10.1145/1536414.1536440
[28]
Craig Gentry and Shai Halevi. 2011. Implementing Gentry’s Fully-Homomorphic
Encryption Scheme. In Advances in Cryptology - EUROCRYPT 2011 - 30th An-
nual International Conference on the Theory and Applications of Cryptographic
Techniques, Tallinn, Estonia, May 15-19, 2011. Proceedings (Lecture Notes in Com-
puter Science, Vol. 6632), Kenneth G. Paterson (Ed.). Springer, 129–148. https:
//doi.org/10.1007/978-3- 642-20465- 4_9
[29]
Craig Gentry, Amit Sahai, and Brent Waters. 2013. Homomorphic Encryp-
tion from Learning with Errors: Conceptually-Simpler, Asymptotically-Faster,
Attribute-Based. In Advances in Cryptology - CRYPTO 2013 - 33rd Annual Cryp-
tology Conference, Santa Barbara, CA, USA, August 18-22, 2013. Proceedings, Part
I (Lecture Notes in Computer Science, Vol. 8042), Ran Canetti and Juan A. Garay
(Eds.). Springer, 75–92. https://doi.org/10.1007/978- 3-642-40041- 4_5
[30]
LLC Gisselquist Technology. [n. d.]. A Generic Piplined FFT Core Generator.
https://github.com/ZipCPU/dblclockt.
[31]
Kyoohyung Han, Seungwan Hong, Jung Hee Cheon, and Daejun Park. 2019.
Logistic Regression on Homomorphic Encrypted Data at Scale. In The Thirty-
Third AAAI Conference on Articial Intelligence, AAAI 2019, The Thirty-First
Innovative Applications of Articial Intelligence Conference, IAAI 2019, The Ninth
AAAI Symposium on Educational Advances in Articial Intelligence, EAAI 2019,
Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 9466–9471.
https://doi.org/10.1609/aaai.v33i01.33019466
[32]
Shousheng He and Mats Torkelson. 1996. A New Approach to Pipeline FFT
Processor. In Proceedings of IPPS ’96, The 10th International Parallel Processing
Symposium, April 15-19, 1996, Honolulu, Hawaii, USA. IEEE Computer Society,
766–770. https://doi.org/10.1109/IPPS.1996.508145
[33]
Lei Jiang, Qian Lou, and Nrushad Joshi. 2022. MATCHA: a fast and energy-
ecient accelerator for fully homomorphic encryption over the torus. In DAC
’22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA,
July 10 - 14, 2022, Rob Oshana (Ed.). ACM, 235–240. https://doi.org/10.1145/
3489517.3530435
[34]
Marc Joye. 2022. SoK: Fully Homomorphic Encryption over the [Discretized]
Torus. IACR Trans. Cryptogr. Hardw. Embed. Syst. 2022, 4 (2022), 661–692. https:
//doi.org/10.46586/tches.v2022.i4.661-692
[35]
Wonkyung Jung, Sangpyo Kim, Jung Ho Ahn, Jung Hee Cheon, and Younho Lee.
2021. Over 100x Faster Bootstrapping in Fully Homomorphic Encryption through
Memory-centric Optimization with GPUs. IACR Trans. Cryptogr. Hardw. Embed.
Syst. 2021, 4 (2021), 114–148. https://doi.org/10.46586/tches.v2021.i4.114-148
[36]
Jongmin Kim, Gwangho Lee, Sangpyo Kim, Gina Sohn, Minsoo Rhu, John Kim,
and Jung Ho Ahn. 2022. ARK: Fully Homomorphic Encryption Accelerator with
Runtime Data Generation and Inter-Operation Key Reuse. In 55th IEEE/ACM
International Symposium on Microarchitecture, MICRO 2022, Chicago, IL, USA,
October 1-5, 2022. IEEE, 1237–1254. https://doi.org/10.1109/MICRO56248.2022.
00086
[37]
Sangpyo Kim, Jongmin Kim, Michael Jaemin Kim, Wonkyung Jung, John Kim,
Minsoo Rhu, and Jung Ho Ahn. 2022. BTS: an accelerator for bootstrappable fully
homomorphic encryption. In ISCA ’22: The 49th Annual International Symposium
on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, Valentina
Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang (Eds.). ACM, 711–725.
https://doi.org/10.1145/3470496.3527415
[38]
Igor Kononenko. 2001. Machine learning for medical diagnosis: history, state
of the art and perspective. Artif. Intell. Medicine 23, 1 (2001), 89–109. https:
//doi.org/10.1016/S0933-3657(01)00077- X
[39]
Joon-Woo Lee, HyungChul Kang, Yongwoo Lee, Woosuk Choi, Jieun Eom, Maxim
Deryabin, Eunsang Lee, Junghyun Lee, Donghoon Yoo, Young-Sik Kim, and Jong-
Seon No. 2022. Privacy-Preserving Machine Learning With Fully Homomorphic
Encryption for Deep Neural Network. IEEE Access 10 (2022), 30039–30054. https:
//doi.org/10.1109/ACCESS.2022.3159694
[40]
Charles E. Leiserson and James B. Saxe. 1991. Retiming Synchronous Circuitry.
Algorithmica 6, 1 (1991), 5–35. https://doi.org/10.1007/BF01759032
[41]
Ahmet Can Mert, Aikata, Sunmin Kwon, Youngsam Shin, Donghoon Yoo, Yong-
woo Lee, and Sujoy Sinha Roy. 2022. Medha: Microcoded Hardware Acceler-
ator for computing on Encrypted Data. CoRR abs/2210.05476 (2022). https:
//doi.org/10.48550/arXiv.2210.05476 arXiv:2210.05476
[42]
Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. [n.d.]. Spiral
DFT/FFT IP Core Generator. https://www.spiral.net/hardware/dftgen.html.
[43]
Peter A. Milder, Franz Franchetti, James C. Hoe, and Markus Püschel. 2012.
Computer Generation of Hardware for Linear Digital Signal Processing Trans-
forms. ACM Trans. Design Autom. Electr. Syst. 17, 2 (2012), 15:1–15:33. https:
//doi.org/10.1145/2159542.2159547
[44]
Mohammed Nabeel, Deepraj Soni, Mohammed Ashraf, Mizan Abraha Ge-
bremichael, Homer Gamil, Eduardo Chielle, Ramesh Karri, Mihai Sanduleanu,
and Michail Maniatakos. 2022. CoFHEE: A Co-processor for Fully Homomorphic
Encryption Execution. CoRR abs/2204.08742 (2022). https://doi.org/10.48550/
arXiv.2204.08742 arXiv:2204.08742
FPT: a Fixed-Point Accelerator for Torus Fully Homomorphic Encryption
[45]
NuCypher. [n.d.]. NuFHE, a GPU-powered Torus FHE implementation. https:
//github.com/nucypher/nufhe/.
[46]
Nicolas Papernot, Patrick D. McDaniel, Arunesh Sinha, and Michael P. Wellman.
2018. SoK: Security and Privacy in Machine Learning. In 2018 IEEE European
Symposium on Security and Privacy, EuroS&P 2018, London, United Kingdom, April
24-26, 2018. IEEE, 399–414. https://doi.org/10.1109/EuroSP.2018.00035
[47]
Thomas Pöppelmann, Michael Naehrig, Andrew Putnam, and Adrián Macías.
2015. Accelerating Homomorphic Evaluation on Recongurable Hardware. In
Cryptographic Hardware and Embedded Systems - CHES 2015 - 17th International
Workshop, Saint-Malo, France, September 13-16, 2015, Proceedings (Lecture Notes
in Computer Science, Vol. 9293), Tim Güneysu and Helena Handschuh (Eds.).
Springer, 143–163. https://doi.org/10.1007/978- 3-662-48324- 4_8
[48]
Samira Pouyanfar, Saad Sadiq, Yilin Yan, Haiman Tian, Yudong Tao, Maria E. Presa
Reyes, Mei-Ling Shyu, Shu-Ching Chen, and S. S. Iyengar. 2019. A Survey on
Deep Learning: Algorithms, Techniques, and Applications. ACM Comput. Surv.
51, 5 (2019), 92:1–92:36. https://doi.org/10.1145/3234150
[49] M. Sadegh Riazi, Kim Laine, Blake Pelton, and Wei Dai. 2020. HEAX: An Archi-
tecture for Computing on Encrypted Data. In ASPLOS ’20: Architectural Support
for Programming Languages and Operating Systems, Lausanne, Switzerland, March
16-20, 2020, James R. Larus, Luis Ceze, and Karin Strauss (Eds.). ACM, 1295–1309.
https://doi.org/10.1145/3373376.3378523
[50] Ronald L Rivest, Len Adleman, Michael L Dertouzos, et al. 1978. On data banks
and privacy homomorphisms. Foundations of secure computation 4, 11 (1978),
169–180.
[51]
Sujoy Sinha Roy, Furkan Turan, Kimmo Järvinen, Frederik Vercauteren, and
Ingrid Verbauwhede. 2019. FPGA-Based High-Performance Parallel Architecture
for Homomorphic Computing on Encrypted Data. In 25th IEEE International
Symposium on High Performance Computer Architecture, HPCA 2019, Washington,
DC, USA, February 16-20, 2019. IEEE, 387–398. https://doi.org/10.1109/HPCA.
2019.00052
[52]
Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Srinivas Devadas,
Ronald G. Dreslinski, Christopher Peikert, and Daniel Sánchez. 2021. F1: A
Fast and Programmable Accelerator for Fully Homomorphic Encryption. In MI-
CRO ’21: 54th Annual IEEE/ACM International Symposium on Microarchitecture,
Virtual Event, Greece, October 18-22, 2021. ACM, 238–252. https://doi.org/10.1145/
3466752.3480070
[53]
Nikola Samardzic, Axel Feldmann, Aleksandar Krastev, Nathan Manohar,
Nicholas Genise, Srinivas Devadas, Karim Eldefrawy, Chris Peikert, and Daniel
Sánchez. 2022. CraterLake: a hardware accelerator for ecient unbounded com-
putation on encrypted data. In ISCA ’22: The 49th Annual International Symposium
on Computer Architecture, New York, New York, USA, June 18 - 22, 2022, Valentina
Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang (Eds.). ACM, 173–187.
https://doi.org/10.1145/3470496.3527393
[54]
François Serre and Markus Püschel. 2018. A DSL-Based FFT Hardware Generator
in Scala. In 28th International Conference on Field Programmable Logic and Ap-
plications, FPL 2018, Dublin, Ireland, August 27-31, 2018. IEEE Computer Society,
315–322. https://doi.org/10.1109/FPL.2018.00060
[55]
François Serre and Markus Püschel. 2019. DSL-Based Modular IP Core Generators:
Example FFT and Related Structures. In 26th IEEE Symposium on Computer
Arithmetic, ARITH 2019, Kyoto, Japan, June 10-12, 2019, Naofumi Takagi, Sylvie
Boldo, and Martin Langhammer (Eds.). IEEE, 190–191. https://doi.org/10.1109/
ARITH.2019.00043
[56]
François Serre and Markus Püschel. 2020. DSL-Based Hardware Generation
with Scala: Example Fast Fourier Transforms and Sorting Networks. ACM Trans.
Recongurable Technol. Syst. 13, 1 (2020), 1:1–1:23. https://doi.org/10.1145/
3359754
[57]
Furkan Turan, Sujoy Sinha Roy, and Ingrid Verbauwhede. 2020. HEAWS: An
Accelerator for Homomorphic Encryption on the Amazon AWS FPGA. IEEE Trans.
Computers 69, 8 (2020), 1185–1196. https://doi.org/10.1109/TC.2020.2988765
[58]
Wei Wang, Yin Hu, Lianmu Chen, Xinming Huang, and Berk Sunar. 2012. Accel-
erating fully homomorphic encryption using GPU. In IEEE Conference on High
Performance Extreme Computing, HPEC 2012, Waltham, MA, USA, September 10-12,
2012. IEEE, 1–5. https://doi.org/10.1109/HPEC.2012.6408660
[59]
P. Welch. 1969. A xed-point fast Fourier transform error analysis. IEEE Trans-
actions on Audio and Electroacoustics 17, 2 (1969), 151–157. https://doi.org/10.
1109/TAU.1969.1162035
[60] Xilinx. [n. d.]. Vitis DSP Library. https://xilinx.github.io/Vitis_Libraries/dsp.
[61]
Xilinx. 2022. Fast Fourier Transform v9.1. LogiCORE IP Product Guide. PG109.
[62]
Tian Ye, Rajgopal Kannan, and Viktor K. Prasanna. 2022. FPGA Acceleration
of Fully Homomorphic Encryption over the Torus. In IEEE High Performance
Extreme Computing Conference, HPEC 2022, Waltham, MA, USA, September 19-23,
2022. IEEE, 1–7. https://doi.org/10.1109/HPEC55821.2022.9926381
[63]
Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. 2020. A
Survey of Autonomous Driving: Common Practices and Emerging Technologies.
IEEE Access 8 (2020), 58443–58469. https://doi.org/10.1109/ACCESS.2020.2983149
[64]
Zama. 2022. Announcing Concrete-core v1.0.0-gamma with GPU acceleration.
https://www.zama.ai/post/concrete-core-v1- 0-gamma-with-gpu-acceleration
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Fully Homomorphic encryption (FHE) has been gaining in popularity as an emerging means of enabling an unlimited number of operations in an encrypted message without decryption. A major drawback of FHE is its high computational cost. Specifically, a bootstrapping step that refreshes the noise accumulated through consequent FHE operations on the ciphertext can even take minutes of time. This significantly limits the practical use of FHE in numerous real applications. By exploiting the massive parallelism available in FHE, we demonstrate the first instance of the implementation of a GPU for bootstrapping CKKS, one of the most promising FHE schemes supporting the arithmetic of approximate numbers. Through analyzing CKKS operations, we discover that the major performance bottleneck is their high main-memory bandwidth requirement, which is exacerbated by leveraging existing optimizations targeted to reduce the required computation. These observations motivate us to utilize memory-centric optimizations such as kernel fusion and reordering primary functions extensively. Our GPU implementation shows a 7.02× speedup for a single CKKS multiplication compared to the state-of-the-art GPU implementation and an amortized bootstrapping time of 0.423us per bit, which corresponds to a speedup of 257× over a single-threaded CPU implementation. By applying this to logistic regression model training, we achieved a 40.0× speedup compared to the previous 8-thread CPU implementation with the same data.
Article
Full-text available
First posed as a challenge in 1978 by Rivest et al., fully homomorphic encryption—the ability to evaluate any function over encrypted data—was only solved in 2009 in a breakthrough result by Gentry (Commun. ACM, 2010). After a decade of intense research, practical solutions have emerged and are being pushed for standardization. This paper explains the inner-workings of TFHE, a torus-based fully homomorphic encryption scheme. More exactly, it describes its implementation on a discretized version of the torus. It also explains in detail the technique of the programmable bootstrapping. Numerous examples are provided to illustrate the various concepts and definitions.
Article
Full-text available
Fully homomorphic encryption (FHE) is a prospective tool for privacy-preserving machine learning (PPML). Several PPML models have been proposed based on various FHE schemes and approaches. Although FHE schemes are suitable as tools for implementing PPML models, previous PPML models based on FHE, such as CryptoNet, SEALion, and CryptoDL, are limited to simple and nonstandard types of machine learning models; they have not proven to be efficient and accurate with more practical and advanced datasets. Previous PPML schemes replaced non-arithmetic activation functions with simple arithmetic functions instead of adopting approximation methods and did not use bootstrapping, which enables continuous homomorphic evaluations. Thus, they could neither use standard activation functions nor employ large numbers of layers. In this work, we first implement the standard ResNet-20 model with the RNS-CKKS FHE with bootstrapping and verify the implemented model with the CIFAR-10 dataset and plaintext model parameters. Instead of replacing the non-arithmetic functions with simple arithmetic functions, we use state-of-the-art approximation methods to evaluate these non-arithmetic functions, such as ReLU and Softmax, with sufficient precision. Further, for the first time, we use the bootstrapping technique of the RNS-CKKS scheme in the proposed model, which enables us to evaluate an arbitrary deep learning model on encrypted data. We numerically verify that the proposed model with the CIFAR-10 dataset shows 98.43% identical results to the original ResNet-20 model with non-encrypted data. The classification accuracy of the proposed model is 92.43%±2.65%, which is quite close to that of the original ResNet-20 CNN model (91.89%). It takes approximately 3 h for inference on a dual Intel Xeon Platinum 8280 CPU (112 cores) with 172 GB of memory. We believe that this opens the possibility of applying FHE to an advanced deep PPML model.
Article
Full-text available
The field of pipelined FFT hardware architectures has been studied during the last 50 years. This paper is a survey that includes the main advances in the field related to architectures for complex input data and power-of-two FFT sizes. Furthermore, the paper is intended to be educational, so that the reader can learn how the architectures work. Finally, the paper divides the architectures into serial and parallel. This classification puts together those architectures that are conceived for a similar purpose and, therefore, are comparable.