ArticlePDF Available

: A Crystal for Post-Quantum Security Using Kyber and Dilithium

Authors:

Abstract and Figures

Quantum computers pose a threat to the security of communications over the internet. This imminent risk has led to the standardization of cryptographic schemes for protection in a post-quantum scenario. We present a design methodology for future implementations of such algorithms. This is manifested using the NIST selected digital signature scheme CRYSTALS-Dilithium and key encapsulation scheme CRYSTALS-Kyber. A unified architecture, is proposed that can perform key generation, encapsulation, decapsulation, signature generation, and signature verification for all the security levels of CRYSTALS-Dilithium, and CRYSTALS-Kyber. A unified yet flexible polynomial arithmetic unit is designed that can processes Kyber operations twice as fast as Dilithium operations. Efficient memory management is proposed to achieve optimal latency. is explicitly tailored for ASIC platforms using multiple clock domains. On ASIC 28nm/65nm technology, it occupies 0.263/1.107 mm $^2$ and achieves a clock frequency of 2GHz/560MHz for the fast clock used for memory unit. On Xilinx Zynq Ultrascale $+$ ZCU102 FPGA, the proposed architecture uses 23,277 LUTs, 9,758 DFFs, 4 DSPs, and 24 BRAMs, at 270 MHz clock frequency. performs better than the standalone implementations of either of the two schemes. This is the first work to provide a unified design in hardware for both schemes.
Content may be subject to copyright.
1
KaLi: A Crystal for Post-Quantum Security using
Kyber and Dilithium
Aikata Aikata, Ahmet Can Mert, Malik Imran, Samuel Pagliarini, Sujoy Sinha Roy
Abstract—Quantum computers pose a threat to the security of
communications over the internet. This imminent risk has led to
the standardization of cryptographic schemes for protection in
a post-quantum scenario. We present a design methodology for
future implementations of such algorithms. This is manifested
using the NIST selected digital signature scheme CRYSTALS-
Dilithium and key encapsulation scheme CRYSTALS-Kyber.
A unified architecture, KaLi, is proposed that can perform
key generation, encapsulation, decapsulation, signature gener-
ation, and signature verification for all the security levels of
CRYSTALS-Dilithium, and CRYSTALS-Kyber. A unified yet
flexible polynomial arithmetic unit is designed that can processes
Kyber operations twice as fast as Dilithium operations. Efficient
memory management is proposed to achieve optimal latency.
KaLi is explicitly tailored for ASIC platforms using mul-
tiple clock domains. On ASIC 28nm/65nm technology, it oc-
cupies 0.263/1.107 mm2and achieves a clock frequency of
2GHz/560MHz for the fast clock used for memory unit. On Xilinx
Zynq Ultrascale+ZCU102 FPGA, the proposed architecture uses
23,277 LUTs, 9,758 DFFs, 4 DSPs, and 24 BRAMs, at 270
MHz clock frequency. KaLi performs better than the standalone
implementations of either of the two schemes. This is the first
work to provide a unified design in hardware for both schemes.
Index Terms—CRYSTALS-Dilithium, CRYSTALS-Kyber,
Cryptoprocessor, NIST PQC Standardized
I. INTRODUCTION
COMMUNICATION over the internet forms the backbone
of the digitalized world. Every communication packet
passes through various insecure channels and untrusted servers
before reaching the destination. Data and communication leaks
in the past led to the development of public key cryptographic
(PKC) schemes to ensure end-to-end security and privacy of
communication. These schemes use the hard problems of the
discrete logarithm, integer factorization, etc. In 1994, Peter
Shor [1] proposed an algorithm that can help a powerful
quantum computer solve them in polynomial (realistic) time,
thus breaking the classical PKC schemes. Since then, the past
This paper was produced by the IEEE Publication Technology Group. They
are in Piscataway, NJ.
Aikata Aikata, Ahmet Can Mert, and Sujoy Sinha Roy are affiliated
to Institute of Applied Information Processing and Communications, Graz
University of Technology, Graz, Austria. Their work was supported in part
by Semiconductor Research Corporation (SRC) and the State Government of
Styria, Austria Department Zukunftsfonds Steiermark.{aikata, ahmet.mert,
sujoy.sinharoy}@iaik.tugraz.at
Malik Imran and Samuel Pagliarini are with the Centre for Hardware Security,
Tallinn University of Technology, Tallinn, Estonia Their work has been
partially conducted in the project “ICT programme” which was supported
by the European Union through the European Social Fund. It was also
partially supported by European Union’s Horizon 2020 research and innova-
tion programme under grant agreement No 952252 (SAFEST). {malik.imran,
samuel.pagliarini}@taltech.ee
eighteen years have witnessed a giant leap in the develop-
ment of quantum computers. In 2019, Google claimed quan-
tum supremacy by developing a 53-qubit quantum computer
Sycamore [2]. Sycamore could solve a task in 200 seconds
which would take a classical computer 10,000 years. Various
labs across the world have developed even stronger quantum
computers [3]. This raises the existential question of whether
our communication packets containing emails, passwords, etc.,
are already insecure. The answer to this is - yes. Even though
quantum computers built until now are not strong enough to
break classical public key cryptography, emails and passwords
sent now can be stored and decrypted later.
This inevitable breach of security paved way for the devel-
opment of post-quantum secure PKC schemes based on the
hard problems that are safely sustained in a post-quantum sce-
nario. Many standardizations were launched to select the best
PKC candidates for digital signature and key-encapsulation
algorithms [4]. Key encapsulation schemes allow the commu-
nicating parties to agree on the same key securely, which can
then be used for symmetric key-based encryption-decryption
of messages. Thus, ensuring the security and privacy of
the communications. A digital signature scheme allows the
receiver to verify the authenticity of the messages. Both these
schemes will replace the classical PKC schemes in various
applications, like the TLS networking protocol.
These standardizations have now concluded, and the in-
dustry is now starting to gear up toward implementing stan-
dardized candidates. After finalizing the implementations, a
transition phase will start for all the devices to switch from
classical to post-quantum secure PKC schemes [5]. This tran-
sition will not only take years but also lead to a large amount
of wastage in terms of chips and hardware resources which
are now obsolete. However, now that we know that change is
inevitable, and what we believe to be secure now might again
be broken in the next 10-20 years, there is an urgent need for
a design methodology for future implementations to prevent
loss of time and hardware resources.
This work proposes a design methodology that covers
three vital aspects for the future implementations of the PKC
algorithms. The first is the need to make a unified design. A
majority of PKC applications require both digital signature and
key encapsulation schemes. Therefore, the design decisions
should be adapted to help unify the two algorithms for saving
area via resource sharing. Secondly, the design must be com-
pact. The new PKC schemes require much larger memory and
logical units to store and process the keys. If we do not attempt
to make these designs compact, a lot of resource-constrained
CPUs that were designed for classical PKC schemes will be
2
rendered inoperable. The final aspect is agility/flexibility. The
architecture design should consider the ever-evolving nature of
these algorithms. It will not only help prevent the huge wastage
of hardware resources but also enable a smooth transition.
To exhibit the applicative advantages of this design method-
ology, we take the NIST finalized lattice-based digital signa-
ture scheme CRYSTALS-Dilithium [6] and key-encapsulation
scheme CRYSTALS-Kyber [7]. We unify the extensive build-
ing blocks of these schemes and call the resultant architecture
of the two CRYSTALS schemes as KaLi. The design choices
for KaLi favor reducing area over improving performance. As
a step toward agility, KaLi is modeled as an instruction-set
cryptoprocessor. From here on, we will refer to CRYSTALS-
Dilithium as Dilithium, and CRYSTALS-Kyber as Kyber.
Prior works in literature propose the hardware implemen-
tation of PKC schemes. Most of them focus on standalone
efficient implementations [8]–[23]. The real-life applications
would require both the types of schemes. Therefore, these
works fail to provide complete area and timing results for the
implementations that make the communication post-quantum
secure. The authors in [24], [25] present hardware/software
(HW/SW) co-designs for Dilithium and Kyber. Since they
keep some part of the design in software, it is not sufficient to
provide a good estimate for hardware-only architectures. There
is a need for a unified implementation of these two types of
schemes completely in hardware to get better performance. We
show how KaLi follows the proposed design methodology
and performs better than the state-of-the-art.
Our contributions can be summarized as follow:
1) Polynomial multiplication is the most computation-
intensive operation in the Dilithium signature scheme
and Kyber encapsulation scheme. We propose a compact
polynomial multiplier architecture that works optimally
for the two cryptographic algorithms. Dilithium has
a 23-bit prime modulus, whereas Kyber has a 12-bit
prime modulus. A unified polynomial arithmetic unit is
designed for both, Dilithium and Kyber, to save time
and area. This unit has a 24-bit datapath. The core oper-
ations: addition, subtraction, multiplication, and modular
reduction, are made flexible to either process two sets of
12-bit Kyber coefficients or one set of 23-bit Dilithium
coefficients. This, in combination with efficient memory
management, enables performing arithmetic operations
for Kyber twice as fast as Dilithium.
2) We customized the Keccak-based SHA-SHAKE and
pseudo-random number generation unit to make an ef-
ficient sampling unit for both Dilithium and Kyber. The
samplers for both schemes are unified and added into the
Keccak block to prevent redundant writing and reading
of pseudo-random numbers. The remaining primitive
building blocks of Dilithium and Kyber are designed dis-
cretely while ensuring low area consumption, simplicity
and flexibility. The proposed arithmetic units altogether
form the unified cryptoprocessor KaLi. It can perform
key generation, encapsulation, decapsulation, signature
generation, and signature verification operations for all
security levels of Dilithium and Kyber. This is the
first work that implements a unified cryptoprocessor for
Kyber and Dilithium solely in hardware.
3) We propose an instruction set architecture for flexibility.
The instructions are divided into two sets, and KaLi
can run instructions from these two sets in parallel,
thus improving the latency, while keeping the area
consumption low. This leads to a 35% reduction in run-
time.
4) KaLi is engineered separately for the ASIC platform to
reduce area overhead. It uses two clock domains, where
the memory unit works at a higher clock frequency
than the logic unit. This allowed us to use single port
memory instead of dual port memory used in FPGA
implementation, thus reducing the area consumption.
The paper is organized as follows. Section II provides
a high-level overview of Dilithium and Kyber. The major
contributions of the paper are described in Section III. It
includes the design methodology for implementing the PQC
schemes and implementation details. In Section IV, we give
the results and compare them with the existing works in
the literature, and add benchmarking estimates. Section V
concludes our paper.
II. PRELIMINARIES
Kyber and Dilithium are part of the Cryptographic Suite for
Algebraic Lattices (CRYSTALS), which are recently selected
for standardization by the American National Institute of
Standards and Technology (NIST). Kyber’s security relies on
the hardness of solving learning-with-errors in module lattices
(MLWE), while Dilithium’s security is based on MLWE and
Shortest Integer Solution (SIS) problems. The polynomials and
algebraic operations are assumed to be over the polynomial
ring Rq=Zq[x]/xn+ 1. For Dilithium n= 256 and
q= 8380417 = 223 213 + 1, and for Kyber n= 256 and
q= 3329 = (212 3·28+ 1). Next, we give a brief overview
of these schemes and their building blocks.
A. Dilithium
This digital signature scheme has three main algorithms: key
generation, signature generation, and signature verification.
The sender generates a public and secret key using the key
generation algorithm. Then he uses his private key to sign
a message using the signature generation algorithm. The
receiver can verify the signature using the sender’s public key
and signature verification algorithm. The signature generation
algorithm continues to generate a signature until a valid
signature is generated. For a signature to be valid, a set of
constraints have to be satisfied to ensure that the signature does
not bear any similarity with the message. Readers may refer
to [26] for the original specification of Dilithium. Dilithium
has three variants for NIST security levels 2, 3, and 5. Several
building blocks used by these algorithms are explained below.
Polynomial generation: SHAKE-128 is used to generate
the polynomials of the public matrix A
A
ARk×
qby
expanding the seed ρ {0,1}256 along with 16-bit
nonce values. The secret polynomial vectors s
s
s1and s
s
s2
S
η×Sk
ηare generated using SHAKE-256. For each
3
polynomial, the seed ςand a 16-bit nonce are fed to
SHAKE-256 and passed through rejection sampling in
the range {−η, η}. The two types of generations are
named as ExpandA() and ExpandS().ExpandMask(),
is used to generate a polynomial vector in the range
[0,2γ11] using a rejection sampler. SampleInBall()
is used during signature generation and verification, to
generate a polynomial with only τcoefficients set to +1
or 1and the remaining coefficients as 0.
Polynomial Arithmetic: Polynomial multiplications are
performed using the Number Theoretic Transform (NTT)
method. The addition and subtraction operations are
coefficient-wise linear operations.
Hash functions: SHAKE-256 is used to make a collision
resistant hash function- CRH().
Power2Round : The function, Power2Roundq(), takes
an element r=r1·2d+r0and returns r0and r1, where
r0=rmod ±2dand r1= (rr0)/2d.
Decompose and other related functions: Let αbe a
divisor of q1. The function Decomposeq() is defined
in the same way as Power2Round() with αreplacing
2d. The HighBitsq()/LoweBitsq() return r1/r0from the
output of Decomposeq().MakeHint uses HighBitsq() to
produce a hint h
h
h.UseHint uses the hint h
h
hproduced by
MakeHintq() to recover the high-bits.
B. Kyber
Kyber is an IND-CCA2-secure key encapsulation scheme. It
has three principal algorithms: key generation, encapsulation,
and decapsulation. The receiver generates a public and secret
key using the key generation algorithm and broadcasts the
public key. When the sender wishes the send a message, he/she
can encapsulate it using the receiver’s public key through the
encapsulation algorithm. The receiver can then decapsulate it
using her/his secret key through the decapsulation scheme.
Three variants of Kyber, Kyber-512, Kyber-768, and Kyber-
1024 are provided for NIST Security levels 1, 3, and 5,
respectively. The variants differ in module dimensions and
coefficient distributions. Readers may refer to [7] for the
detailed specifications of Kyber. Kyber has the following
internal routines:
Pseudorandom functions: Kyber uses PRF (SHAKE-
256) and XOF (SHAKE-128) to generate the pseudo-
random numbers for polynomial coefficients.
Hash functions: Kyber provides functions Hand Gfor
SHA3-256 and SHA3-512, respectively, for hashing.
Key-derivation function (KDF): It is instantiated using
SHAKE-256 in Kyber.
Polynomial Arithmetic: Kyber uses a new method NTT-
based polynomial multiplication unit. Polynomial addi-
tion and subtractions are also supported.
Samplers: Uniform sampling (Parse) is used to generate
the public polynomials, and Binomial sampling (CBD) is
used to generate secret and error polynomials.
Encode/Decode: These modules are used to serial-
ize/deserialize the polynomials to/from byte arrays.
Compress/Decompress: They are used to reduce the
size of ciphertext by discarding low-order bits. They are
defined on an element xZqas (2d/q)·x(mod 2d)
and (q/2d)·xrespectively, where d < log2(q). The
value xsuch that x=Decompress(Compress(a, d), d)
is an element close to x.
C. NTT-based Polynomial Multiplication
Polynomial multiplication of (n1)-degree polynomials
has been the focus of works for PQC implementations. Most
implementations use the traditional NTT-based multiplication
technique, while others show how methods like schoolbook
O(n2), Karatsuba O(n1.59), etc., can be used. NTT-based
multiplication has a time complexity of O(n(log n)). The
designers of Dilithium and Kyber select polynomials in Ring
Rq=Zq[x]/xn+ 1, where modulus qis an NTT-friendly
prime. Thus, making it easier to use the fast NTT-based
multiplication method.
Forward NTT transform converts an (n1)-degree poly-
nomial (coefficient representation) to n0-degree polynomi-
als (value representation). Then two polynomials in their
value-representation form (NTT domain) can be multiplied
coefficient-wise to get the multiplied values in the NTT
domain. Now, if we need to get the polynomial in coefficient
representation again, a backward NTT transform (INTT) is
used. The conversion to-and-from NTT domain has a time-
complexity of O(n(log n)). Coefficient-wise multiplication
has a time-complexity of O(n). Thus, a total time com-
plexity of O(n(log n)). Various algorithms exist in the
literature to facilitate these transformations. The most used
ones are the Cooley-Tukey (Algorithm 1) transform for NTT
and Gentleman-Sande for INTT. For more information on
NTT/INTT, refer to [27].
Next, we discuss the major optimizations made to realize
the design methodology in the context of Dilithium and Kyber.
III. PROP OS ED UN IFI ED HARDWARE ARCHITECTURE
The first and foremost goal is to unify the digital signature
scheme and the key-encapsulation scheme. While doing this, it
is important to ensure that the design is compact and flexible.
Unification has a very straightforward three-step approach.
First, we must identify the most area and time-consuming
Algorithm 1 The Cooley-Tukey NTT Algorithm [28]
In: An n-element vector x= [x0,··· , xn1]where xi[0, q 1]
In: n(power of 2), modulus q(q1 (mo d 2n))
In: g (precomputed table of 2n-th roots of unity, bit-reversed order)
Out: xNT T (x)
1: tn/2;m1
2: while (m<n)do
3: k0
4: for (i0; i<m;ii+ 1)do
5: Sg[m+i]
6: for (jk;j < k + 1; jj+ 1)do
7: Ux[j]Butterfly starts
8: Vx[j+t]·S(mod q)
9: x[j]U+V(mod q)
10: x[j+t]UV(mod q)Butterfly ends
11: end for
12: kk+ 2t
13: end for
14: tt/2;m2m
15: end while
16: return x
4
Fig. 1. High-level architecture of KaLi
building blocks, the Giants. This is because unifying the low
area and time-consuming building blocks (the Dwarves) will
not reduce the area consumption significantly, and instead
limit the flexibility of the design. The next step is to find the
algorithmic synergies between the Giants of the two schemes.
The final step is to discern if some of the Dwarves which are
dependent on the Giants can be unified with Giants to reduce
both area and time consumption.
A high-level view of the proposed architecture KaLi
is given in Fig. 1. The Keccak-based SHA-SHAKE unit
and polynomial arithmetic unit are the two Giants in both
schemes. The remaining building blocks are deemed as
Dwarves since they comprise only 20% of the total area
consumption. Unifying the Keccak-based SHA-SHAKE unit
is relatively easy since we can use a common Keccak core for
both schemes. Therefore in this section, we will discuss how
we efficiently unified the polynomial arithmetic unit. We will
also discuss how we efficiently manage the memory for the
two schemes. Another facet of the work, the optimization for
ASIC platforms, is also presented. We utilized multiple clock
domains and boosted the memory bandwidth budget on ASIC
platforms to reduce area consumption.
A. The colossal Giant: Polynomial Arithmetic Unit
The Polynomial Arithmetic Unit performs polynomial ad-
dition, subtraction, and multiplication. Polynomial addition
and subtraction are simple coefficient-wise operations, hence
cheap. Polynomial multiplication is rather complex, and it
is what makes the polynomial arithmetic unit a Giant. Both
schemes perform this using NTT, as discussed in Section II-C.
Although the two schemes use NTT-based polynomial multi-
plication units, there are many differences between the two
schemes that make their NTT units quite distinct.
1) A clash of the Giants: The differences between the
presumed similar NTTs
The first distinction between the NTTs used by the two
schemes lies in the algorithm itself. The NTT-based poly-
nomial multiplication method used in Dilithium requires the
existence of 2n-th root of unity that mandates q1
(mod 2n). Accordingly, Dilithium uses a complete-NTT. After
acomplete-NTT transform of an n-degree polynomial, we get
npolynomials of degree 0. In [29], Lyubashevsky et al. pro-
pose a new method for NTT-based polynomial multiplication
that requires only q1 (mod n), without pre-processing
and post-processing operations. This technique is adopted by
Kyber, and their 12-bit prime modulus does not have a 2n-th
root of unity. Therefore, Kyber has to use an incomplete-NTT.
An incomplete-NTT gives us n/2polynomials of degree 1.
These polynomials cannot be multiplied coefficient-wise.
For the incomplete-INTT, multiplication operation of two
degree-1 polynomials is performed in the ring Zq[x]/(x2ωi)
where ωis the n-th root of unity and idepends on the index of
coefficients. For the details, readers may follow original Kyber
specifications [7] or related prior works in the literature [30].
Along with this, they also have a difference in datapath design.
Dilithium has a 23-bit prime modulus, while Kyber has a
12-bit prime modulus. Therefore, while Kyber requires 12-
bit adder/subtracter/multiplier units, Dilithium requires them
in 23-bits. Designing a datapath for one of them and using it
for the other one would lead to over-or-under saturation.
Next, we will discuss in detail how we achieved a unified
polynomial multiplication unit with full utilization. The unit of
interest here is the butterfly unit (BFU). Each BFU performs
dyadic addition, subtraction, and multiplication, on the two
input coefficients. The results are reduced by modulo q.
This is shown by steps 7-10 in Algorithm 1. Since modulus
multiplication is the most expensive operation, we will discuss
how we unify this unit. Then, we will discuss how with a few
more changes, the entire BFU can be consolidated.
2) Flexible fusion of Modular Multiplier Unit
As discussed above, if we naively use the 23-bit Dilithium
polynomial multiplier unit for Kyber, then it will always be
undersaturated as half of it will be unused. Instead, if we aim
to use a 12-bit Kyber unit for Dilithium, it will require extra
control logic but also slow down Dilithium’s NTT. Therefore,
we need to find a solution using a 23-bit Dilithium unit that
does not lead to undersaturation. The modular multiplier unit
has two parts: (i)integer multiplier and (ii)modular reduction
unit. We propose an algorithm (Algorithm 2) to make the
integer multiplication unit designed for Dilithium flexible for
Kyber. It performs two 23-bit×12-bit integer multiplications.
The result is added for Dilithium and concatenated for Kyber.
This algorithm gives us one multiplied coefficient in the case
of Dilithium and two multiplied coefficients in the case of
Kyber.
The modular multiplier unit, designed to support modular
multiplication using both primes, uses two DSP units of Xil-
inx FPGAs. The hardware architecture of the re-configurable
integer multiplier is shown in Fig. 2. The datapath depends
on the scheme type and is heavily pipelined. We used internal
registers of DSP units to synchronize two DSP unit outputs
and achieve a high clock frequency. Now we need to design
a modular reduction unit accordingly.
3) Versatile Modular Reduction Unit
The naive solution is to design separate modular reduction
units for the two primes. It would require one modular
reduction unit for the Dilithium prime and two reduction
units for the Kyber prime, which will result in extra hard-
ware costs. To avoid this, we propose a unified modular
reduction unit. Both Dilithium (223 213 + 1) and Kyber
(212 2928+1) primes have pseudo-Mersenne structure. For
5
Fig. 2. Flexible yet compact integer multiplier. The red lines show control
signals, and the black lines show data movement. The two DSP units used
are highlighted in purple.
Algorithm 2 Integer Multiplication Algorithm
In: a, b Z8380417 or a[23 : 12], b[23 : 12], a[11 : 0], b[11 : 0] Z3329
In: sel {0,1}(0 for Dilithium and 1 for Kyber)
Out: d=a·bor d={a[23 : 12] ·b[23 : 12], a[11 : 0] ·b[11 : 0]}
1: d0= (sel) ? b[23 : 12] : b
2: m0=d0·a[23 : 12]
3: m1= (sel)?(m024) : (m012)
4: d1= (sel) ? b[11 : 0] : b
5: d=d1·a[11 : 0] + m1
6: return d
Dilithium prime, we followed the method described in [31]
which uses 223 213 1equation recursively. Using this
equation, we can reduce a 46-bit integer d(mod 223 213 +1)
to the integer 213d[45 : 24] + d[22 : 0] d[45 : 23] which
consists of addition/subtraction of 36-bit and 23-bit partial
results. If we apply this operation recursively, we will obtain
d(=d[22 : 0] + (d[32 : 23] + d[42 : 33] + d[45 : 43])213
d[45 : 23] d[45 : 33] d[45 : 43]). Similarly, for Kyber,
we followed add-shift-based method proposed in [30] which
generates partial results using equations 212 29+281and
211 210 281recursively.
Summing all partial results using carry propagate adders
(CPAs) will result in either a very long carry chain or multiple
pipeline stages. In order to avoid long carry chain and pipeline
delays, we used carry-save adders (CSAs) along with CPA.
The proposed unified modular reduction unit is shown in
Fig. 3, where the boxes ’D’ and ’K’ represent the partial
result generation circuits for Dilithium and Kyber primes,
respectively. All the subtraction operations are converted into
additions by taking the 2’s complements of partial results. In
Fig. 3, each number inside a box represents a bit index of the
input integer (0 to 45 for Dilithium and 0 to 22 for Kyber). The
white and brown (terracotta) boxes represent the normal and
negated bits. When 2’s complements operation is applied to
the partial results, extra plus ones, along with sign extensions,
come into the picture. These are represented with blue circles.
After adding all of these partial results, we also perform a fi-
nal correction which brings the resulting integer from the range
(q, 3q)to the range [0, q). The proposed modular reduction
unit can either perform one reduction for the Dilithium prime
or two reductions for the Kyber prime. The latency of the
22
45
32
4542
45
11 10
13
1
24
34
44
0
23
33
43
24
34
44
23
33
43
45 44
21
44
31
41
12 12 19 18
17 13 17 22 21
19 15 19 23 22 21 20 19 17
20 19 18 16
1
13
0
12
15 14
17 17
23 22 21 20 18 18 1818 19 23
23 22 19 20 1916 18 18
23 21 21 2015 14
22
0
1
Fig. 3. Unified modular reduction unit for Dilithium and Kyber primes.
Adder
Adder
SubtractorD/K
MultiplierD/K
Mod. ReductionD/K
SubtractorD/K
AdderD/K
1/2D/K
1/2D/K
Fig. 4. Compact butterfly unit (BFU) with flexibility for both Dilithium and
Kyber. The red and blue lines show control signals, and the black lines show
data movement.
modular reduction unit is two cycles and it is fully pipelined.
4) Coalesced datapath for the Butterfly Unit
Now that we have unified the modular multiplication unit,
we propose a unified BFU (Fig. 4). It can perform one butterfly
operation for Dilithium and two butterfly operations for Kyber
using the same datapath. All the arithmetic units are made re-
configurable to work for both schemes. New re-configurable
adder and subtractor units are shown in Fig. 5. The idea is
to divide each 24-bit adder/subtractor into two small 12-bit
parts and select proper input signals based on the scheme.
The complete unified butterfly unit, designed using the re-
configurable arithmetic units, is shown in Fig. 4.
6
0
carry
0
carry
Fig. 5. Unified adder and subtractor for the butterfly unit. The red lines show
control signals, and the black lines show data movement.
3 2 1 0 3 2 1 0
ab
0
+ ++
Fig. 6. Butterfly feedback unit for Kyber’s NTT-domain polynomial multi-
plication.
The schoolbook multiplication for Kyber requires five mul-
tiplications for multiplying two linear (i.e., degree one) poly-
nomials. We have two flexible butterfly units that act as four
butterfly units for Kyber and allow four multiplications only.
One way to perform these five multiplications is to add another
DSP multiplier just for the extra multiplication. This unit will
not be useful for any other operation. To avoid this extra
multiplier, we condense five multiplications into four using
Karatsuba-like reduction. Then we use these four independent
butterfly units as a set of two. The output of the first set is
the input to the second set as shown in Fig. 6. The inputs and
outputs for the BFUs are highlighted in blue. The control flow
here is separated from the Dilithium polynomial multiplication
control flow, for simplicity. The entire flow is pipelined to
achieve a high clock frequency.
The coefficient consumption during NTT/INTT is shown in
Fig. 7. Owing to the flexible datapath and efficient memory
arrangement (discussed in the next subsection), the Kyber
NTT coefficients can be consumed faster. This enables full
utilization of the datapath. The complete polynomial arithmetic
unit consumes 3,487 LUTs, 1,918 FFs, 4 DSPs, and 1 BRAM.
The BRAM is used to store the powers of roots-of-unity
(twiddle factors) required during NTT/INTT operation. In the
next section, we will discuss the efficient memory arrangement
designed to optimally feed the polynomial arithmetic unit.
B. Memory Arrangement
The polynomial arithmetic unit is designed to consume the
Kyber coefficients twice as fast as the Dilithium coefficients.
It requires that the memory unit feeds it at the same rate,
otherwise, making these unifications will not help improve the
performance. Dilithium coefficients are 23-bit, and we have
designed the NTT/INTT unit using two butterfly cores. Each
of the cores requires exclusive access to the read/write port of a
memory. Therefore, we split the memory into two blocks, each
storing two Dilithium coefficients per address. For Kyber, each
memory block stores four 12-bit Kyber coefficients. Fig. 8
shows the storage of Kyber polynomials in one 64-bit word
of memory. One Dilithium polynomial coefficient will occupy
two of these coefficients, thus requiring twice the amount of
storage. It also ensures that the two required coefficients during
NTT/INTT are always stored across different BRAMs. Fig. 8
shows an example of the coefficients storage during Kyber’s
incomplete-NTT iterations for a 16-coefficient polynomial.
Next, we will discuss how we used multiple clock domains
to reduce area consumption in ASIC platforms.
C. Multi-clock domains: Customization for ASIC platforms
The memory organization discussed above has two sets
of BRAMs to feed the two BUF. These BRAMs are used
by all the remaining building blocks as well. It is generated
using dual-port BRAMs in FPGA. In ASIC, dual-port RAMs
consume more area than single-port RAMs. Therefore, to
reduce the area consumption, we decide to replace dual-
port RAMs with single-port RAMs, which work at a clock
frequency twice as fast as the rest of the design. Using two
different sources for the two clocks leads to an asynchronous
setting. This creates meta-stability problems due to clock-
domain crossing. To avoid these problems, we decided to
keep the clocking synchronous and generate the slow clock
(clock logic) using the fast clock (clock mem).
Fig. 9 describes the handshake between memory and logic.
A wrapper is provided to process the simultaneous reads
and writes to the memories. The read operation is given
preference over the write operation to ensure data is valid
when the building blocks fetch it and avoid any issues due
to clock glitches. The read latency is three clock cycles, and
all the building blocks are tailored accordingly. This design
helps reduce the area for ASIC designs. Note that a similar
modification will not change the FPGA area consumption and
instead cause timing problems running the memory at a high
clock frequency. Therefore, this adaptation specifically targets
ASIC platforms.
Until now we discussed the major contributions of the work.
Next, we will briefly discuss how we efficiently implement
the remaining building block. We will start with the rejection
samplers used in both schemes. These are the Giant dependent
Dwarves that might help reduce the area and time consump-
tion without compromising the flexibility of the design.
D. The Giant and the Dwarves: Keccak-based SHA-SHAKE
unit and the rejection samplers
Dilithium requires SHAKE-128 and SHAKE-256 for
pseudo-random number generation and hashing. Kyber re-
7
Fig. 7. Timeline showing the unified butterfly unit processing Dilithium and Kyber coefficients.
7 6 5 4
3 2 1 0
15 14 13 12
11 10 9 8
BRAM0 BRAM1
23 22 21 20
19 18 17 16
31 30 29 28
27 26 25 24
7 6 5 4
3 2 1 0
15 14 13 12
11 10 9 8
BRAM0 BRAM1
23 22 21 20
19 18 17 16
31 30 29 28
27 26 25 24
7 6 5 43 2 1 0
15 14 13 1211 10 9 8
BRAM0 BRAM1
23 22 21 2019 18 17 16
31 30 29 2827 26 25 24
L=16 L=8
L=4
7 6 5 43 2 1 0
15 14 13 1211 10 9 8
BRAM0 BRAM1
23 22 21 2019 18 17 16
31 30 29 2827 26 25 24
L=2
Fig. 8. Storage of coefficients during Kyber’s NTT for 16-coefficient
polynomial
Fig. 9. The data read-and-write handshake between memory and logic unit
quires SHA3-256 and SHA3-512 for hashing and SHAKE-
128 and SHAKE-256 for KDF and pseudo-random number
generation. These different Keccak-based functions are imple-
mented as modes of the same Keccak output. Therefore, we
can use the same Keccak instance for all these modes. Both
schemes employ different sampling for the generation of secret
and error polynomials. While some of these fully consume the
Keccak output, the remaining have to keep track of the leftover
bits.
We combine the rejection sampler with the Keccak unit
using a book-keeping approach similar to [31]. It improves
the performance of the sampling operation, as we do not need
to store and then read the Keccak output in between. The base
implementation of Keccak follows a high-speed and parallel
directive. The control and datapath are modified to work for
rejection samplers as it depends on coefficients passing the
rejection constraints. The complete Keccak unit consumes
12,326 LUTs and 3,560 FFs.
We have unified all the Giants, so now we will discuss the
optimized implementation techniques for the Dwarves.
E. Optimizations for the Dwarves
Making a design compact while keeping it agile increases
the life and usability of KaLi on the FPGA and ASIC
platforms. However, this comes with a series of challenges.
We must ensure that for keeping the design agile/flexible, we
do not pay a huge price in terms of area. Similarly, while
making the design compact, the performance should not get
worse. We now discuss how to make certain building blocks
of the two schemes compact, while maintaining flexibility.
1) Compress/Decompress Unit
The decompress unit performs division by power-of-two and
rounding operation which is trivial to implement in hardware.
On the other hand, the compress operation requires division
by qand rounding. Some works in the literature use Barrett
reduction and division algorithms to perform the compress
operation. We decide to use sufficient precision and convert
8
Algorithm 3 The Proposed Compression Algorithm
In: xZ3329,d {1,4,5,10,11}
Out: y=(2d/q)·x
1: switch ddo
2: case 1:t= (10079 ·x); y= (t24) + (t[23] 23)
3: case 4:t= (315 ·x); y= (t16) + (t[15] 15)
4: case 5:t= (630 ·x); y= (t16) + (t[15] 15)
5: case 10:t= (5160669 ·x); y= (t24) + (t[23] 23)
6: case 11:t= (10321339 ·x); y= (t24) + (t[23] 23)
7: end switch
8: return y(mod 2d)
Fig. 10. Architecture of the Compress/Decompress unit. The red lines show
control signals, and the black lines show data movement.
division by qoperation into multiplication and shift opera-
tions. The proposed multiplication-based compress algorithm
is shown in Algorithm 3. The input is the Kyber coefficient
xand the type of compression required d. The compressed
coefficient yis returned as the output.
Since the multiplications are by constant values, we imple-
ment these operations using add and shift technique utilizing
the LUTs. Fig. 10 shows the hardware architecture of this
multiplication unit, used for retrieving the tvalues in Algo-
rithm 3. This is unified and works for both compress and
decompress operations. The control flow is dependent on the
type of compression or decompression required.
2) Encode/Decode Unit
Encode and decode units perform coefficient-to-byte and
byte-to-coefficient conversions, for all security levels of the
Kyber scheme. We used a similar idea as proposed in [32]
which uses a 32-bit interface. Our architecture uses a 64-bit
interface and thus the proposed encode unit uses a 104-bit
buffer. It can encode 1-bit, 4-bit, 5-bit, 10-bit, and 11-bit long
coefficients. The decode unit can decode 64-bit inputs into
1-bit, 4-bit, 5-bit, 10-bit, and 11-bit long coefficients using a
72-bit buffer.
3) Pack/Unpack unit
Similar to Kyber, Dilithium requires coefficient-to-byte and
byte-to-coefficient conversions for various coefficient sizes.
Pack and unpack units perform these conversions for all
security levels of Dilithium for coefficient sizes 3, 4, 6, 10, 13,
18, and 20 bits. We again followed the idea proposed in [32]
for the pack and unpack units.
The remaining blocks of both schemes are different and
unifying them would not save any area and instead complicate
the control logic and reduce the flexibility of the design. These
building blocks do not require any DSP units and comprise
simple bit-wise packing, unpacking, or addition/subtraction
operations. They are implemented as individual blocks and
they occupy only 18% of the cryptoprocessor’s area.
F. Instruction set cryptoprocessor
We made the building blocks compact while ensuring flex-
ibility, but this is insufficient. What happens if, in two years,
the Keccak pseudo-random number generation and hashing
unit are obsolete? Do we then need to redesign the entire
cryptoprocessor? To counter this and increase agility as well
as flexibility, we design an instruction set architecture (ISA),
where each instruction is a building block required by the
cryptographic schemes. A simple program controller runs the
cryptographic protocols by executing the necessary instruc-
tions, manages the synchronization of parallel instructions, and
avoids back-and-forth CPU-Cryptoprocessor communication.
Note that the program controller is not a ‘control processor’,
and it does not contain arithmetic circuits to process operand
data. It simply decodes an instruction and then activates the
corresponding module inside the cryptoprocessor. It consumes
only 8% of the cryptoprocessor’s area. The instructions and
the corresponding hardware modules are listed in Table I.
G. Running the Giants and the Dwarves in parallel
Our goal was to make the unified design compact and agile.
However, does this mean we have to pay an equal price
in terms of performance? To some extent, this is correct.
However, we should continue to ponder on some methods
that could boost the performance without increasing the area
consumption. One such way is to run the Giant instructions
in parallel to each other or to multiple Dwarf instructions, as
shown in [31]. We make sure that two Giants, the Keccak
unit and the polynomial arithmetic unit, can always run in
parallel to cancel each other’s run-time. It leads to a reduction
of 35% in the total run-time. Similar to [31], we define two
instruction sets (S-1 and S-2), as shown in Table I. Every
instruction belonging to the first set can be run in parallel with
any instruction that belongs to the second set. The instruction
opcodes for each instruction are shown in the INS column
of Table I. Following the design methodology, we design the
unified cryptoprocessor- KaLi as shown in Fig. 1.
IV. RES ULT S
In this section, we present the performance and area results
of KaLi. The proposed architecture is described in Verilog. It
is synthesized and implemented for Zynq Ultrascale+ ZCU102
with a performance-optimized strategy using Vivado 2019.1
tool and achieves 270 MHz clock frequency on FPGA. The
proposed architecture is also implemented with 65nm and
28nm ASIC technologies using the Cadence Genus tool. On
65nm/28nm ASIC technology, it achieves 280 MHz/1 GHz for
the slow clock (in logic units), and 560 MHz/2 GHz for the
fast clock (in memory units).
9
TABLE I
ARE A OF KALION THE ZY NQ ULTRASCALE+ ZCU102 FPGA PLATF ORM .
ALL S ECU RI TY LE VEL S OF DILITHIUM AND KY BER A RE S UPP ORT ED.
Unit S-1 S-2 INS LUT DFF DSP BRAM
Comp.Core 21K 9.2K 4 21
Dilithium (D)
Decompose - 1 474 338 0 0
Pow2Round - 1 55 84 0 0
MakeHint - 2 61 124 0 0
UseHint - 3 565 433 0 0
Encode H-4 202 233 0 0
Pack - 2 582 181 0 0
Unpack - 3 315 182 0 0
SampleInBall - 5 505 285 0 0
Refresh - 4 8 7 0 0
VerifyEq. - 6 13 76 0 0
Kyber (K)
Encode - 7 517 190 0 0
Decode - 5 237 180 0 0
Com./Decom. - 6/7 272 376 0 0
Verify - 8 102 216 0 0
CMOV - 9 20 120 0 0
COPY - 10 15 120 0 0
D+K
Memory - 11 268 12 0 20
Keccak - 8-18 12K 3.5K 0 0
Multiplier - 12-16 3.5K 2K 4 1
Prog.Contr. 2K 296 0 3
Total 23K 9.7K 4 24
TABLE II
PERFORMANCE RESULTS FOR DILITHIUM AND KYBER -KEM IN FPGA
Operation
Dilithium-2 Dilithium-3 Dilithium-5
Kyber-512 Kyber-768 Kyber-1024
Cycle µs Cycle µs Cycle µs
Dil.Gen 14,594 54.05 23,619 87.48 39,737 147.17
Dil.Signpre 7,883 29.2 9,640 35.7 12,943 46.27
Dil.Sign 21,812 80.79 36,643 135.72 53,965 199.87
Dil.Signpost 1,967 7.23 2,463 9.12 3,271 12.12
Dil.Verify 15,423 57.12 26,124 96.76 46,671 172.86
Kyb.Keygen 3,395 12.6 6,291 23.2 9,089 33.7
Kyb.Encaps 4,956 18.4 7,862 29.11 11,351 42.04
Kyb.Decaps 6,807 25.21 11,291 41.82 13,905 51.5
A. Area and Performance Results
Table I presents the detailed utilization of each building
blocks in KaLi for UltraScale+ ZCU102 platform. The pro-
posed cryptoprocessor uses 23,347 LUTs (8.4%), 9,798 DFFs
(1.7%), 4 DSPs (0.1%), and 24 BRAMs (2.6%). On ASIC,
KaLi consumes 1.107 mm2(769.04 KGE) in 65nm technol-
ogy, and 0.263 mm2(747.81 KGE) in 28nm technology.
Table II presents the cycle count and latency (in µs) for
the operations of Dilithium and Kyber. With 270 MHz clock
frequency in the FPGA, the CCA-secure key generation,
encapsulation and decapsulation operations for Kyber-768 take
23.2, 29.11, and 41.82 µs, respectively. For the best-case sce-
nario, where a valid signature is generated after the first loop
iteration [31], the key generation, signature generation, and
signature verification operations for Dilithium-3 take 87.48,
179.91, and 96.76 µs, respectively. The ASIC implementation
with 65nm/28nm technology (with 560 MHz/2 GHz clock
frequency for the memory unit) can perform the operations
for Kyber-768 and Dilithium-3 in 22.07/6.18, 27.59/7.73, and
39.62/11.09 µs, and 82.87/23.2, 171.03/47.89, and 91.66/25.67
µs, respectively. Next, we compare these results with the
existing works in the literature.
B. Comparison with unified designs in literature
In [24], the authors design a unified architecture for
Dilithium and Kyber. They present both HW/SW co-design
TABLE III
COMPARISON TAB LE FO R DILITHIUM-3 FPGA IMPLEMENTATIONS
Ref. Plat. Performance Freq. Resources (LUT/
(in µs) (MHz) FF/DSP/BRAM)
[9]Zynq -/8.8K/9.9K 100 2.6K/-/-/-
[10]a
US+V
51.9/-/- 350 54.1K/25.2K/182/15
[10]b,d -/63.1/- 333 68.4K/86.2K/965/145
[10]c-/-/95.1 158 61.7K/34.9K/316/18
[11]d
Ar.-7 229/0.3K/0.2K 145 30.9K/11.3K/45/21
[11]e229/0.85K/0.2K
[6]d
Ar.-7 60/0.12K/63.8 96.9 30K/10.34K/10/11
[6]e60/0.46K/63.8
[12]d
US+V 32/63/39 145 55.9K/28.4K/16/29
[12]e32/193/39
[31]d,f US+Z 114.7/237/127.6 200 18.5K/9.3K/4/24
KaLid,f US+Z82.8/171.3/96.7 270 23K/9.7K/4/24
a: Implements K. Gen. b: Implements Sign. c: Implements Verify. d: Reports
best-case scenario. e: Reports average-case scenario. f: Supports multiple
schemes. : HW/SW co-design. US+V/Z refers to Virtex/Zynq US+ platforms.
TABLE IV
COMPARISON TAB LE FO R DILITHIUM-3 ASIC IMPLEMENTATIONS
Ref. Tech. Perf.Freq. Area SRAM Energy
(nm) (µs) (MHz) (KGE) (KB) (µJ)
[25]40 18,266 72 106 40.25 88.89
[24]a28 747 540 697 24.75 62.39
[31]a,b 65 182.3 400 854 34.82 -
KaLia,b 65 262.7 280/560 769 34.82 117.9
KaLia,b 28 73.55 1K/2K 747 34.82 27
:Performance/Energy is measured as total time/energy for signature generation
and verification (key generation can be done offline). : HW/SW co-design.
a:Supports multiple schemes. b: Reports best-case scenario.
as well as HW results for Kyber while keeping some parts
of Dilithium in the software. Their NTT unit occupies 25,674
LUTs, 3,137 DFFs, 64 DSPs, and 6 BRAMs on a Xilinx Artix-
7 FPGA. The NTT unit alone occupies more LUT and DSP
units than our entire design. On ASIC, it occupies 697 KGE
on 28nm technology [24] which is very close to our unified
cryptoprocessor’s 747 KGE area consumption. Their imple-
mentation shows similar performance for Kyber even though
they target a high-speed design of Kyber in hardware and use
32 butterfly units for NTT, making their NTT unit 8×faster
than KaLi. For Dilithium, KaLi shows 10×better results.
The energy consumption of KaLi is also approximately half
of their design for both Kyber and Dilithium.
To the best of our knowledge, no work exists in the literature
that unifies Dilithium and Kyber solely in hardware. Therefore,
next, we compare our work with standalone implementations
of Dilithium or Kyber in hardware.
C. Comparison with Dilithium-only designs in literature
Comparison with FPGA-based implementations:
Different works in the literature use different FPGA plat-
forms. Hence, drawing a one-to-one comparison between
works is not always feasible. When we started the hardware
implementation of the proposed architecture, we chose Ultra-
scale+ as the platform and thereafter pipelined the building
blocks for achieving around 300 MHz frequency on this plat-
form. Several works in the literature optimized their designs
for Artix-7 or other FPGAs. While optimizing the critical
paths of architecture for meeting the desired clock frequency is
heavily dependent on the technology of the platform, area re-
quirements (LUT/DSP/FF/BRAM) do not change significantly
10
TABLE V
COMPARISON TAB LE FO R KYB ER-1024 FPGA IMPLEMENTATIONS
Ref. Platform PerformanceFreq. Resources (LUT/
(in µs) (MHz) FF/DSP/BRAM)
[21]Cortex-M4 33,850 100 -/-/-/-
[25]Artix-7 18,560 25 15K/3K/11/14
[18]Zynq - - 24K/11K/21/32
[20]Artix-7 85,559 59 2K/2K/5/34
[13] Virtex-7 1,260 192 133K/-/548/202
[14] Artix-7 154 161 7K/5K/2/3
[15] Artix-7 63 210 12K/12K/8/15
[16] Artix-7 56 185 13K/12K/16/16
[17] Artix-7 286 112 16K/6K/12/17
[17] Virtex-7 205 156 16K/6K/12/17
[22] US+Z 23.5 450 11.6K/12K/8/8.5
[23] US+Z 3.4 (Encap) 450 18.4K/13.7K/2/0
[23] US+Z 4.1 (Decap) 450 15.9K/12.9K/2/0
KaLiaUS+Z 93 270 23K/9.7K/4/24
:Performance is measured as the total time for encapsulation and decapsulation
(key generation can be done offline). a:Supports multiple schemes. : HW/SW
co-design. : SW design. US+Z refers to Zynq Ultrascale+ platforms.
across FPGA technologies. In Table III, we present the FPGA
implementation results of Dilithium-3 from the literature.
Zhou et al. [9] present an HW/SW co-design and they only
implement the polynomial arithmetic unit in hardware. Thus,
they consume less area but report an inferior performance.
Ricci et al. [10] provide separate designs for each of the
Dilithium variants. These designs in total occupy 9×more
area compared to our design and still perform as good as our
design for signature verification. For a signature generation,
their implementation shows only 3×improvement. Thus, our
design gives a much better area-time trade-off result.
The authors in [6], [11], [12] present Dilithium implementa-
tions, which consume much more area compared to our design.
Note that across technologies, the area consumption does not
change notably. A lower frequency in [6], [11] can be justified
by the use of Artix-7 FPGA, which is technologically inferior
to our Ultrascale+ platform. A limitation of [6] is that it uses
a segmented pipeline and hence, an inflexible data path for
Dilithium. The implementation in [12] uses a better platform
than ours, consumes 3×more area (LUT+FF), and achieves
a speed-up of only 2.5×(Sign+Verify). In [31], the authors
present a unified cryptoprocessor for Dilithium and Saber [33].
Their area is almost comparable to ours, considering the
difference between Kyber and Saber. We achieve a higher
clock frequency and report 1.4×better performance.
Comparison with ASIC-based implementations: Table IV
gives the comparison of implementation results for Dilihtium-
3 on ASIC platforms. Banerjee et al. [25] present ASIC results
for HW/SW co-design of Dilithium with Round 2 parameters.
KaLi outperforms them significantly in terms of performance.
Our hardware only design gives 45×better performance at
the cost of only 7.5×more area. KaLi consumes almost the
same area as reported in [24] but gives a 10×and 2.8×better
performance with 28nm and 65nm technology, respectively.
[31] reports a higher number of logic gates than our design.
KaLi sets new records for energy consumption, in both 28nm
and 65nm technologies.
Thus, our FPGA and ASIC models are the most compact
compared to all the existing Dilithium implementations.
TABLE VI
COMPARISON TAB LE FO R KYB ER-1024 ASIC IMPLEMENTATIONS
Ref. Tech. Perf.Freq. Area SRAM Energy
(nm) (µs) (MHz) (KGE) (KB) (µJ)
[25]40 6,444 72 106 40.25 36.06
[18]65 18,444 45 170 465 307.68
[19]28 727 300 979 12 19.57
[17] 65 160 200 104 190 -
[24],a 28 206 540 697 24.75 16.24
[24]a28 22/17.7b540 623 36.75 -
KaLia65 90.2 280/560 769 34.82 40.48
KaLia28 25.26 1K/2K 747 34.82 9.27
:Performance/Energy is measured as the total time/energy for encapsulation
and decapsulation (key generation can be done offline). : HW/SW co-design.
a:Supports multiple schemes. b:Depending on the type of schedule.
D. Comparison with Kyber-only designs in literature
Comparison with FPGA-based implementations: Table V
gives the comparison of implementation results for Kyber-
1024 on FPGA platform. Banerjee et al. [25] present an
HW/SW co-design for Kyber. KaLi surpasses their perfor-
mance results on both platforms, at the cost of some area.
Observe that KaLi gives better results compared to software
only [21] as well as HW/SW co-designs [18], [20], [25]. The
hardware-only designs [13]–[17] target Artix-7 or Virtex-7
FPGAs. Our KaLi consumes significantly smaller area than
[13]. Authors in [15]–[17], [22] target a high-speed Kyber
implementation and therefore achieve better performance and
frequency. In [23], the authors present separate results for
all the Kyber variants, and present individual encapsulation
and decapsulation architectures (unlike KaLi combines all
operations in a single architecture). Their design goal is also
high-speed, and their standalone implementation of Kyber-
1024 encapsulation consumes more area than KaLi. Note that
the area of our design is determined by Dilithium and not
by Kyber. Therefore, even though the results show that we
consume a very high area, we only consume the bare minimum
and give almost the best performance results on the FPGA
platform.
Comparison with ASIC-based implementations: Table VI
gives the comparison of implementation results for Kyber-
1024 on ASIC platform. On ASIC platform, KaLi consumes
the same area as reported in [24] but gives a 2.3/8.1×better
performance under 65nm/28nm technology. In fact, we surpass
all existing designs [17]–[19], [25] in terms of performance.
However, compared to some of the designs, we use more area,
and for this, we must remind again that Kyber is the recessive
scheme among the two, and therefore this area is higher when
compared to Kyber-only implementations. KaLi consumes the
least energy of all for respective technologies.
We have now established that KaLi transcends all the state-
of-the-art works that exist in literature. Thus, showing that the
proposed design methodology yields better results. Next, we
discuss the aspect of application benchmarking.
E. Application-Benchmarking and Impact
Several works, for example, [34], [35], present application
benchmarking using the existing libraries. The authors in
[34] provide results for TLS protocol using the mbed TLS
library and use Kyber for KEM and SPHINCS+ for digital
11
signature. Runtimes are reported for Raspberry Pi 3 Model B+,
ESP32-PICO-KIT V4, Fieldbus Option Card, and LPC11U68
LPCXpresso. Compared to these works, KaLi’s FPGA imple-
mentation shows 85×, 1349×, 6190×, and 23809×speedups,
respectively, noting that the ASIC models of KaLi will further
improve the timings.
In [35], the authors evaluate PQ TLS 1.3, which is a
post-quantum variant of TLS version 1.3. It supports Round
3 parameters for both Kyber and Dilithium, along with
other schemes. They use ARM Cortex-M4 embedded plat-
form NUCLEO-F439ZI, with and without hardware accel-
eration. These boards can reach a maximum frequency of
180MHz. Compared to their results for Kyber’s decapsulation,
we achieve a speedup of 131×if we run our design at
180MHz. The authors report that replacing RSA+ECDHE with
Dilithium3+Kyber5 in TLS handshake increased the runtime
by 64%. KaLi can help bridge this gap. Thus, replacing these
devices with KaLi would give significant speedup. KaLi only
occupies 8% of the available resources on the Zynq US+
FPGA board, implying the ability to run twelve such unified
cores in parallel, further improving the speedup.
There are several data center and network security appli-
ances where high-performance SIMD processors (e.g., In-
tel/AMD with AVX) are too expensive to deploy or extremely
constrained (or passively powered) devices are too slow to
use. There are commercial cryptoprocessors that target such
applications. For example, NXP’s C29x family of crypto
coprocessors [36] (which are battery-powered) use dedicated
hardware acceleration for speeding up the RSA and elliptic
curve-based public-key cryptographic computations. It com-
putes up to 32K RSA2048 public-key operations per second.
More constrained platform OPTIGA™ TPM SLB 9672 from
Infenion [37] has hardware acceleration for RSA-4096. They
use the same RSA engine for both public-key signature and
key agreement.
Our unified Kyber+Dilithium coprocessor performs faster
than the public-key engines of [36] and [37] and, at the same
time, requires only 0.263 mm2area in a 28nm node. When a
smaller area is required by an application, some of the design
parameters (e.g, number of NTT cores, Keccak’s data-width,
etc.) can be tuned accordingly to meet the area budget at the
cost of speed. The proposed design techniques and architecture
will be useful to replace the classical public-key cryptography
used in conventional cryptoprocessors with post-quantum key
agreement and signature.
V. CONCLUSIONS AND FUTURE WORK
Post-quantum key encapsulation and digital signature algo-
rithms are required for securing communication. In this paper,
we presented a design methodology for efficient and compact
hardware implementation of both post-quantum key encapsu-
lation and digital signature algorithms in a unified cryptopro-
cessor architecture. Following the proposed methodology, we
designed and implemented the first unified cryptoprocessor
architecture KaLi that can perform all the cryptographic
protocol operations of the Dilithium signature and Kyber key
encapsulation algorithms for all the security levels. Architec-
tural optimizations in the data path of the cryptoprocessor
were performed to reduce the cycle count and improve the
clock frequency. Experimental evaluation of KaLi on FPGA
and ASIC platforms showed that KaLi outperforms all the
existing implementations. Therefore, the design of KaLi is a
significant step towards making post-quantum cryptography
compact and agile on hardware platforms. The proposed
design methodology can be customized to meet different
application-specific constraints and requirements.
The hardware implementation presented in this paper is
resistant to timing attacks but does not incorporate any
countermeasure, for example, masking against more powerful
side-channel attacks. Side-channel protection of the unified
cryptoprocessor architecture will require significant research
and is considered future work. There are several works in the
literature on masking Kyber [38], [39]. However, at the time
of writing this paper, the authors are not aware of any reported
masked implementation of the NIST standardized version
of Dilithium. How to design an ‘optimal and unified’ side-
channel protection mechanism for a unified hardware imple-
mentation of Kyber and Dilithium is an interesting topic that
needs to be researched. Furthermore, researching protection
techniques against fault injection-based attacks will be very
important due to the vast deployment of these cryptographic
schemes in various embedded devices.
REFERENCES
[1] P. W. Shor, “Polynomial-time algorithms for prime factorization and
discrete logarithms on a quantum computer, SIAM J. Comput., vol. 26,
no. 5, p. 1484–1509, oct 1997.
[2] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends,
R. Biswas, S. Boixo, and many more, “Quantum supremacy using a
programmable superconducting processor,” Nature, 2019, https://doi.org/
10.1038/s41586-019- 1666-5.
[3] M. Gong, S. Wang, C. Zha, M.-C. Chen, H.-L. Huang, Y. Wu, Q. Zhu,
Y. Zhao, S. Li, S. Guo, and e. a. Haoran Qian, “Quantum walks on
a programmable two-dimensional 62-qubit superconducting processor,
Science, vol. 372, no. 6545, pp. 948–952, 2021.
[4] “Post-quantum cryptography- call for proposals,” 2017. [Online].
Available: https://csrc.nist.gov/projects/post-quantum- cryptography
[5] D. Joseph, R. Misoczki, M. Manzano, J. Tricot, F. D. Pinuaga,
O. Lacombe, S. Leichenauer, J. Hidary, P. Venables, and R. Hansen,
“Transitioning organizations to post-quantum cryptography.” Nature,
605(7909), 237–243., 2022.
[6] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin,
S. Wei, and L. Liu, A compact and high-performance hardware archi-
tecture for crystals-dilithium,” IACR Trans. Cryptogr. Hardw. Embed.
Syst., vol. 2022, no. 1, pp. 270–295, 2022.
[7] P. Schwabe, R. Avanzi, J. Bos, L. Ducas, E. Kiltz, T. Lepoint,
V. Lyubashevsky, J. M. Schanck, G. Seiler, and D. Stehle, “CRYSTALS-
KYBER,” Proposal to NIST PQC Standardization, 2021, https://csrc.nist.
gov/Projects/post-quantum-cryptography/round-3-submissions.
[8] S. S. Roy and A. Basso, “High-speed instruction-set coprocessor for
lattice-based key encapsulation mechanism: Saber in hardware, IACR
Trans. Crypt. Hardw. Embed. Syst., vol. 2020, no. 4, pp. 443–466, 2020.
[9] Z. Zhou, D. He, Z. Liu, M. Luo, and K.-K. R. Choo, “A soft-
ware/hardware co-design of crystals-dilithium signature scheme,” ACM
Trans. Reconfigurable Technol. Syst., vol. 14, no. 2, Jun. 2021.
[10] S. Ricci, L. Malina, P. Jedlicka, D. Sm´
ekal, J. Hajny, P. Cibik,
P. Dzurenda, and P. Dobias, “Implementing crystals-dilithium signature
scheme on fpgas,” in The 16th International Conference on Availability,
Reliability and Security, ser. ARES 2021. New York, NY, USA:
Association for Computing Machinery, 2021.
[11] G. Land, P. Sasdrich, and T. G¨
uneysu, A hard crystal - implementing
dilithium on reconfigurable hardware,” IACR Cryptol. ePrint Arch., vol.
2021, p. 355, 2021.
[12] L. Beckwith, D. T. Nguyen, and K. Gaj, “High-performance hardware
implementation of crystals-dilithium,” Crypto. ePrint Arch., Report
2021/1451, 2021.
12
[13] Y. Huang, M. Huang, Z. Lei, and J. Wu, A pure hardware implemen-
tation of CRYSTALS-KYBER PQC algorithm through resource reuse,
IEICE Electron. Express, vol. 17, no. 17, p. 20200234, 2020.
[14] Y. Xing and S. Li, “A compact hardware implementation of cca-secure
key exchange mechanism CRYSTALS-KYBER on FPGA, IACR Trans.
Cryptogr. Hardw. Embed. Syst., vol. 2021, no. 2, pp. 328–356, 2021.
[15] V. B. Dang, F. Farahmand, M. Andrzejczak, K. Mohajerani, D. T.
Nguyen, and K. Gaj, “Implementation and benchmarking of round
2 candidates in the NIST post-quantum cryptography standardization
process using hardware and software/hardware co-design approaches,
IACR Cryptol. ePrint Arch., p. 795, 2020.
[16] M. Bisheh-Niasar, R. Azarderakhsh, and M. M. Kermani, “High-speed
ntt-based polynomial multiplication accelerator for crystals-kyber post-
quantum cryptography, IACR Cryptol. ePrint Arch., p. 563, 2021.
[17] M. Bisheh-Niasar, R. Azarderakhsh, and M. M. Kermani, “Instruction-
set accelerated implementation of crystals-kyber, IEEE Trans. Circuits
Syst. I Regul. Pap., vol. 68, no. 11, pp. 4648–4659, 2021.
[18] T. Fritzmann, G. Sigl, and J. Sep´
ulveda, “RISQ-V: tightly coupled RISC-
V accelerators for post-quantum cryptography, IACR Trans. Cryptogr.
Hardw. Embed. Syst., vol. 2020, no. 4, pp. 239–280, 2020.
[19] G. Xin, J. Han, T. Yin, Y. Zhou, J. Yang, X. Cheng, and X. Zeng,
“VPQC: A domain-specific vector processor for post-quantum cryptog-
raphy based on RISC-V architecture,” IEEE Trans. Circuits Syst. I Regul.
Pap., vol. 67-I, no. 8, pp. 2672–2684, 2020.
[20] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “ISA
extensions for finite field arithmetic accelerating kyber and newhope
on RISC-V,” IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2020,
no. 3, pp. 219–242, 2020.
[21] L. Botros, M. J. Kannwischer, and P. Schwabe, “Memory-efficient high-
speed implementation of kyber on cortex-m4, in Progress in Cryptology
- AFRICACRYPT 2019, vol. 11627. Springer, 2019, pp. 209–228.
[22] V. B. Dang, K. Mohajerani, and K. Gaj, “High-speed hardware architec-
tures and FPGA benchmarking of crystals-kyber, ntru, and saber,” IACR
Cryptol. ePrint Arch., p. 1508, 2021.
[23] Z. Ni, A. Khalid, D. Kundi, M. O’Neill, and W. Liu, “Efficient pipelining
exploration for A high-performance crystals-kyber accelerator, IACR
Cryptol. ePrint Arch., p. 1093, 2022.
[24] Y. Zhao, R. Xie, G. Xin, and J. Han, “A high-performance domain-
specific processor with matrix extension of RISC-V for module-lwe
applications,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 69, no. 7,
pp. 2871–2884, 2022.
[25] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: A
configurable crypto-processor for post-quantum lattice-based protocols
(extended version), IACR Cryptol. ePrint Arch., p. 1140, 2019.
[26] S. Bai, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe,
G. Seiler, and D. Stehl´
e, “CRYSTALS-Dilithium, Proposal to
NIST PQC Standardization, Round3, 2021, https://csrc.nist.gov/Projects/
post-quantum- cryptography/round-3- submissions.
[27] D. Sprenkels, “The Kyber/Dilithium NTT,
https://dsprenkels.com/ntt.html.
[28] M. Scott, “A note on the implementation of the number theoretic
transform,” in Cryptography and Coding - 16th IMA International
Conference, IMACC 2017. Springer, 2017.
[29] V. Lyubashevsky and G. Seiler, “NTTRU: Truly Fast NTRU Using
NTT,” IACR Trans. on CHES, vol. 2019, no. 3, pp. 180–201, May 2019.
[30] F. Yaman, A. C. Mert, E. ¨
Ozt¨
urk, and E. Savas¸, “A hardware accelerator
for polynomial multiplication operation of crystals-kyber pqc scheme,”
in 2021 Design, Automation & Test in Europe Conference & Exhibition
(DATE). IEEE, 2021, pp. 1020–1025.
[31] Aikata, A. C. Mert, D. Jacquemin, A. Das, D. Matthews, S. Ghosh, and
S. S. Roy, A unified cryptoprocessor for lattice-based signature and
key-exchange, Cryptology ePrint Archive, Report 2021/1461, 2021.
[32] Y. Xing and S. Li, “A compact hardware implementation of cca-secure
key exchange mechanism crystals-kyber on fpga, IACR Transactions on
Cryptographic Hardware and Embedded Systems, pp. 328–356, 2021.
[33] J.-P. D’Anvers, A. Karmakar, S. S. Roy, F. Vercauteren, J. M. B.
Mera, M. V. Beirendonck, and A. Basso, “SABER,” Proposal to
NIST PQC Standardization, Round3, 2021, https://csrc.nist.gov/Projects/
post-quantum- cryptography/round-3- submissions.
[34] K. B¨
urstinghaus-Steinbach, C. Krauß, R. Niederhagen, and M. Schnei-
der, “Post-quantum TLS on embedded systems: Integrating and evalu-
ating kyber and SPHINCS+ with mbed TLS,” in ASIA CCS ’20: The
15th ACM Asia Conference on Computer and Communications Security,
2020. ACM, 2020, pp. 841–852.
[35] T. George, J. Li, A. P. Fournaris, R. K. Zhao, A. Sakzad, and R. Ste-
infeld, “Performance evaluation of post-quantum TLS 1.3 on embedded
systems,” IACR Cryptol. ePrint Arch., p. 1553, 2021.
[36] “Nxp’s c29x family of crypto coprocessors. [Online]. Available:
https://www.nxp.com/docs/en/fact-sheet/C29XFAMFS.pdf
[37] “Optiga™ tpm slb 9672 from infenion.” [On-
line]. Available: https://www.infineon.com/cms/en/about-infineon/press/
market-news/2022/INFCSS202202-051.html
[38] J. W. Bos, M. Gourjon, J. Renes, T. Schneider, and C. van Vredendaal,
“Masking kyber: First- and higher-order implementations, IACR Trans.
Cryptogr. Hardw. Embed. Syst., vol. 2021, pp. 173–214, 2021.
[39] T. Fritzmann, M. Van Beirendonck, D. Basu Roy, P. Karl, T. Scham-
berger, I. Verbauwhede, and G. Sigl, “Masked accelerators and instruc-
tion set extensions for post-quantum cryptography, IACR Transactions
on Cryptographic Hardware and Embedded Systems, vol. 2022, no. 1,
p. 414–460, Nov. 2021.
Aikata obtained her Bachelors in Technology degree from
IIT Bhilai, India, in 2020 and Masters degree from Graz
University of Technology, Austria, in 2022. She is currently
a PhD student at Institute of Applied Information Processing
and Communications, Graz University of Technology. Her
research interests include lattice-based cryptography and
hardware design.
Ahmet Can Mert received his PhD degree in electron-
ics engineering from Sabanci University, Turkey in 2021.
Currently, he is working as a postdoctoral researcher at
the Institute of Applied Information Processing and Com-
munications, Graz University of Technology, Austria. His
research interest include homomorphic encryption, lattice-
based cryptography and hardware design.
Malik Imran received his bachelor’s and master’s degrees
from Pakistan in 2011 and 2015, respectively. Now, he is in
with the Center for Hardware Security, Tallinn University
of Technology (TalTech), Tallinn, Estonia, as a doctoral stu-
dent. Before joining TalTech, Malik contributed to different
research labs for efficient hardware accelerators for intrusion
detection systems and asymmetric cryptography.
Samuel Pagliarini (M’14) received the PhD degree from
Telecom ParisTech, Paris, France, in 2013. He has held
research positions with the University of Bristol, Bristol,
UK, and with Carnegie Mellon University, Pittsburgh, PA,
USA. He is currently a Professor with Tallinn University
of Technology (TalTech) in Tallinn, Estonia where he leads
the Centre for Hardware Security.
Sujoy Sinha Roy is an Assistant Professor of cryptographic
engineering at IAIK, Graz University of Technology. He is
a Co-Designer of “Saber, which is a finalist key encapsula-
tion mechanism (KEM) candidate in NIST’s Post-Quantum
Cryptography Standardization Project. He is interested in
the implementation aspects of cryptography.
... In [47], the authors present an implementation of Dilithium and Kyber on a Xilinx Ultrascale+ ZCU102 platform 4 . With a performance-optimized strategy using the Vivado 2019.1 tool they achieve 270 MHz clock frequency on the FPGA. ...
... As the authors report the latency in µs, we converted the latency into executions per seconds for comparability and included the results in Fig. 9 in dark red. The hardware implementation presented in [47] outperforms all software-based solutions presented in this work. Comparing Table 1 with Table 2 demonstrates that the implementation of Dilithium [47] yields more executions per second with respect to the implementation of Falcon presented in [32]. ...
... The hardware implementation presented in [47] outperforms all software-based solutions presented in this work. Comparing Table 1 with Table 2 demonstrates that the implementation of Dilithium [47] yields more executions per second with respect to the implementation of Falcon presented in [32]. ...
Article
Full-text available
Commercially available quantum computers are expected to reshape the world in the near future. They are said to break conventional cryptographic security mechanisms that are deeply embedded in our today’s communication. Symmetric cryptography, such as AES, will withstand quantum attacks as long as the key sizes are doubled compared to today’s key lengths. Asymmetric cryptographic procedures, e.g. RSA, however are broken. It is therefore necessary to change the way we assure our privacy by adopting and moving towards post-quantum cryptography (PQC) principles. In this work, we benchmark three PQC algorithms, Falcon, Dilithium, and Kyber. Moreover, we present an implementation of a PQC stack consisting of the algorithms Dilithium/Kyber and Falcon/Kyber which use hardware accelerators for some key functions and evaluate their performance and resource utilization. Regarding a classic server-client model, the computational load of the Dilithium/Kyber stack is distributed more equally among server and client. The stack Falcon/Kyber biases the computational challenges towards the server, hence relieving the client of performing costly operations. We found that Dilithium’s advantage over Falcon is that Dilithium’s execution is faster while the workload to be performed is distributed equally among client and server, whereas Falcon’s advantage over Dilithium lies within the small signature sizes and the unequally distributed computational tasks. In a client server model with a performance limited client (i.e. Internet-of-Things - IoT - environments) Falcon could proof useful for it constrains the computational hard tasks to the server and leaves a minimal workload to the client. Furthermore, Falcon requires smaller bandwidth, making it a strong candidate for deep-edge or IoT applications.
... Existing accelerators for Dilithium all employ an iterative NTT architecture, which helps save hardware resources. In studies [23], [24], a pair of butterfly cores are utilized for lightweight Dilithium implementations. [25] proposed a butterfly core architecture using DSPs for all arithmetic operations within the butterfly computation. ...
... Finally, polynomial multiplication based on NWC-NTT can be performed as follows: [47]. In the case of Dilithium with a large q, recent studies often utilize the property 2 23 ≡ 2 13 −1(mod q) to design intricate combinational logic circuits for modulus q= 8380417 [23]- [25], [48], [49]. In [26], Pham et al. utilized a variation of the Barrett method combined with the 2 23 ≡ 2 13 − 1(mod q) property to execute the modular reduction for Dilithium efficiently. ...
... Therefore, we apply the property 2 23 ≡ 2 13 − 1(mod q) to reduce the higher-order bits of the product. Specifically, we obtain (2 23 ...
Article
Full-text available
The efficiency of polynomial multiplication execution majorly impacts the performance of lattice-based post-quantum cryptosystems. In this research, we propose a high-speed hardware architecture to accelerate polynomial multiplication based on the Number Theoretic Transform (NTT) in CRYSTAL-Kyber and CRYSTAL-Dilithium. We design a Digital Signal Processing (DSP) architecture for modular multiplication in butterfly and Point-Wise Multiplication (PWM) operations. Our method reduces the critical path delay of an n-bit multiplier to that of a (2n-2)-bit adder, optimizing both area and speed. These dedicated DSPs are employed in butterfly and PWM operations, completely eliminating the pre-process and post-process of NTT transforms. Furthermore, we introduce a novel unified pipelined architecture for the NTT and Inverse NTT (INTT) transformations of Kyber and Dilithium, with corresponding high-speed (Radix-2) and ultra high-speed (Radix-4) versions. Lastly, we construct a complete hardware accelerator for polynomial matrix-vector multiplication in Kyber. The Field-Programmable Gate Array (FPGA) implementation results have proven that our designs have significantly improved execution time by 3.4×–9.6× for the NTT transforms in Dilithium and 1.36×–34.16× for Kyber polynomial multiplication, compared to previous studies reported to date. Additionally, the hardware footprint results indicate that our proposed architectures exhibit superior hardware performance in Area-Time-Product (ATP), corresponding to a 44%–96% improvement. The proposed architectures are efficient and well-suited for high-performance lattice-based cryptography systems.
... Some studies aimed at maximizing performance while others focused on minimizing power or area consumption or achieving a balance in area-time tradeoff. As a result, these implementations were broadly categorized as highperformance [9], [10] or lightweight [12], [15] although the distinction is not always clear-cut as some designs may aim for specific tradeoff directions. For instance, a high-efficiency design [11] might primarily focus on achieving high performance while keeping resource utilization in comparable level. ...
... Lightweight implementations aim to minimize hardware utilization, allowing them to fit on even the smallest FPGA platforms. However, current FPGA platforms support a large amount of hardware resources, as evidenced by the unified architecture for CRYSTALS-Kyber and Dilithium on the Ul-traScale+ ZCU102 [15]. This cryptoprocessor utilizes only a small portion of the platform's hardware resources, utilizing less than 10% of available resources. ...
... The memory-based or iterative architecture has the advantage of requiring fewer hardware resources, but it is limited by memory port constraints, which restrict the number of butterfly units (BUs) that can be used to enhance NTT speed. This is evident in works of [9], [12], [15], which are limited to using a maximum of 4 BUs. Additionally, the memory-based architecture requires a complex control module to calculate the NTT layer-by-layer, which increases the critical path. ...
Article
Full-text available
The rapid advancement of powerful quantum computers poses a significant security risk to current public-key cryptosystems, which heavily rely on the computational complexity of problems such as discrete logarithms and integer factorization. As a result, CRYSTALS-Dilithium, a lattice-based digital signature scheme with the potential to be an alternative algorithm that can withstand both quantum and classical attacks, has been standardized as ML-DSA after NIST Post-Quantum Cryptography competition. While prior studies have proposed hardware designs to accelerate this cryptosystem, there is room for further optimization in the tradeoff between performance and hardware consumption. This paper addresses these limitations by presenting an efficient low-latency hardware architecture for ML-DSA, leveraging optimized timing schedules for its three main algorithms. The hardware implementation enables runtime switching main operations in ML-DSA with various security levels. We design flexible arithmetic and hash modules tailored for ML-DSA, the most time-consuming submodules and key determinants of the scheme implementation. Combined with efficient operation scheduling to maximize the utilized time of submodules, our design achieves the best latency among FPGA-based implementations, outperforming state-of-the-art works by 1.27~2.58× in terms of the area-time tradeoff metric. Therefore, the proposed hardware architecture demonstrates its practical applicability for digital signature cryptosystems in post-quantum era.
Article
CRYSTALS-Kyber, as the only public key encryption (PKE) algorithm selected by the National Institute of Standards and Technology (NIST) in the third round, is considered one of the most promising post-quantum cryptography (PQC) schemes. Lattice-based cryptography uses complex discrete algorithm problems on lattices to build secure encryption and decryption systems to resist attacks from quantum computing. Performance is an important bottleneck affecting the promotion of post quantum cryptography. In this paper, we present a High-performance Implementation of Kyber (named HI-Kyber) on the NVIDIA GPUs, which can increase the key-exchange performance of Kyber to the million-level. Firstly, we propose a lattice-based PQC implementation architecture based on kernel fusion, which can avoid redundant global-memory access operations. Secondly, We optimize and implement the core operations of CRYSTALS-Kyber, including Number Theoretic Transform (NTT), inverse NTT (INTT), pointwise multiplication, etc. Especially for the calculation bottleneck NTT operation, three novel methods are proposed to explore extreme performance: the sliced layer merging (SLM), the sliced depth-first search (SDFS-NTT) and the entire depth-first search (EDFS-NTT), which achieve a speedup of 7.5%, 28.5%, and 41.6% compared to the native implementation. Thirdly, we conduct comprehensive performance experiments with different parallel dimensions based on the above optimization. Finally, our key exchange performance reaches 1,664 kops/s. Specifically, based on the same platform, our HI-Kyber is 3.52× that of the GPU implementation based on the same instruction set and 1.78× that of the state-of-the-art one based on AI-accelerated tensor core.
Article
Full-text available
Lattice-Based Cryptography (LBC) schemes, like CRYSTALS-Kyber and CRYSTALS-Dilithium, have been selected to be standardized in the NIST Post-Quantum Cryptography standard. However, implementing these schemes in resourceconstrained Internet-of-Things (IoT) devices is challenging, considering efficiency, power consumption, area overhead, and flexibility to support various operations and parameter settings. Some existing ASIC designs that prioritize lower power and area can not achieve optimal performance efficiency, which are not practical for battery-powered devices. Custom hardware accelerators in prior co-processor and processor designs have limited applications and flexibility, incurring significant area and power overheads for IoT devices. To address these challenges, this paper presents an efficient lattice-based cryptography processor with customized Single-Instruction-Multiple-Data (SIMD) instruction. First, our proposed SIMD architecture supports efficient parallel execution of various polynomial operations in 256-bit mode and acceleration of Keccak in 320-bit mode, both utilizing efficiently reused resources. Additionally, we introduce data shuffling hardware units to resolve data dependencies within SIMD data. To further enhance performance, we design a dual-issue path for memory accesses and corresponding software design methodologies to reduce the impact of data load/store blocking. Through a hardware/software co-design approach, our proposed processor achieves high efficiency in supporting all operations in lattice-based cryptography schemes. Evaluations of Kyber and Dilithium show our proposed processor achieves over 10x speedup compared with the baseline RISC-V processor and over 5x speedup versus ARM Cortex M4 implementations, making it a promising solution for securing IoT communications and storage. Moreover, Silicon synthesis results show our design can run at 200 MHz with 2.01 mW for Kyber KEM 512 and 2.13 mW for Dilithium 2, which outperforms state-of-the-art works in terms of PPAP (Performance x Power x Area).
Article
Full-text available
This work presents the first hardware realisation of the Syndrome-Decodingin-the-Head (SDitH) signature scheme, which is a candidate in the NIST PQC process for standardising post-quantum secure digital signature schemes. SDitH’s hardness is based on conservative code-based assumptions, and it uses the Multi-Party-Computation-in-the-Head (MPCitH) construction. This is the first hardware design of a code-based signature scheme based on traditional decoding problems and only the second for MPCitH constructions, after Picnic. This work presents optimised designs to achieve the best area efficiency, which we evaluate using the Time-Area Product (TAP) metric. This work also proposes a novel hardware architecture by dividing the signature generation algorithm into two phases, namely offline and online phases for optimising the overall clock cycle count. The hardware designs for key generation, signature generation, and signature verification are parameterised for all SDitH parameters, including the NIST security levels, both syndrome decoding base fields (GF256 and GF251), and thus conforms to the SDitH specifications. The hardware design further supports secret share splitting, and the hypercube optimisation which can be applied in this and multiple other NIST PQC candidates. The results of this work result in a hardware design with a drastic reducing in clock cycles compared to the optimised AVX2 software implementation, in the range of 2-4x for most operations. Our key generation outperforms software drastically, giving a 11-17x reduction in runtime, despite the significantly faster clock speed. On Artix 7 FPGAs we can perform key generation in 55.1 Kcycles, signature generation in 6.7 Mcycles, and signature verification in 8.6 Mcycles for NIST L1 parameters, which increase for GF251, and for L3 and L5 parameters.
Conference Paper
This paper explores the challenges and potential solutions of implementing the recommended upcoming post-quantum cryptography standards (the CRYSTALS-Dilithium and CRYSTALS-Kyber algorithms) on resource constrained devices. The high computational cost of polynomial operations, fundamental to cryptography based on ideal lattices, presents significant challenges in an efficient implementation. This paper proposes a hardware/software co-design strategy using RISC-V extensions to optimize resource utilization and speed up the number-theoretic transformations (NTTs). The primary contributions include a lightweight custom arithmetic logic unit (ALU), integrated into a 4-stage pipeline 32-bit RISC-V processor. This ALU is tailored towards the NTT computations and supports modular arithmetic as well as NTT butterfly operations. Furthermore, an extension to the RISC-V instruction set is introduced, with ten new instructions accessing the custom ALU to perform the necessary operations. The new instructions reduce the cycle count of the Kyber and Dilithium NTTs by more than 80% compared to optimized assembly, while being more lightweight than other works that exist in the literature.
Article
Full-text available
In the final phase of the post-quantum cryptography standardization effort, the focus has been extended to include the side-channel resistance of the candidates. While some schemes have been already extensively analyzed in this regard, there is no such study yet of the finalist Kyber. In this work, we demonstrate the first completely masked implementation of Kyber which is protected against first- and higher-order attacks. To the best of our knowledge, this results in the first higher-order masked implementation of any post-quantum secure key encapsulation mechanism algorithm. This is realized by introducing two new techniques. First, we propose a higher-order algorithm for the one-bit compression operation. This is based on a masked bit-sliced binary-search that can be applied to prime moduli. Second, we propose a technique which enables one to compare uncompressed masked polynomials with compressed public polynomials. This avoids the costly masking of the ciphertext compression while being able to be instantiated at arbitrary orders. We show performance results for first-, second- and third-order protected implementations on the Arm Cortex-M0+ and Cortex-M4F. Notably, our implementation of first-order masked Kyber decapsulation requires 3.1 million cycles on the Cortex-M4F. This is a factor 3.5 overhead compared to the unprotected optimized implementation in pqm4. We experimentally show that the first-order implementation of our new modules on the Cortex-M0+ is hardened against attacks using 100 000 traces and mechanically verify the security in a fine-grained leakage model using the verification tool scVerif.
Article
Full-text available
Side-channel attacks can break mathematically secure cryptographic systems leading to a major concern in applied cryptography. While the cryptanalysis and security evaluation of Post-Quantum Cryptography (PQC) have already received an increasing research effort, a cost analysis of efficient side-channel countermeasures is still lacking. In this work, we propose a masked HW/SW codesign of the NIST PQC finalists Kyber and Saber, suitable for their different characteristics. Among others, we present a novel masked ciphertext compression algorithm for non-power-of-two moduli. To accelerate linear performance bottlenecks, we developed a generic Number Theoretic Transform (NTT) multiplier, which, in contrast to previously published accelerators, is also efficient and suitable for schemes not based on NTT. For the critical non-linear operations, masked HW accelerators were developed, allowing a secure execution using RISC-V instruction set extensions. With the proposed design, we achieved a cycle count of K:214k/E:298k/D:313k for Kyber and K:233k/E:312k/D:351k for Saber with NIST Level III parameter sets. For the same parameter sets, the masking overhead for the first-order secure decapsulation operation including randomness generation is a factor of 4.48 for Kyber (D:1403k) and 2.60 for Saber (D:915k).
Article
CRYSTALS-Kyber (Kyber) was recently chosen as the first quantum resistant Key Encapsulation Mechanism (KEM) scheme for standardisation, after three rounds of the National Institute of Standards and Technology (NIST) initiated PQC competition which begin in 2016 and search of the best quantum resistant KEMs and digital signatures. Kyber is based on the Module-Learning with Errors (M-LWE) class of Lattice-based Cryptography, that is known to manifest efficiently on FPGAs. This work explores several architectural optimizations and proposes a high-performance and area-time (AT) product efficient hardware accelerator for Kyber. The proposed architectural optimizations include inter-module and intra-module pipelining, that are designed and balanced via FIFO based buffering to ensure maximum parallelisation. The implementation results show that compared to state-of-the-art designs, the proposed architecture delivers 25-51% speedups for Kyber’s three different security levels on Artix-7 and Zynq UltraScale+ devices, and a 50-75% reduction in DSPs at comparable security level. Consequently, the proposed design achieve higher AT product efficiencies of 19-33%.
Chapter
Transport Layer Security (TLS) constitutes one of the most widely used protocols for securing Internet communications and has also found broad acceptance in the Internet of Things (IoT) domain. As we progress toward a security environment resistant to quantum computer attacks, TLS needs to be transformed to support post-quantum cryptography. However, post-quantum TLS is still not standardised, and its overall performance, especially in resource-constrained, IoT-capable, embedded devices, is not well understood. In this paper, we showcase how TLS 1.3 can be transformed into quantum-safe by modifying the TLS 1.3 architecture in order to accommodate the latest Post-Quantum Cryptography (PQC) algorithms from NIST PQC process. Furthermore, we evaluate the execution time, memory, and bandwidth requirements of this proposed post-quantum variant of TLS 1.3 (PQ TLS 1.3). This is facilitated by integrating the pqm4 and PQClean library implementations of almost all PQC algorithms selected for standardisation by the NIST PQC process, as well as the alternatives to be evaluated in a new round (Round 4). The proposed solution and evaluation focuses on the lower end of resource-constrained embedded devices. Thus, the evaluation is performed on the ARM Cortex-M4 embedded platform NUCLEO-F439ZI that provides 180 MHz clock rate, 2 MB Flash Memory, and 256 KB SRAM. To the authors’ knowledge, this is the first systematic, thorough, and complete timing, memory usage, and network traffic evaluation of PQ TLS 1.3 for all the NIST PQC process selections and upcoming candidate algorithms, that explicitly targets resource-constrained embedded systems.
Article
Post-Quantum Cryptography (PQC) has emerged as a response of the cryptographic community to the danger of attacks performed using quantum computers. All PQC schemes can be implemented in software and hardware using conventional (non-quantum) computing systems. PQC is the biggest revolution in cryptography since the invention of public-key schemes in the mid-1970s. Lattice-based key exchange schemes have emerged as leading candidates in the NIST PQC standardization process due to their relatively short public keys and ciphertexts. This paper presents novel high-speed hardware architectures for four lattice-based Key Encapsulation Mechanisms (KEMs) representing three NIST PQC finalists: NTRU (with two distinct variants, NTRU-HPS and NTRU-HRSS), CRYSTALS-Kyber, and Saber. We benchmark these candidates in terms of their performance and resource utilization in today's FPGAs. Our best architectures outperform the best designs from other groups reported to date in terms of the area-time product by factors ranging from 1.01 to 2.88, depending on the algorithm and security level. Additionally, our study demonstrates that CRYSTALS-Kyber and Saber have very similar hardware performance. Both outperform NTRU in terms of execution time by a factor 36-62 for key generation and 3-7 for decapsulation, assuming the same security level.
Article
We propose design methodologies for building a compact, unified and programmable cryptoprocessor architecture that computes post-quantum key agreement and digital signature. Synergies in the two types of cryptographic primitives are used to make the cryptoprocessor compact. As a case study, the cryptoprocessor architecture has been optimized targeting the signature scheme ‘CRYSTALS-Dilithium’ and the key encapsulation mechanism (KEM) ‘Saber’, both finalists in the NIST's post-quantum cryptography standardization project. The programmable cryptoprocessor executes key generations, encapsulations, decapsulations, signature generations, and signature verifications for all the security levels of Dilithium and Saber. On a Xilinx Ultrascale+ FPGA, the proposed cryptoprocessor consumes 18,406 LUTs, 9,323 FFs, 4 DSPs, and 24 BRAMs. It achieves 200 MHz clock frequency and finishes CCA-secure key-generation/encapsulation/decapsulation operations for LightSaber in 29.6/40.4/ 58.3 $\mu$ s; for Saber in 54.9/69.7/94.9 $\mu$ s; and for FireSaber in 87.6/108.0/139.4 $\mu$ s, respectively. It finishes key-generation/sign/verify operations for Dilithium-2 in 70.9/151.6/75.2 $\mu$ s; for Dilithium-3 in 114.7/237/127.6 $\mu$ s; and for Dilithium-5 in 194.2/342.1/228.9 $\mu$ s, respectively, for the best-case scenario. On UMC 65nm library for ASIC the latency is improved by a factor of two due to a 2× increase in clock frequency.
Article
Quantum computers are expected to break modern public key cryptography owing to Shor’s algorithm. As a result, these cryptosystems need to be replaced by quantum-resistant algorithms, also known as post-quantum cryptography (PQC) algorithms. The PQC research field has flourished over the past two decades, leading to the creation of a large variety of algorithms that are expected to be resistant to quantum attacks. These PQC algorithms are being selected and standardized by several standardization bodies. However, even with the guidance from these important efforts, the danger is not gone: there are billions of old and new devices that need to transition to the PQC suite of algorithms, leading to a multidecade transition process that has to account for aspects such as security, algorithm performance, ease of secure implementation, compliance and more. Here we present an organizational perspective of the PQC transition. We discuss transition timelines, leading strategies to protect systems against quantum attacks, and approaches for combining pre-quantum cryptography with PQC to minimize transition risks. We suggest standards to start experimenting with now and provide a series of other recommendations to allow organizations to achieve a smooth and timely PQC transition. Standards and recommendations for transitioning organizations to quantum-secure cryptographic protocols are outlined, including a discussion of transition timelines and the leading strategies to protect systems against quantum attacks.
Article
The 5G edge computing infrastructure should be empowered with quantum attack resistance by implementing post-quantum cryptography (PQC). Among various PQC schemes, lattice-based cryptography (LBC) based on learning with error (LWE) has attracted much attention because of its performance efficiency and security guarantee. In LWE-based LBCs, the Module-LWE-based schemes gain advantage over the others benefiting from the unique polynomial matrix and vector structure. To provide a high-performance implementation of Module-LWE applications for the edge computing paradigm, we propose a domain-specific processor based on a matrix extension of RISC-V architecture. This custom extension encapsulates the matrix-based ring operations with a high-level functional abstraction. A 2-D systolic array with configurable functionality is proposed to perform matrix-based number theoretic transform (NTT) and other arithmetic operations, achieving high data-level parallelism with support for the variable-sized polynomial matrix and vector structure. As this structure of Module-LWE involves no data dependency between different inner elements, an out-of-order mechanism is further developed to exploit the instruction-level parallelism. We implement the proposed architecture under TSMC 28nm technology. The evaluation results show that our implementation can achieve up to $3.5\times $ and $3.3\times $ improvement in cycle count respectively in Kyber and Dilithium, compared to the state-of-the-art crypto-processor counterparts.
Chapter
CRYSTALS-Dilithium as a lattice-based digital signature scheme has been selected as a finalist in the Post-Quantum Cryptography (PQC) standardization process of NIST. As part of this selection, a variety of software implementations have been evaluated regarding their performance and memory requirements for platforms like x86 or ARM Cortex-M4. In this work, we present a first set of Field-Programmable Gate Array (FPGA) implementations for the low-end Xilinx Artix-7 platform, evaluating the peculiarities of the scheme in hardware, reflecting all available round-3 parameter sets. As a key component in our analysis, we present results for a specifically adapted Number-Theoretic Transform (NTT) core for the Dilithium cryptosystem, optimizing this component for an optimal Look-Up Table (LUT) and Flip-Flop (FF) utilization by efficient use of special purpose Digital Signal Processors (DSPs). Presenting our results, we aim to shed further light on the performance of lattice-based cryptography in low-cost and high-throughput configurations and their respective potential use-cases in practice.KeywordsFPGADilithiumPQC