ArticlePDF Available

: A Crystal for Post-Quantum Security Using Kyber and Dilithium

January 2022
IEEE Transactions on Circuits and Systems I Regular Papers PP(99):1-12

January 2022
PP(99):1-12

DOI:10.1109/TCSI.2022.3219555

Authors:

Aikata ..

Graz University of Technology

Ahmet Can Mert

Graz University of Technology

Malik Imran

Tallinn University of Technology

Samuel Pagliarini

Carnegie Mellon University

Show all 5 authorsHide

Quantum computers pose a threat to the security of communications over the internet. This imminent risk has led to the standardization of cryptographic schemes for protection in a post-quantum scenario. We present a design methodology for future implementations of such algorithms. This is manifested using the NIST selected digital signature scheme CRYSTALS-Dilithium and key encapsulation scheme CRYSTALS-Kyber. A unified architecture, is proposed that can perform key generation, encapsulation, decapsulation, signature generation, and signature verification for all the security levels of CRYSTALS-Dilithium, and CRYSTALS-Kyber. A unified yet flexible polynomial arithmetic unit is designed that can processes Kyber operations twice as fast as Dilithium operations. Efficient memory management is proposed to achieve optimal latency. is explicitly tailored for ASIC platforms using multiple clock domains. On ASIC 28nm/65nm technology, it occupies 0.263/1.107 mm $^2$ and achieves a clock frequency of 2GHz/560MHz for the fast clock used for memory unit. On Xilinx Zynq Ultrascale $+$ ZCU102 FPGA, the proposed architecture uses 23,277 LUTs, 9,758 DFFs, 4 DSPs, and 24 BRAMs, at 270 MHz clock frequency. performs better than the standalone implementations of either of the two schemes. This is the first work to provide a unified design in hardware for both schemes.

High-level architecture of KaLi

…

Architecture of the Compress/Decompress unit. The red lines show control signals, and the black lines show data movement.

…

Figures - uploaded by Aikata ..

Content may be subject to copyright.

Content uploaded by Aikata ..

Content may be subject to copyright.

KaLi: A Crystal for Post-Quantum Security using

Kyber and Dilithium

Aikata Aikata, Ahmet Can Mert, Malik Imran, Samuel Pagliarini, Sujoy Sinha Roy

Abstract—Quantum computers pose a threat to the security of

communications over the internet. This imminent risk has led to

the standardization of cryptographic schemes for protection in

a post-quantum scenario. We present a design methodology for

future implementations of such algorithms. This is manifested

using the NIST selected digital signature scheme CRYSTALS-

Dilithium and key encapsulation scheme CRYSTALS-Kyber.

A uniﬁed architecture, KaLi, is proposed that can perform

key generation, encapsulation, decapsulation, signature gener-

ation, and signature veriﬁcation for all the security levels of

CRYSTALS-Dilithium, and CRYSTALS-Kyber. A uniﬁed yet

ﬂexible polynomial arithmetic unit is designed that can processes

Kyber operations twice as fast as Dilithium operations. Efﬁcient

memory management is proposed to achieve optimal latency.

KaLi is explicitly tailored for ASIC platforms using mul-

tiple clock domains. On ASIC 28nm/65nm technology, it oc-

cupies 0.263/1.107 mm2and achieves a clock frequency of

2GHz/560MHz for the fast clock used for memory unit. On Xilinx

Zynq Ultrascale+ZCU102 FPGA, the proposed architecture uses

23,277 LUTs, 9,758 DFFs, 4 DSPs, and 24 BRAMs, at 270

MHz clock frequency. KaLi performs better than the standalone

implementations of either of the two schemes. This is the ﬁrst

work to provide a uniﬁed design in hardware for both schemes.

Index Terms—CRYSTALS-Dilithium, CRYSTALS-Kyber,

Cryptoprocessor, NIST PQC Standardized

I. INTRODUCTION

COMMUNICATION over the internet forms the backbone

of the digitalized world. Every communication packet

passes through various insecure channels and untrusted servers

before reaching the destination. Data and communication leaks

in the past led to the development of public key cryptographic

(PKC) schemes to ensure end-to-end security and privacy of

communication. These schemes use the hard problems of the

discrete logarithm, integer factorization, etc. In 1994, Peter

Shor [1] proposed an algorithm that can help a powerful

quantum computer solve them in polynomial (realistic) time,

thus breaking the classical PKC schemes. Since then, the past

This paper was produced by the IEEE Publication Technology Group. They

are in Piscataway, NJ.

Aikata Aikata, Ahmet Can Mert, and Sujoy Sinha Roy are afﬁliated

to Institute of Applied Information Processing and Communications, Graz

University of Technology, Graz, Austria. Their work was supported in part

by Semiconductor Research Corporation (SRC) and the State Government of

Styria, Austria – Department Zukunftsfonds Steiermark.{aikata, ahmet.mert,

sujoy.sinharoy}@iaik.tugraz.at

Malik Imran and Samuel Pagliarini are with the Centre for Hardware Security,

Tallinn University of Technology, Tallinn, Estonia Their work has been

partially conducted in the project “ICT programme” which was supported

by the European Union through the European Social Fund. It was also

partially supported by European Union’s Horizon 2020 research and innova-

tion programme under grant agreement No 952252 (SAFEST). {malik.imran,

samuel.pagliarini}@taltech.ee

eighteen years have witnessed a giant leap in the develop-

ment of quantum computers. In 2019, Google claimed quan-

tum supremacy by developing a 53-qubit quantum computer

Sycamore [2]. Sycamore could solve a task in 200 seconds

which would take a classical computer 10,000 years. Various

labs across the world have developed even stronger quantum

computers [3]. This raises the existential question of whether

our communication packets containing emails, passwords, etc.,

are already insecure. The answer to this is - yes. Even though

quantum computers built until now are not strong enough to

break classical public key cryptography, emails and passwords

sent now can be stored and decrypted later.

This inevitable breach of security paved way for the devel-

opment of post-quantum secure PKC schemes based on the

hard problems that are safely sustained in a post-quantum sce-

nario. Many standardizations were launched to select the best

PKC candidates for digital signature and key-encapsulation

algorithms [4]. Key encapsulation schemes allow the commu-

nicating parties to agree on the same key securely, which can

then be used for symmetric key-based encryption-decryption

of messages. Thus, ensuring the security and privacy of

the communications. A digital signature scheme allows the

receiver to verify the authenticity of the messages. Both these

schemes will replace the classical PKC schemes in various

applications, like the TLS networking protocol.

These standardizations have now concluded, and the in-

dustry is now starting to gear up toward implementing stan-

dardized candidates. After ﬁnalizing the implementations, a

transition phase will start for all the devices to switch from

classical to post-quantum secure PKC schemes [5]. This tran-

sition will not only take years but also lead to a large amount

of wastage in terms of chips and hardware resources which

are now obsolete. However, now that we know that change is

inevitable, and what we believe to be secure now might again

be broken in the next 10-20 years, there is an urgent need for

a design methodology for future implementations to prevent

loss of time and hardware resources.

This work proposes a design methodology that covers

three vital aspects for the future implementations of the PKC

algorithms. The ﬁrst is the need to make a uniﬁed design. A

majority of PKC applications require both digital signature and

key encapsulation schemes. Therefore, the design decisions

should be adapted to help unify the two algorithms for saving

area via resource sharing. Secondly, the design must be com-

pact. The new PKC schemes require much larger memory and

logical units to store and process the keys. If we do not attempt

to make these designs compact, a lot of resource-constrained

CPUs that were designed for classical PKC schemes will be

rendered inoperable. The ﬁnal aspect is agility/ﬂexibility. The

architecture design should consider the ever-evolving nature of

these algorithms. It will not only help prevent the huge wastage

of hardware resources but also enable a smooth transition.

To exhibit the applicative advantages of this design method-

ology, we take the NIST ﬁnalized lattice-based digital signa-

ture scheme CRYSTALS-Dilithium [6] and key-encapsulation

scheme CRYSTALS-Kyber [7]. We unify the extensive build-

ing blocks of these schemes and call the resultant architecture

of the two CRYSTALS schemes as KaLi. The design choices

for KaLi favor reducing area over improving performance. As

a step toward agility, KaLi is modeled as an instruction-set

cryptoprocessor. From here on, we will refer to CRYSTALS-

Dilithium as Dilithium, and CRYSTALS-Kyber as Kyber.

Prior works in literature propose the hardware implemen-

tation of PKC schemes. Most of them focus on standalone

efﬁcient implementations [8]–[23]. The real-life applications

would require both the types of schemes. Therefore, these

works fail to provide complete area and timing results for the

implementations that make the communication post-quantum

secure. The authors in [24], [25] present hardware/software

(HW/SW) co-designs for Dilithium and Kyber. Since they

keep some part of the design in software, it is not sufﬁcient to

provide a good estimate for hardware-only architectures. There

is a need for a uniﬁed implementation of these two types of

schemes completely in hardware to get better performance. We

show how KaLi follows the proposed design methodology

and performs better than the state-of-the-art.

Our contributions can be summarized as follow:

1) Polynomial multiplication is the most computation-

intensive operation in the Dilithium signature scheme

and Kyber encapsulation scheme. We propose a compact

polynomial multiplier architecture that works optimally

for the two cryptographic algorithms. Dilithium has

a 23-bit prime modulus, whereas Kyber has a 12-bit

prime modulus. A uniﬁed polynomial arithmetic unit is

designed for both, Dilithium and Kyber, to save time

and area. This unit has a 24-bit datapath. The core oper-

ations: addition, subtraction, multiplication, and modular

reduction, are made ﬂexible to either process two sets of

12-bit Kyber coefﬁcients or one set of 23-bit Dilithium

coefﬁcients. This, in combination with efﬁcient memory

management, enables performing arithmetic operations

for Kyber twice as fast as Dilithium.

2) We customized the Keccak-based SHA-SHAKE and

pseudo-random number generation unit to make an ef-

ﬁcient sampling unit for both Dilithium and Kyber. The

samplers for both schemes are uniﬁed and added into the

Keccak block to prevent redundant writing and reading

of pseudo-random numbers. The remaining primitive

building blocks of Dilithium and Kyber are designed dis-

cretely while ensuring low area consumption, simplicity

and ﬂexibility. The proposed arithmetic units altogether

form the uniﬁed cryptoprocessor KaLi. It can perform

key generation, encapsulation, decapsulation, signature

generation, and signature veriﬁcation operations for all

security levels of Dilithium and Kyber. This is the

ﬁrst work that implements a uniﬁed cryptoprocessor for

Kyber and Dilithium solely in hardware.

3) We propose an instruction set architecture for ﬂexibility.

The instructions are divided into two sets, and KaLi

can run instructions from these two sets in parallel,

thus improving the latency, while keeping the area

consumption low. This leads to a 35% reduction in run-

time.

4) KaLi is engineered separately for the ASIC platform to

reduce area overhead. It uses two clock domains, where

the memory unit works at a higher clock frequency

than the logic unit. This allowed us to use single port

memory instead of dual port memory used in FPGA

implementation, thus reducing the area consumption.

The paper is organized as follows. Section II provides

a high-level overview of Dilithium and Kyber. The major

contributions of the paper are described in Section III. It

includes the design methodology for implementing the PQC

schemes and implementation details. In Section IV, we give

the results and compare them with the existing works in

the literature, and add benchmarking estimates. Section V

concludes our paper.

II. PRELIMINARIES

Kyber and Dilithium are part of the Cryptographic Suite for

Algebraic Lattices (CRYSTALS), which are recently selected

for standardization by the American National Institute of

Standards and Technology (NIST). Kyber’s security relies on

the hardness of solving learning-with-errors in module lattices

(MLWE), while Dilithium’s security is based on MLWE and

Shortest Integer Solution (SIS) problems. The polynomials and

algebraic operations are assumed to be over the polynomial

ring Rq=Zq[x]/xn+ 1. For Dilithium n= 256 and

q= 8380417 = 223 −213 + 1, and for Kyber n= 256 and

q= 3329 = (212 −3·28+ 1). Next, we give a brief overview

of these schemes and their building blocks.

A. Dilithium

This digital signature scheme has three main algorithms: key

generation, signature generation, and signature veriﬁcation.

The sender generates a public and secret key using the key

generation algorithm. Then he uses his private key to sign

a message using the signature generation algorithm. The

receiver can verify the signature using the sender’s public key

and signature veriﬁcation algorithm. The signature generation

algorithm continues to generate a signature until a valid

signature is generated. For a signature to be valid, a set of

constraints have to be satisﬁed to ensure that the signature does

not bear any similarity with the message. Readers may refer

to [26] for the original speciﬁcation of Dilithium. Dilithium

has three variants for NIST security levels 2, 3, and 5. Several

building blocks used by these algorithms are explained below.

•Polynomial generation: SHAKE-128 is used to generate

the polynomials of the public matrix A

A∈Rk×ℓ

qby

expanding the seed ρ∈ {0,1}256 along with 16-bit

nonce values. The secret polynomial vectors s

s1and s

∈Sℓ

η×Sk

ηare generated using SHAKE-256. For each

polynomial, the seed ςand a 16-bit nonce are fed to

SHAKE-256 and passed through rejection sampling in

the range {−η, η}. The two types of generations are

named as ExpandA() and ExpandS().ExpandMask(),

is used to generate a polynomial vector in the range

[0,2γ1−1] using a rejection sampler. SampleInBall()

is used during signature generation and veriﬁcation, to

generate a polynomial with only τcoefﬁcients set to +1

or −1and the remaining coefﬁcients as 0.

•Polynomial Arithmetic: Polynomial multiplications are

performed using the Number Theoretic Transform (NTT)

method. The addition and subtraction operations are

coefﬁcient-wise linear operations.

•Hash functions: SHAKE-256 is used to make a collision

resistant hash function- CRH().

•Power2Round : The function, Power2Roundq(), takes

an element r=r1·2d+r0and returns r0and r1, where

r0=rmod ±2dand r1= (r−r0)/2d.

•Decompose and other related functions: Let αbe a

divisor of q−1. The function Decomposeq() is deﬁned

in the same way as Power2Round() with αreplacing

2d. The HighBitsq()/LoweBitsq() return r1/r0from the

output of Decomposeq().MakeHint uses HighBitsq() to

produce a hint h

h.UseHint uses the hint h

hproduced by

MakeHintq() to recover the high-bits.

B. Kyber

Kyber is an IND-CCA2-secure key encapsulation scheme. It

has three principal algorithms: key generation, encapsulation,

and decapsulation. The receiver generates a public and secret

key using the key generation algorithm and broadcasts the

public key. When the sender wishes the send a message, he/she

can encapsulate it using the receiver’s public key through the

encapsulation algorithm. The receiver can then decapsulate it

using her/his secret key through the decapsulation scheme.

Three variants of Kyber, Kyber-512, Kyber-768, and Kyber-

1024 are provided for NIST Security levels 1, 3, and 5,

respectively. The variants differ in module dimensions and

coefﬁcient distributions. Readers may refer to [7] for the

detailed speciﬁcations of Kyber. Kyber has the following

internal routines:

•Pseudorandom functions: Kyber uses PRF (SHAKE-

256) and XOF (SHAKE-128) to generate the pseudo-

random numbers for polynomial coefﬁcients.

•Hash functions: Kyber provides functions Hand Gfor

SHA3-256 and SHA3-512, respectively, for hashing.

•Key-derivation function (KDF): It is instantiated using

SHAKE-256 in Kyber.

•Polynomial Arithmetic: Kyber uses a new method NTT-

based polynomial multiplication unit. Polynomial addi-

tion and subtractions are also supported.

•Samplers: Uniform sampling (Parse) is used to generate

the public polynomials, and Binomial sampling (CBD) is

used to generate secret and error polynomials.

•Encode/Decode: These modules are used to serial-

ize/deserialize the polynomials to/from byte arrays.

•Compress/Decompress: They are used to reduce the

size of ciphertext by discarding low-order bits. They are

deﬁned on an element x∈Zqas ⌈(2d/q)·x⌋(mod 2d)

and ⌈(q/2d)·x⌋respectively, where d < ⌈log2(q)⌉. The

value x′such that x′=Decompress(Compress(a, d), d)

is an element close to x.

C. NTT-based Polynomial Multiplication

Polynomial multiplication of (n−1)-degree polynomials

has been the focus of works for PQC implementations. Most

implementations use the traditional NTT-based multiplication

technique, while others show how methods like schoolbook

O(n2), Karatsuba O(n1.59), etc., can be used. NTT-based

multiplication has a time complexity of O(n(log n)). The

designers of Dilithium and Kyber select polynomials in Ring

Rq=Zq[x]/xn+ 1, where modulus qis an NTT-friendly

prime. Thus, making it easier to use the fast NTT-based

multiplication method.

Forward NTT transform converts an (n−1)-degree poly-

nomial (coefﬁcient representation) to n0-degree polynomi-

als (value representation). Then two polynomials in their

value-representation form (NTT domain) can be multiplied

coefﬁcient-wise to get the multiplied values in the NTT

domain. Now, if we need to get the polynomial in coefﬁcient

representation again, a backward NTT transform (INTT) is

used. The conversion to-and-from NTT domain has a time-

complexity of O(n(log n)). Coefﬁcient-wise multiplication

has a time-complexity of O(n). Thus, a total time com-

plexity of O(n(log n)). Various algorithms exist in the

literature to facilitate these transformations. The most used

ones are the Cooley-Tukey (Algorithm 1) transform for NTT

and Gentleman-Sande for INTT. For more information on

NTT/INTT, refer to [27].

Next, we discuss the major optimizations made to realize

the design methodology in the context of Dilithium and Kyber.

III. PROP OS ED UN IFI ED HARDWARE ARCHITECTURE

The ﬁrst and foremost goal is to unify the digital signature

scheme and the key-encapsulation scheme. While doing this, it

is important to ensure that the design is compact and ﬂexible.

Uniﬁcation has a very straightforward three-step approach.

First, we must identify the most area and time-consuming

Algorithm 1 The Cooley-Tukey NTT Algorithm [28]

In: An n-element vector x= [x0,··· , xn−1]where xi∈[0, q −1]

In: n(power of 2), modulus q(q≡1 (mo d 2n))

In: g (precomputed table of 2n-th roots of unity, bit-reversed order)

Out: x←NT T (x)

1: t←n/2;m←1

2: while (m<n)do

3: k←0

4: for (i←0; i<m;i←i+ 1)do

5: S←g[m+i]

6: for (j←k;j < k + 1; j←j+ 1)do

7: U←x[j]▷Butterﬂy starts

8: V←x[j+t]·S(mod q)

9: x[j]←U+V(mod q)

10: x[j+t]←U−V(mod q)▷Butterﬂy ends

11: end for

12: k←k+ 2t

13: end for

14: t←t/2;m←2m

15: end while

16: return x

Fig. 1. High-level architecture of KaLi

building blocks, the Giants. This is because unifying the low

area and time-consuming building blocks (the Dwarves) will

not reduce the area consumption signiﬁcantly, and instead

limit the ﬂexibility of the design. The next step is to ﬁnd the

algorithmic synergies between the Giants of the two schemes.

The ﬁnal step is to discern if some of the Dwarves which are

dependent on the Giants can be uniﬁed with Giants to reduce

both area and time consumption.

A high-level view of the proposed architecture KaLi

is given in Fig. 1. The Keccak-based SHA-SHAKE unit

and polynomial arithmetic unit are the two Giants in both

schemes. The remaining building blocks are deemed as

Dwarves since they comprise only 20% of the total area

consumption. Unifying the Keccak-based SHA-SHAKE unit

is relatively easy since we can use a common Keccak core for

both schemes. Therefore in this section, we will discuss how

we efﬁciently uniﬁed the polynomial arithmetic unit. We will

also discuss how we efﬁciently manage the memory for the

two schemes. Another facet of the work, the optimization for

ASIC platforms, is also presented. We utilized multiple clock

domains and boosted the memory bandwidth budget on ASIC

platforms to reduce area consumption.

A. The colossal Giant: Polynomial Arithmetic Unit

The Polynomial Arithmetic Unit performs polynomial ad-

dition, subtraction, and multiplication. Polynomial addition

and subtraction are simple coefﬁcient-wise operations, hence

cheap. Polynomial multiplication is rather complex, and it

is what makes the polynomial arithmetic unit a Giant. Both

schemes perform this using NTT, as discussed in Section II-C.

Although the two schemes use NTT-based polynomial multi-

plication units, there are many differences between the two

schemes that make their NTT units quite distinct.

1) A clash of the Giants: The differences between the

presumed similar NTTs

The ﬁrst distinction between the NTTs used by the two

schemes lies in the algorithm itself. The NTT-based poly-

nomial multiplication method used in Dilithium requires the

existence of 2n-th root of unity that mandates q≡1

(mod 2n). Accordingly, Dilithium uses a complete-NTT. After

acomplete-NTT transform of an n-degree polynomial, we get

npolynomials of degree 0. In [29], Lyubashevsky et al. pro-

pose a new method for NTT-based polynomial multiplication

that requires only q≡1 (mod n), without pre-processing

and post-processing operations. This technique is adopted by

Kyber, and their 12-bit prime modulus does not have a 2n-th

root of unity. Therefore, Kyber has to use an incomplete-NTT.

An incomplete-NTT gives us n/2polynomials of degree 1.

These polynomials cannot be multiplied coefﬁcient-wise.

For the incomplete-INTT, multiplication operation of two

degree-1 polynomials is performed in the ring Zq[x]/(x2−ωi)

where ωis the n-th root of unity and idepends on the index of

coefﬁcients. For the details, readers may follow original Kyber

speciﬁcations [7] or related prior works in the literature [30].

Along with this, they also have a difference in datapath design.

Dilithium has a 23-bit prime modulus, while Kyber has a

12-bit prime modulus. Therefore, while Kyber requires 12-

bit adder/subtracter/multiplier units, Dilithium requires them

in 23-bits. Designing a datapath for one of them and using it

for the other one would lead to over-or-under saturation.

Next, we will discuss in detail how we achieved a uniﬁed

polynomial multiplication unit with full utilization. The unit of

interest here is the butterﬂy unit (BFU). Each BFU performs

dyadic addition, subtraction, and multiplication, on the two

input coefﬁcients. The results are reduced by modulo q.

This is shown by steps 7-10 in Algorithm 1. Since modulus

multiplication is the most expensive operation, we will discuss

how we unify this unit. Then, we will discuss how with a few

more changes, the entire BFU can be consolidated.

2) Flexible fusion of Modular Multiplier Unit

As discussed above, if we naively use the 23-bit Dilithium

polynomial multiplier unit for Kyber, then it will always be

undersaturated as half of it will be unused. Instead, if we aim

to use a 12-bit Kyber unit for Dilithium, it will require extra

control logic but also slow down Dilithium’s NTT. Therefore,

we need to ﬁnd a solution using a 23-bit Dilithium unit that

does not lead to undersaturation. The modular multiplier unit

has two parts: (i)integer multiplier and (ii)modular reduction

unit. We propose an algorithm (Algorithm 2) to make the

integer multiplication unit designed for Dilithium ﬂexible for

Kyber. It performs two 23-bit×12-bit integer multiplications.

The result is added for Dilithium and concatenated for Kyber.

This algorithm gives us one multiplied coefﬁcient in the case

of Dilithium and two multiplied coefﬁcients in the case of

Kyber.

The modular multiplier unit, designed to support modular

multiplication using both primes, uses two DSP units of Xil-

inx FPGAs. The hardware architecture of the re-conﬁgurable

integer multiplier is shown in Fig. 2. The datapath depends

on the scheme type and is heavily pipelined. We used internal

registers of DSP units to synchronize two DSP unit outputs

and achieve a high clock frequency. Now we need to design

a modular reduction unit accordingly.

3) Versatile Modular Reduction Unit

The naive solution is to design separate modular reduction

units for the two primes. It would require one modular

reduction unit for the Dilithium prime and two reduction

units for the Kyber prime, which will result in extra hard-

ware costs. To avoid this, we propose a uniﬁed modular

reduction unit. Both Dilithium (223 −213 + 1) and Kyber

(212 −29−28+1) primes have pseudo-Mersenne structure. For

Fig. 2. Flexible yet compact integer multiplier. The red lines show control

signals, and the black lines show data movement. The two DSP units used

are highlighted in purple.

Algorithm 2 Integer Multiplication Algorithm

In: a, b ∈Z8380417 or a[23 : 12], b[23 : 12], a[11 : 0], b[11 : 0] ∈Z3329

In: sel ∈ {0,1}(0 for Dilithium and 1 for Kyber)

Out: d=a·bor d={a[23 : 12] ·b[23 : 12], a[11 : 0] ·b[11 : 0]}

1: d0= (sel) ? b[23 : 12] : b

2: m0=d0·a[23 : 12]

3: m1= (sel)?(m0≪24) : (m0≪12)

4: d1= (sel) ? b[11 : 0] : b

5: d=d1·a[11 : 0] + m1

6: return d

Dilithium prime, we followed the method described in [31]

which uses 223 ≡213 −1equation recursively. Using this

equation, we can reduce a 46-bit integer d(mod 223 −213 +1)

to the integer 213d[45 : 24] + d[22 : 0] −d[45 : 23] which

consists of addition/subtraction of 36-bit and 23-bit partial

results. If we apply this operation recursively, we will obtain

d(=d[22 : 0] + (d[32 : 23] + d[42 : 33] + d[45 : 43])213 −

d[45 : 23] −d[45 : 33] −d[45 : 43]). Similarly, for Kyber,

we followed add-shift-based method proposed in [30] which

generates partial results using equations 212 ≡29+28−1and

211 ≡ −210 −28−1recursively.

Summing all partial results using carry propagate adders

(CPAs) will result in either a very long carry chain or multiple

pipeline stages. In order to avoid long carry chain and pipeline

delays, we used carry-save adders (CSAs) along with CPA.

The proposed uniﬁed modular reduction unit is shown in

Fig. 3, where the boxes ’D’ and ’K’ represent the partial

result generation circuits for Dilithium and Kyber primes,

respectively. All the subtraction operations are converted into

additions by taking the 2’s complements of partial results. In

Fig. 3, each number inside a box represents a bit index of the

input integer (0 to 45 for Dilithium and 0 to 22 for Kyber). The

white and brown (terracotta) boxes represent the normal and

negated bits. When 2’s complements operation is applied to

the partial results, extra plus ones, along with sign extensions,

come into the picture. These are represented with blue circles.

After adding all of these partial results, we also perform a ﬁ-

nal correction which brings the resulting integer from the range

(−q, 3q)to the range [0, q). The proposed modular reduction

unit can either perform one reduction for the Dilithium prime

or two reductions for the Kyber prime. The latency of the

4542

11 10

45 44

12 12 19 18

17 13 17 22 21

19 15 19 23 22 21 20 19 17

20 19 18 16

15 14

17 17

23 22 21 20 18 18 1818 19 23

23 22 19 20 1916 18 18

23 21 21 2015 14

Fig. 3. Uniﬁed modular reduction unit for Dilithium and Kyber primes.

Adder

SubtractorD/K

MultiplierD/K

Mod. ReductionD/K

SubtractorD/K

AdderD/K

1/2D/K

Fig. 4. Compact butterﬂy unit (BFU) with ﬂexibility for both Dilithium and

Kyber. The red and blue lines show control signals, and the black lines show

data movement.

modular reduction unit is two cycles and it is fully pipelined.

4) Coalesced datapath for the Butterﬂy Unit

Now that we have uniﬁed the modular multiplication unit,

we propose a uniﬁed BFU (Fig. 4). It can perform one butterﬂy

operation for Dilithium and two butterﬂy operations for Kyber

using the same datapath. All the arithmetic units are made re-

conﬁgurable to work for both schemes. New re-conﬁgurable

adder and subtractor units are shown in Fig. 5. The idea is

to divide each 24-bit adder/subtractor into two small 12-bit

parts and select proper input signals based on the scheme.

The complete uniﬁed butterﬂy unit, designed using the re-

conﬁgurable arithmetic units, is shown in Fig. 4.

carry

Fig. 5. Uniﬁed adder and subtractor for the butterﬂy unit. The red lines show

control signals, and the black lines show data movement.

3 2 1 0 3 2 1 0

+ ++

Fig. 6. Butterﬂy feedback unit for Kyber’s NTT-domain polynomial multi-

plication.

The schoolbook multiplication for Kyber requires ﬁve mul-

tiplications for multiplying two linear (i.e., degree one) poly-

nomials. We have two ﬂexible butterﬂy units that act as four

butterﬂy units for Kyber and allow four multiplications only.

One way to perform these ﬁve multiplications is to add another

DSP multiplier just for the extra multiplication. This unit will

not be useful for any other operation. To avoid this extra

multiplier, we condense ﬁve multiplications into four using

Karatsuba-like reduction. Then we use these four independent

butterﬂy units as a set of two. The output of the ﬁrst set is

the input to the second set as shown in Fig. 6. The inputs and

outputs for the BFUs are highlighted in blue. The control ﬂow

here is separated from the Dilithium polynomial multiplication

control ﬂow, for simplicity. The entire ﬂow is pipelined to

achieve a high clock frequency.

The coefﬁcient consumption during NTT/INTT is shown in

Fig. 7. Owing to the ﬂexible datapath and efﬁcient memory

arrangement (discussed in the next subsection), the Kyber

NTT coefﬁcients can be consumed faster. This enables full

utilization of the datapath. The complete polynomial arithmetic

unit consumes 3,487 LUTs, 1,918 FFs, 4 DSPs, and 1 BRAM.

The BRAM is used to store the powers of roots-of-unity

(twiddle factors) required during NTT/INTT operation. In the

next section, we will discuss the efﬁcient memory arrangement

designed to optimally feed the polynomial arithmetic unit.

B. Memory Arrangement

The polynomial arithmetic unit is designed to consume the

Kyber coefﬁcients twice as fast as the Dilithium coefﬁcients.

It requires that the memory unit feeds it at the same rate,

otherwise, making these uniﬁcations will not help improve the

performance. Dilithium coefﬁcients are 23-bit, and we have

designed the NTT/INTT unit using two butterﬂy cores. Each

of the cores requires exclusive access to the read/write port of a

memory. Therefore, we split the memory into two blocks, each

storing two Dilithium coefﬁcients per address. For Kyber, each

memory block stores four 12-bit Kyber coefﬁcients. Fig. 8

shows the storage of Kyber polynomials in one 64-bit word

of memory. One Dilithium polynomial coefﬁcient will occupy

two of these coefﬁcients, thus requiring twice the amount of

storage. It also ensures that the two required coefﬁcients during

NTT/INTT are always stored across different BRAMs. Fig. 8

shows an example of the coefﬁcients storage during Kyber’s

incomplete-NTT iterations for a 16-coefﬁcient polynomial.

Next, we will discuss how we used multiple clock domains

to reduce area consumption in ASIC platforms.

C. Multi-clock domains: Customization for ASIC platforms

The memory organization discussed above has two sets

of BRAMs to feed the two BUF. These BRAMs are used

by all the remaining building blocks as well. It is generated

using dual-port BRAMs in FPGA. In ASIC, dual-port RAMs

consume more area than single-port RAMs. Therefore, to

reduce the area consumption, we decide to replace dual-

port RAMs with single-port RAMs, which work at a clock

frequency twice as fast as the rest of the design. Using two

different sources for the two clocks leads to an asynchronous

setting. This creates meta-stability problems due to clock-

domain crossing. To avoid these problems, we decided to

keep the clocking synchronous and generate the slow clock

(clock logic) using the fast clock (clock mem).

Fig. 9 describes the handshake between memory and logic.

A wrapper is provided to process the simultaneous reads

and writes to the memories. The read operation is given

preference over the write operation to ensure data is valid

when the building blocks fetch it and avoid any issues due

to clock glitches. The read latency is three clock cycles, and

all the building blocks are tailored accordingly. This design

helps reduce the area for ASIC designs. Note that a similar

modiﬁcation will not change the FPGA area consumption and

instead cause timing problems running the memory at a high

clock frequency. Therefore, this adaptation speciﬁcally targets

ASIC platforms.

Until now we discussed the major contributions of the work.

Next, we will brieﬂy discuss how we efﬁciently implement

the remaining building block. We will start with the rejection

samplers used in both schemes. These are the Giant dependent

Dwarves that might help reduce the area and time consump-

tion without compromising the ﬂexibility of the design.

D. The Giant and the Dwarves: Keccak-based SHA-SHAKE

unit and the rejection samplers

Dilithium requires SHAKE-128 and SHAKE-256 for

pseudo-random number generation and hashing. Kyber re-

Fig. 7. Timeline showing the uniﬁed butterﬂy unit processing Dilithium and Kyber coefﬁcients.

7 6 5 4

3 2 1 0

15 14 13 12

11 10 9 8

BRAM0 BRAM1

23 22 21 20

19 18 17 16

31 30 29 28

27 26 25 24

7 6 5 4

3 2 1 0

15 14 13 12

11 10 9 8

BRAM0 BRAM1

23 22 21 20

19 18 17 16

31 30 29 28

27 26 25 24

7 6 5 43 2 1 0

15 14 13 1211 10 9 8

BRAM0 BRAM1

23 22 21 2019 18 17 16

31 30 29 2827 26 25 24

L=16 L=8

L=4

7 6 5 43 2 1 0

15 14 13 1211 10 9 8

BRAM0 BRAM1

23 22 21 2019 18 17 16

31 30 29 2827 26 25 24

L=2

Fig. 8. Storage of coefﬁcients during Kyber’s NTT for 16-coefﬁcient

polynomial

Fig. 9. The data read-and-write handshake between memory and logic unit

quires SHA3-256 and SHA3-512 for hashing and SHAKE-

128 and SHAKE-256 for KDF and pseudo-random number

generation. These different Keccak-based functions are imple-

mented as modes of the same Keccak output. Therefore, we

can use the same Keccak instance for all these modes. Both

schemes employ different sampling for the generation of secret

and error polynomials. While some of these fully consume the

Keccak output, the remaining have to keep track of the leftover

bits.

We combine the rejection sampler with the Keccak unit

using a book-keeping approach similar to [31]. It improves

the performance of the sampling operation, as we do not need

to store and then read the Keccak output in between. The base

implementation of Keccak follows a high-speed and parallel

directive. The control and datapath are modiﬁed to work for

rejection samplers as it depends on coefﬁcients passing the

rejection constraints. The complete Keccak unit consumes

12,326 LUTs and 3,560 FFs.

We have uniﬁed all the Giants, so now we will discuss the

optimized implementation techniques for the Dwarves.

E. Optimizations for the Dwarves

Making a design compact while keeping it agile increases

the life and usability of KaLi on the FPGA and ASIC

platforms. However, this comes with a series of challenges.

We must ensure that for keeping the design agile/ﬂexible, we

do not pay a huge price in terms of area. Similarly, while

making the design compact, the performance should not get

worse. We now discuss how to make certain building blocks

of the two schemes compact, while maintaining ﬂexibility.

1) Compress/Decompress Unit

The decompress unit performs division by power-of-two and

rounding operation which is trivial to implement in hardware.

On the other hand, the compress operation requires division

by qand rounding. Some works in the literature use Barrett

reduction and division algorithms to perform the compress

operation. We decide to use sufﬁcient precision and convert

Algorithm 3 The Proposed Compression Algorithm

In: x∈Z3329,d∈ {1,4,5,10,11}

Out: y=⌈(2d/q)·x⌋

1: switch ddo

2: case 1:t= (10079 ·x); y= (t≫24) + (t[23] ≫23)

3: case 4:t= (315 ·x); y= (t≫16) + (t[15] ≫15)

4: case 5:t= (630 ·x); y= (t≫16) + (t[15] ≫15)

5: case 10:t= (5160669 ·x); y= (t≫24) + (t[23] ≫23)

6: case 11:t= (10321339 ·x); y= (t≫24) + (t[23] ≫23)

7: end switch

8: return y(mod 2d)

Fig. 10. Architecture of the Compress/Decompress unit. The red lines show

control signals, and the black lines show data movement.

division by qoperation into multiplication and shift opera-

tions. The proposed multiplication-based compress algorithm

is shown in Algorithm 3. The input is the Kyber coefﬁcient

xand the type of compression required d. The compressed

coefﬁcient yis returned as the output.

Since the multiplications are by constant values, we imple-

ment these operations using add and shift technique utilizing

the LUTs. Fig. 10 shows the hardware architecture of this

multiplication unit, used for retrieving the tvalues in Algo-

rithm 3. This is uniﬁed and works for both compress and

decompress operations. The control ﬂow is dependent on the

type of compression or decompression required.

2) Encode/Decode Unit

Encode and decode units perform coefﬁcient-to-byte and

byte-to-coefﬁcient conversions, for all security levels of the

Kyber scheme. We used a similar idea as proposed in [32]

which uses a 32-bit interface. Our architecture uses a 64-bit

interface and thus the proposed encode unit uses a 104-bit

buffer. It can encode 1-bit, 4-bit, 5-bit, 10-bit, and 11-bit long

coefﬁcients. The decode unit can decode 64-bit inputs into

1-bit, 4-bit, 5-bit, 10-bit, and 11-bit long coefﬁcients using a

72-bit buffer.

3) Pack/Unpack unit

Similar to Kyber, Dilithium requires coefﬁcient-to-byte and

byte-to-coefﬁcient conversions for various coefﬁcient sizes.

Pack and unpack units perform these conversions for all

security levels of Dilithium for coefﬁcient sizes 3, 4, 6, 10, 13,

18, and 20 bits. We again followed the idea proposed in [32]

for the pack and unpack units.

The remaining blocks of both schemes are different and

unifying them would not save any area and instead complicate

the control logic and reduce the ﬂexibility of the design. These

building blocks do not require any DSP units and comprise

simple bit-wise packing, unpacking, or addition/subtraction

operations. They are implemented as individual blocks and

they occupy only 18% of the cryptoprocessor’s area.

F. Instruction set cryptoprocessor

We made the building blocks compact while ensuring ﬂex-

ibility, but this is insufﬁcient. What happens if, in two years,

the Keccak pseudo-random number generation and hashing

unit are obsolete? Do we then need to redesign the entire

cryptoprocessor? To counter this and increase agility as well

as ﬂexibility, we design an instruction set architecture (ISA),

where each instruction is a building block required by the

cryptographic schemes. A simple program controller runs the

cryptographic protocols by executing the necessary instruc-

tions, manages the synchronization of parallel instructions, and

avoids back-and-forth CPU-Cryptoprocessor communication.

Note that the program controller is not a ‘control processor’,

and it does not contain arithmetic circuits to process operand

data. It simply decodes an instruction and then activates the

corresponding module inside the cryptoprocessor. It consumes

only 8% of the cryptoprocessor’s area. The instructions and

the corresponding hardware modules are listed in Table I.

G. Running the Giants and the Dwarves in parallel

Our goal was to make the uniﬁed design compact and agile.

However, does this mean we have to pay an equal price

in terms of performance? To some extent, this is correct.

However, we should continue to ponder on some methods

that could boost the performance without increasing the area

consumption. One such way is to run the Giant instructions

in parallel to each other or to multiple Dwarf instructions, as

shown in [31]. We make sure that two Giants, the Keccak

unit and the polynomial arithmetic unit, can always run in

parallel to cancel each other’s run-time. It leads to a reduction

of 35% in the total run-time. Similar to [31], we deﬁne two

instruction sets (S-1 and S-2), as shown in Table I. Every

instruction belonging to the ﬁrst set can be run in parallel with

any instruction that belongs to the second set. The instruction

opcodes for each instruction are shown in the INS column

of Table I. Following the design methodology, we design the

uniﬁed cryptoprocessor- KaLi as shown in Fig. 1.

IV. RES ULT S

In this section, we present the performance and area results

of KaLi. The proposed architecture is described in Verilog. It

is synthesized and implemented for Zynq Ultrascale+ ZCU102

with a performance-optimized strategy using Vivado 2019.1

tool and achieves 270 MHz clock frequency on FPGA. The

proposed architecture is also implemented with 65nm and

28nm ASIC technologies using the Cadence Genus tool. On

65nm/28nm ASIC technology, it achieves 280 MHz/1 GHz for

the slow clock (in logic units), and 560 MHz/2 GHz for the

fast clock (in memory units).

TABLE I

ARE A OF KALION THE ZY NQ ULTRASCALE+ ZCU102 FPGA PLATF ORM .

ALL S ECU RI TY LE VEL S OF DILITHIUM AND KY BER A RE S UPP ORT ED.

Unit S-1 S-2 INS LUT DFF DSP BRAM

Comp.Core 21K 9.2K 4 21

Dilithium (D)

⌊Decompose ✓- 1 474 338 0 0

⌊Pow2Round - ✓1 55 84 0 0

⌊MakeHint - ✓2 61 124 0 0

⌊UseHint - ✓3 565 433 0 0

⌊Encode H-✓4 202 233 0 0

⌊Pack ✓- 2 582 181 0 0

⌊Unpack ✓- 3 315 182 0 0

⌊SampleInBall - ✓5 505 285 0 0

⌊Refresh ✓- 4 8 7 0 0

⌊VerifyEq. - ✓6 13 76 0 0

Kyber (K)

⌊Encode - ✓7 517 190 0 0

⌊Decode ✓- 5 237 180 0 0

⌊Com./Decom. ✓- 6/7 272 376 0 0

⌊Verify - ✓8 102 216 0 0

⌊CMOV - ✓9 20 120 0 0

⌊COPY - ✓10 15 120 0 0

D+K

⌊Memory - ✓11 268 12 0 20

⌊Keccak ✓- 8-18 12K 3.5K 0 0

⌊Multiplier - ✓12-16 3.5K 2K 4 1

Prog.Contr. 2K 296 0 3

Total 23K 9.7K 4 24

TABLE II

PERFORMANCE RESULTS FOR DILITHIUM AND KYBER -KEM IN FPGA

Operation

Dilithium-2 Dilithium-3 Dilithium-5

Kyber-512 Kyber-768 Kyber-1024

Cycle µs Cycle µs Cycle µs

Dil.Gen 14,594 54.05 23,619 87.48 39,737 147.17

Dil.Signpre 7,883 29.2 9,640 35.7 12,943 46.27

Dil.Sign 21,812 80.79 36,643 135.72 53,965 199.87

Dil.Signpost 1,967 7.23 2,463 9.12 3,271 12.12

Dil.Verify 15,423 57.12 26,124 96.76 46,671 172.86

Kyb.Keygen 3,395 12.6 6,291 23.2 9,089 33.7

Kyb.Encaps 4,956 18.4 7,862 29.11 11,351 42.04

Kyb.Decaps 6,807 25.21 11,291 41.82 13,905 51.5

A. Area and Performance Results

Table I presents the detailed utilization of each building

blocks in KaLi for UltraScale+ ZCU102 platform. The pro-

posed cryptoprocessor uses 23,347 LUTs (8.4%), 9,798 DFFs

(1.7%), 4 DSPs (0.1%), and 24 BRAMs (2.6%). On ASIC,

KaLi consumes 1.107 mm2(769.04 KGE) in 65nm technol-

ogy, and 0.263 mm2(747.81 KGE) in 28nm technology.

Table II presents the cycle count and latency (in µs) for

the operations of Dilithium and Kyber. With 270 MHz clock

frequency in the FPGA, the CCA-secure key generation,

encapsulation and decapsulation operations for Kyber-768 take

23.2, 29.11, and 41.82 µs, respectively. For the best-case sce-

nario, where a valid signature is generated after the ﬁrst loop

iteration [31], the key generation, signature generation, and

signature veriﬁcation operations for Dilithium-3 take 87.48,

179.91, and 96.76 µs, respectively. The ASIC implementation

with 65nm/28nm technology (with 560 MHz/2 GHz clock

frequency for the memory unit) can perform the operations

for Kyber-768 and Dilithium-3 in 22.07/6.18, 27.59/7.73, and

39.62/11.09 µs, and 82.87/23.2, 171.03/47.89, and 91.66/25.67

µs, respectively. Next, we compare these results with the

existing works in the literature.

B. Comparison with uniﬁed designs in literature

In [24], the authors design a uniﬁed architecture for

Dilithium and Kyber. They present both HW/SW co-design

TABLE III

COMPARISON TAB LE FO R DILITHIUM-3 FPGA IMPLEMENTATIONS

Ref. Plat. Performance Freq. Resources (LUT/

(in µs) (MHz) FF/DSP/BRAM)

[9]†Zynq -/8.8K/9.9K 100 2.6K/-/-/-

[10]a

US+V

51.9/-/- 350 54.1K/25.2K/182/15

[10]b,d -/63.1/- 333 68.4K/86.2K/965/145

[10]c-/-/95.1 158 61.7K/34.9K/316/18

[11]d

Ar.-7 229/0.3K/0.2K 145 30.9K/11.3K/45/21

[11]e229/0.85K/0.2K

[6]d

Ar.-7 60/0.12K/63.8 96.9 30K/10.34K/10/11

[6]e60/0.46K/63.8

[12]d

US+V 32/63/39 145 55.9K/28.4K/16/29

[12]e32/193/39

[31]d,f US+Z 114.7/237/127.6 200 18.5K/9.3K/4/24

KaLid,f US+Z82.8/171.3/96.7 270 23K/9.7K/4/24

a: Implements K. Gen. b: Implements Sign. c: Implements Verify. d: Reports

best-case scenario. e: Reports average-case scenario. f: Supports multiple

schemes. †: HW/SW co-design. US+V/Z refers to Virtex/Zynq US+ platforms.

TABLE IV

COMPARISON TAB LE FO R DILITHIUM-3 ASIC IMPLEMENTATIONS

Ref. Tech. Perf.∗Freq. Area SRAM Energy∗

(nm) (µs) (MHz) (KGE) (KB) (µJ)

[25]†40 18,266 72 106 40.25 88.89

[24]†a28 747 540 697 24.75 62.39

[31]a,b 65 182.3 400 854 34.82 -

KaLia,b 65 262.7 280/560 769 34.82 117.9

KaLia,b 28 73.55 1K/2K 747 34.82 27

∗:Performance/Energy is measured as total time/energy for signature generation

and veriﬁcation (key generation can be done ofﬂine). †: HW/SW co-design.

a:Supports multiple schemes. b: Reports best-case scenario.

as well as HW results for Kyber while keeping some parts

of Dilithium in the software. Their NTT unit occupies 25,674

LUTs, 3,137 DFFs, 64 DSPs, and 6 BRAMs on a Xilinx Artix-

7 FPGA. The NTT unit alone occupies more LUT and DSP

units than our entire design. On ASIC, it occupies 697 KGE

on 28nm technology [24] which is very close to our uniﬁed

cryptoprocessor’s 747 KGE area consumption. Their imple-

mentation shows similar performance for Kyber even though

they target a high-speed design of Kyber in hardware and use

32 butterﬂy units for NTT, making their NTT unit 8×faster

than KaLi. For Dilithium, KaLi shows 10×better results.

The energy consumption of KaLi is also approximately half

of their design for both Kyber and Dilithium.

To the best of our knowledge, no work exists in the literature

that uniﬁes Dilithium and Kyber solely in hardware. Therefore,

next, we compare our work with standalone implementations

of Dilithium or Kyber in hardware.

C. Comparison with Dilithium-only designs in literature

Comparison with FPGA-based implementations:

Different works in the literature use different FPGA plat-

forms. Hence, drawing a one-to-one comparison between

works is not always feasible. When we started the hardware

implementation of the proposed architecture, we chose Ultra-

scale+ as the platform and thereafter pipelined the building

blocks for achieving around 300 MHz frequency on this plat-

form. Several works in the literature optimized their designs

for Artix-7 or other FPGAs. While optimizing the critical

paths of architecture for meeting the desired clock frequency is

heavily dependent on the technology of the platform, area re-

quirements (LUT/DSP/FF/BRAM) do not change signiﬁcantly

TABLE V

COMPARISON TAB LE FO R KYB ER-1024 FPGA IMPLEMENTATIONS

Ref. Platform Performance∗Freq. Resources (LUT/

(in µs) (MHz) FF/DSP/BRAM)

[21]‡Cortex-M4 33,850 100 -/-/-/-

[25]†Artix-7 18,560 25 15K/3K/11/14

[18]†Zynq - - 24K/11K/21/32

[20]†Artix-7 85,559 59 2K/2K/5/34

[13] Virtex-7 1,260 192 133K/-/548/202

[14] Artix-7 154 161 7K/5K/2/3

[15] Artix-7 63 210 12K/12K/8/15

[16] Artix-7 56 185 13K/12K/16/16

[17] Artix-7 286 112 16K/6K/12/17

[17] Virtex-7 205 156 16K/6K/12/17

[22] US+Z 23.5 450 11.6K/12K/8/8.5

[23] US+Z 3.4 (Encap) 450 18.4K/13.7K/2/0

[23] US+Z 4.1 (Decap) 450 15.9K/12.9K/2/0

KaLiaUS+Z 93 270 23K/9.7K/4/24

∗:Performance is measured as the total time for encapsulation and decapsulation

(key generation can be done ofﬂine). a:Supports multiple schemes. †: HW/SW

co-design. ‡: SW design. US+Z refers to Zynq Ultrascale+ platforms.

across FPGA technologies. In Table III, we present the FPGA

implementation results of Dilithium-3 from the literature.

Zhou et al. [9] present an HW/SW co-design and they only

implement the polynomial arithmetic unit in hardware. Thus,

they consume less area but report an inferior performance.

Ricci et al. [10] provide separate designs for each of the

Dilithium variants. These designs in total occupy 9×more

area compared to our design and still perform as good as our

design for signature veriﬁcation. For a signature generation,

their implementation shows only 3×improvement. Thus, our

design gives a much better area-time trade-off result.

The authors in [6], [11], [12] present Dilithium implementa-

tions, which consume much more area compared to our design.

Note that across technologies, the area consumption does not

change notably. A lower frequency in [6], [11] can be justiﬁed

by the use of Artix-7 FPGA, which is technologically inferior

to our Ultrascale+ platform. A limitation of [6] is that it uses

a segmented pipeline and hence, an inﬂexible data path for

Dilithium. The implementation in [12] uses a better platform

than ours, consumes 3×more area (LUT+FF), and achieves

a speed-up of only 2.5×(Sign+Verify). In [31], the authors

present a uniﬁed cryptoprocessor for Dilithium and Saber [33].

Their area is almost comparable to ours, considering the

difference between Kyber and Saber. We achieve a higher

clock frequency and report 1.4×better performance.

Comparison with ASIC-based implementations: Table IV

gives the comparison of implementation results for Dilihtium-

3 on ASIC platforms. Banerjee et al. [25] present ASIC results

for HW/SW co-design of Dilithium with Round 2 parameters.

KaLi outperforms them signiﬁcantly in terms of performance.

Our hardware only design gives 45×better performance at

the cost of only 7.5×more area. KaLi consumes almost the

same area as reported in [24] but gives a 10×and 2.8×better

performance with 28nm and 65nm technology, respectively.

[31] reports a higher number of logic gates than our design.

KaLi sets new records for energy consumption, in both 28nm

and 65nm technologies.

Thus, our FPGA and ASIC models are the most compact

compared to all the existing Dilithium implementations.

TABLE VI

COMPARISON TAB LE FO R KYB ER-1024 ASIC IMPLEMENTATIONS

Ref. Tech. Perf.∗Freq. Area SRAM Energy∗

(nm) (µs) (MHz) (KGE) (KB) (µJ)

[25]†40 6,444 72 106 40.25 36.06

[18]†65 18,444 45 170 465 307.68

[19]†28 727 300 979 12 19.57

[17] 65 160 200 104 190 -

[24]†,a 28 206 540 697 24.75 16.24

[24]a28 22/17.7b540 623 36.75 -

KaLia65 90.2 280/560 769 34.82 40.48

KaLia28 25.26 1K/2K 747 34.82 9.27

∗:Performance/Energy is measured as the total time/energy for encapsulation

and decapsulation (key generation can be done ofﬂine). †: HW/SW co-design.

a:Supports multiple schemes. b:Depending on the type of schedule.

D. Comparison with Kyber-only designs in literature

Comparison with FPGA-based implementations: Table V

gives the comparison of implementation results for Kyber-

1024 on FPGA platform. Banerjee et al. [25] present an

HW/SW co-design for Kyber. KaLi surpasses their perfor-

mance results on both platforms, at the cost of some area.

Observe that KaLi gives better results compared to software

only [21] as well as HW/SW co-designs [18], [20], [25]. The

hardware-only designs [13]–[17] target Artix-7 or Virtex-7

FPGAs. Our KaLi consumes signiﬁcantly smaller area than

[13]. Authors in [15]–[17], [22] target a high-speed Kyber

implementation and therefore achieve better performance and

frequency. In [23], the authors present separate results for

all the Kyber variants, and present individual encapsulation

and decapsulation architectures (unlike KaLi combines all

operations in a single architecture). Their design goal is also

high-speed, and their standalone implementation of Kyber-

1024 encapsulation consumes more area than KaLi. Note that

the area of our design is determined by Dilithium and not

by Kyber. Therefore, even though the results show that we

consume a very high area, we only consume the bare minimum

and give almost the best performance results on the FPGA

platform.

Comparison with ASIC-based implementations: Table VI

gives the comparison of implementation results for Kyber-

1024 on ASIC platform. On ASIC platform, KaLi consumes

the same area as reported in [24] but gives a 2.3/8.1×better

performance under 65nm/28nm technology. In fact, we surpass

all existing designs [17]–[19], [25] in terms of performance.

However, compared to some of the designs, we use more area,

and for this, we must remind again that Kyber is the recessive

scheme among the two, and therefore this area is higher when

compared to Kyber-only implementations. KaLi consumes the

least energy of all for respective technologies.

We have now established that KaLi transcends all the state-

of-the-art works that exist in literature. Thus, showing that the

proposed design methodology yields better results. Next, we

discuss the aspect of application benchmarking.

E. Application-Benchmarking and Impact

Several works, for example, [34], [35], present application

benchmarking using the existing libraries. The authors in

[34] provide results for TLS protocol using the mbed TLS

library and use Kyber for KEM and SPHINCS+ for digital

signature. Runtimes are reported for Raspberry Pi 3 Model B+,

ESP32-PICO-KIT V4, Fieldbus Option Card, and LPC11U68

LPCXpresso. Compared to these works, KaLi’s FPGA imple-

mentation shows 85×, 1349×, 6190×, and 23809×speedups,

respectively, noting that the ASIC models of KaLi will further

improve the timings.

In [35], the authors evaluate PQ TLS 1.3, which is a

post-quantum variant of TLS version 1.3. It supports Round

3 parameters for both Kyber and Dilithium, along with

other schemes. They use ARM Cortex-M4 embedded plat-

form NUCLEO-F439ZI, with and without hardware accel-

eration. These boards can reach a maximum frequency of

180MHz. Compared to their results for Kyber’s decapsulation,

we achieve a speedup of 131×if we run our design at

180MHz. The authors report that replacing RSA+ECDHE with

Dilithium3+Kyber5 in TLS handshake increased the runtime

by 64%. KaLi can help bridge this gap. Thus, replacing these

devices with KaLi would give signiﬁcant speedup. KaLi only

occupies 8% of the available resources on the Zynq US+

FPGA board, implying the ability to run twelve such uniﬁed

cores in parallel, further improving the speedup.

There are several data center and network security appli-

ances where high-performance SIMD processors (e.g., In-

tel/AMD with AVX) are too expensive to deploy or extremely

constrained (or passively powered) devices are too slow to

use. There are commercial cryptoprocessors that target such

applications. For example, NXP’s C29x family of crypto

coprocessors [36] (which are battery-powered) use dedicated

hardware acceleration for speeding up the RSA and elliptic

curve-based public-key cryptographic computations. It com-

putes up to 32K RSA2048 public-key operations per second.

More constrained platform OPTIGA™ TPM SLB 9672 from

Infenion [37] has hardware acceleration for RSA-4096. They

use the same RSA engine for both public-key signature and

key agreement.

Our uniﬁed Kyber+Dilithium coprocessor performs faster

than the public-key engines of [36] and [37] and, at the same

time, requires only 0.263 mm2area in a 28nm node. When a

smaller area is required by an application, some of the design

parameters (e.g, number of NTT cores, Keccak’s data-width,

etc.) can be tuned accordingly to meet the area budget at the

cost of speed. The proposed design techniques and architecture

will be useful to replace the classical public-key cryptography

used in conventional cryptoprocessors with post-quantum key

agreement and signature.

V. CONCLUSIONS AND FUTURE WORK

Post-quantum key encapsulation and digital signature algo-

rithms are required for securing communication. In this paper,

we presented a design methodology for efﬁcient and compact

hardware implementation of both post-quantum key encapsu-

lation and digital signature algorithms in a uniﬁed cryptopro-

cessor architecture. Following the proposed methodology, we

designed and implemented the ﬁrst uniﬁed cryptoprocessor

architecture KaLi that can perform all the cryptographic

protocol operations of the Dilithium signature and Kyber key

encapsulation algorithms for all the security levels. Architec-

tural optimizations in the data path of the cryptoprocessor

were performed to reduce the cycle count and improve the

clock frequency. Experimental evaluation of KaLi on FPGA

and ASIC platforms showed that KaLi outperforms all the

existing implementations. Therefore, the design of KaLi is a

signiﬁcant step towards making post-quantum cryptography

compact and agile on hardware platforms. The proposed

design methodology can be customized to meet different

application-speciﬁc constraints and requirements.

The hardware implementation presented in this paper is

resistant to timing attacks but does not incorporate any

countermeasure, for example, masking against more powerful

side-channel attacks. Side-channel protection of the uniﬁed

cryptoprocessor architecture will require signiﬁcant research

and is considered future work. There are several works in the

literature on masking Kyber [38], [39]. However, at the time

of writing this paper, the authors are not aware of any reported

masked implementation of the NIST standardized version

of Dilithium. How to design an ‘optimal and uniﬁed’ side-

channel protection mechanism for a uniﬁed hardware imple-

mentation of Kyber and Dilithium is an interesting topic that

needs to be researched. Furthermore, researching protection

techniques against fault injection-based attacks will be very

important due to the vast deployment of these cryptographic

schemes in various embedded devices.

REFERENCES

[1] P. W. Shor, “Polynomial-time algorithms for prime factorization and

discrete logarithms on a quantum computer,” SIAM J. Comput., vol. 26,

no. 5, p. 1484–1509, oct 1997.

[2] F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends,

R. Biswas, S. Boixo, and many more, “Quantum supremacy using a

programmable superconducting processor,” Nature, 2019, https://doi.org/

10.1038/s41586-019- 1666-5.

[3] M. Gong, S. Wang, C. Zha, M.-C. Chen, H.-L. Huang, Y. Wu, Q. Zhu,

Y. Zhao, S. Li, S. Guo, and e. a. Haoran Qian, “Quantum walks on

a programmable two-dimensional 62-qubit superconducting processor,”

Science, vol. 372, no. 6545, pp. 948–952, 2021.

[4] “Post-quantum cryptography- call for proposals,” 2017. [Online].

Available: https://csrc.nist.gov/projects/post-quantum- cryptography

[5] D. Joseph, R. Misoczki, M. Manzano, J. Tricot, F. D. Pinuaga,

O. Lacombe, S. Leichenauer, J. Hidary, P. Venables, and R. Hansen,

“Transitioning organizations to post-quantum cryptography.” Nature,

605(7909), 237–243., 2022.

[6] C. Zhao, N. Zhang, H. Wang, B. Yang, W. Zhu, Z. Li, M. Zhu, S. Yin,

S. Wei, and L. Liu, “A compact and high-performance hardware archi-

tecture for crystals-dilithium,” IACR Trans. Cryptogr. Hardw. Embed.

Syst., vol. 2022, no. 1, pp. 270–295, 2022.

[7] P. Schwabe, R. Avanzi, J. Bos, L. Ducas, E. Kiltz, T. Lepoint,

V. Lyubashevsky, J. M. Schanck, G. Seiler, and D. Stehle, “CRYSTALS-

KYBER,” Proposal to NIST PQC Standardization, 2021, https://csrc.nist.

gov/Projects/post-quantum-cryptography/round-3-submissions.

[8] S. S. Roy and A. Basso, “High-speed instruction-set coprocessor for

lattice-based key encapsulation mechanism: Saber in hardware,” IACR

Trans. Crypt. Hardw. Embed. Syst., vol. 2020, no. 4, pp. 443–466, 2020.

[9] Z. Zhou, D. He, Z. Liu, M. Luo, and K.-K. R. Choo, “A soft-

ware/hardware co-design of crystals-dilithium signature scheme,” ACM

Trans. Reconﬁgurable Technol. Syst., vol. 14, no. 2, Jun. 2021.

[10] S. Ricci, L. Malina, P. Jedlicka, D. Sm´

ekal, J. Hajny, P. Cibik,

P. Dzurenda, and P. Dobias, “Implementing crystals-dilithium signature

scheme on fpgas,” in The 16th International Conference on Availability,

Reliability and Security, ser. ARES 2021. New York, NY, USA:

Association for Computing Machinery, 2021.

[11] G. Land, P. Sasdrich, and T. G¨

uneysu, “A hard crystal - implementing

dilithium on reconﬁgurable hardware,” IACR Cryptol. ePrint Arch., vol.

2021, p. 355, 2021.

[12] L. Beckwith, D. T. Nguyen, and K. Gaj, “High-performance hardware

implementation of crystals-dilithium,” Crypto. ePrint Arch., Report

2021/1451, 2021.

[13] Y. Huang, M. Huang, Z. Lei, and J. Wu, “A pure hardware implemen-

tation of CRYSTALS-KYBER PQC algorithm through resource reuse,”

IEICE Electron. Express, vol. 17, no. 17, p. 20200234, 2020.

[14] Y. Xing and S. Li, “A compact hardware implementation of cca-secure

key exchange mechanism CRYSTALS-KYBER on FPGA,” IACR Trans.

Cryptogr. Hardw. Embed. Syst., vol. 2021, no. 2, pp. 328–356, 2021.

[15] V. B. Dang, F. Farahmand, M. Andrzejczak, K. Mohajerani, D. T.

Nguyen, and K. Gaj, “Implementation and benchmarking of round

2 candidates in the NIST post-quantum cryptography standardization

process using hardware and software/hardware co-design approaches,”

IACR Cryptol. ePrint Arch., p. 795, 2020.

[16] M. Bisheh-Niasar, R. Azarderakhsh, and M. M. Kermani, “High-speed

ntt-based polynomial multiplication accelerator for crystals-kyber post-

quantum cryptography,” IACR Cryptol. ePrint Arch., p. 563, 2021.

[17] M. Bisheh-Niasar, R. Azarderakhsh, and M. M. Kermani, “Instruction-

set accelerated implementation of crystals-kyber,” IEEE Trans. Circuits

Syst. I Regul. Pap., vol. 68, no. 11, pp. 4648–4659, 2021.

[18] T. Fritzmann, G. Sigl, and J. Sep´

ulveda, “RISQ-V: tightly coupled RISC-

V accelerators for post-quantum cryptography,” IACR Trans. Cryptogr.

Hardw. Embed. Syst., vol. 2020, no. 4, pp. 239–280, 2020.

[19] G. Xin, J. Han, T. Yin, Y. Zhou, J. Yang, X. Cheng, and X. Zeng,

“VPQC: A domain-speciﬁc vector processor for post-quantum cryptog-

raphy based on RISC-V architecture,” IEEE Trans. Circuits Syst. I Regul.

Pap., vol. 67-I, no. 8, pp. 2672–2684, 2020.

[20] E. Alkim, H. Evkan, N. Lahr, R. Niederhagen, and R. Petri, “ISA

extensions for ﬁnite ﬁeld arithmetic accelerating kyber and newhope

on RISC-V,” IACR Trans. Cryptogr. Hardw. Embed. Syst., vol. 2020,

no. 3, pp. 219–242, 2020.

[21] L. Botros, M. J. Kannwischer, and P. Schwabe, “Memory-efﬁcient high-

speed implementation of kyber on cortex-m4,” in Progress in Cryptology

- AFRICACRYPT 2019, vol. 11627. Springer, 2019, pp. 209–228.

[22] V. B. Dang, K. Mohajerani, and K. Gaj, “High-speed hardware architec-

tures and FPGA benchmarking of crystals-kyber, ntru, and saber,” IACR

Cryptol. ePrint Arch., p. 1508, 2021.

[23] Z. Ni, A. Khalid, D. Kundi, M. O’Neill, and W. Liu, “Efﬁcient pipelining

exploration for A high-performance crystals-kyber accelerator,” IACR

Cryptol. ePrint Arch., p. 1093, 2022.

[24] Y. Zhao, R. Xie, G. Xin, and J. Han, “A high-performance domain-

speciﬁc processor with matrix extension of RISC-V for module-lwe

applications,” IEEE Trans. Circuits Syst. I Regul. Pap., vol. 69, no. 7,

pp. 2871–2884, 2022.

[25] U. Banerjee, T. S. Ukyab, and A. P. Chandrakasan, “Sapphire: A

conﬁgurable crypto-processor for post-quantum lattice-based protocols

(extended version),” IACR Cryptol. ePrint Arch., p. 1140, 2019.

[26] S. Bai, L. Ducas, E. Kiltz, T. Lepoint, V. Lyubashevsky, P. Schwabe,

G. Seiler, and D. Stehl´

e, “CRYSTALS-Dilithium,” Proposal to

NIST PQC Standardization, Round3, 2021, https://csrc.nist.gov/Projects/

post-quantum- cryptography/round-3- submissions.

[27] D. Sprenkels, “The Kyber/Dilithium NTT,”

https://dsprenkels.com/ntt.html.

[28] M. Scott, “A note on the implementation of the number theoretic

transform,” in Cryptography and Coding - 16th IMA International

Conference, IMACC 2017. Springer, 2017.

[29] V. Lyubashevsky and G. Seiler, “NTTRU: Truly Fast NTRU Using

NTT,” IACR Trans. on CHES, vol. 2019, no. 3, pp. 180–201, May 2019.

[30] F. Yaman, A. C. Mert, E. ¨

Ozt¨

urk, and E. Savas¸, “A hardware accelerator

for polynomial multiplication operation of crystals-kyber pqc scheme,”

in 2021 Design, Automation & Test in Europe Conference & Exhibition

(DATE). IEEE, 2021, pp. 1020–1025.

[31] Aikata, A. C. Mert, D. Jacquemin, A. Das, D. Matthews, S. Ghosh, and

S. S. Roy, “A uniﬁed cryptoprocessor for lattice-based signature and

key-exchange,” Cryptology ePrint Archive, Report 2021/1461, 2021.

[32] Y. Xing and S. Li, “A compact hardware implementation of cca-secure

key exchange mechanism crystals-kyber on fpga,” IACR Transactions on

Cryptographic Hardware and Embedded Systems, pp. 328–356, 2021.

[33] J.-P. D’Anvers, A. Karmakar, S. S. Roy, F. Vercauteren, J. M. B.

Mera, M. V. Beirendonck, and A. Basso, “SABER,” Proposal to

NIST PQC Standardization, Round3, 2021, https://csrc.nist.gov/Projects/

post-quantum- cryptography/round-3- submissions.

[34] K. B¨

urstinghaus-Steinbach, C. Krauß, R. Niederhagen, and M. Schnei-

der, “Post-quantum TLS on embedded systems: Integrating and evalu-

ating kyber and SPHINCS+ with mbed TLS,” in ASIA CCS ’20: The

15th ACM Asia Conference on Computer and Communications Security,

2020. ACM, 2020, pp. 841–852.

[35] T. George, J. Li, A. P. Fournaris, R. K. Zhao, A. Sakzad, and R. Ste-

infeld, “Performance evaluation of post-quantum TLS 1.3 on embedded

systems,” IACR Cryptol. ePrint Arch., p. 1553, 2021.

[36] “Nxp’s c29x family of crypto coprocessors.” [Online]. Available:

https://www.nxp.com/docs/en/fact-sheet/C29XFAMFS.pdf

[37] “Optiga™ tpm slb 9672 from infenion.” [On-

line]. Available: https://www.inﬁneon.com/cms/en/about-inﬁneon/press/

market-news/2022/INFCSS202202-051.html

[38] J. W. Bos, M. Gourjon, J. Renes, T. Schneider, and C. van Vredendaal,

“Masking kyber: First- and higher-order implementations,” IACR Trans.

Cryptogr. Hardw. Embed. Syst., vol. 2021, pp. 173–214, 2021.

[39] T. Fritzmann, M. Van Beirendonck, D. Basu Roy, P. Karl, T. Scham-

berger, I. Verbauwhede, and G. Sigl, “Masked accelerators and instruc-

tion set extensions for post-quantum cryptography,” IACR Transactions

on Cryptographic Hardware and Embedded Systems, vol. 2022, no. 1,

p. 414–460, Nov. 2021.

Aikata obtained her Bachelors in Technology degree from

IIT Bhilai, India, in 2020 and Masters degree from Graz

University of Technology, Austria, in 2022. She is currently

a PhD student at Institute of Applied Information Processing

and Communications, Graz University of Technology. Her

research interests include lattice-based cryptography and

hardware design.

Ahmet Can Mert received his PhD degree in electron-

ics engineering from Sabanci University, Turkey in 2021.

Currently, he is working as a postdoctoral researcher at

the Institute of Applied Information Processing and Com-

munications, Graz University of Technology, Austria. His

research interest include homomorphic encryption, lattice-

based cryptography and hardware design.

Malik Imran received his bachelor’s and master’s degrees

from Pakistan in 2011 and 2015, respectively. Now, he is in

with the Center for Hardware Security, Tallinn University

of Technology (TalTech), Tallinn, Estonia, as a doctoral stu-

dent. Before joining TalTech, Malik contributed to different

research labs for efﬁcient hardware accelerators for intrusion

detection systems and asymmetric cryptography.

Samuel Pagliarini (M’14) received the PhD degree from

Telecom ParisTech, Paris, France, in 2013. He has held

research positions with the University of Bristol, Bristol,

UK, and with Carnegie Mellon University, Pittsburgh, PA,

USA. He is currently a Professor with Tallinn University

of Technology (TalTech) in Tallinn, Estonia where he leads

the Centre for Hardware Security.

Sujoy Sinha Roy is an Assistant Professor of cryptographic

engineering at IAIK, Graz University of Technology. He is

a Co-Designer of “Saber,” which is a ﬁnalist key encapsula-

tion mechanism (KEM) candidate in NIST’s Post-Quantum

Cryptography Standardization Project. He is interested in

the implementation aspects of cryptography.

Falcon/Kyber and Dilithium/Kyber Network Stack on Nvidia’s Data Processing Unit Platform

Article

Full-text available

Jan 2024

Commercially available quantum computers are expected to reshape the world in the near future. They are said to break conventional cryptographic security mechanisms that are deeply embedded in our today’s communication. Symmetric cryptography, such as AES, will withstand quantum attacks as long as the key sizes are doubled compared to today’s key lengths. Asymmetric cryptographic procedures, e.g. RSA, however are broken. It is therefore necessary to change the way we assure our privacy by adopting and moving towards post-quantum cryptography (PQC) principles. In this work, we benchmark three PQC algorithms, Falcon, Dilithium, and Kyber. Moreover, we present an implementation of a PQC stack consisting of the algorithms Dilithium/Kyber and Falcon/Kyber which use hardware accelerators for some key functions and evaluate their performance and resource utilization. Regarding a classic server-client model, the computational load of the Dilithium/Kyber stack is distributed more equally among server and client. The stack Falcon/Kyber biases the computational challenges towards the server, hence relieving the client of performing costly operations. We found that Dilithium’s advantage over Falcon is that Dilithium’s execution is faster while the workload to be performed is distributed equally among client and server, whereas Falcon’s advantage over Dilithium lies within the small signature sizes and the unequally distributed computational tasks. In a client server model with a performance limited client (i.e. Internet-of-Things - IoT - environments) Falcon could proof useful for it constrains the computational hard tasks to the server and leaves a minimal workload to the client. Furthermore, Falcon requires smaller bandwidth, making it a strong candidate for deep-edge or IoT applications.

High-speed NTT Accelerator for CRYSTAL-Kyber and CRYSTAL-Dilithium

Article

Full-text available

Jan 2024

The efficiency of polynomial multiplication execution majorly impacts the performance of lattice-based post-quantum cryptosystems. In this research, we propose a high-speed hardware architecture to accelerate polynomial multiplication based on the Number Theoretic Transform (NTT) in CRYSTAL-Kyber and CRYSTAL-Dilithium. We design a Digital Signal Processing (DSP) architecture for modular multiplication in butterfly and Point-Wise Multiplication (PWM) operations. Our method reduces the critical path delay of an n-bit multiplier to that of a (2n-2)-bit adder, optimizing both area and speed. These dedicated DSPs are employed in butterfly and PWM operations, completely eliminating the pre-process and post-process of NTT transforms. Furthermore, we introduce a novel unified pipelined architecture for the NTT and Inverse NTT (INTT) transformations of Kyber and Dilithium, with corresponding high-speed (Radix-2) and ultra high-speed (Radix-4) versions. Lastly, we construct a complete hardware accelerator for polynomial matrix-vector multiplication in Kyber. The Field-Programmable Gate Array (FPGA) implementation results have proven that our designs have significantly improved execution time by 3.4×–9.6× for the NTT transforms in Dilithium and 1.36×–34.16× for Kyber polynomial multiplication, compared to previous studies reported to date. Additionally, the hardware footprint results indicate that our proposed architectures exhibit superior hardware performance in Area-Time-Product (ATP), corresponding to a 44%–96% improvement. The proposed architectures are efficient and well-suited for high-performance lattice-based cryptography systems.

Efficient Low-Latency Hardware Architecture for Module-Lattice-Based Digital Signature Standard

Article

Full-text available

Jan 2024

The rapid advancement of powerful quantum computers poses a significant security risk to current public-key cryptosystems, which heavily rely on the computational complexity of problems such as discrete logarithms and integer factorization. As a result, CRYSTALS-Dilithium, a lattice-based digital signature scheme with the potential to be an alternative algorithm that can withstand both quantum and classical attacks, has been standardized as ML-DSA after NIST Post-Quantum Cryptography competition. While prior studies have proposed hardware designs to accelerate this cryptosystem, there is room for further optimization in the tradeoff between performance and hardware consumption. This paper addresses these limitations by presenting an efficient low-latency hardware architecture for ML-DSA, leveraging optimized timing schedules for its three main algorithms. The hardware implementation enables runtime switching main operations in ML-DSA with various security levels. We design flexible arithmetic and hash modules tailored for ML-DSA, the most time-consuming submodules and key determinants of the scheme implementation. Combined with efficient operation scheduling to maximize the utilized time of submodules, our design achieves the best latency among FPGA-based implementations, outperforming state-of-the-art works by 1.27~2.58× in terms of the area-time tradeoff metric. Therefore, the proposed hardware architecture demonstrates its practical applicability for digital signature cryptosystems in post-quantum era.

KiD: A Hardware Design Framework Targeting Unified NTT Multiplication for CRYSTALS-Kyber and CRYSTALS-Dilithium on FPGA

Conference Paper

Jan 2024

HI-Kyber: A Novel High-Performance Implementation Scheme of Kyber Based on GPU

Article

Jun 2024

CRYSTALS-Kyber, as the only public key encryption (PKE) algorithm selected by the National Institute of Standards and Technology (NIST) in the third round, is considered one of the most promising post-quantum cryptography (PQC) schemes. Lattice-based cryptography uses complex discrete algorithm problems on lattices to build secure encryption and decryption systems to resist attacks from quantum computing. Performance is an important bottleneck affecting the promotion of post quantum cryptography. In this paper, we present a High-performance Implementation of Kyber (named HI-Kyber) on the NVIDIA GPUs, which can increase the key-exchange performance of Kyber to the million-level. Firstly, we propose a lattice-based PQC implementation architecture based on kernel fusion, which can avoid redundant global-memory access operations. Secondly, We optimize and implement the core operations of CRYSTALS-Kyber, including Number Theoretic Transform (NTT), inverse NTT (INTT), pointwise multiplication, etc. Especially for the calculation bottleneck NTT operation, three novel methods are proposed to explore extreme performance: the sliced layer merging (SLM), the sliced depth-first search (SDFS-NTT) and the entire depth-first search (EDFS-NTT), which achieve a speedup of 7.5%, 28.5%, and 41.6% compared to the native implementation. Thirdly, we conduct comprehensive performance experiments with different parallel dimensions based on the above optimization. Finally, our key exchange performance reaches 1,664 kops/s. Specifically, based on the same platform, our HI-Kyber is 3.52× that of the GPU implementation based on the same instruction set and 1.78× that of the state-of-the-art one based on AI-accelerated tensor core.

A Highly-efficient Lattice-based Post-Quantum Cryptography Processor for IoT Applications

Article

Full-text available

Mar 2024

Lattice-Based Cryptography (LBC) schemes, like CRYSTALS-Kyber and CRYSTALS-Dilithium, have been selected to be standardized in the NIST Post-Quantum Cryptography standard. However, implementing these schemes in resourceconstrained Internet-of-Things (IoT) devices is challenging, considering efficiency, power consumption, area overhead, and flexibility to support various operations and parameter settings. Some existing ASIC designs that prioritize lower power and area can not achieve optimal performance efficiency, which are not practical for battery-powered devices. Custom hardware accelerators in prior co-processor and processor designs have limited applications and flexibility, incurring significant area and power overheads for IoT devices. To address these challenges, this paper presents an efficient lattice-based cryptography processor with customized Single-Instruction-Multiple-Data (SIMD) instruction. First, our proposed SIMD architecture supports efficient parallel execution of various polynomial operations in 256-bit mode and acceleration of Keccak in 320-bit mode, both utilizing efficiently reused resources. Additionally, we introduce data shuffling hardware units to resolve data dependencies within SIMD data. To further enhance performance, we design a dual-issue path for memory accesses and corresponding software design methodologies to reduce the impact of data load/store blocking. Through a hardware/software co-design approach, our proposed processor achieves high efficiency in supporting all operations in lattice-based cryptography schemes. Evaluations of Kyber and Dilithium show our proposed processor achieves over 10x speedup compared with the baseline RISC-V processor and over 5x speedup versus ARM Cortex M4 implementations, making it a promising solution for securing IoT communications and storage. Moreover, Silicon synthesis results show our design can run at 200 MHz with 2.01 mW for Kyber KEM 512 and 2.13 mW for Dilithium 2, which outperforms state-of-the-art works in terms of PPAP (Performance x Power x Area).

SDitH in Hardware

Article

Full-text available

Mar 2024

This work presents the first hardware realisation of the Syndrome-Decodingin-the-Head (SDitH) signature scheme, which is a candidate in the NIST PQC process for standardising post-quantum secure digital signature schemes. SDitH’s hardness is based on conservative code-based assumptions, and it uses the Multi-Party-Computation-in-the-Head (MPCitH) construction. This is the first hardware design of a code-based signature scheme based on traditional decoding problems and only the second for MPCitH constructions, after Picnic. This work presents optimised designs to achieve the best area efficiency, which we evaluate using the Time-Area Product (TAP) metric. This work also proposes a novel hardware architecture by dividing the signature generation algorithm into two phases, namely offline and online phases for optimising the overall clock cycle count. The hardware designs for key generation, signature generation, and signature verification are parameterised for all SDitH parameters, including the NIST security levels, both syndrome decoding base fields (GF256 and GF251), and thus conforms to the SDitH specifications. The hardware design further supports secret share splitting, and the hypercube optimisation which can be applied in this and multiple other NIST PQC candidates. The results of this work result in a hardware design with a drastic reducing in clock cycles compared to the optimised AVX2 software implementation, in the range of 2-4x for most operations. Our key generation outperforms software drastically, giving a 11-17x reduction in runtime, despite the significantly faster clock speed. On Artix 7 FPGAs we can perform key generation in 55.1 Kcycles, signature generation in 6.7 Mcycles, and signature verification in 8.6 Mcycles for NIST L1 parameters, which increase for GF251, and for L3 and L5 parameters.

PQ.V.ALU.E: Post-quantum RISC-V Custom ALU Extensions on Dilithium and Kyber

Conference Paper

Feb 2024

This paper explores the challenges and potential solutions of implementing the recommended upcoming post-quantum cryptography standards (the CRYSTALS-Dilithium and CRYSTALS-Kyber algorithms) on resource constrained devices. The high computational cost of polynomial operations, fundamental to cryptography based on ideal lattices, presents significant challenges in an efficient implementation. This paper proposes a hardware/software co-design strategy using RISC-V extensions to optimize resource utilization and speed up the number-theoretic transformations (NTTs). The primary contributions include a lightweight custom arithmetic logic unit (ALU), integrated into a 4-stage pipeline 32-bit RISC-V processor. This ALU is tailored towards the NTT computations and supports modular arithmetic as well as NTT butterfly operations. Furthermore, an extension to the RISC-V instruction set is introduced, with ten new instructions accessing the custom ALU to perform the necessary operations. The new instructions reduce the cycle count of the Kyber and Dilithium NTTs by more than 80% compared to optimized assembly, while being more lightweight than other works that exist in the literature.

Multi-Probability Hash-based Random Number Generator for Post-Quantum Cryptography

Conference Paper

Aug 2023

High-Performance Rejection Sampling Hardware Circuit Design for Kyber

Conference Paper

Oct 2023

Masking Kyber: First- and Higher-Order Implementations

Article

Full-text available

Aug 2021

In the final phase of the post-quantum cryptography standardization effort, the focus has been extended to include the side-channel resistance of the candidates. While some schemes have been already extensively analyzed in this regard, there is no such study yet of the finalist Kyber. In this work, we demonstrate the first completely masked implementation of Kyber which is protected against first- and higher-order attacks. To the best of our knowledge, this results in the first higher-order masked implementation of any post-quantum secure key encapsulation mechanism algorithm. This is realized by introducing two new techniques. First, we propose a higher-order algorithm for the one-bit compression operation. This is based on a masked bit-sliced binary-search that can be applied to prime moduli. Second, we propose a technique which enables one to compare uncompressed masked polynomials with compressed public polynomials. This avoids the costly masking of the ciphertext compression while being able to be instantiated at arbitrary orders. We show performance results for first-, second- and third-order protected implementations on the Arm Cortex-M0+ and Cortex-M4F. Notably, our implementation of first-order masked Kyber decapsulation requires 3.1 million cycles on the Cortex-M4F. This is a factor 3.5 overhead compared to the unprotected optimized implementation in pqm4. We experimentally show that the first-order implementation of our new modules on the Cortex-M0+ is hardened against attacks using 100 000 traces and mechanically verify the security in a fine-grained leakage model using the verification tool scVerif.

Masked Accelerators and Instruction Set Extensions for Post-Quantum Cryptography

Article

Full-text available

Nov 2021

Side-channel attacks can break mathematically secure cryptographic systems leading to a major concern in applied cryptography. While the cryptanalysis and security evaluation of Post-Quantum Cryptography (PQC) have already received an increasing research effort, a cost analysis of efficient side-channel countermeasures is still lacking. In this work, we propose a masked HW/SW codesign of the NIST PQC finalists Kyber and Saber, suitable for their different characteristics. Among others, we present a novel masked ciphertext compression algorithm for non-power-of-two moduli. To accelerate linear performance bottlenecks, we developed a generic Number Theoretic Transform (NTT) multiplier, which, in contrast to previously published accelerators, is also efficient and suitable for schemes not based on NTT. For the critical non-linear operations, masked HW accelerators were developed, allowing a secure execution using RISC-V instruction set extensions. With the proposed design, we achieved a cycle count of K:214k/E:298k/D:313k for Kyber and K:233k/E:312k/D:351k for Saber with NIST Level III parameter sets. For the same parameter sets, the masking overhead for the first-order secure decapsulation operation including randomness generation is a factor of 4.48 for Kyber (D:1403k) and 2.60 for Saber (D:915k).

HPKA: A High-Performance CRYSTALS-Kyber Accelerator Exploring Efficient Pipelining

Article

Dec 2023

CRYSTALS-Kyber (Kyber) was recently chosen as the first quantum resistant Key Encapsulation Mechanism (KEM) scheme for standardisation, after three rounds of the National Institute of Standards and Technology (NIST) initiated PQC competition which begin in 2016 and search of the best quantum resistant KEMs and digital signatures. Kyber is based on the Module-Learning with Errors (M-LWE) class of Lattice-based Cryptography, that is known to manifest efficiently on FPGAs. This work explores several architectural optimizations and proposes a high-performance and area-time (AT) product efficient hardware accelerator for Kyber. The proposed architectural optimizations include inter-module and intra-module pipelining, that are designed and balanced via FIFO based buffering to ensure maximum parallelisation. The implementation results show that compared to state-of-the-art designs, the proposed architecture delivers 25-51% speedups for Kyber’s three different security levels on Artix-7 and Zynq UltraScale+ devices, and a 50-75% reduction in DSPs at comparable security level. Consequently, the proposed design achieve higher AT product efficiencies of 19-33%.

Performance Evaluation of Post-Quantum TLS 1.3 on Resource-Constrained Embedded Systems

Chapter

Nov 2022

Transport Layer Security (TLS) constitutes one of the most widely used protocols for securing Internet communications and has also found broad acceptance in the Internet of Things (IoT) domain. As we progress toward a security environment resistant to quantum computer attacks, TLS needs to be transformed to support post-quantum cryptography. However, post-quantum TLS is still not standardised, and its overall performance, especially in resource-constrained, IoT-capable, embedded devices, is not well understood. In this paper, we showcase how TLS 1.3 can be transformed into quantum-safe by modifying the TLS 1.3 architecture in order to accommodate the latest Post-Quantum Cryptography (PQC) algorithms from NIST PQC process. Furthermore, we evaluate the execution time, memory, and bandwidth requirements of this proposed post-quantum variant of TLS 1.3 (PQ TLS 1.3). This is facilitated by integrating the pqm4 and PQClean library implementations of almost all PQC algorithms selected for standardisation by the NIST PQC process, as well as the alternatives to be evaluated in a new round (Round 4). The proposed solution and evaluation focuses on the lower end of resource-constrained embedded devices. Thus, the evaluation is performed on the ARM Cortex-M4 embedded platform NUCLEO-F439ZI that provides 180 MHz clock rate, 2 MB Flash Memory, and 256 KB SRAM. To the authors’ knowledge, this is the first systematic, thorough, and complete timing, memory usage, and network traffic evaluation of PQ TLS 1.3 for all the NIST PQC process selections and upcoming candidate algorithms, that explicitly targets resource-constrained embedded systems.

High-Speed Hardware Architectures and FPGA Benchmarking of CRYSTALS-Kyber, NTRU, and Saber

Article

Jan 2022

Post-Quantum Cryptography (PQC) has emerged as a response of the cryptographic community to the danger of attacks performed using quantum computers. All PQC schemes can be implemented in software and hardware using conventional (non-quantum) computing systems. PQC is the biggest revolution in cryptography since the invention of public-key schemes in the mid-1970s. Lattice-based key exchange schemes have emerged as leading candidates in the NIST PQC standardization process due to their relatively short public keys and ciphertexts. This paper presents novel high-speed hardware architectures for four lattice-based Key Encapsulation Mechanisms (KEMs) representing three NIST PQC finalists: NTRU (with two distinct variants, NTRU-HPS and NTRU-HRSS), CRYSTALS-Kyber, and Saber. We benchmark these candidates in terms of their performance and resource utilization in today's FPGAs. Our best architectures outperform the best designs from other groups reported to date in terms of the area-time product by factors ranging from 1.01 to 2.88, depending on the algorithm and security level. Additionally, our study demonstrates that CRYSTALS-Kyber and Saber have very similar hardware performance. Both outperform NTRU in terms of execution time by a factor 36-62 for key generation and 3-7 for decapsulation, assuming the same security level.

A Unified Cryptoprocessor for Lattice-Based Signature and Key-Exchange

Article

Jan 2022

We propose design methodologies for building a compact, unified and programmable cryptoprocessor architecture that computes post-quantum key agreement and digital signature. Synergies in the two types of cryptographic primitives are used to make the cryptoprocessor compact. As a case study, the cryptoprocessor architecture has been optimized targeting the signature scheme ‘CRYSTALS-Dilithium’ and the key encapsulation mechanism (KEM) ‘Saber’, both finalists in the NIST's post-quantum cryptography standardization project. The programmable cryptoprocessor executes key generations, encapsulations, decapsulations, signature generations, and signature verifications for all the security levels of Dilithium and Saber. On a Xilinx Ultrascale+ FPGA, the proposed cryptoprocessor consumes 18,406 LUTs, 9,323 FFs, 4 DSPs, and 24 BRAMs. It achieves 200 MHz clock frequency and finishes CCA-secure key-generation/encapsulation/decapsulation operations for LightSaber in 29.6/40.4/ 58.3 $\mu$ s; for Saber in 54.9/69.7/94.9 $\mu$ s; and for FireSaber in 87.6/108.0/139.4 $\mu$ s, respectively. It finishes key-generation/sign/verify operations for Dilithium-2 in 70.9/151.6/75.2 $\mu$ s; for Dilithium-3 in 114.7/237/127.6 $\mu$ s; and for Dilithium-5 in 194.2/342.1/228.9 $\mu$ s, respectively, for the best-case scenario. On UMC 65nm library for ASIC the latency is improved by a factor of two due to a 2× increase in clock frequency.

Transitioning organizations to post-quantum cryptography

Article

May 2022

Quantum computers are expected to break modern public key cryptography owing to Shor’s algorithm. As a result, these cryptosystems need to be replaced by quantum-resistant algorithms, also known as post-quantum cryptography (PQC) algorithms. The PQC research field has flourished over the past two decades, leading to the creation of a large variety of algorithms that are expected to be resistant to quantum attacks. These PQC algorithms are being selected and standardized by several standardization bodies. However, even with the guidance from these important efforts, the danger is not gone: there are billions of old and new devices that need to transition to the PQC suite of algorithms, leading to a multidecade transition process that has to account for aspects such as security, algorithm performance, ease of secure implementation, compliance and more. Here we present an organizational perspective of the PQC transition. We discuss transition timelines, leading strategies to protect systems against quantum attacks, and approaches for combining pre-quantum cryptography with PQC to minimize transition risks. We suggest standards to start experimenting with now and provide a series of other recommendations to allow organizations to achieve a smooth and timely PQC transition. Standards and recommendations for transitioning organizations to quantum-secure cryptographic protocols are outlined, including a discussion of transition timelines and the leading strategies to protect systems against quantum attacks.

A High-Performance Domain-Specific Processor With Matrix Extension of RISC-V for Module-LWE Applications

Article

Jul 2022

The 5G edge computing infrastructure should be empowered with quantum attack resistance by implementing post-quantum cryptography (PQC). Among various PQC schemes, lattice-based cryptography (LBC) based on learning with error (LWE) has attracted much attention because of its performance efficiency and security guarantee. In LWE-based LBCs, the Module-LWE-based schemes gain advantage over the others benefiting from the unique polynomial matrix and vector structure. To provide a high-performance implementation of Module-LWE applications for the edge computing paradigm, we propose a domain-specific processor based on a matrix extension of RISC-V architecture. This custom extension encapsulates the matrix-based ring operations with a high-level functional abstraction. A 2-D systolic array with configurable functionality is proposed to perform matrix-based number theoretic transform (NTT) and other arithmetic operations, achieving high data-level parallelism with support for the variable-sized polynomial matrix and vector structure. As this structure of Module-LWE involves no data dependency between different inner elements, an out-of-order mechanism is further developed to exploit the instruction-level parallelism. We implement the proposed architecture under TSMC 28nm technology. The evaluation results show that our implementation can achieve up to $3.5\times $ and $3.3\times $ improvement in cycle count respectively in Kyber and Dilithium, compared to the state-of-the-art crypto-processor counterparts.

A Hard Crystal - Implementing Dilithium on Reconfigurable Hardware

Chapter

Mar 2022

CRYSTALS-Dilithium as a lattice-based digital signature scheme has been selected as a finalist in the Post-Quantum Cryptography (PQC) standardization process of NIST. As part of this selection, a variety of software implementations have been evaluated regarding their performance and memory requirements for platforms like x86 or ARM Cortex-M4. In this work, we present a first set of Field-Programmable Gate Array (FPGA) implementations for the low-end Xilinx Artix-7 platform, evaluating the peculiarities of the scheme in hardware, reflecting all available round-3 parameter sets. As a key component in our analysis, we present results for a specifically adapted Number-Theoretic Transform (NTT) core for the Dilithium cryptosystem, optimizing this component for an optimal Look-Up Table (LUT) and Flip-Flop (FF) utilization by efficient use of special purpose Digital Signal Processors (DSPs). Presenting our results, we aim to shed further light on the performance of lattice-based cryptography in low-cost and high-throughput configurations and their respective potential use-cases in practice.KeywordsFPGADilithiumPQC

High-Performance Hardware Implementation of CRYSTALS-Dilithium

Conference Paper

Dec 2021

: A Crystal for Post-Quantum Security Using Kyber and Dilithium

Abstract and Figures

Recommended publications

A Unified Cryptoprocessor for Lattice-Based Signature and Key-Exchange

Agile Acceleration of Stateful Hash-Based Signatures in Hardware

A Unified Cryptoprocessor for Lattice-based Signature and Key-exchange

High-Speed Design of Post Quantum Cryptography With Optimized Hashing and Multiplication

Post-Quantum Signatures on RISC-V with Hardware Acceleration