ArticlePDF Available

New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

August 2019
IACR Transactions on Cryptographic Hardware and Embedded Systems

August 2019

DOI:10.46586/tches.v2019.i4.91-125

License
CC BY 4.0

Authors:

In this paper we consider various methods and techniques to find the smallest circuit realizing a given linear transformation on n input signals and m output signals, with a constraint of a maximum depth, maxD, of the circuit. Additional requirements may include that input signals can arrive to the circuit with different delays, and output signals may be requested to be ready at a different depth. We apply these methods and also improve previous results in order to find hardware circuits for forward, inverse, and combined AES SBoxes, and for each of them we provide the fastest and smallest combinatorial circuits. Additionally, we propose a novel technique with “floating multiplexers” to minimize the circuit for the combined SBox, where we have two different linear matrices (forward and inverse) combined with multiplexers. The resulting AES SBox solutions are the fastest and smallest to our knowledge.

Architecture of the forward SBox according to [Can05] and [BP12].

…

More details on the architectures A and D.

…

Forward SBox: Synthesis results (the closer the curve is to the axes the better the result in terms of the area/speed trade-off).

…

Combined SBox: Synthesis results (the closer the curve is to the axes the better the result in terms of the area/speed trade-off).

…

Refactored INV block and scenarios S0-S5 .

…

Figures - available via license: Creative Commons Attribution 4.0 International

Content may be subject to copyright.

Available via license: CC BY 4.0

Content may be subject to copyright.

New Circuit Minimization Techniques for

Smaller and Faster AES SBoxes

Alexander Maximov and Patrik Ekdahl

Ericsson Research, Lund, Sweden

{alexander.maximov,patrik.ekdahl}@ericsson.com

Abstract.

In this paper we consider various methods and techniques to ﬁnd the

smallest circuit realizing a given linear transformation on

input signals and

output

signals, with a constraint of a maximum depth,

maxD

, of the circuit. Additional

requirements may include that input signals can arrive to the circuit with diﬀerent

delays, and output signals may be requested to be ready at a diﬀerent depth. We

apply these methods and also improve previous results in order to ﬁnd hardware

circuits for forward, inverse, and combined AES SBoxes, and for each of them we

provide the fastest and smallest combinatorial circuits. Additionally, we propose a

novel technique with “ﬂoating multiplexers” to minimize the circuit for the combined

SBox, where we have two diﬀerent linear matrices (forward and inverse) combined

with multiplexers. The resulting AES SBox solutions are the fastest and smallest to

our knowledge.

Keywords:

AES SBox

circuit area

circuit depth

multiplexers

linear matrices

1 Introduction

Eﬃcient hardware design of AES SBoxes is a well-known subject. If you want the absolute

maximum clocking speed of the hardware, you’d probably use a straightforward table-

lookup implementation, which naturally leads to a large area. In many practical situations

the area of the cryptographic subsystem is limited, and the designer cannot aﬀord to

implement table-lookup for the 16 SBoxes involved in an AES round. For these situations,

we need to study how to implement an AES SBox with logical gates only, focusing on

both area and maximum clocking speed. The maximum clocking speed of a circuit is

determined by the critical path or depth of the circuit; the worst case time it takes to get

stable output signals from a change in input signals.

Another aspect when implementing AES is, in particular, the need for the inverse

cipher. Many modes of operation for a block cipher only use the encryption functionality

and hence there is no need for the inverse cipher. In case you need both the forward and

inverse SBox, it is often beneﬁcial to combine the two circuits. This is because the main

operation of the AES SBox is taking the inverse of a ﬁeld element, which naturally is its

own inverse, and we expect that many gates of the two circuits can be shared.

From a mathematical perspective, the forward AES SBox is deﬁned as the composition

of a non-linear function

(

)and an aﬃne function

(

), such that

SBox

(

) =

(

)).

The non-linear function

(

) =

g−1

is the multiplicative inverse of an element

in the

ﬁnite ﬁeld

)deﬁned by the irreducible polynomial

+ 1. We will

assume that the reader is familiar with the AES SBox, and refer to [

oST01

] for a more

comprehensive description.

The ﬁrst step towards a small area implementation was described by Rijmen [

Rij00

where results from [

IT88

] was used. The idea is that the inverse calculation in

)

Licensed under Creative Commons License CC-BY 4.0.

IACR Transactions on Cryptographic Hardware and Embedded Systems ISSN 2569-2925,

Vol. 2019, No. 4, pp. 91–125

DOI:10.13154/tches.v2019.i4.91-125

92 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

can be reduced to a much simpler inverse calculation in the subﬁeld

)by doing

a base change to

((2

)

). In 2001, Satoh et al [

SMTM01

] took this idea further and

reduced the inverse calculation to the subﬁeld

). In 2005, Canright [

Can05

] built

on the work of Satoh et al and investigated the importance of the representation of the

subﬁeld, testing many diﬀerent isomorphisms that led to the smallest area design. This

construction is perhaps the most cited and used implementation of an area-constrained

combined AES SBox.

In a series of papers, Boyar, Peralta et al presented some very interesting ideas for both

the subﬁeld inverter as well as new heuristics for minimizing the area of logical circuits

[

BP10a

BP10b

BP12

BFP18

]. They derived an inverter over

)with depth 4 and

a gate count of only 17. The construction in [BP12] is the starting point for this paper.

After Boyar, several other papers followed focusing on low depth implementations

[

JKL10

NNT+10

UHS+15

]. In 2018 two papers by Reyhani et al [

RMTA18a

RMTA18b

]

presented the best known implementation (up until now) of both the forward SBox as

well as the combined SBox. In [

LSL+19

] the authors present a very nice way to include

the depth into Boyar’s SLP problem [

BMP13

]. But the algorithm does not work with

multiplexers, and hence cannot be applied to the combined SBox.

As pointed out in [

RMTA18a

], there are misalignments between researchers in how to

present and compare implementations of combinatorial circuits. One way is to simply count

the total number of standard gates in the design and ﬁnd the path through the circuit that

contains the critical path to determine and compare the speed. In practice it is much more

complicated than that. For this paper, we present both the simple measure using only the

number of gates, as well as giving a Gate Equivalent (GE) number based on the typical

area required for the gate compared to the NAND gate. So for example the 2-input NAND

gate will have GE=1, while the XOR gate will have a GE=2.33. The relative numbers for

the GE are dependent on the speciﬁc ASIC process technology used, as well as the drive

strength needed from the gate. We have used the GE values obtained from the Samsung’s

STD90/MDL90 0.35

m 3.3V CMOS technology [

Sam00

]. A comprehensive discussion on

our choices for circuits comparison can be found in Appendix A. Additionally, we propose

to count technological depth of a circuit normalized in terms of the delays of a XOR gate,

which makes it possible to compare depths and the speed of various academic results.

The rest of the paper is organized as follows. In Section 2we introduce the standard

hardware architecture for the AES SBox. In Section 3we describe the fundamental

problem we are addressing, together with improvements to previously known techniques for

solving it. The new idea of considering “ﬂoating multiplexers” is introduced in Section 4,

followed by architectural improvements to the AES SBox in Section 5. The results, both

theoretical and practical synthesis results, are given in Section 6. The paper ends with

some conclusions and acknowledgements in Sections 7and 7.

2 Preliminaries

We will follow the notation used in both [

Can05

] and [

BP12

] when we now construct our

tower ﬁeld representation. The irreducible polynomials, roots, and normal basis can be

found in Table 1.

Table 1: Deﬁnition of the subﬁelds used to construct GF (28).

Target Field Irreducible Poly. Root Coeﬃcients in Field Normal Base

GF (22)x2+x+ 1 W GF (2) [W, W 2]

GF (24)x2+x+W2Z GF (22) [Z2, Z8]

GF (28)x2+x+W Z Y GF (24) [Y, Y 16]

Alexander Maximov and Patrik Ekdahl 93

Following [

Can05

] and [

BP12

], we can now derive the expression for inverting a general

element A=a0Y+a1Y16 in GF (28)as

A−1= (AA16)−1A16

= ((a0Y+a1Y16)(a1Y+a0Y16 ))−1(a1Y+a0Y16)

= ((a2

0+a2

1)Y17 +a0a1(Y2+Y32))−1(a1Y+a0Y16 )

= ((a0+a1)2Y17 +a0a1(Y+Y16)2)−1(a1Y+a0Y16 )

= ((a0+a1)2W Z +a0a1)−1(a1Y+a0Y16).

The element inversion in GF (28)can be done over GF (24)according to

T1= (a0+a1)T2= (W Z)T2

1T3=a0a1T4=T2+T3

T5=T−1

4T6=T5a1T7=T5a0

(1)

where the result is obtained as

A−1

T6Y

T7Y16

. In these equations we utilize several

operations (addition, multiplication, scalar, and squaring) but only two of them are non-

linear over

(2); multiplication and inversion. Furthermore, the standard multiplication

operation also contains some linear operations. If we separate all the linear operations

from the non-linear and combine the former with the linear equations needed to do the

base change for the AES SBox input, which is represented in polynomial base using the

AES SBox irreducible polynomial

+ 1, we will end up with an architecture

of the SBox according to Figure 1, where we also indicate were the diﬀerent parts of

equations 1are calculated.

Top

linear

Bottom

linear

Mul-

Sum

Inverse

GF(24) 2xMul

8-bit output R

8-bit Input U 4-bit X4-bit Y18-bit N22-bit Q

Base conversion and

generation of the linear

parts of the inversion. T3and T4 T5 T6and T7

Base back-conversion and

the afﬁne transformation of

the AES SBox.

Figure 1: Architecture of the forward SBox according to [Can05] and [BP12].

In case we are dealing with the inverse SBox, we naturally need to apply the inverse

aﬃne transform to the top linear matrix instead of the bottom.

This architecture will be our starting point, and we will now provide a set of new or

enhanced algorithms for minimizing both the area and the depth of the two linear top and

bottom matrices.

3 Circuits for binary linear system of equations

In this section, we will recapitulate the known techniques for linear circuit minimization

and propose a few improvements. We start by stating the objectives.

3.1 Basic problem statement

Given a binary matrix

Mm×n

and the maximum allowed depth

maxD

, ﬁnd the circuit of

depth

D≤maxD

with the minimum number of 2-input XOR gates such that it computes

M·X

. In other words, given

bits of input

= (

x0. . . xn−1

), the circuit should

94 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

compute

linear combinations

= (

y0. . . ym−1

). Any circuit realization that implements

a given system of linear expressions is called a solution.

The above problem is NP-hard [

BMP08

], and we have seen various heuristic approaches

that help ﬁnding a sub-optimal solution in the literature. In all previous work we have

studied, the assumption is that all input signals arrive in the same time, and all output

signals are “ready” with delays at most

maxD

. In this paper we extend the original

problem with AIR and AOR deﬁned as follows.

Additional Input Requirement (AIR).

The problem may be extended with an

additional requirement on input signals

, such that each input bit

arrives with its own

delay

, in terms of XOR-gates delays. The resulting depth

D≤maxD

then includes

input delays. For example, if some input

has the delay

di> maxD

then no solution

exists. The AIR is useful while deriving the bottom matrix as described in Section 2, since

after the non-linear part, the signals entering the bottom matrix will have diﬀerent delays.

Additional Output Requirement (AOR).

The problem may be extended by an

additional requirement on the output signals. Each output signal

may be required to be

“ready” at depth at most

ei≤maxD

. This is useful when some output signals continue

to propagate in the critical path and other signals may be computed with larger delays,

but still at most

maxD

. The AOR is used while deriving the top matrix as described in

Section 2, since when we introduce multiplexers for the combined SBox, the output signals

of the top matrix will be required to have diﬀerent delays.

3.2 Cancellation-free heuristics

Cancellation-free heuristics are algorithms that produce linear expressions

a⊕b

, where

both

and

are Boolean linear expressions in the input variables, and

and

share no

common terms. In other words, as we add aand bwe will not cancel out any term.

Paar [

Paa97

] suggested a greedy approach to solving the Basic Problem in 3.1. That

solution starts with the matrix

and considers all pairs of columns (

i, j

)in

. Then a

metric is deﬁned (on the pairs of columns) as the number of rows where

Mr,i

Mr,j

= 1,

i.e. where the input variables

and

both occur. For the column pair with the highest

metric, we form a new variable

xi⊕xj

and add that to the matrix (which now is of

size m×(n+ 1)), and set positions Mr,i =Mr,j = 0, and Mr,n+1 = 1.

Canright [

Can05

] also used this technique but instead of using the metric function, he

performed an exhaustive search over all possible column pairs. This was possible due to

the fact that the target matrix in his case was the base conversion matrix only of size 8

As we saw in Section 2, our bottom matrix will be considerably larger, and hence we need

to take another approach. We also need to consider the AIR and the AOR.

Satisfying the AIR.

When performing the above algorithm we should keep track of

the depth of the newly added XOR gates. This is done by having a vector

= (

d0. . . dn−1

)

with the current depth of all inputs and newly added signals

. When the new signal

xi⊕xj

is added, the delay of

is trivially

max

(

di, dj

)+1. We then also

restrict the algorithm such that if

dn> maxD

then we are not allowed to add

as a new

input signal. The AIR is hereby satisﬁed automatically.

Satisfying the AOR.

Similarly, when adding a new input variable

, we need to

check if a solution is theoretically possible. An elegant solution to this is presented in

Theorem 2 in [

LSL+19

] where they calculate the shortest circuit given additional delay

constraints.

Probabilistic heuristic approach.

Since we cannot perform a full exhaustive search

on the bottom matrix due to its size, we need to conﬁne the number of pairs to keep and

further evaluate. We have found that keeping the

best candidates (based on the original

metric by Paar) and then randomly selecting which one to pick for the next XOR gate is a

good strategy. In our simulations, this probabilistic approach gave us much smaller circuits

than only considering the best metric candidates. Naturally, the execution time will be too

Alexander Maximov and Patrik Ekdahl 95

long if we pick a too large

, and conversely picking a too small

decreases the chances

of deriving a good circuit. In practice we found that

= 2

,...,

6is a reasonable number

of candidates to keep and try.

3.3 Cancellation-allowed heuristic

The cancellation-free approaches give sub-optimal results, as it was shown by Boyar and

Peralta in [

BP10a

], where they also introduced a new algorithm that allows cancellations.

This was later improved by Reyhani et al in [

RMTA18a

]. Next, we brieﬂy describe the

basic idea of that heuristic.

3.3.1 Basic cancellation-allowed algorithm [BP10a]

Every row of

is an

-bit binary vector. That vector can be seen as an

-bit integer

value. We deﬁne that integer value as a

target point

. Thus, the matrix M can be seen

as the column vector of

target points. The input signals

{x0, . . . , xn−1}

can also be

represented as integer values xi= 2i, for i= 0, . . . , n −1.

Let the base set

{s0, . . . , sn−1}

{

,...,

initially represent the input

signals. The key function of the algorithm is the distance function

δi

(

S, yi

)that returns

the smallest number of XOR gates needed to compute a target point

from the set of

known points

. The algorithm keeps a vector ∆=[

δ0, δ1, . . . , δn−1

]which is initially set

to the Hamming weight minus one of the rows of

, which would be the number of XOR

gates needed without any sharing of intermediate gates.

The algorithm then proceeds by combining two base points

and

in the base set

and xor them together producing a candidate point

si⊕sj

. The selection of

and

is performed by an exhaustive search over all distinct pairs, and then for each candidate

point, the sum of the distance vector

Pδi

, for

i∈

, n −

1], is calculated. Note that the

distance functions

δi

now is computed over the set

S∪ {c}

. The pair which gives the

smallest distance sum is picked and

is updated

S∪ {c}

. In case there is a tie, the

algorithm picks the pair that maximizes the Euclidean norm

pPδ2

, for

i∈

, n −

1]. If

there is a tie after this step too, the authors in [

BP10a

] investigated diﬀerent strategies

and concluded that all strategies tested performed similarly, and hence a simple random

selection can be used. The algorithm then repeats the step of picking two new base points

and calculating the distance vector sum, until the distance vector is all-zeros and the

targets are all found. In the original description, there is also a notion of “preemptive”

choices. A preemptive choice is a candidate point

such that it directly fulﬁls a target row

in the matrix

. If such a candidate is found, it is immediately used as the new point

and added to S.

Reyhani et al [

RMTA18a

] improved the original algorithm from [

BP10a

] by directly

searching for preemptive candidates in each round and add them all to the set

before

the “real” candidate is added and the distance vector recalculated. They also improved

the tie resolution strategy and kept all the candidates that were equally good under the

Euclidean norm and recursively tried them all, keeping the one that was best in the next

round.

When the maximum depth

maxD

is a required constraint, the newly proposed algorithm

in [

LSL+19

] can be used. However, in our simulations for bottom matrices, it didn’t produce

better results than the cancellation-free algorithm with randomization factor.

3.4 Exhaustive search methods

In this section we present an algorithm for an eﬃcient exhaustive search of the minimal

circuit. The overall complexity is exponential in the number of input signals, and linear in

96 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

the number of output signals. From our experiments we can conclude that this exhaustive

search algorithm can be readily applied to circuits of up to approximately 10 input bits.

3.4.1 Notations and data representation

Using the same integer representation of the rows of

, and the input signals

as in

Section 3.3.1, we can re-phrase the basic problem statement: given the set of input points

we want to ﬁnd the sequence of XORs on those points such that we get all the

wanted

target points

, the rows of the matrix

, with the maximum delay

maxD

. Input and

output points may have diﬀerent delays diand ei, respectively.

For data structures, we can store a set of 2

points as either a normal set, and/or

as a bit-vector. The set makes it possible to loop through the points while the bit-mask

representation is eﬃcient to test set membership.

3.4.2 Basic idea

The proposed exhaustive search algorithm is a recursive algorithm, iterating over the

depths, starting at depth 1 and ending at

maxD

. At each depth

, we try to construct

new points from the previous depths, thereby constructing circuits that are of exactly

depth

. When all target points are found, we check the number of required XOR gates,

keeping track of the smallest solution. We will need the following sets of points:

known[maxD+1] – the set of known points at certain depth D.

ignored[maxD+1] – the set of points that will be ignored at depth D.

targets – the set of target points.

candidates

– the set of candidate points that can be added to the set

known

at the current

recursion step.

The initial set of known points is

, for

= 0

. . . n −

1, and the set of target points is

, for

= 0

. . . m −

1. AIR is met by initially placing the input point

to the known set

at depth

. AOR is satisﬁed by setting the point

with output delay

to the ignore list

on all depth levels that are larger than ei.

We will now explain the steps executed at each depth of the recursion, assuming that

we currently are at depth D.

Step 1

– Preemptive points. Check the

known

[

] set to see if any pair can be combined

(XOR:ed) to give a target point not yet found. If all targets are found, or if we have

reached maxD, we return from this level of the recursion.

Step 2

– Collect candidates. Form all possible pairs of points from the

known

..D −

sets, where at least one of the points is from

known

[

D−

1], and XOR the pair to derive a

new point. If the derived point is in the set

ignored

[

] then we skip it, otherwise we add

it to the candidate set.

Step 3

– In this step we try to add points from the

candidate

set to the known list,

and call the algorithm recursively again. We start by trying to add 1 point and do the

recursive call. If that’s not solving the target points, we’ll try to add 2 points, and so

on until all combinations (or a maximum number of combinations) of the points in the

candidate set have been tried.

3.4.3 Ignored points and other optimizations

In step 2, we check the candidate set against the

ignored

[

] set, the set of ignored

points at depth

. The

ignored

set is constructed from a set of rules;

Intersection:

A candidate point

should be ignored if for all target points

we get (

)

This means that the point

covers too many of the input variables, and is not covered

by any of the points in the

targets

set;

Forward Propagation:

We can calculate all

possible points on each level starting from the top level

= 0 with

known points and

Alexander Maximov and Patrik Ekdahl 97

going down to

maxD

. Those points that can never appear at some level

are then

included into the

ignored

[d] set. If some target point

has another desired maximum

delay

ei< maxD

, then that point on the following depths should be ignored, i.e., we add

ignored

[

+ 1

..maxD

];

Sum of Direct Inputs:

If any of the input signals

xi, xj

give the point

xi⊕xj

on level

, then all consecutive levels

> d

must have the point

in the ignored list;

Backward Propagation:

As a last check, we can go backwards level

by level, starting from

maxD

and ending at level

= 1, and for each allowed (not

ignored) point on the level

we check whether there is still a not-ignored pair

a, b

at the

previous levels (one of

must be on the level

d−

1) such that it gives

a⊕b

. If not,

then the point

should be added to the

ignore

[d] set;

Ignore Candidates:

dynamically

add a point

to the

ignore

[

] set if

has been one of the candidates at previous levels

< d.

3.5 Remarks

Simulations show that regarding searching for the minimum solution the top matrix

(with only 8 inputs) can be solved with the exhaustive cancellation-allowed search as in

Section 3.4. The bottom matrix (with 18 inputs) is too large for a direct exhaustive search,

and we should start with a probabilistic cancel lation-free heuristic from Section 3.2, and

then use a full exhaustive search for the ending part, when the Hamming weights of the

remaining rows become small enough to perform the exhaustive search. This approach

gave us the best result.

4 System of linear circuits with multiplexers

Assume we want to ﬁnd a solution for the combined AES SBox, where the top and the

bottom linear matrices need to be multiplexed based on the SBox direction. This means

that the circuit for the combined linear expressions is basically doubled in size, plus the

set of multiplexers. In this section we will show how to deal with multiplexed systems of

linear expressions. We will show that the MUX and XOR gates can be considered in a

combined way in order to achieve a very compact circuit.

4.1 Floating multiplexers

Consider that for some signal

we have to compute two linear expressions

and

for the forward and the inverse SBoxes respectively. Then we apply a multiplexer so that

only one of the signals continues as

. Assume further that the signals

and

some part of the expression. Then it may be better to push that shared part after the

multiplexer, and the resulting solution can be simpliﬁed.

For example, let

X0⊕X1

and

X0⊕X2

, then normally we should spend 2

XOR gates and 1 multiplexer, so that we get

MUX

(

select, X0⊕X1, X0⊕X2

)with

3 gates. However, we can push the common part X0after the multiplexer as follows:

Y=MUX(select, X1, X2)⊕X0,

then we get a circuit with only 2 gates. In general, one can pick any linear combination ∆

on input signals and make a substitution:

Y=MUX(select, Y F, Y I)→MUX(select, Y F⊕∆, Y I⊕∆) ⊕∆,

where ∆is then added to the linear matrix as an additional target signal to compute. If

that substitution leads to a shorter circuit then we keep it. We should also choose such ∆

that the overall depth is not increased. Thus, various multiplexers will be “ﬂoating” over

the depth of the circuit. Signals with ∆

= 0 should have their maximum depth decreased

by 1.

98 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

4.1.1 Metrics and linear expressions to solve

We have

input signals

X1. . . Xn

and

output signals

Y1, . . . , Ym

, where each

represented in its most general form as a triple (Ai, Bi, Ci)such that

Yi=Ai⊕MUX(select, Bi, Ci),

where

Ai, Bi

, and

are linear expressions on the input signals. We are allowed to modify

the above expression as (

Ai⊕

∆

i, Bi⊕

∆

i, Ci⊕

∆

)for any ∆

, since the Boolean function

of Yiwill not change.

Let ABC represents the linear matrix that describes all the rows

Ai, Bi

, and

, for

i= 0, . . . , m, such that

ABC ×X

gives the wanted linear system to realize using minimal number of gates and a given

maxD

By choosing favorable values of ∆

, one can shrink the number of total gates, since some of

the target points of ABC may become equal to each other, and hence ABC can be reduced

by at least one row. Also, some of the targets may become 0 or having only one bit - i.e.,

they are equal to corresponding input signals. These targets are also removed from the

linear system as they are trivial and cost zero gates. After the above reductions we get a

system of linear expressions where all rows are distinct and have Hamming weight at least

2. As before, we interpret the rows of ABC as integers, and adding (XORing) a ∆

to the

three rows Ai, Bi, and Ciwill change those three target points, but not the resulting Yi.

Metric.

The search for a good combination of ∆s requires a lot of computations and it

rapidly becomes infeasible to compute a minimal solution for each selection. Thus, we need

to decide on a good metric that allows us to truncate the search space down to promising

sets of ∆s. We propose to adopt a metric that is based on the lower bound of the number

of gates of a ﬁxed system (when ∆values are selected), and deﬁne the metric to be the

number of rows of the reduced ABC matrix, plus the minimum number of extra gates

needed to complete the circuit, such as multiplexers.

In the following we present several heuristic approaches to ﬁnding a good set of ∆s

while minimizing the metric.

4.1.2 Iterative algorithms to ﬁnd ∆s: metric→minimize

The below techniques only work for small

, but in our case they are readily applicable to

the 8-input top matrix of the AES SBox.

Algorithm-A(k)

– Select

triplets (

Ai, Bi, Ci

)and try to ﬁnd

matching ∆

s that

minimize the metric. If some choice results in a smaller metric, we keep that choice and

continue searching with the updated ABC matrix. The algorithm runs in a loop until the

metric is not decreasing any more. Algorithm-A(1) is quite fast, and Algorithm-A(2) also

has acceptable speed. For larger

s it becomes infeasible. Algorithm-A(k) works ﬁne for a

very quick/brief analysis of the given system but the result is quite unstable since for a

random set of initial values of ∆is the resulting metric ﬂuctuates heavily.

Algorithm-B

– unlike Algorithm-A this algorithm is trying to construct a linear

system of expressions, starting from an empty set of knowns

and then trying to add

new points to

one by one, until all targets of

ABC

become included in the set

. While

testing whether a new candidate

should be added to

we loop through all (

Ai, Bi, Ci

)

and for each one try to ﬁnd a ∆

that minimizes the overall metric. This heuristic algorithm

is a lot more stable and gives quite good results.

However, the smallest possible metric does not guarantee that the ﬁnal solution will

have the smallest number of gates, and the number of non-target intermediates needed is

unclear. Thus, it would be a good idea to collect a number of promising systems whose

metric is the smallest possible, then try to ﬁnd the smallest solution amongst them. We

will investigate this in the next section.

Alexander Maximov and Patrik Ekdahl 99

4.2 New generic heuristic technique for linear systems with ﬂoating

multiplexers

If we generalize the idea of ﬂoating multiplexers and let them ﬂoat even higher up in the

circuit, and also sharing them wider, we could achieve better results. In this section we

propose a generic heuristic algorithm that ﬁnds good circuits for such systems.

4.2.1 Problem statement

We are given

-bit input signal

, binary matrices

m×n

and

m×n

, binary vectors

, and vectors of delays

and

. We want to ﬁnd a smallest and

shortest solution that computes the m-bit output signal Y:

YF=MF·(X⊕AF),

YI=MI·(X⊕AI),

Y=MUX(ZF, Y F⊕BF, Y I⊕BI),

where each input signal

has an input arrival delay

and each output signal

must

have the total delay at most

A∗

and

B∗

are constant masking vectors for the input

and output signals respectively (NOT-gates).

is the mux selector, when

= 1 we

pick the ﬁrst (

=“forward”) output otherwise the second (

=“inverse”) output. We

also assume there is a complement signal

ZF ⊕

1that is also available as an input

control signal.

4.2.2 Preliminaries

Similar to our previous notation, we deﬁne a “point” to be tuple of a point value (.p) and

a delay (.d):

point:={.p=[f(1 bit)|F(n bits)|i(1 bit)|I(n bits)], .d=Delay},

which is then translated into a 1-bit signal circuit

signal:=MUX(ZF, F ·X⊕f, I ·X⊕i),

with a total output delay

point.d

. I.e.,

and

are linear combinations of the

-bit input

, and

and

are negate bits applied to the result in case the selector

is “forward”

or “inverse”, respectively. The ninput points are then represented as:

input point Xk:= {.p=[AF

k|2k|AI

k|2k], .d=DX

k},for k= 0, . . . , n −1,

and the target mpoints are:

target point Yk:= {.p=[BF

k|YF

k|BI

k|YI

k], .d= ≤DY

k},for k= 0, . . . , m −1.

We should also include the following 4 trivial points to the set of inputs:

signal ZF := {.p=[1|0|0|0], .d=0},signal 0 := {.p=[0|0|0|0], .d=0},

signal ZI := {.p=[0|0|1|0], .d=0},signal 1 := {.p=[1|0|1|0], .d=0}.

100 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

Given any two (ordered) points

and

there are at most 6 possible new points that can

be generated based on the following gates:

MUX(v, w) := {.p=[v.f|v.F|w.i|w.I], .d=Dnew },

NMUX(v, w) := {.p=[v.f ⊕1|v.F|w.i ⊕1|w.I], .d=Dnew },

MUX(w, v) := {.p=[w.f|w.F|v.i|v.I], .d=Dnew },

NMUX(w, v) := {.p=[w.f ⊕1|w.F|v.i ⊕1|v.I], .d=Dnew },

XOR(v, w) := {.p=[w.f ⊕v.f |w.F⊕v.F|w.i ⊕v.i|w.I⊕v.I], .d=Dnew},

NXOR(v, w) := {.p=[w.f ⊕v.f ⊕1|w.F⊕v.F|w.i ⊕v.i ⊕1|w.I⊕v.I], .d=Dnew},

where

Dnew

max{v.d,w.d}

+1. Note that the inclusion of the 4 trivial points is important,

since then we can limit the number of gate types to be considered. For example, a NOT-

gate is then implemented as

XOR

(

1), AND gate with ZF can be implemented as

MUX

(

0),

OR gate with ZI is MUX(v, 1), etc.

4.2.3 The ﬂoating multiplexers algorithm

We start with the set

of input points (of size

+ 4), and place all target points into

the set

. At each step, we compute the set of candidate points

that is generated

by applying the above 6 gates to any two points from the set

. Naturally,

should

only contain unique points and exclude those already in

. We try to add one candidate

point from

and compute the distances from

to each of the target points in

Thereafter we compare metrics to decide which candidate point will be included into

this step, and start over by calculating the possible candidates. The algorithm stops when

the overall distance δ-metric is 0.

The metric

consists of several values. The distance

(

S, ti

)is the minimum number

of basic gates (the above 6) required to get the target point

from the points in

, such

that the delay is at most

. Subsection 4.2.5 discusses how to compute

(

S, ti

). The

applied metrics and their order of importance are then as follows:

γ= (|S| − n−4) +

m−1

i=0

δ(S, ti)→min,

δ=

m−1

i=0

(δ(S, ti)−(δ(S, ti) == 1)) →max,

τ=delay of the recent candidate point from Cadded to S→min,

ν2=

m−1

i=0

(δ(S, ti)−(δ(S, ti) == 1))2→max.

The metric

is the projected number of gates in case there will be no more shared

gates; that metric we should deﬁnitely minimize. In case there are several candidates that

give the same value, then we look into the second metric δ.

is the sum of distances excluding distances where only 1 gate is needed. Given the

smallest

, we must maximize

. The larger

the more opportunities to shrink

. We

exclude distances 1 because of the inclusion of the preemptive step that we will describe

below. When we accept candidates to

one by one as described above, the metrics

and

are similar, but will become distinct when we, in the next subsection, introduce a search

tree where the size of |S|may diﬀer.

selects the candidate having the minimum depth in case the above two metrics showed

the same values for two candidates. In case there are no maximum depth constraints for

target points then this metric is not needed.

Alexander Maximov and Patrik Ekdahl 101

is the Euclidean norm excluding the preemptive points (similar to

). This is the

last decision metric since it is not a very good predictor, a worse value may give a better

result and vice versa. However, if there are two candidates with equal metrics

, and

then ordering of the two candidates may be done based on

. An alternative approach in

case of tie-candidates is to choose one of them randomly.

Preemptive points.

If some distance

(

S, ti

)=1then we accept the point

into

immediately without the search through the candidates

. The inclusion of this step in

the algorithm forces us to exclude such points from the metrics δand ν.

In [

RMTA18a

] preemptive points were included into the metric, but we believe it was

not fully correct. E.g., when two distance vectors {1,2, . . .}and {0,2, . . .}have the same

projected gates, then they fall into a totally equal situation in terms of possible shared

gates, thus they should result in the same

. The point with the distance 1 in the above

vector will be included into the circuit immediately (preemptive point), and it does not

give any advantage over the second choice where we have a point with the distance 0.

Therefore, distances with the value 1 should be ignored in

and

, but they should be

accounted in the projected gates γ, instead.

4.2.4 Search tree

Additionally to the above algorithm, we propose to have a search tree where each node

is a set

with metrics. Children of such a node are also nodes where

is derived from

by adding one of the candidate point

S←C

. Thus, every path from the root node to

a leaf represents a sequence of accepted candidate points to the root set

. If, at some

point, a leaf has metric δ= 0 then that leaf represents a possible solution path.

We keep a number of children nodes (in our experiments we kept at least 20-50 best

children) whose metrics are the best (they may even have diﬀerent projected gates

We also deﬁne the maximum depth

T D

of the search tree (in our experiments we tried

T D

= 1

,...,

20). When the tree at depth

T D

is constructed, we then examine the leaves

and see where we get the best metric over all leaves at all diﬀerent branches. Tracking

back to the root, we then choose to keep the top branch that leads to the best leaf(s).

Other top branches from the root are removed. We then advance the root node to the ﬁrst

child of the selected branch and try to extend the tree’s depth again from the remaining

leaves, thus, keeping the search tree at a constant depth T D.

If, at every depth of the tree, each leaf is extended with additional 20-50 sub-branches,

then the number of leaves will increase exponentially. However, we can apply a

truncation

algorithm

to the leaves before extending the tree to the next depth. We simply keep no

more than a certain number of promising leaves that will be expanded to the next depth,

and other, less promising leaves we just remove from the tree (in our experiments the

truncation level was up to 400 leaves overall for the whole tree). This type of truncation

makes it possible to select the best top branch of the root node by “looking further”

basically at any depth

T D

. Notably, the complexity does not depend on the depth

T D

but it depends on the truncation level.

102 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

Truncation strategy.

In brief, we keep those leaves with the best metrics, but try to

distribute nearly equal leaves among diﬀerent branches, so that we keep as many diverted

solution paths as possible.

4.2.5 Computation of δ(S, ti)

The “heart” and the critical part of the algorithm is the subalgorithm to compute the

distances

(

S, ti

), given a fresh

. There are many candidates to test at each step, and

there are many branches to track, so we need to make this core algorithm as fast as

possible.

Note that the length of a point (

is an integer) is 2

+ 2 bits, plus the delay value.

We will ignore the delay (

) value when doing Boolean operations over two points. Let

us assign the number of possible points as:

N= 22n+2.

Let

[] be a vector of length

cells, each cell

[

]corresponds to a (2

+ 2)-bit point

represented as an integer index, and the value stored in the cell will be the minimum delay

p.d of that point such that it can be derived from Swith exactly kgates.

Set the initial vector

∀p→V0

[

] =

p.d

, if

p∈S

, and

[

] =

∞

, otherwise.

Thereafter, the vector

Vk+1

can be derived from the previously derived vectors

V0. . . Vk

applying the allowed 6 gates to points from some level 0

≤l < k

(

) and the level

k−l

(

Vk−l

), thus resulting in total

+ (

k−l

) + 1 =

+ 1 gates. After a new

Vk+1

is derived,

we simply check if it contains new distance values for the targets from

, and we repeat

the procedure until all distances

(

S, ti

)for all

are found. A high-level description

of the algorithm is given in Algorithm 1, and in Appendix B.1 we provide a more detailed

description alongside multiple computational tricks that can be made.

Algorithm 1 Algorithm for computing δ(S, ti)

1: function Distances(S,T,maxδ)→∆ = {δi}, i = 0, . . . , m −1

2: Init δi=∞for i= 0, . . . , m −1

3: Init ∀p:V0[p] = p.d if p∈S, otherwise ∞

4: Init k= 0

5: while true do

6: while ∃i:δi=∞and Vk[ti]≤ti.d do

7: δi=k

8: if ∀i:δi<∞then return OK

9: if k≥maxδ then return FAIL

10: k←k+ 1

11: Init ∀p:Vk[p] = ∞

12: for all a, b :do

13: for p

{MUX

(

a, b

)

,NMUX

(

a, b

)

,MUX

(

b, a

)

,NMUX

(

b, a

)

,XOR

(

a, b

)

,NXOR

(

a, b

)

}do

14: for l← bk/2cto k−1do

15: d←max(Vk−l−1[a], Vl[b]) + 1

16: Vk[p]←min(Vk[p], d)

4.2.6 Double and Useless points

“Double” points. When, at some step of the algorithm, we ﬁnd a candidate point

that

is already in

but now having a smaller depth, then the point

is kept in

and tested

along with other candidates. If it turns that adding

gives the best metric, we add it

. An alternative strategy would be to update the point

c.p

with the lower depth

Alexander Maximov and Patrik Ekdahl 103

c.d

and recalculate depths of dependent points. However, it is not clear what to do with

the parent points that were used to generate that previous

c.p

. We leave this as an

open question for further research.

“Useless” points. At the end of the algorithm (when

= 0), it could happen that

contains points that can be safely excluded while a solution can still be derived. As a ﬁnal

step, we try to remove points from

one by one and test if every target is reachable from

the remaining

under the given depth constraints. In our experience this situation is rare,

but it helped to remove 1-2 gates, mainly caused by “double” points.

The above problems with “double” and “useless” points are generic for such class of

algorithms where certain depth constraints should be met, and Algorithm 1 in [

LSL+19

]

also falls under this category.

5 Architectural improvements

Most known AES SBox architectures look quite similar, consisting of the Top and Bottom

linear parts, and the middle non-linear part, as previously described in Section 2. In this

section, we take that classic design and propose a number of improvements, along with a

completely new architecture that focuses on low depth solutions.

Top

linear

Bottom

linear

Mul-

Sum

Inverse

GF(24)

2xMul

32nand2

+8xor4

4-bit Y

32-bit L

Architecture D

8-bit output R

Architecture A

8-bit output R

8-bit Input U

4-bit X

4-bit Y

18-bit N

18-bit Q

Figure 2: Diﬀerence between the architectures A and D.

5.1 Two SBox architectures – Area and Depth

Referring to Figure 2, the

architecture A (Area)

is the classical one that implements

designs based on tower and composite ﬁelds. It starts with the 8-bit input signal

to the

Top linear matrix

, which produces a 22-bit signal

(as in [

BP12

]). We managed to

reduce the number of needed

-signals to 18, and refactored the multiplication and linear

summation block

Mul-Sum

to 24 gates and depth 3. (See Appendix D.2 for equations). The

output from the

Mul-Sum

block is the 4-bit signal

which is the input to the inversion

over GF(2

). The output from the inversion,

, is non-linearly mixed with the

signals,

derived in the top matrix, and produces 18-bit signal

. The ﬁnal step is the

Bottom

linear matrix

that takes 18-bit

and linearly derives the output 8-bit signal

. The

top and bottom matrices incorporate the SBox’s aﬃne transformation that depends on

the direction.

In the new

architecture D (Depth)

we tried to remove the “irregular” bottom matrix

and as a result shrinking the depth of the circuit as much as possible. The idea behind is

that the bottom matrix only depends on the set of multiplications of the 4-bit signal

104 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

and some linear combinations of the 8-bit input

. Thus, the result

can be achieved as

follows:

R=Y0·M0·U⊕. . . ⊕Y3·M3·U,

where each

is a 8

8matrix representing 8 linear equations on the 8-bit input

to be scalar multiplied by the

-bit. Those 4x8 linear circuits can be computed as a

32-bits signal

in parallel with the circuit for the 4-bits of

. The result

is achieved

by summing up four 8-bit sub-results. Therefore, in the architecture D we get the depth

3 after the inversion step (critical path:

MULL

and

8XOR4

blocks), instead of the depth

5-6 in the architecture A. That new architecture D requires a bit more gates, since the

assembling bottom circuit needs 56 gates:

32NAND2+8XOR4

. The reward is the lower depth.

A more detailed sketch of the two architectures is given in Figure 3, that includes the

components of the designs, delays and the number of gates.

FTopA

[22G, 3D]

ITopA

[23G, 3D]

CTopA

[41G, 4D]

X0..X3<=MULX(Q0..Q17) [24G, 3D]

Y0..Y3<=INV(X0..X3) [9G, 3D]

S0 [+5G, +2D]

Y0..Y3, Y01, Y23, Y02, Y13, Yall

S1 [+7G, +1D]

FBotA

[31G, +3D]

IBotA

[29G, +3D]

CBotA

[46G, +4D]

N0..N17<=MULN(Y*, Q0..Q17)

[18G, 1D]N with diff.delays due to Ys

K0..31<= MULL(Y0..3, L0..31)

[32G, 1D]

FTopD

[41G, 3D]

ITopD

[41G, 3D]

R0..7<= 8XOR4(K0..31) [24G, 2D]

Architecture A(Area) Architecture D(Depth)

CTopD

[83G, 4-10D]

^ZF:[+4G,+1D]

Result R0..R7

Input U0..U7

Figure 3: More details on the architectures A and D.

5.2 Six diﬀerent scenarios of MULN

In the

MULN

block, where the 18-bit

-signals are computed, we need as input the 18-bit

-signals and the inversion result

. But we also need the following additional linear

combinations of

Y02

Y0⊕Y2, Y13

Y1⊕Y3, Y23

Y2⊕Y3, Y01

Y0⊕Y1, Y00

Y01 ⊕Y23

– these correspond to the signals

M41-M45

in [

BP12

]. Thus, the

vector is

actually extended to 9 bits, and the delays of

bits become diﬀerent, depending on which

of the

is used in the multiplication. For example, in the worst case, the delay of

Y00

+2 compared to the delay of

. Thus, the resulting signals

will have diﬀerent output

delays. However, it is possible to compute these 5 additional

s in parallel with the base

signals

Y0, . . . , Y3

. This will cost some extra gates, but then the +2 delay can either shrink

down to +1 or +0. In general one can consider the following 6 scenarios:

•S0

. We compute only the base signals

Y0, . . . , Y3

, and the remaining {

Y01, Y23 , Y02, Y13 ,

Y00

} we compute with XORs as above. The delay is +2 but it has the smallest

number of gates;

•S1. Compute {Y01, Y23 } in parallel, the delay is +1;

•S2. Compute {Y02, Y13 } in parallel, the delay is +1;

Alexander Maximov and Patrik Ekdahl 105

•S3. Compute {Y00} in parallel, the delay is +1;

•S4. Compute {Y01, Y23 , Y02, Y13 } in parallel, the delay is +1;

•S5

. Compute {

Y01, Y23 , Y02, Y13 , Y00

} in parallel, the delay is +0 as there is no signal

left to compute afterwards.

In the next subsection we show how to ﬁnd Boolean expressions for the above scenarios.

5.3 INV. Inversion over GF(24)

The inversion formulae are as follows:

Y0=X1X2X3⊕X0X2⊕X1X2⊕X2⊕X3,

Y1=X0X2X3⊕X0X2⊕X1X2⊕X1X3⊕X3,

Y2=X0X1X3⊕X0X2⊕X0X3⊕X0⊕X1,

Y3=X0X1X2⊕X0X2⊕X0X3⊕X1X3⊕X1.

In [

BP12

] they found a circuit of depth 4 and 17 XORs, but we would like to shrink the

depth even further by utilizing a wider range of standard gates.

We have adapted the algorithm from Section 4.2 to also ﬁnd a small solution for the

INV

block. The idea is simple; each

is a truth table of length 16 bits, based on a 4-bit

input

X0, . . . , X3

. We deﬁne our “point” to be a 16-bit value. All standard gates,

AND,

OR, XOR, MUX, NOT

, including their negate versions, can be applied to any combination of

“known” points (

), and distances to target points

can be computed in a similar manner

as before. Using this slightly modiﬁed algorithm for ﬂoating multiplexers, we found a

solution with only 9 gates and depth 3. The results are shown in Equation 2and Table 2.

The full listing of the formulae for scenarios S0-S5 can be found in D.2.

T0 = NAND(X0, X2) T3 = MUX(X1, X2,1) Y1 = MUX(T2, X 3, T 3)

T1 = NOR(X1, X3) T4 = MUX(X3, X0,1) Y2 = MUX(X0, T 2, X1)

T2 = XNOR(T0, T 1) Y0 = MUX(X2, T 2, X 3) Y3 = MUX(T2, X1, T 4)

(2)

Table 2: Refactored INV block and scenarios S0-S5 .

INV S0 S1 S2 S3 S4 S5

Std. area (gates) 9 14 16 17 16 19 19

Std. depth (gates) 3 5 4 4 4 4 3

Tech. area (GE) 18.31 29.96 35.30 39.63 36.62 42.29 44.63

Tech. depth(XORs) 2.31 4.31 3.31 3.77 3.76 3.59 3.11

In our tradeoﬀ circuits we have used scenario S1, as it showed best results with respect to

the area and depth. For the bonus circuits, we used S0 as it has the smallest area. For the

fast circuit, only the

INV

formulae are needed. We also derived an alternative circuit for

the inversion block without multiplexers, the results and formulae are given in B.2.

5.4 Additional Transformation Matrices (ATM)

We are solving the top matrices through exhaustive search and the bottom matrices with

various heuristic techniques. The way those matrices look, naturally inﬂuence the ﬁnal

number of gates in the solution. Here we present a simple method to try diﬀerent top and

bottom matrices for the best solution.

106 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

Assume that the SBox is a black box and, when excluding the ﬁnal addition of the

constant, it performs the function:

SBox(x) = x−1·A8×8,

where

x−1

is the inverse element in the Rijndael ﬁeld

), and the matrix

A8×8

the aﬃne transformation. In any ﬁeld of characteristic 2: squaring, square root, and

multiplication by a constant – are linear functions, thus for a non-trivial choice (

α, β

)we

have:

Z(x) = α·x2β−1,

SBox(x) = 2β

pα·Z(x)·A8×8.

If the initial Top and Bottom matrices for the forward and inverse SBoxes were

TF, BF, TI, FI

, respectively, then one can choose any

= 1

,...,

255 and

= 0

,...,

7, and

change the matrices as follows:

F=TF·E·Cα·Pβ·E,

F=E·A·P−1

β·Cα·A−1·E·BF,

I=TI·E·A·Cα·Pβ·A−1·E,

I=E·P−1

β·Cα·E·BI,

where:

– is the 8x8 matrix that switches bits endianness (in our circuits input and output bits

are in Big Endian)

A– is the 8x8 matrix that performs the SBox’s aﬃne transformation

Cα– is the 8x8 matrix that multiplies a ﬁeld element by the selected constant α

Pβ– is the 8x8 matrix that raises an element of the Rijndael ﬁeld to the power of 2β

TF/TI

– are the original (without modiﬁcations) 18x8 matrixes for the top linear transfor-

mation of the Forward/Inverse SBoxes, resp.

BF/BI

– are the original (without modiﬁcations) 8x18 matrixes for the bottom linear

transformation of the Forward/Inverse SBoxes, resp.

There are 2040 choices for (

α, β

)pair and each choice gives new linear matrices. It

is easy to test all of them and ﬁnd the best combination that gives the smallest SBox

circuit. We have applied this idea to both the forward as well as the inverse SBox, for

both architectures A and D. Note that a similar approach was recently and independently

considered in [

UHNA19

] but in that work they only considered multiplication with a

constant, and not squaring.

5.4.1 ATM approach for the combined SBox

For the combined SBoxes we can apply the ATM approach to the forward and the inverse

parts independently. This means that we have 2040

= 4

161

600 variants of linear

matrices to test. We have focused on the architecture D, since there is no bottom matrix

and thus we can do a more extensive search. We searched through all those 4 million

variants and applied the heuristic algorithm from the Section 4.1 as a quick analysis method

to select a set of around 4000 promising cases. We then applied the algorithm given in

Section 4.2 to ﬁnd a solution with ﬂoating multiplexers. In our case we have

= 8 input

bits and thus each point is encoded with 18 bits, and the complexity of calculating the

distance

(

S, ti

)is quadratic over

= 2

points. In the search we used the search tree

with the maximum depth T D ≤20 and the truncation level of 400 leaves.

Alexander Maximov and Patrik Ekdahl 107

Table 3:

Summary of which algorithms were used to derive the new SBoxes. BM is Bottom

Matrix, and TM is Top Matrix.

Bonus and Tradoﬀ (Arch. A) Fast (Arch. D, no BM)

Section Our contribution Fwd/Inv Combined Fwd/Inv Combined

3.2

Cancellation-free

heuristic

Probabilistic approach with

ﬁnal exhaustive search.

BM BM + optimization of

MUXes by hand

3.3

Boyar’s basic

algorithm and

[LSL+19]

Not used since probabilistic heuristic with ﬁnal exhaustive search gave better results.

3.4

Exhaustive

New contribution TM TM

4.1

Floating multi-

plexers (approxi-

mative solution)

New contribution In combination with

ATM to select

preliminary set from

≈4M choices.

4.2

Generic ﬂoating

multiplexers

New contribution TM but applied after a

ﬁrst selection using ATM

approach.

TM but applied after a

ﬁrst selection using

4.1+ATM.

5.2 and 5.3

MULN Scenario

used

New contribution

S0(Bonus),

S1(Tradeoﬀ)

S0(Bonus), S1(Tradeoﬀ) INV INV

5.4

Additional

Transformation

Matrixes

New contribution. A

similar approach was

independently derived

in [

UHNA19

], but only

constant multiplication was

considered.

Used Used Used Used

6 Results and comparisons

In this section we present our best solutions for the AES SBox, both forward and combined.

The stand-alone inverse SBox is perhaps not as widely used, and those results can be

found in Appendix C. We compare our area and depth using the techniques described

in Appendix Aand where possible, we have recalculated the corresponding GE for other

academic results for easier comparison. We present three diﬀerent solutions for each SBox

(forward, inverse, and combined): “fast”, “tradeoﬀ”, and “bonus”. The fast one is the

solution with the lowest critical path, the tradeoﬀ solution is a well-balanced trade-oﬀ

between area and speed, and the bonus solution is given to establish a new record in terms

of the smallest number of gates. Exact circuit expressions for all the derived solutions can

be found in Appendix D, where we also indicate which algorithm was used in deriving the

solution.

6.1 Synthesis results

We have performed a synthesis of the results and compared with other recent academic work.

The technology process is GlobalFoundries 22nm CSC20L [

Glo19

], and we have synthesized

using Design Compiler 2017 from Synospys in topological mode with the

compile_ultra

command. We also turned on the ﬂag

compile_timing_high_effort

to force the compiler

to make as fast circuits as possible. In those graphs, the X axis is the clock period (in ps)

and the Y axis is the resulting topology estimated area (in

µm2

). We have not restricted

the available gates in any way, so the compiler was free to use non-standard gates e.g., a 3

input AND-OR gate. To get the graphs in the following subsections, we have started at a

1200 ps clock period (

∼

833 MHz) and reduced the clock period by 20 ps until the timing

constraints could not be met. We note that the area estimates by the compiler ﬂuctuate

heavily, and we believe that this is a result of the many diﬀerent strategies the compiler

has to minimize the depth. One strategy might be successful for say a 700 ps clock period,

but a diﬀerent strategy (which results in a signiﬁcantly larger area) could be successful for

720 ps. There is also an element of randomness involved in the strategies for the compiler.

108 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

Table 4: Forward SBox: Comparison of the results.

Forward SBox Area Size/Gates Critical Path/Depth

Std. gates Tech. GE Std. gates Tech. XORs

Previous Results

Canright [Can05]80XO+34ND+6NR 19XO+3ND+1NR

most famous design 120 226.40 23 20.796

Boyar et al [BP12]94XO+34AD 13XO+3AD

our starting point 128 264.24 16 14.932

Boyar et al [Boy]81XO+32ND 21XO+6ND

record smallest 113 220.73 27 23.508

Ueno et al [UHS+15]91XO+48ND+13NR (+4IV) 10XO+5ND (+1IV)

record fastest, formulas from [RMTA18a] 151(+4) 270.71 15(+1) 12.449

Reyhani-Light [RMTA18a]69XO+43ND+7NR (+4IV) 16XO+4ND (+1IV)

at CHES 2018 119(+4) 213.45 20(+1) 18.031

Reyhani-Fast [RMTA18a]79XO+43ND+7NR (+4IV) 11XO+5ND (+1IV)

at CHES 2018 129(+4) 236.75 16(+1) 13.449

Ueno et al [UHNA19]90XO+4XN+10OR+45AD (+10IV) 11XO+1OR+3AD (+1IV)

recent result 149(+10) 298.87 15(+1) 14.131

Our Results

Forward (fast) 77XO+1XN+4AD+37ND+5NR+6MX 7XO+1XN+1AD+2NR+1MX

fast with depth 12 130 243.04 12 10.496

Forward (tradeoﬀ) 61XO+8XN+27ND+5NR+8MX+2MI 8XO+2ND+1ND+2NR+1MX

area/speed tradeoﬀ 111 216.75 14 12.263

Forward (bonus) 58XO+6XN+27ND+5NR+6MX 18XO+2XN+1ND+2NR+1MX

new record smallest 102 195.10 24 22.263

100

150

200

250

300

600

700

800

900

1000

1100

1200

Area

(um2)

Clock

period

(ps)

Our

fast

Our

tradeoff

Our

bonus

Reyhani

fast

Reyhani

light

Ueno'15

Ueno'19

Boyar

small

Figure 4:

Forward SBox: Synthesis results (the closer the curve is to the axes the better

the result in terms of the area/speed trade-oﬀ).

Alexander Maximov and Patrik Ekdahl 109

Table 5: Combined SBox: Comparison of the results.

Combined SBox Area Size/Gates Critical Path/Depth

Std. gates Tech. GE Std. gates Tech. XORs

Previous Results

Canright [Can05]94XO+34ND+6NR+16MX (+2IV) 20XO+3ND+2OR+5NR

most famous design 150(+2) 297.64 30 25.644

Reyhani et al [RMTA18b]81XO+32ND+4OR+16NR+16MI (+8IV) 17XO+2ND+3OR+6NR

149(+8) 290.13 28 23.608

Ueno et al [UHNA19]112XO+7XN+10OR+45AN+16MX (+10IV) 11XO+3AN+1OR+2MX (+1IV)

recent result 190 (+10) 393.40 17(+1) 15.681

Our Results

Combined (fast) 77XO+27XN+41ND+6NR+13MX+12MI 6XO+3XN+1ND+2NR+1MX+1MI

fast with depth 14 176 351.65 14 12.312

Combined (tradeoﬀ) 70XO+21XN+27ND+5NR+17MX+5MI 7XO+4XN+1ND+2NR+1MX+1MI

area/speed tradeoﬀ 145 296.99 16 14.305

Combined (bonus) 70XO+9XN+27ND+5NR+16MX 15XO+4XN+2ND+1NR+3MX

new record smallest 127 253.35 25 22.675

100

150

200

250

300

350

400

450

700

800

900

1000

1100

1200

Area

(um2)

Clock

period

(ps)

Our

fast

Our

tradeoff

Our

bonus

Reyhani

Ueno'19

Canright

Figure 5:

Combined SBox: Synthesis results (the closer the curve is to the axes the better

the result in terms of the area/speed trade-oﬀ).

110 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

6.2 Forward SBoxes

We have included a number of interesting previous results for comparison in Table 4. The

most famous design by Canright is widely used and cited. Our tradeoﬀ SBox is both

faster and smaller. We also included the work done by Boyar et al as their design was the

starting point for our research.

The two results from CHES’18 by Reyhani et al are the most recent, and our tradeoﬀ

SBox has a similar area as their “lightweight” version in terms of GE, but around 30%

faster. The tradeoﬀ SBox is both smaller and faster than their “fast” circuit. Also, our

“fast” version is faster by 25% than their “fast” version, while maintaining a decent area

increase. The currently fastest SBox done by Ueno has 270.71GE and 12.449XORs depth,

while our fast version is only 243GE with depth 10.496XORs, outperforming the known

fastest circuit by around 23%.

We also included the current known smallest circuit (in terms of standard gates) done

by Boyar in 2016, which has 113 gates (220.73GE) and depth 27 gates. Our “bonus” circuit

is even smaller with only 102 gates and depth 24, reaching as low as 195.10GE. Synthesis

results are shown in Figure 4.

6.3 Combined SBoxes

Table 5shows our results compared to the three previously known best results. Our

tradeoﬀ combined SBox has a similar size to that of [

Can05

] and [

RMTA18b

], but its speed

is a lot faster due to a much lower depth of the circuit. The tradeoﬀ circuit has depth

16 (in reality only 14.305XORs) and 145 gates (297GE), while Canright’s combined SBox

is of size 150(+2) gates (298GE) and the depth 30 (25.644XORs). The bonus solution

in this paper has slightly smaller depth than the most recent result [

RMTA18b

] but is

signiﬁcantly smaller in size (127 vs 149(+8) standard gates). Finally, the proposed “fast”

design using Architecture D has the best currently known depth. Our synthesis results are

shown in the comparison Figure 5.

7 Conclusions

In this paper we have introduced a number of heuristic and exhaustive search methods for

minimizing the circuit realization of the AES SBox. We have proposed a novel idea on

how to include the multiplexers of the combined SBox in the minimization algorithms, and

derived smaller and faster circuit realizations for the forward, inverse, and combined AES

SBox. We also introduced a new architecture where we remove the “irregular” bottom

linear matrix, in order to derive a faster solution than previously known.

Acknowledgements

We would like to thank the Ericsson Research Data Center team for their patience and

help with the compute resources that made this work possible, and our colleague Ben

Smeets and all reviewers for providing valuable comments to the manuscript.

References

[Art01]

Artisan Components, Inc. TSMC 0.18

m Process 1.8-Volt SAGE-XTM Stan-

dard Cell Library Databook, 2001.

http://www.utdallas.edu/~mxl095420/

EE6306/Final%20project/tsmc18_component.pdf.

Alexander Maximov and Patrik Ekdahl 111

[BFP18]

Joan Boyar, Magnus Find, and René Peralta. Small low-depth circuits for

cryptographic applications. Cryptography and Communications, 11, 03 2018.

[BHWZ94]

Michael Bussieck, Hannes Hassler, Gerhard J. Woeginger, and Uwe T. Zim-

mermann. Fast algorithms for the maximum convolution problem. Oper.

Res. Lett., 15(3):133–141, April 1994.

http://citeseerx.ist.psu.edu/

viewdoc/download?doi=10.1.1.3.5023&rep=rep1&type=pdf.

[BMP08]

Joan Boyar, Philip Matthews, and René Peralta. On the shortest linear

straight-line program for computing linear forms. In Edward Ochmański and

Jerzy Tyszkiewicz, editors, Mathematical Foundations of Computer Science

2008, pages 168–179, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg.

[BMP13]

Joan Boyar, Philip Matthews, and René Peralta. Logic minimization tech-

niques with applications to cryptology. J. Cryptol., 26(2):280–312, April

2013.

[Boy]

Joan Boyar. Circuit minimization work.

http://www.cs.yale.edu/homes/

peralta/CircuitStuff/CMT.html.

[BP10a]

Joan Boyar and René Peralta. A new combinational logic minimization

technique with applications to cryptology. In Paola Festa, editor, Experi-

mental Algorithms, pages 178–189, Berlin, Heidelberg, 2010. Springer Berlin

Heidelberg.

[BP10b]

Joan Boyar and René Peralta. A new combinational logic minimization

technique with applications to cryptology. In of Lecture Notes in Computer

Science, pages 178–189. Springer, 2010.

[BP12]

Joan Boyar and René Peralta. A small depth-16 circuit for the AES S-Box. In

Dimitris Gritzalis, Steven Furnell, and Marianthi Theoharidou, editors, SEC,

volume 376 of IFIP Advances in Information and Communication Technology,

pages 287–298. Springer, 2012.

https://link.springer.com/chapter/10.

1007/978-3-642-30436-1_24.

[Can05]

D. Canright. A very compact S-Box for AES. In Josyula R. Rao and Berk

Sunar, editors, Cryptographic Hardware and Embedded Systems – CHES

2005, pages 441–455, Berlin, Heidelberg, 2005. Springer Berlin Heidelberg.

https://www.iacr.org/archive/ches2005/032.pdf.

[FAR06]

FARADAY Technology Co. FSD0A_A 90 nm Logic SP-RVT (Low-K)

Process , 2006.

https://www.cl.cam.ac.uk/research/srg/han/ACS-P35/

documents/90nm-cell.pdf.

[Glo19]

GlobalFoundries. 22nm FDX process, 2019.

https://www.globalfoundries.

com/technology-solutions/cmos/fdx/22fdx.

[Int01]

International Business Machines Corporation. ASIC SA-27E Databook, Part

I Base Library and I/Os. Data Book, 2001.

http://people.csail.mit.edu/

jasonm/nigel/base_06-01.pdf.

[IT88]

Toshiya Itoh and Shigeo Tsujii. A fast algorithm for computing multiplica-

tive inverses in GF(2

) using normal bases. Inf. Comput., 78(3):171–177,

September 1988. http://dx.doi.org/10.1016/0890-5401(88)90024- 7.

[JKL10]

Yong-Sung Jeon, Young-Jin Kim, and Dong-Ho Lee. A compact memory-free

architecture for the AES algorithm using resource sharing methods. Journal

of Circuits, Systems, and Computers, 19:1109–1130, 2010.

112 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

[LSL+19]

Shun Li, Siwei Sun, Chaoyun Li, Zihao Wei, and Lei Hu. Constructing low-

latency involutory MDS matrices with lightweight circuits. IACR Transactions

on Symmetric Cryptology, 2019(1):84–117, Mar. 2019.

[MNG00]

Microelectronics Group, Carl F. Nielsen, and Samuel R. Gir-

gis. WPI 0.5 mm CMOS Standard Cell Library Databook , 2000.

https://lsm.epfl.ch/files/content/sites/lsm/files/shared/

Resources%20documents/data_book.pdf.

[NNT+10]

Yasuyuki Nogami, Kenta Nekado, Tetsumi Toyota, Naoto Hongo, and Yoshi-

taka Morikawa. Mixed bases for eﬃcient inversion in

((2

)

and conversion

matrices of SubBytes of AES. In Cryptographic Hardware and Embedded

Systems, CHES 2010, pages 234–247. Springer Berlin Heidelberg, 08 2010.

[oST01]

National Institute of Standards and Technology. Advanced encryption stan-

dard. NIST FIPS PUB 197, 2001.

[Paa97]

Christof Paar. Optimized arithmetic for Reed-Solomon encoders. In Pro-

ceedings of IEEE International Symposium on Information Theory. IEEE,

1997.

[Pet]

Graham Petley. Internet resource: VLSI and ASIC Technology Standard Cell

Library Design. http://www.vlsitechnology.org/index.html.

[Rij00]

Vincent Rijmen. Eﬃcient implementation of the Rijndael S-

Box, 2000.

https://www.researchgate.net/publication/2621085_

Efficient_Implementation_of_the_Rijndael_S-box.

[RMTA18a]

Arash Reyhani-Masoleh, Mostafa Taha, and Doaa Ashmawy. Smashing the

implementation records of AES S-Box. IACR Transactions on Cryptographic

Hardware and Embedded Systems, 2018(2):298–336, May 2018.

[RMTA18b]

Arash Reyhani-Masoleh, Mostafa M. I. Taha, and Doaa Ashmawy. New

area record for the AES combined S-Box/inverse S-Box. 2018 IEEE 25th

Symposium on Computer Arithmetic (ARITH), pages 145–152, 2018.

[Sam00]

Samsung Electronics Co., Ltd. STD90/MDL90 0.35

m 3.3V CMOS

Standard Cell Library for Pure Logic/MDL Products Databook,

2000.

https://www.digchip.com/datasheets/download_datasheet.php?

id=935791&part-number=STD90.

[SMTM01]

Akashi Satoh, Sumio Morioka, Kohji Takano, and Seiji Munetoh. A compact

Rijndael hardware architecture with S-Box optimization. In Colin Boyd,

editor, ASIACRYPT, volume 2248 of Lecture Notes in Computer Science,

pages 239–254. Springer, 2001.

[UHNA19]

Rei Ueno, Naofumi Homma, Yasuyuki Nogami, and Takafumi Aoki. Highly

eﬃcient GF(2

) inversion circuit based on hybrid GF representations. Journal

of Cryptographic Engineering, 9(2):101–113, Jun 2019.

[UHS+15]

Rei Ueno, Naofumi Homma, Yukihiro Sugawara, Yasuyuki Nogami, and

Takafumi Aoki. Highly eﬃcient GF(2

) inversion circuit based on redundant

GF arithmetic and its application to AES design. In Tim Güneysu and Helena

Handschuh, editors, Cryptographic Hardware and Embedded Systems - CHES

2015 - 17th International Workshop, Saint-Malo, France, September 13-16,

2015, Proceedings, volume 9293 of Lecture Notes in Computer Science, pages

63–80. Springer, 2015.

Alexander Maximov and Patrik Ekdahl 113

A Area and speed measurement methods

Firstly, we introduce some notations. Gate names are written in capital letters GATE

(examples: AND, OR). The notation mGATEn means m gates of type GATE, each of

which has n inputs (example: XOR4, 8XOR4, NAND3, 2AND2). When the number of

inputs n is missing then the assumption is that the gate has minimum inputs, usually only

2 (3 for MUX).

Cells that are constructed as gates combinations can be described as GATES1-GATE2,

meaning that we ﬁrst perform one or more gates on the ﬁrst level GATES1, then the result

goes to the gate on the second level 2. Example: NAND2-NOR2, means that the cell has

3 inputs (a, b, c) and the corresponding Boolean function is NOR2(a, NAND2(b, c)).

We present two diﬀerent methods of comparing circuits; the standard method and the

technology method.

A.1 Standard method

Cells. The basic elements that are considered in the standard method are:

{XOR, XNOR, AND, NAND, OR, NOR, MUX, NMUX, NOT}.

Negotiation of NOT gates.

In some places of the circuit there can be a need to use

the inverted version of a signal. This can be done in several ways, without the explicit use

of a NOT gate. Here we list a few of them.

Method 1. One way to implement a NOT gate is to change the previous gate that

generates that signal to instead produce an inverted signal. For example, switch XOR into

XNOR, AND into NAND, etc.

Method 2. In several technologies some gates can produce both the straight signal and

the inverted version. For example, XOR gates in many implementations produce both

signals simultaneously, and thus the inverted value is readily available.

Method 3. We can change the gates following the inverted signal such that the resulting

scheme would produce the correct result given the inverted input, using e.g. De Morgan’s

laws.

Summarizing the above, we believe that NOT gates may be ignored while evaluating a

circuit with the standard method, since it can hardly be counted as a full gate. However,

for completeness, we will print out the number of NOT gates in the resulting tables.

Area.

For area comparisons the number of basic elements is counted without any size

distinction between them. The NOT-gates are ignored.

Depth.

The depth is counted in terms of the number of basic elements on the circuit

path. The overall depth of a circuit is therefore the delay of the critical path. The

NOT-gates are ignored.

A.2 Technology method

Cells.

Some papers complement the standard cells with a few extra combinatorial cells,

often available in various technologies. For example, the gates

NAND2-NAND2, NOR2-NOR2,

2AND2-NOR2, XOR4

could be highly useful to improve and speed up our SBox circuits in

this paper. However, for comparison purposes with previous academic results, we will stay

with the set of standard cells in order to make a more fair comparison. In this method we

do count NOT gates in both the delay and the area.

Area.

There exist many ASIC technologies (90nm, 45nm, 14nm, etc) from diﬀerent

vendors (Intel, Samsung, GlobalFoundries, etc), with diﬀerent speciﬁcs. In order to develop

an ASIC one needs to get a “standard cells library” of a certain technology, and that

library usually includes much more versatile cells than the basic elements listed above, so

that the designer has a wider choice of building blocks.

114 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

However, even if we take a standard cell, for example XOR, then for diﬀerent technologies

that cell has diﬀerent areas and delays. This makes it harder to compare two circuits

of the same logic developed by two academic groups, when they chose to apply diﬀerent

technologies.

For a fair comparison of circuit area of various solutions in academia we usually utilize

the term of

gates equivalence (GE)

, where 1GE is the size of the smallest NAND gate.

The size of a circuit in GE terms is then computed as Area(Circuit)/Area(NAND)

→

t GE.

Knowing the estimated GE values for each standard or technology cell makes it possible to

compute an estimated area size of a circuit in terms of GE. Although various technologies

have slightly diﬀerent GEs for standard cells, those GE numbers are still pretty close to

each other.

We have studied several technologies, where data books are available, and came to the

decision to utilize GE values given in the data book by the Samsung’s STD90/MDL90

0.35

m 3.3V CMOS technology [

Sam00

]. The cells to be used are without the speed

x-factor.

Other data books that we checked include IBM’s 0.18

m [

Int01

], WPI 0.5mm [

MNG00

FARADAY’s 90

m [

FAR06

], TSMC 0.18

m [

Art01

], Web resource [

Pet

], etc.; we veriﬁed

that GE numbers given in [

Sam00

] are quite fair and close to the reality. This makes it

possible to have an approximated comparison of the eﬀectiveness of diﬀerent circuits, even

though they may be developed for diﬀerent technologies.

Depth.

Diﬀerent cells, like XOR and NAND, not only diﬀer in terms of GEs but also

diﬀer in terms of the maximum delay of the gates.

Normally data books include the delays (e.g., in ns.) for each gate, and for all input-

output combinations.

We propose to normalize the delays of all used gates by the delay of the XOR gate.

I.e., we adopt the worst-case delay of the XOR gate as 1 unit in our measurements of the

critical path. Then we look at each standard cell and pick the maximum of the switching

characteristics for all in-out paths of the cell and divide it by the maximum delay of the

XOR gate, so that we get the normalized delay-units for each of the gates utilized.

For multiplexers (MUX and NMUX), we ignore propagation delays for the select bit

since in most cases, the select bit is the input to the circuit. For example, in the combined

SBox the select bit says if we compute the forward or the inverse SBox, and that selection

is ready as an input signal and not switching over the circuit signals propagation, so it can

be regarded as a stable signal.

The proposed above method is similar to the idea of GEs, but is adopted for computing

the depth of a circuit, normalized in XOR delays. The reason to choose XOR as the base

element for delay counting is that circuits often have a lot of XOR gates, and thus, it

now becomes possible to compare the depths between the standard and the technology

methods as well. For example, in our SBox the critical path contains 14 gates, most of

which are XORs, but in reality the depth would be equivalent to only 12.26 XOR-delays,

due to the critical path contains also faster gates.

The area and delays for the Samsung’s STD90/MDL90 0.35

m gates are summarized

in Table 6.

Table 6: Technology gates’ area and delays based on [Sam00].

Std. cell XOR XNOR AND NAND OR NOR MUX NMUX NOT D-Flop/Q

Ref. in [Sam00][XO2] [XN2] [AD2] [ND2] [OR2] [NR2] [MX2] [MX2I] [IV] [FD1Q]

Our short ref. XO XN AD ND OR NR MX MI IV FD

Area (GE) 2.33 2.33 1.33 1.00 1.33 1.00 2.33 2.67 0.67 4.33

Delay (XORs) 1.000 0.993 0.644 0.418 0.840 0.542 0.775 1.056 0.359 1.242

Alexander Maximov and Patrik Ekdahl 115

B Algorithmic details and improvements

In this section we present some more details to various algorithms previously described in

the paper.

B.1 On the computation of δ(S, ti)in Section 4.2.5

In this section we give a more detailed presentation on how the computation of

(

S, ti

)

can be done. A slightly re-organized set of algorithms for computing

(

S, ti

)is given by

Algorithms 2,3, and 4.

Algorithm 2 Computation of all distances

1: function Distances2(S,T,maxδ)→∆ = {δi}, i = 0, . . . , m −1

2: Init δi=∞for i= 0, . . . , m −1

3: Init ∀p:V0[p] = p.d if p∈S, otherwise ∞

4: Init k= 0

5: while true do

6: while ∃i:δi=∞and Vk[ti]≤ti.d do δi=k

7: if ∀i:δi<∞then return OK

8: if k≥maxδ then return FAIL

9: k←k+ 1

10: Init ∀p:Vk[p] = ∞

11: for l← bk/2cto k−1do

12: ConvolutionXOR(Vk, Vk−l−1, Vl)

13: ConvolutionMUX(Vk, Vk−l−1, Vl)

14: ConvolutionMUX(Vk, Vl, Vk−l−1)

Algorithm 3 Convolution of XOR gates

1: function ConvolutionXOR(V, A, B)

2: for a= 0 . . . 22n+2 −1do

3: for b= 0 . . . 22n+2 −1do

4: d= max{A[a], B[b]}+ 1

5: p=a⊕b  XOR(a, b) gate

6: if V[p]> d then V[p] = d

7: p=a⊕b⊕(1; 0 . . . 0; 1; 0 . . . 0) NXOR(a, b) gate

8: if V[p]> d then V[p] = d

There are two convolution algorithms, for XOR gates and for MUX gates, and they

can be performed independently. The MUX-convolution can be done in linear time

(

We ﬁrst collect the smallest distances for all possible

-values and

-values independently

(each of which has

√N

possible indexes), then the gate MUX can be applied to any of the

combinations, so the convolution is

(

√N2

). The XOR-convolution is a bit more

complicated and it has quadratic complexity O(N2)in general case.

Algorithmic improvements.

Assume for some

we have already computed all

distances

δi

(

S, ti

).For each candidate

from

, we add it to

so that

S∪c

then we need to compute all distances

δ0

(

S0, ti

)in order to compute the metrics and

decide on which

is good. Note that adding a single candidate

implies

δ0

i≤δi

for every

target

. Therefore, we should modify the algorithm Distances(S’, T,

maxδ

)such that

we set

maxδ

max{δi} −

1, and check in the end that if

δ0

∞

then

δ0

maxδ

. This

116 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

Algorithm 4 Convolution of MUX gates

1: function ConvolutionMUX(V, A, B)

2: Set ∀i= 0,...,2n+1 −1 : F[i] = I[i] = ∞

3: for a= 0 . . . 22n+2 −1do

4: Set f=a÷2n+1 high half of arelated to Fpart

5: Set i=amod 2n+1 low half of arelated to Ipart

6: if F[f]> A[a]then thenF[f] = A[a]

7: if I[i]> B[a]then thenI[i] = B[a]

8: for f= 0 . . . 2n+1 −1do

9: for i= 0 . . . 2n+1 −1do

10: d= max{F[f], I [i]}+ 1

11: p= (f·2n+1 +i)MUX(ZF; f; i) gate

12: if V[p]> d then thenV[p] = d

13: p=p⊕(1; 0 . . . 0; 1; 0 . . . 0) NMUX(ZF; f; i) gate

14: if V[p]> d then thenV[p] = d

simple trick helps to avoid the computation of the last vector

and eﬀectively speed up

the computations by up to x20 times.

Generation of the candidates Cinvolves testing if a candidate is already in Cor in S

(with the same delay) – those needs to be ignored. To speed up this part we can use a

temporary vector

[

]of length

, where all cells are initialized to

∞

, and then for each

point

from

we set

[

s.p

] =

s.d

. Then, when a new candidate

is generated we simply

update the table

[

c.p

] =

min{c.d, Z

[

c.p

]

}

. In the end we remove

points from

, and

generate

from

as follow: for all

= 0

, . . . , N −

1, if

[

]

<∞

then add a candidate

{.p

i, .d

[

]

}

. This way we construct

with unique candidates and also

having the smallest depths.

Architectural improvements.

MUX(a, b) and MUX(b, a) can be combined in a

single MUX-convolution function. In

max{d1, d2}

+ 1, move the +1 operation outside the

convolution functions, and do it after the convolutions, instead.

p⊕{.p=[

|0|

|0], .d=

}

is done in order to include gates with negated output; those can be moved outside the

convolution functions as well and be performed in the main function Distances() in

linear time. This helps to reduce the number of operations in the critical loop of the

function ConvolutionXOR(), basically this doubles the speed. When

, then in

ConvolutionXOR() we only need to run

starting from

. When

is not equal to

, then ConvolutionXOR() can be done only on the half values of

, since we know

that all vectors Vkfor k > 0are symmetric in regards to NOT-gates. When A[a] = ∞in

ConvolutionXOR() then we do not need to enter the inner loop for

. The same check

for B[b]6=∞is not justiﬁed since it adds an unnecessary branching in the critical loop.

Leveraging SIMD (SSSE3).

It is quite clear that ConvolutionMUX() can be

easily refactored to utilize SIMD vectorized instructions and, for example, 128-bit registers

(SSE). However, it is a bit tricky to ﬁnd a way how to use SIMD instructions for the

function ConvolutionXOR(). First of all, assume each cell

[

]

, B

[

]are all of char

type (byte), then we must start

aligned to 16 bytes, since our registers are 128-bit long.

Secondly, the result of

a⊕b

for

= 0

,...,

mod

16 will end up in a permuted

location

, but that permutation would only happen in the low 4 bits. With the help of

_mm_shuffle_epi8()

we can make a permutation of the destination 16-byte block, where

the permutation vector only depends on the value of

amod

16 (recall that

= 0

mod

16).

Those permutation vectors can be hard coded in a constant table. Other operations within

that ConvolutionXOR() are trivial to implement. One could also try to utilize 256-bit

long registers, thus speeding up the algorithms even more.

Alexander Maximov and Patrik Ekdahl 117

B.1.1 More on ConvolutionXOR()

One can notice that ConvolutionXOR() may be done with the help of the following

convolution:

V[p] =

a=0

A[a]·B[p⊕a],

where the operation

x·y7→ max{a, b}

, and

y7→ min{a, b}

. Thus, we have a convolution

to be done in the (min, max)-algebra. One could think of applying Fast Walsh-Hadamard

Transform (FWHT) in

(

Nlog N

)but the problem is that that algebra does not have an

inverse element.

In [

BHWZ94

] there is an algorithm “MinConv” that can be converted into our convo-

lution problem, and it is claimed to work “around and in average”

(

Nlog N

)time. The

idea behind MinConv is to sort A and B vectors, then we get the smallest delays in the

beginning of the vectors A and B. Thus, we can enumerate the

max{A

[

]

, B

[

]

}

delays

starting from the smallest. Also, we should take care of the indexes while sorting A and B,

so that we can ﬁnd the destination point

a⊕b

. Every point

hit the ﬁrst time will

receive the smallest possible delay, and thus can be skipped later on. The idea is that the

predicted number of hits to cover all Npoints of the result should be around Nlog N.

We have programmed that but it did not demonstrate a speed up on our input size

(n=8, N=2

) and actually performed slower than our SIMD-improved quadratic algorithm,

at least on our input size. Also, the above algorithm cannot be parallelized.

B.1.2 ConvolutionXOR() in O(maxDelay2·Nlog N)time

Usually the delay values stored in

vectors are small. We can rely on that fact in order

to develop an algorithm that may be faster than O(N2).

The idea is simple. Construct two vectors

[] and

[] such that

[

] = 1 if

[

] =

otherwise

[

] = 0, do the same for

[]. Then compute the convolution of two Boolean

vectors

and

through the classical FWHT transform in

(

Nlog N

). Let

[] be the

result of that convolution with

max{x, y}

+ 1. Then we know that if

[

]

= 0 then

the point

may have the depth

. So we just make a linear loop over

[

]and check if

[

]

= 0 and

[

]

> d

then

[

] =

. We should repeat the above for all combinations

x, y

= 0

, . . . , maxD

, each step of which has the complexity

(

Nlog N

). The value of

maxDelay

can also be determined in the beginning of the algorithm linearly. Also note

that

maxDelay

may be diﬀerent for

and

, so that

and

may have diﬀerent ranges.

B.1.3 ConvolutionXOR() in O(|S|2)time

When constructing the vector

from the initial

is it worth to do the classical way and

run through pairs of points of

, instead of doing the full scale convolution over

points.

However, the number of newly generated points grows very rapidly and this method can

only be applied to the very ﬁrst

s (in our experiments we have seen some “win” only in

, then for further

Vk, k >

1we have used our SIMD optimized convolution algorithms).

B.2 Alternative equations for INV block

In case we want to avoid multiplexers in the

INV

block then there is an alternative set

of equations that we also present in this section. We have considered each expression

independently, using a general depth 3 expression:

Yi= ((Xaop1Xb)op5(Xcop2Xd)) op7((Xeop3Xf)op6(Xgop4Xh)),

118 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

where

Xa−h

are terms from

{

, X0, X1, X2, X3}

and

op1−7

are operators from the

set of standard gates

{AND, OR, XOR, NAND, NOR, XNOR}

. Note that the above does not

need to have all terms, for example, the expression AND(x, x) is simply x.

The exhaustive search can be organized as follows. Let us have an object Term which

consists of a truth table

of length 16 bits, based on the 4 bits

X0, . . . , X3

, and a Boolean

function that is associated with the term. We start with the initial set of available terms

T(0)

{

, X0, . . . , X3}

, and then construct an expression for a chosen

iteratively.

Assume at some step

we have the set of available terms

T(k)

, then the next set of terms

and associated expressions can be obtained as:

T(k+1) ={T(k), T (k)operator T(k)},

taking care of unique terms. At some step

we will get one or more term(s) whose

TTs

are equal to target TTs (Yis).

Since we could actually get multiple Boolean functions for each

, we should select

only the “best” functions following the criteria: there are no

NOT

gates (due to better

sharing capabilities), there is a maximum number of gates that can be shared between the

4 expressions for Y0, . . . , Y3, and the area/depth in terms of GE is small.

Using this technique, we have found a depth 3, 15 gates solution for the inversion. The

equations are given below, where we also provide depth 3 solutions for the additional 5

signals {

Y01, Y23 , Y02, Y13 , Y00

} such that they can share a lot of gates in the mentioned

scenarios S0-S5.

Y0 = xnor( and(X0, X2) , nand(nand(X1, X2), xor(X2, X3)))

Y1 = xor(nand(xor(X2, X3), X1) , nor( and(X0, X2), X3))

Y2 = xnor( and(X0, X2) , nand( xor(X0, X1), nand(X0, X3)))

Y3 = xor(nand(xor(X0, X1), X3) , nor( and(X0, X2), X1))

Y01 = nand(nand(xor(X2, X3), X1) , nand(nand(X0, X3), X2))

Y23 = nand(nand(xor(X0, X1), X3) , nand(nand(X1, X2), X0))

Y13 = xor( nor(and(X0, X2), xnor(X1, X3)), xor(nand(X0, X3), nand(X1, X2)))

Y02 = xor(nand(xor(X2, X3), nand(X1, X2)), nand( xor(X0, X1), nand(X0, X3)))

Y00 = and(nand(and(X0, X2), xnor(X1, X3)), nor( nor(X0, X2), and(X1, X3)))

Listing 1: INV refactored, without multiplexers.

When implementing the above circuits for the scenarios S0-S5, and sharing the gates

in a best possible way, we then got the results shown in Table 7.

Table 7: Alternative INV block and scenarios S0-S5.

INV S0 S1 S2 S3 S4 S5

Std. area (gates) 15 20 22 23 25 25 29

Std. depth (gates) 3 5 4 4 4 4 3

Tech. area (GE) 23.31 34.96 34.30 40.62 40.62 39.96 43.29

Tech. depth(XORs) 2.42 4.42 3.42 3.54 3.42 2.84 2.54

C Inverse SBoxes

The stand-alone inverse SBox is as far as we know, not used very much. But we provide

the comparison with previously known solutions in Table 8.

Alexander Maximov and Patrik Ekdahl 119

Table 8: Inverse SBox: Comparison of the results.

Inverse SBox Area Size/Gates Critical Path/Depth

Std. gates Tech. GE Std. gates Tech. XORs

Previous Results

Canright [Can05]’05 81XO+34ND+6NR –-

most famous design 121 228.73 25? ?

Boyar et al. [BP12]’12 93XO+34AD 13XO+3AD

our starting point 127 261.91 16 14.932

Our Results

Inverse (fast) 68XO+10XN+41ND+5NR+6MX 7XO+1XN+1ND+2NR+1MX

fast with depth 12 130 241.72 12 10.270

Inverse (tradeoﬀ) 64XO+4XN+27ND+5NR+8MX+2MI (+1IV) 9XO+1XN+1ND+2NR+1MX

area/speed tradeoﬀ 110(+1) 215.09 14 12.270

Inverse (bonus) 56XO+7XN+27ND+5NR+6MX (+1IV) 19XO+2XN+1ND+2NR+1MX

new record smallest 101(+1) 193.44 25 23.263

D Circuits

D.1 Preliminaries

In the below listings we present 9 circuits for the forward, inverse, and combined SBoxes

that utilize two architectures A(small) and D(fast).

The used symbols are:

•##comment – a comment line

•@filename

– means that we should include the code from another ﬁle

`filename'

the listing of which is then given in this section as well.

•a^b

– is the usual

XOR

gate, other gates are explicitly denoted and taken from the

set of {XNOR, AND, NAND, OR, NOR, MUX, NMUX, NOT}

•(a op b)

– where the order of execution (the order of gate connections) is important

we specify it by brackets.

The input to all SBoxes are the 8 signals

{U0..U7}

and the output are the 8 signals

{R0..R7}

. The input and output bits are represented in Big Endian bit order. For

combined SBoxes the input has additional signals

and

where

ZF=1

if we perform the

forward SBox and

ZF=0

if inverse, otherwise; the signal

is the complement of

. We

have tested all the proposed circuits and veriﬁed their correctness.

The circuits are divided into sub-programs, according to Figure 3. In Section D.2 we

describe the common shared components, and then for each solution we give components

(common or speciﬁc) for the circuits.

D.2 Shared components

#File: mulx.a

T20 = NAND(Q6, Q12)

T21 = NAND(Q3, Q14)

T22 = NAND(Q1, Q16)

T10 = (NOR(Q3, Q14) ^ NAND(Q0, Q7))

T11 = (NOR(Q4, Q13) ^ NAND(Q10, Q11))

T12 = (NOR(Q2, Q17) ^ NAND(Q5, Q9))

T13 = (NOR(Q8, Q15) ^ NAND(Q2, Q17))

X0 = T10 ^ (T20 ^ T22)

X1 = T11 ^ (T21 ^ T20)

X2 = T12 ^ (T21 ^ T22)

X3 = T13 ^ (T21 ^ NAND(Q4, Q13))

#File: 8xor4.d

120 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

R0 = (K0 ^ K1 ) ^ (K2 ^ K3 )

R1 = (K4 ^ K5 ) ^ (K6 ^ K7 )

R2 = (K8 ^ K9 ) ^ (K10 ^ K11)

R3 = (K12 ^ K13) ^ (K14 ^ K15)

R4 = (K16 ^ K17) ^ (K18 ^ K19)

R5 = (K20 ^ K21) ^ (K22 ^ K23)

R6 = (K24 ^ K25) ^ (K26 ^ K27)

R7 = (K28 ^ K29) ^ (K30 ^ K31)

Listing 2: MULX/8XOR4: Shared components.

#File: inv.a

T0 = NAND(X0, X2)

T1 = NOR(X1, X3)

T2 = XNOR(T0, T1)

Y0 = MUX(X2, T2, X3)

Y2 = MUX(X0, T2, X1)

T3 = MUX(X1, X2, 1)

Y1 = MUX(T2, X3, T3)

T4 = MUX(X3, X0, 1)

Y3 = MUX(T2, X1, T4)

#File: s0.a

@inv.a

Y02 = Y2 ^ Y0

Y13 = Y3 ^ Y1

Y23 = Y3 ^ Y2

Y01 = Y1 ^ Y0

Y00 = Y02 ^ Y13

#File: s1.a

@inv.a

T5 = MUX(X0, T0, X3)

Y23 = MUX(X1, T5, X0)

T6 = NMUX(T3, X2, X3)

Y01 = NMUX(T0, T6, X3)

Y02 = Y2 ^ Y0

Y13 = Y3 ^ Y1

Y00 = Y01 ^ Y23

#File: s2.a

T0 = XNOR(X1, X3)

T1 = OR(X1, X3)

T2 = XOR(X0, X2)

T3 = XOR(T1, T2)

Y0 = MUX(X2, T3, X3)

Y2 = MUX(X0, T3, X1)

T4 = MUX(T2, X3, T0)

Y3 = MUX(T4, X0, X1)

T5 = NMUX(X1, X0, X2)

Y02 = MUX(T0, T2, T5)

T6 = MUX(X0, T0, X1)

Y1 = MUX(T6, X2, X3)

T7 = NMUX(X2, X3, X1)

Y13 = NMUX(T2, T7, T0)

Y23 = Y3 ^ Y2

Y01 = Y1 ^ Y0

Y00 = Y02 ^ Y13

#File: s3.a

T0 = XNOR(X1, X3)

T1 = OR(X1, X3)

T2 = XNOR(X0, X2)

T3 = XNOR(T1, T2)

Y0 = MUX(X2, T3, X3)

Y2 = MUX(X0, T3, X1)

T4 = MUX(T2, T0, X3)

Y3 = MUX(T4, X0, X1)

T5 = MUX(T2, X0, T1)

Y00 = NMUX(T5, T0, T2)

T6 = MUX(T2, T0, X1)

Y1 = MUX(T6, X2, X3)

Y02 = Y2 ^ Y0

Y13 = Y3 ^ Y1

Y23 = Y3 ^ Y2

Y01 = Y1 ^ Y0

#File: s4.a

T0 = NAND(X0, X2)

T1 = NOR(X1, X3)

T2 = NMUX(X2, T1, X3)

Y0 = XOR(T0, T2)

T3 = MUX(X0, T1, X1)

Y2 = XNOR(T0, T3)

Y02 = XNOR(T2, T3)

T4 = MUX(X3, X0, T0)

T5 = MUX(X1, X2, T0)

Y13 = XOR(T4, T5)

T6 = XOR(X3, T0)

Y3 = MUX(T6, X1, X0)

T7 = MUX(X0, T0, X3)

Y23 = MUX(X1, T7, X0)

T8 = NMUX(T0, X1, T1)

Y1 = MUX(T8, X3, X2)

T9 = MUX(X2, T0, X1)

Y01 = MUX(X3, T9, X2)

Y00 = Y02 ^ Y13

#File: s5.a

T0 = XOR(X0, X2)

T1 = XNOR(X1, X3)

T2 = OR(X1, X3)

T3 = XOR(T0, T2)

Y0 = MUX(X2, T3, X3)

Y2 = MUX(X0, T3, X1)

T4 = MUX(X2, T1, X3)

Y3 = MUX(T4, X0, X1)

T5 = MUX(T0, X1, T1)

Y1 = MUX(T5, X2, X3)

T6 = NMUX(T1, T0, X0)

Y00 = NMUX(T3, T6, T1)

T7 = MUX(X0, T0, T1)

Y23 = MUX(X1, T7, X0)

Y13 = NMUX(T5, T6, T7)

T8 = MUX(X1, X0, X2)

Y02 = NMUX(T1, T6, T8)

T9 = MUX(X2, T0, T1)

Y01 = MUX(X3, T9, X2)

Listing 3: INV/S0-S5: Shared components.

An alternative set of equations for the INV block is given in Appendix B.2.

#File: muln.a

N0 = NAND(Y01, Q11)

N1 = NAND(Y0 , Q12)

N2 = NAND(Y1 , Q0 )

N3 = NAND(Y23, Q17)

N4 = NAND(Y2 , Q5 )

N5 = NAND(Y3 , Q15)

N6 = NAND(Y13, Q14)

N7 = NAND(Y00, Q16)

N8 = NAND(Y02, Q13)

N9 = NAND(Y01, Q7 )

N10 = NAND(Y0 , Q10)

N11 = NAND(Y1 , Q6 )

N12 = NAND(Y23, Q2 )

N13 = NAND(Y2 , Q9 )

N14 = NAND(Y3 , Q8 )

N15 = NAND(Y13, Q3 )

N16 = NAND(Y00, Q1 )

N17 = NAND(Y02, Q4 )

#File: mull.d

K0 = NAND(Y0, L0 )

K12 = NAND(Y0, L12)

K16 = NAND(Y0, L16)

K20 = NAND(Y0, L20)

Alexander Maximov and Patrik Ekdahl 121

K1 = NAND(Y1, L1 )

K5 = NAND(Y1, L5 )

K9 = NAND(Y1, L9 )

K13 = NAND(Y1, L13)

K17 = NAND(Y1, L17)

K21 = NAND(Y1, L21)

K25 = NAND(Y1, L25)

K29 = NAND(Y1, L29)

K2 = NAND(Y2, L2 )

K6 = NAND(Y2, L6 )

K10 = NAND(Y2, L10)

K14 = NAND(Y2, L14)

K18 = NAND(Y2, L18)

K22 = NAND(Y2, L22)

K26 = NAND(Y2, L26)

K30 = NAND(Y2, L30)

K3 = NAND(Y3, L3 )

K7 = NAND(Y3, L7 )

K11 = NAND(Y3, L11)

K15 = NAND(Y3, L15)

K19 = NAND(Y3, L19)

K23 = NAND(Y3, L23)

K27 = NAND(Y3, L27)

K31 = NAND(Y3, L31)

#File: mull.f

K4 = AND(Y0, L4 )

K8 = AND(Y0, L8 )

K24 = AND(Y0, L24)

K28 = AND(Y0, L28)

#File: mull.i

K4 = NAND(Y0, L4 )

K8 = NAND(Y0, L8 )

K24 = NAND(Y0, L24)

K28 = NAND(Y0, L28)

#File: mull.c

K4 = NAND(Y0, L4 ) ^ ZF

K8 = NAND(Y0, L8 ) ^ ZF

K24 = NAND(Y0, L24) ^ ZF

K28 = NAND(Y0, L28) ^ ZF

Listing 4: MULN/MULL: Shared components.

D.3 Forward SBox (fast)

#Forward (fast)

@ftop.d

@mulx.a

@inv.a

@mull.f

@mull.d

@8xor4.d

#File: ftop.d

# Exhaustive

search,→

Z18 = U1 ^ U4

L28 = Z18 ^ U6

Q0 = U2 ^ L28

Z96 = U5 ^ U6

Q1 = U0 ^ Z96

Z160= U5 ^ U7

Q2 = U6 ^ Z160

Q11 = U2 ^ U3

L6 = U4 ^ Z96

Q3 = Q11 ^ L6

Q16 = U0 ^ Q11

Q4 = Q16 ^ U4

Q5 = Z18 ^ Z160

Z10 = U1 ^ U3

Q6 = Z10 ^ Q2

Q7 = U0 ^ U7

Z36 = U2 ^ U5

Q8 = Z36 ^ Q5

L19 = U2 ^ Z96

Q9 = Z18 ^ L19

Q10 = Z10 ^ Q1

Q12 = U3 ^ L28

Q13 = U3 ^ Q2

L10 = Z36 ^ Q7

Q14 = U6 ^ L10

Q15 = U0 ^ Q5

L8 = U3 ^ Q5

L12 = Q16 ^ Q2

L16 = U2 ^ Q4

L15 = U1 ^ Z96

L31 = Q16 ^ L15

L5 = Q12 ^ L31

L13 = U3 ^ Q8

L17 = U4 ^ L10

L29 = Z96 ^ L10

L14 = Q11 ^ L10

L26 = Q11 ^ Q5

L30 = Q11 ^ U6

L7 = Q12 ^ Q1

L11 = Q12 ^ L15

L27 = L30 ^ L10

Q17 = U0

L0 = Q10

L4 = U6

L20 = Q0

L24 = Q16

L1 = Q6

L9 = U5

L21 = Q11

L25 = Q13

L2 = Q9

L18 = U1

L22 = Q15

L3 = Q8

L23 = U0

Listing 5: Forward SBox with the smallest delay (fast)

D.4 Forward SBox (tradeoﬀ)

#Forward

(tradeoff),→

@ftop.a

@mulx.a

@s1.a

@muln.a

@fbot.a

#File: ftop.a

# Exhaustive

search,→

Z6 = U1 ^ U2

Q12 = Z6 ^ U3

Q11 = U4 ^ U5

Q0 = Q12 ^ Q11

Z9 = U0 ^ U3

Z80 = U4 ^ U6

Q1 = Z9 ^ Z80

Q7 = Z6 ^ U7

Q2 = Q1 ^ Q7

Q3 = Q1 ^ U7

Q13 = U5 ^ Z80

Q5 = Q12 ^ Q13

Z66 = U1 ^ U6

Z114= Q11 ^ Z66

Q6 = U7 ^ Z114

Q8 = Q1 ^ Z114

Q9 = Q7 ^ Z114

Q10 = U2 ^ Q13

Q16 = Z9 ^ Z66

Q14 = Q16 ^ Q13

Q15 = U0 ^ U2

Q17 = Z9 ^ Z114

Q4 = U7

#File: fbot.a

# Probabilistic

heuristic,→

H0 = N3 ^ N8

H1 = N5 ^ N6

H2 = XNOR(H0, H1)

H3 = N1 ^ N4

H4 = N9 ^ N10

H5 = N13 ^ N14

H6 = N15 ^ H4

H7 = N0 ^ H3

H8 = N17 ^ H5

H9 = N3 ^ H7

H10 = N15 ^ N17

H11 = N9 ^ N11

122 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

H12 = N12 ^ N14

H13 = N1 ^ N2

H14 = N5 ^ N16

H15 = N7 ^ H11

H16 = H10 ^ H11

H17 = N16 ^ H8

H18 = H6 ^ H8

H19 = H10 ^ H12

H20 = N2 ^ H3

H21 = H6 ^ H14

H22 = N8 ^ H12

H23 = H13 ^ H15

R0 = XNOR(H16,

H2),→

R1 = H2

R2 = XNOR(H20,

H21),→

R3 = XNOR(H17,

H2),→

R4 = XNOR(H18,

H2),→

R5 = H22 ^ H23

R6 = XNOR(H19,

H9),→

R7 = XNOR(H9 ,

H18),→

Listing 6: Forward SBox circuit with area/depth trade-oﬀ (tradeoﬀ)

D.5 Forward SBox (bonus)

We include these bonus circuits just to update the world record for the smallest SBox.

The new record is 102 gates with depth 24.

#Forward (bonus)

@ftop.b

@mulx.a

@s0.a

@muln.a

@fbot.b

#File: ftop.b

Z24 = U3 ^ U4

Q17 = U1 ^ U7

Q16 = U5 ^ Q17

Q0 = Z24 ^ Q16

Z66 = U1 ^ U6

Q7 = Z24 ^ Z66

Q2 = U2 ^ Q0

Q1 = Q7 ^ Q2

Q3 = U0 ^ Q7

Q4 = U0 ^ Q2

Q5 = U1 ^ Q4

Q6 = U2 ^ U3

Q10 = Q6 ^ Q7

Q8 = U0 ^ Q10

Q9 = Q8 ^ Q2

Q12 = Z24 ^ Q17

Q15 = U7 ^ Q4

Q13 = Z24 ^ Q15

Q14 = Q15 ^ Q0

Q11 = U5

#File: fbot.b

H0 = N1 ^ N5

H1 = N4 ^ H0

R2 = XNOR(N2, H1)

H2 = N9 ^ N15

H3 = N11 ^ N17

R6 = XNOR(H2, H3)

H4 = N11 ^ N14

H5 = N9 ^ N12

R5 = H4 ^ H5

H6 = N16 ^ H2

H7 = R2 ^ R6

H8 = N10 ^ H7

R7 = XNOR(H6, H8)

H9 = N8 ^ H1

H10 = N13 ^ H8

R3 = H5 ^ H10

H11 = H9 ^ H10

H12 = N7 ^ H11

H13 = H4 ^ H12

R4 = N1 ^ H13

H14 = XNOR(N0, R7)

H15 = H9 ^ H14

H16 = H7 ^ H15

R1 = XNOR(N6, H16)

H17 = N4 ^ H14

H18 = N3 ^ H17

R0 = H13 ^ H18

Listing 7: Forward SBox circuit with the smallest number of gates (bonus)

D.6 Combined SBox (fast)

#Combined (fast)

@ctop.d

@mulx.a

@inv.a

@mull.c

@mull.d

@8xor4.d

#File: ctop.d

# Floating multiplexers

A0 = XNOR(U2, U4)

A1 = XNOR(U1, A0)

A2 = XNOR(U5, U7)

A3 = U0 ^ U5

A4 = XNOR(U3, U6)

A5 = U2 ^ U6

A6 = NMUX(ZF, A4, U1)

Q11 = A5 ^ A6

Q16 = U0 ^ Q11

A7 = U3 ^ A1

L24 = MUX(ZF, Q16, A7)

A8 = NMUX(ZF, A3, U6)

L5 = A0 ^ A8

L11 = Q16 ^ L5

A9 = MUX(ZF, U2, U6)

A10 = XNOR(A2, A9)

Q5 = A1 ^ A10

Q15 = U0 ^ Q5

A11 = U2 ^ U3

A12 = NMUX(ZF, A2, A11)

Q13 = A6 ^ A12

Q12 = Q5 ^ Q13

A13 = A5 ^ A12

Q0 = Q5 ^ A13

Q14 = U0 ^ A13

A14 = XNOR(U3, A3)

A15 = NMUX(ZF, A0, U3)

A16 = XNOR(U5, A15)

Q3 = A4 ^ A16

L6 = Q11 ^ Q3

A17 = U2 ^ A10

Q7 = XNOR(A8, A17)

A18 = NMUX(ZF, A14, A2)

Q1 = XNOR(A4, A18)

Q4 = XNOR(A16, A18)

L7 = Q12 ^ Q1

L8 = Q7 ^ L7

A19 = NMUX(ZF, U1, A4)

A20 = XNOR(U6, A19)

Q9 = XNOR(A16, A20)

Q10 = A18 ^ A20

L9 = Q0 ^ Q9

A21 = U1 ^ A2

A22 = NMUX(ZF, A21, A5)

Q2 = A20 ^ A22

Q6 = XNOR(A4, A22)

Q8 = XNOR(A16, A22)

A23 = XNOR(Q5, Q9)

L10 = XNOR(Q1, A23)

L4 = Q14 ^ L10

A24 = NMUX(ZF, Q2, L4)

Alexander Maximov and Patrik Ekdahl 123

L12 = XNOR(Q16, A24)

L25 = XNOR(U3, A24)

A25 = MUX(ZF, L10, A3)

L17 = U4 ^ A25

A26 = MUX(ZF, A10, Q4)

L14 = L24 ^ A26

L23 = A25 ^ A26

A27 = MUX(ZF, A1, U5)

L30 = Q12 ^ A27

A28 = NMUX(ZF, L10, L5)

L21 = XNOR(L14, A28)

L27 = XNOR(L30, A28)

A29 = XNOR(U5, L4)

L29 = A28 ^ A29

L15 = A19 ^ A29

A30 = XNOR(A3, A10)

L18 = NMUX(ZF, A19, A30)

A31 = XNOR(A7, A21)

L16 = A25 ^ A31

L26 = L18 ^ A31

A32 = MUX(ZF, U7, A5)

L13 = A7 ^ A32

A33 = NMUX(ZF, A15, U0)

L19 = XNOR(L6, A33)

A34 = NOR(ZF, U6)

L20 = Q0 ^ A34

A35 = XNOR(A4, A8)

L28 = XNOR(L7, A35)

A36 = NMUX(ZF, Q6, L11)

L31 = A30 ^ A36

A37 = MUX(ZF, L26, A0)

L22 = Q16 ^ A37

Q17 = U0

L0 = Q10

L1 = Q6

L2 = Q9

L3 = Q8

Listing 8: Combined SBox circuit with the smallest delay

D.7 Combined SBox (tradeoﬀ)

#Combined (tradeoff)

@ctop.a

@mulx.a

@s1.a

@muln.a

@cbot.a

#File: ctop.a

# Floating multiplexers

A0 = XNOR(U0, U6)

Q1 = XNOR(U1, ZF)

A1 = U2 ^ U5

A2 = XNOR(U3, U4)

A3 = XNOR(U3, U7)

A4 = MUX(ZF, A2, U2)

A5 = A0 ^ A1

Q6 = A4 ^ A5

A6 = XNOR(Q1, A1)

A7 = NMUX(ZF, U0, A3)

Q4 = A5 ^ A7

Q3 = Q1 ^ Q4

A8 = NMUX(ZF, U6, A2)

A9 = Q1 ^ A3

Q9 = A8 ^ A9

Q10 = Q4 ^ Q9

A10 = XNOR(A4, A7)

Q7 = XNOR(Q9, A10)

Q8 = XNOR(Q1, A10)

A11 = XNOR(U0, U2)

Q0 = ZF ^ A11

A12 = U1 ^ U3

A13 = A1 ^ A12

A14 = MUX(ZF, A13, A11)

Q15 = U4 ^ A14

A15 = NMUX(ZF, U5, A0)

Q5 = XNOR(A14, A15)

Q17 = XNOR(U4, A15)

A16 = MUX(ZF, A5, A2)

Q16 = XNOR(A13, A16)

A17 = A3 ^ A8

Q2 = XNOR(A10, A17)

A18 = U4 ^ U6

A19 = U1 ^ U2

Q11 = Q6 ^ A19

A20 = MUX(ZF, A18, A19)

Q13 = U5 ^ A20

A21 = XNOR(U4, Q0)

Q14 = XNOR(A14, A21)

A22 = XNOR(A4, A6)

Q12 = XNOR(U6, A22)

#File: cbot.a

# Probabilistic heuristic

H1 = N1 ^ N3

H3 = N15 ^ N17

H4 = N12 ^ N13

H5 = N0 ^ H1

H6 = N7 ^ N8

H8 = N10 ^ N11

H9 = H4 ^ H8

S4 = H3 ^ H9

H10 = N12 ^ N14

H11 = N16 ^ H8

S14 = N17 ^ H11

H12 = N1 ^ N2

H13 = N3 ^ N5

H14 = N4 ^ N5

H15 = N9 ^ N11

H16 = N6 ^ H13

H17 = H6 ^ H14

H18 = N4 ^ H5

H30 = H18 ^ ZF

S1 = H17 ^ H30

H19 = H3 ^ H15

S6 = XNOR(H18, H19)

S11 = H17 ^ H19

H20 = H10 ^ H15

S0 = XNOR(S6, H20)

S5 = H17 ^ H20

H21 = N7 ^ H12

H22 = H16 ^ H21

S12 = H20 ^ H22

S13 = S4 ^ H22

H23 = N15 ^ N16

H24 = N9 ^ N10

H25 = N8 ^ H24

H26 = H12 ^ H14

S7 = XNOR(S4, H26)

H27 = H4 ^ H23

S2 = H30 ^ H27

H28 = N8 ^ H16

S3 = S14 ^ H28

H29 = H21 ^ H25

S15 = H23 ^ H29

R0 = S0

R1 = S1

R2 = S2

R3 = MUX(ZF, S3, S11)

R4 = MUX(ZF, S4, S12)

R5 = MUX(ZF, S5, S13)

R6 = MUX(ZF, S6, S14)

R7 = MUX(ZF, S7, S15)

Listing 9: Combined SBox circuit with a good area/depth trade-oﬀ (tradeoﬀ)

124 New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

D.8 Combined SBox (bonus)

#Combined (bonus)

@ctop.b

@mulx.a

@s0.a

@muln.a

@cbot.b

#File: ctop.b

# Floating multiplexers

A0 = XNOR(U3, U6)

Q15 = XNOR(U1, ZF)

A1 = U5 ^ Q15

A2 = U2 ^ A0

A3 = U4 ^ A1

A4 = U4 ^ U6

A5 = MUX(ZF, A2, A4)

Q4 = XNOR(A3, A5)

Q0 = U0 ^ Q4

Q14 = Q15 ^ Q0

A6 = XNOR(U0, U2)

Q3 = ZF ^ A6

Q1 = Q4 ^ Q3

A7 = MUX(ZF, U1, Q0)

Q6 = XNOR(A5, A7)

Q8 = Q3 ^ Q6

A8 = MUX(ZF, Q1, A4)

Q9 = U6 ^ A8

Q2 = Q8 ^ Q9

Q10 = Q4 ^ Q9

Q7 = Q6 ^ Q10

A9 = MUX(ZF, A0, U4)

Q12 = XNOR(U7, A9)

Q11 = Q0 ^ Q12

A10 = MUX(ZF, A6, Q12)

A11 = A2 ^ A10

A12 = A4 ^ A11

Q5 = Q0 ^ A12

Q13 = Q11 ^ A12

Q17 = Q14 ^ A12

Q16 = Q14 ^ Q13

#File: cbot.b

H0 = N9 ^ N10

H1 = N16 ^ H0

H2 = N4 ^ N5

S4 = N7 ^ (N8 ^ H2)

H4 = N0 ^ N2

H6 = N15 ^ H1

H7 = H4 ^ (N3 ^ N5)

H20= H6 ^ ZF

S2 = H20 ^ H7

S14 = S4 ^ H7

H8 = N13 ^ H0

H9 = N12 ^ H8

S1 = H20 ^ H9

H10 = N17 ^ H1

H12 = H2 ^ (N1 ^ N2)

S0 = H6 ^ H12

S5 = N6 ^ (H9 ^ (N8 ^

H4)),→

S11 = H12 ^ S5

S6 = S1 ^ S11

H15 = N14 ^ H10

H16 = H8 ^ H15

S12 = S5 ^ H16

S7 = XNOR(S4, H10 ^ (N9 ^

N11)),→

H19 = XNOR(H7, S7)

S3 = H16 ^ H19

S15 = S11 ^ H19

S13 = S4 ^ (N12 ^ H15)

R0 = S0

R1 = S1

R2 = S2

R3 = MUX(ZF, S3, S11)

R4 = MUX(ZF, S4, S12)

R5 = MUX(ZF, S5, S13)

R6 = MUX(ZF, S6, S14)

R7 = MUX(ZF, S7, S15)

Listing 10: Combined SBox circuit with the smallest number of gates (bonus)

D.9 Inverse SBox (fast)

#Inverse (fast)

@itop.d

@mulx.a

@inv.a

@mull.i

@mull.d

@8xor4.d

#File: itop.d

# Exhaustive

search,→

Q8 = XNOR(U1, U3)

Q0 = Q8 ^ U5

Q1 = U6 ^ U7

Q7 = U3 ^ U4

Q2 = Q7 ^ Q1

Q3 = U0 ^ U4

Q4 = Q3 ^ Q1

Q5 = XNOR(U1, Q3)

Q10 = XNOR(U0, U1)

Q6 = Q10 ^ Q7

Q9 = Q10 ^ Q4

L12 = U4 ^ U5

Z132= U2 ^ U7

Q11 = L12 ^ Z132

Q12 = Q0 ^ Q11

L27 = U3 ^ Z132

Q13 = U0 ^ L27

Q14 = XNOR(Q10,

U2),→

Q15 = Q14 ^ Q0

Q16 = XNOR(Q8, U7)

Q17 = Q16 ^ Q11

L23 = Q15 ^ Z132

L0 = U0 ^ L23

L3 = Q11 ^ Q2

L4 = Q6 ^ L3

L16 = Q3 ^ L27

L1 = XNOR(U2, U3)

L6 = L1 ^ Q0

L20 = L6 ^ Q2

L15 = XNOR(U2, Q6)

L24 = U0 ^ L15

L5 = L27 ^ Q2

L19 = Q14 ^ U5

L26 = Q3 ^ L3

L13 = L19 ^ L26

L17 = U0 ^ L12

L21 = XNOR(U1, Q1)

L25 = Q5 ^ L3

L14 = U3 ^ Q12

L18 = U0 ^ Q1

L22 = XNOR(Q5, U6)

L8 = Q11

L28 = Q7

L9 = Q12

L29 = Q10

L2 = U5

L10 = Q17

L30 = Q2

L7 = U4

L11 = Q5

L31 = Q9

Listing 11: Forward SBox with the smallest delay (fast)

Alexander Maximov and Patrik Ekdahl 125

D.10 Inverse SBox (tradeoﬀ)

#Inverse

(tradeoff),→

@itop.a

@mulx.a

@s1.a

@muln.a

@ibot.a

#File: itop.a

# Exhaustive

search,→

Z20 = U2 ^ U4

Z129= U0 ^ U7

Q0 = Z20 ^ Z129

Q4 = U1 ^ Z20

Z66 = U1 ^ U6

Q3 = U3 ^ Z66

Q1 = Q4 ^ Q3

Q2 = U6 ^ Z129

Z40 = U3 ^ U5

Z132= U2 ^ U7

Q6 = Z40 ^ Z132

Q5 = U0 ^ Q6

Q7 = U3 ^ Q0

Q17 = Z66 ^ Z132

Q8 = U5 ^ Q17

Z33 = U0 ^ U5

Q10 = U4 ^ Z33

Q9 = Q4 ^ Q10

Q12 = XNOR(U4,

Z129),→

Q13 = XNOR(Z20,

Z40),→

Q16 = XNOR(Z66,

U7),→

Q14 = Q13 ^ Q16

Q15 = Z33 ^ Q3

Q11 = NOT(U2)

#File: ibot.a

# Probabilistic

heuristic,→

H0 = N2 ^ N14

H1 = N1 ^ N5

H2 = N10 ^ N11

H3 = N13 ^ H0

H4 = N16 ^ N17

H5 = N1 ^ H2

H6 = N4 ^ H1

H7 = N0 ^ H4

H8 = N15 ^ N16

H9 = N9 ^ N10

H10 = N6 ^ N8

H11 = H3 ^ H6

H12 = N7 ^ N12

H13 = N8 ^ H0

H14 = N3 ^ N5

H15 = H5 ^ H8

H16 = N6 ^ N7

H17 = H12 ^ H13

H18 = H5 ^ H16

H19 = H3 ^ H10

H20 = H10 ^ H14

R0 = H7 ^ H18

R1 = H7 ^ H19

R2 = H2 ^ H11

R4 = H8 ^ H9

R3 = R4 ^ H20

R5 = N2 ^ H6

R6 = H15^ H17

R7 = H4 ^ H11

Listing 12: Inverse SBox circuit with good area/depth trade-oﬀ (tradeoﬀ)

Note: the above ‘NOT(U2)’ in the ﬁle ‘itop.a’ is removable by setting Q11=U2 and

accurately negating some of the gates and variables downwards where Q11 is involved.

For example, the variable Y01 should be negated as well due to: N0 = NAND(Y01, Q11);

consequently, all gates involving Y01 should be negated, leading to negation of other Q

variables, and so on.

D.11 Inverse SBox (bonus)

#Inverse (bonus)

@itop.b

@mulx.a

@s0.a

@muln.a

@ibot.b

#File: itop.b

Z33 = U0 ^ U5

Z3 = U0 ^ U1

Q1 = XNOR(Z3, U3)

Q16 = XNOR(Z33,

U6),→

Q17 = XNOR(U1,

Q16),→

Q8 = U4 ^ Q17

Q3 = XNOR(U2, Z33)

Q4 = Q1 ^ Q3

Q15 = XNOR(U4, U7)

Q10 = U3 ^ Q15

Q9 = Q4 ^ Q10

Q2 = Q8 ^ Q9

Q7 = Q1 ^ Q2

Q0 = Z33 ^ Q7

Q5 = Q17 ^ Q15

Q6 = Q3 ^ Q8

Q12 = XNOR(U1, Q0)

Q14 = Q15 ^ Q0

Q13 = Q16 ^ Q14

Q11 = NOT(U1)

#File: ibot.b

H0 = N4 ^ N5

H1 = N1 ^ N2

R6 = H0 ^ H1

H2 = N13 ^ N14

H3 = R6 ^ H2

H4 = N17 ^ H3

R0 = N16 ^ H4

H5 = N15 ^ H4

H6 = N10 ^ N11

R3 = H3 ^ H6

H7 = N9 ^ H5

R5 = N10 ^ H7

H8 = N8 ^ H0

H9 = N6 ^ H8

H10 = N7 ^ R3

H11 = N1 ^ R0

H12 = N0 ^ H11

R2 = H9 ^ H12

H13 = H8 ^ H10

R1 = R2 ^ H13

H14 = H5 ^ H13

H15 = N13 ^ H14

R7 = N12 ^ H15

H16 = N4 ^ H9

H17 = R5 ^ H16

R4 = N3 ^ H17

Listing 13: Inverse SBox circuit with the smallest number of gates (bonus)

Bitsliced Implementation of Non-Algebraic 8×8 Cryptographic S-Boxes Using ×86-64 Processor SIMD Instructions

Article

Full-text available

Jan 2022

The article is devoted to software bitsliced implementation of randomly generated 8×8 S-Box block ciphers, focused on the use of logical SIMD instructions from the SSE, AVX and AVX-512 extensions in x86-64 processors. A heuristic algorithm for minimizing non-algebraic S-Boxes in three logical bases is proposed: universal - based on logical instructions AND, OR, XOR, NOT, which allows implementation on any 8/16/32/64-bit processors; extended - based on the instructions AND, OR, XOR, NOT, AND-NOT, which allows implementation on x86-64 processors; ternary - based on ternary logic instructions, for implementation on x86-64 processors with AVX-512 support. On average, bitsliced representations of non-algebraic S-Boxes in these logical bases require 400/380/200 logical instructions, respectively. The performance of bitsliced implementations of the S-Box cipher “Kalyna” using logical instructions SSE/AVX/AVX-512 for the Intel Xeon Skylake-SP processor was measured. A fast alternative – non-bitsliced approach to the bytesliced SubBytes operation based on the AVX-512VBMI extension, resistant to timing and cache attacks, is proposed.

UpWB: An Uncoupled Architecture Design for White-box Cryptography Using Vectorized Montgomery Multiplication

Article

Full-text available

Mar 2024

White-box cryptography (WBC) seeks to protect secret keys even if the attacker has full control over the execution environment. One of the techniques to hide the key is space hardness approach, which conceals the key into a large lookup table generated from a reliable small block cipher. Despite its provable security, space-hard WBC also suffers from heavy performance overhead when executed on general purpose hardware platform, hundreds of magnitude slower than conventional block ciphers. Specifically, recent studies adopt nested substitution permutation network (NSPN) to construct dedicated white-box block cipher [BIT16], whose performance is limited by a massive number of rounds, nested loop dependency and high-dimension dynamic maximal distance separable (MDS) matrices.To address these limitations, we put forward UpWB, an uncoupled and efficient accelerator for NSPN-structure WBC. We propose holistic optimization techniques across timing schedule, algorithms and operators. For the high-level timing schedule, we propose a fine-grained task partition (FTP) mechanism to decouple the parameteroriented nested loop with different trip counts. The FTP mechanism narrows down the idle time for synchronization and avoids the extra usage of FIFO, which efficiently increases the computation throughput. For the optimization of arithmetic operators, we devise a flexible and vectorized modular multiplier (VMM) based on the complexity-reduced Montgomery algorithm, which can process multi-precision variable data, multi-size matrix-vector multiplication and different irreducible polynomials. Then, a configurable matrix-vector multiplication (MVM) architecture with diagonal-major dataflow is presented to handle the dynamic MDS matrix. The multi-scale (Inv)Mixcolumns are also unified in a compact manner by intensively sharing the common sub-operations and customizing the constant multiplier.To verify the proposed methodology, we showcase the unified design implementation for three recent families of WBCs, including SPNbox-8/16/24/32, Yoroi-16/32 and WARX-16. Evaluated on FPGA platform, UpWB outperforms the optimized software counterpart (executed on 3.2 GHz Intel CPU with AES-NI and AVX2 instructions) by 7x to 30x in terms of computation throughput. Synthesized under TSMC 28nm technology, 36x to 164x improvement of computation throughput is achieved when UpWB operates at the maximum frequency of 1.3 GHz and consumes a modest area 0.14 mm2. Besides, the proposed VMM also offers about 30% improvement of area efficiency without pulling flexibility down when compared to state-of-the-art work.

ЕВРИСТИЧНИЙ МЕТОД ЗНАХОДЖЕННЯ BITSLICED-ОПИСУ ДОВІЛЬНИХ КРИПТОГРАФІЧНИХ S-Box

Article

Jan 2022

Bitsliced-підхід до імплементації блокових шифрів поєднує такі переваги як потенційно високу швидкодію, безпеку і невимогливість до обчислювальних ресурсів. Головною проблемою при переході до bitsliced-опису шифру є представлення S-Box мінімальною кількістю логічних операцій. Відомі методи мінімізації логічного опису S-Box мають низку обмежень, наприклад, працюють лише з S-Box невеликих розмірів, є повільними або неефективними, що загалом стримує використання bitsliced-підходу. У роботі запропоновано новий евристичний метод bitsliced-опису довільних криптографічних S-Box та здійснено порівняння його ефективності з існуючими методами на прикладі S-Box шифру DES. Запропонований метод орієнтований на програмну реалізацію в логічному базисі AND, OR, XOR, NOT, що допускає імплементацію з використанням стандартних логічних інструкцій на будь-яких 8/16/32/64-бітних процесорах. Метод використовує низку евристичних технік, таких як, швидкі алгоритми вичерпного пошуку на невелику глибину, гнучку процедуру планування процесу пошуку, пошук в глибину тощо, що в комплексі забезпечують високу ефективність і швидкодію. Це дає змогу адаптувати його для мінімізації 8×8 S-Box, що на сьогодні є дуже актуальним для багатьох блокових шифрів, зокрема вітчизняного шифру «Калина». Запропонований підхід до bitsliced-опису довільних S-Box усуває обмеження відомих методів такого подання, що стримували використання bitcliced-підходу при удосконаленні програмних реалізацій блокових шифрів для широкого кола процесорних архітектур.

An Ultra-High Throughput AES-Based Authenticated Encryption Scheme for 6G: Design and Implementation

Chapter

Jan 2024

In this paper, we propose Rocca-S, an authenticated encryption scheme with a 256-bit key and a 256-bit tag targeting 6G applications bootstrapped from AES.Rocca-S achieves an encryption/decryption speed of more than 200 Gbps in the latest software environments. In hardware implementation, Rocca-S is the first cryptographic algorithm to achieve speeds more than 2 Tbps without sacrificing other metrics such as occupied silicon area or power/energy consumption making Rocca-S a competitive choice satisfying the requirements of a wide spectrum of environments for 6G applications.

Efficient and Secure Encryption for FPGAs in the Cloud

Chapter

Sep 2023

Cryptographic primitives are fundamental blocks for ensuring security. Starting from AES, in the past few years, a number of block ciphers and authenticated encryption algorithms have been proposed and, sometimes, even standardized. These primitives can be used also to secure cloud application, including cloud-based FPGAs, but also their communication with edge devices or IoT devices. To this end, this chapter presents FPGA implementations of the most relevant cryptographic primitives and discusses their performance. The chapter starts by reporting results of implementation of block ciphers, the design choices that can be followed to implement them, and the performance obtained when implementing the most common ones on reconfigurable FPGA devices. The chapter continues by presenting stream ciphers and authenticated encryption algorithms and their implementation on FPGA. The chapter concludes by reporting on the current activities related to the transition to post-quantum cryptographic (PQC) algorithms and their implementation on FPGAs.

A Novel Fault-Tolerant Logic Style with Self-Checking Capability

Preprint

Full-text available

May 2023

We introduce a novel logic style with self-checking capability to enhance hardware reliability at logic level. The proposed logic cells have two-rail inputs/outputs, and the functionality for each rail of outputs enables construction of faulttolerant configurable circuits. The AND and OR gates consist of 8 transistors based on CNFET technology, while the proposed XOR gate benefits from both CNFET and low-power MGDI technologies in its transistor arrangement. To demonstrate the feasibility of our new logic gates, we used an AES S-box implementation as the use case. The extensive simulation results using HSPICE indicate that the case-study circuit using on proposed gates has superior speed and power consumption compared to other implementations with error-detection capability

Securing AES Designs Against Power Analysis Attacks: A Survey

Article

Aug 2023

With the advent of Internet of Things (IoT), the call for hardware security has been seriously demanding due to the risks of side-channel attacks from adversaries. Advanced Encryption Standard (AES) is the de facto security standard for such applications and needs to ensure a low power, low area and moderate throughput design apart from providing high security to these devices. Substitution-box (S-box), being the core component of AES, has always drawn the attention of the cryptographic community. A chronological development of the S-box over a period of 20-years since the inception of AES is presented. This paper provides the first comprehensive review of the state-of-the-art S-box design techniques, identifying current advancements and analysing their impact on gate count, area, maximum frequency of operation, throughput and power. The other goal of the survey is to study the countermeasures designed for AES to protect it against side-channel attacks. In particular, we consider the power analysis attacks, and the countermeasures are investigated in terms of their security metrics and design overheads, such as area, power, and performance. The countermeasures are based on hiding or masking approaches depending on their design principle. Similar to the S-box survey, a chronological development of the countermeasures since the discovery of power analysis attacks in 1999, is presented. Finally, we suggest some open research gaps and possible direction of research in terms of S-box and countermeasure designs.

On the Construction Structures of $$3 \times 3$$ Involutory MDS Matrices over $$\mathbb {F}_{2^{m}}

Chapter

Oct 2022

In this paper, we propose new construction structures, in other words, transposition-permutation path patterns for $3 \times 3$ involutory and MDS permutation-equivalent matrices over $\mathbb {F}_{2^{3}}$ and $\mathbb {F}_{2^{4}}$. We generate $3 \times 3$ involutory and MDS matrices over $\mathbb {F}_{2^{3}}$ and $\mathbb {F}_{2^{4}}$ by using the matrix form given in [1], and then all these matrices are analyzed by finding all their permutation-equivalent matrices. After that, we extract whether there are any special permutation patterns, especially for this size of the matrix. As a result, we find new 28,088 different transposition-permutation path patterns to directly construct $3 \times 3$ involutory and MDS matrices from any $3 \times 3$ involutory and MDS representative matrix over $\mathbb {F}_{2^{3}}$ and $\mathbb {F}_{2^{4}}$. The 35 patterns are in common with these finite fields. By using these new transposition-permutation path patterns, new $3 \times 3$ involutory and MDS matrices can be generated especially for different finite fields such as $\mathbb {F}_{2^{8}}$ (is still an open problem because of the large search space). Additionally, the idea of finding the transposition-permutation path patterns can be applicable to larger dimensions such as $8 \times 8$, $16 \times 16$, and $32 \times 32$. To the best of our knowledge, the idea given in this paper to find the common and unique transposition-permutation path patterns over different finite fields is the first work in the literature.KeywordsMDS matrixLightweight cryptographyDiffusion matricesPermutation-equivalent matrices

High-throughput block cipher implementations with SIMD

Article

Nov 2022

The development of new technologies has put forward higher requirements for the performance and security of block ciphers. To meet the needs of these situations, Bitslicing is one of the block cipher implementation techniques which can provide high efficiency and resistance to cache-timing attacks. The S-box is usually the most time-consuming component for Bitsliced implementations. In this paper, we present techniques to optimize the S-box implementation and derive the most compact representations available for Bitslicing. Then we show that the efficiency of linear layers can be further improved by introducing a state-of-the-art technique. As a result, our implementations have significant advantages over previous implementations. The throughputs of our AES and SM4 implementations achieve 27068Mbps and 30026Mbps on i5-11335G7, both exceeding the throughput of AES-NI on the same platform.

A Novel Fault-Tolerant Logic Style with Self-Checking Capability

Conference Paper

Sep 2022

We introduce a novel logic style with self-checking capability to enhance hardware reliability at logic level. The proposed logic cells have two-rail inputs/outputs, and the functionality for each rail of outputs enables construction of fault-tolerant configurable circuits. The AND and OR gates consist of 8 transistors based on CNFET technology, while the proposed XOR gate benefits from both CNFET and low-power MGDI technologies in its transistor arrangement. To demonstrate the feasibility of our new logic gates, we used an AES S-box implementation as the use case. The extensive simulation results using HSPICE indicate that the case-study circuit using on proposed gates has superior speed and power consumption compared to other implementations with error-detection capability.

Small Low-Depth Circuits for Cryptographic Applications

Article

Full-text available

Jan 2019

We present techniques to obtain small circuits which also have low depth. The techniques apply to typical cryptographic functions, as these are often specified over the field GF(2), and they produce circuits containing only AND, XOR and XNOR gates. The emphasis is on the linear components (those portions containing no AND gates). A new heuristic, DCLO (for depth-constrained linear optimization), is used to create small linear circuits given depth constraints. DCLO is repeatedly used in a See-Saw method, alternating between optimizing the upper linear component and the lower linear component. The depth constraints specify both the depth at which each input arrives and restrictions on the depth for each output. We apply our techniques to cryptographic functions, obtaining new results for the S-Box of the Advanced Encryption Standard, for multiplication of binary polynomials, and for multiplication in finite fields. Additionally, we constructed a 16-bit S-Box using inversion in GF(216) which may be significantly smaller than alternatives.

Mixed Bases for Efficient Inversion in ${{\mathbb F}{((2^2)^2)}{2}}$ and Conversion Matrices of SubBytes of AES

Conference Paper

Full-text available

Aug 2010

A lot of improvements and optimizations for the hardware implementation of SubBytes of Rijndael, in detail inversion in ${\mathbb F}_{2^8}$ have been reported. Instead of the Rijndael original ${\mathbb F}_{2^8}$, it is known that its isomorphic tower field ${{\mathbb F}{((2^2)^2)}{2}}$ has a more efficient inversion. For the towerings, several kinds of bases such as polynomial and normal bases can be used in mixture. Different from the meaning of this mixture of bases, this paper proposes another mixture that contributes to the reduction of the critical path delay of SubBytes. To the ${{\mathbb F}{(2^2)}{2}}$–inversion architecture, for example, the proposed mixture inputs and outputs elements represented with normal and polynomial bases, respectively.

A very compact S-box for AES

Conference Paper

Full-text available

Aug 2005
Lect Notes Comput Sci

David Canright

A key step in the Advanced Encryption Standard (AES) algorithm is the "S-box." Many implementations of AES have been pro- posed, for various goals, that effect the S-box in various ways. In partic- ular, the most compact implementations to date of Satoh et al.(14) and Mentens et al.(6) perform the 8-bit Galois field inversion of the S-box using subfields of 4 bits and of 2 bits. Our work refines this approach to achieve a more compact S-box. We examined many choices of ba- sis for each subfield, not only polynomial bases as in previous work, but also normal bases, giving 432 cases. The isomorphism bit matrices are fully optimized, improving on the "greedy algorithm." Introducing some NOR gates gives further savings. The best case improves on (14) by 20%. This decreased size could help for area-limited hardware imple- mentations, e.g., smart cards, and to allow more copies of the S-box for parallelism and/or pipelining of AES.

A New Combinational Logic Minimization Technique with Applications to Cryptology

Conference Paper

Full-text available

May 2010

A new technique for combinational logic optimization is described. The technique is a two-step process. In the first step, the non-linearity of a circuit – as measured by the number of non-linear gates it contains – is reduced. The second step reduces the number of gates in the linear components of the already reduced circuit. The technique can be applied to arbitrary combinational logic problems, and often yields improvements even after optimization by standard methods has been performed. In this paper we show the results of our technique when applied to the S-box of the Advanced Encryption Standard (AES [6]). This is an experimental proof of concept, as opposed to a full-fledged circuit optimization effort. Nevertheless the result is, as far as we know, the circuit with the smallest gate count yet constructed for this function. We have also used the technique to improve the performance (in software) of several candidates to the Cryptographic Hash Algorithm Competition. Finally, we have experimentally verified that the second step of our technique yields significant improvements over conventional methods when applied to randomly chosen linear transformations.

Smashing the Implementation Records of AES S-box

Conference Paper

Oct 2018

Canright S-box has been known as the most compact S-box design since its introduction back in CHES’05. Boyar-Peralta proposed logic-minimization heuristics that could reduce the gate count of Canright S-box from 120 gates to 113 gates, however synthesis results did not reflect much improvement. In CHES’15, Ueno et al. proposed an S-box that has a slightly higher area, but significantly faster than the previous designs, hence it was the most efficient (measured by area×delay) S-box implementation to date. In this paper, we propose two new designs for the AES S-box. One design has a smaller implementation area than both Canright and the 113-gate S-boxes. Hence, our first design is the smallest AES S-box to date, breaking the 13 years implementation record of Canright. The second design is faster and smaller than the Ueno S-box. Hence, our second design is both the fastest and the most efficient S-box design to date. While doing so, we also propose new logicminimization heuristics that outperform the previous algorithms of Boyar-Peralta. Finally, we conduct an exhaustive evaluation of each and every block in the S-box circuit, using both structural and behavioral HDL modeling, to reach the optimum synergy between theoretical algorithms and technology-supported optimization tools. We show that involving the technology-supported CAD tools in the analysis results in several counter-intuitive results.

New Area Record for the AES Combined S-Box/Inverse S-Box

Conference Paper

Jun 2018

The AES combined S-box/inverse S-box is a single construction that is shared between the encryption and decryption data paths of the AES. The currently most compact implementation of the AES combined S-box/inverse S-box is Canright’s design, introduced back in 2005. Since then, the research community has introduced several optimizations over the S-box only, however the combined Sbox/inverse S-box received little attention. In this paper, we propose a new AES combined S-box/inverse S-box design that is both smaller and faster than Canright’s design. We achieve this goal by proposing to use new tower field and optimizing each and every block inside the combined architecture for this field. Our complexity analysis and ASIC implementation results in the CMOS STM 65nm and NanGate 15nm technologies show that our design outperforms the counterparts in terms of area and speed.

Highly efficient $$\textit{GF}(2^8)$$GF(28) inversion circuit based on hybrid GF representations

Article

Mar 2018

This paper proposes a compact and highly efficient $\textit{GF}(2^8)$ inversion circuit design based on a combination of non-redundant and redundant Galois field (GF) (or finite field) arithmetic. The proposed design utilizes an optimal normal basis and redundant GF representations, called polynomial ring representation and redundantly represented basis, to implement $\textit{GF}(2^8)$ inversion using a tower field $\textit{GF}((2^4)^2)$. The flexibility of the redundant representations provides efficient mappings from/to the $\textit{GF}(2^8)$. This paper evaluates the efficacy of the proposed circuit by gate counts and logic synthesis with a 65-nm CMOS standard cell library in comparison with conventional circuits. Consequently, we show that the proposed circuit achieves approximately 25% higher area–time efficiency than the conventional best inversion circuit in our environment. We also demonstrate that AES S-Box with the proposed circuit achieves the best area–time efficiency.

On the Shortest Linear Straight-Line Program for Computing Linear Forms

Conference Paper

Aug 2008

We study the complexity of the Shortest Linear Program (SLP) problem, which is to minimize the number of linear operations necessary to compute a set of linear forms. SLP is shown to be NP-hard. Furthermore, a special case of the corresponding decision problem is shown to be Max SNP-Complete. Algorithms producing cancellation-free straight-line programs, those in which there is never any cancellation of variables in GF(2), have been proposed for circuit minimization for various cryptographic applications. We show that such algorithms have approximation ratios of at least 3/2 and therefore cannot be expected to yield optimal solutions to non-trivial inputs.

Fast algorithms for the maximum convolution problem

Article

Apr 1994
OPER RES LETT

We describe two algorithms for solving the maximum convolution problem, i.e. the calculation of ck: = max {ak−i + bi∥ 0 ⩽ i ⩽ n - 1} for all k with respect to given sequences (a0…, an−1), (b0,…, bn−1), of real numbers. Our first algorithm with expected running time O (n log n) is mainly of theoretical interest while our second algorithm allows a simpler, more practicable implementation and showed quite fast performance in numerical experiments.

A fast algorithm for computing multiplicative inverses in GF(2m) using normal bases

Article

Sep 1988

This paper proposes a fast algorithm for computing multiplicative inverses in GF(2m) using normal bases. Normal bases have the following useful property: In the case that an element x in GF(2m) is represented by normal bases, 2k power operation of an element x in GF(2m) can be carried out by k times cyclic shift of its vector representation. C. C. Wang et al. proposed an algorithm for computing multiplicative inverses using normal bases, which requires (m − 2) multiplications in GF(2m) and (m − 1) cyclic shifts. The fast algorithm proposed in this paper also uses normal bases, and computes multiplicative inverses iterating multiplications in GF(2m). It requires at most 2[log2(m − 1)] multiplications in GF(2m) and (m − 1) cyclic shifts, which are much less than those required in the Wang's method. The same idea of the proposed fast algorithm is applicable to the general power operation in GF(2m) and the computation of multiplicative inverses in GF(qm) (q = 2n).

New Circuit Minimization Techniques for Smaller and Faster AES SBoxes

Abstract and Figures

Recommended publications

Improved upper bounds for the expected circuit complexity of dense systems of linear equations over...

New Area Record for the AES Combined S-Box/Inverse S-Box

A depth-16 circuit for the AES S-box.

Construction of a Low Multiplicative Complexity GF (2 4 ) Inversion Circuit for Compact AES S-Box