ArticlePDF Available

CU Partition Mode Decision for HEVC Hardwired Intra Encoder Using Convolution Neural Network

August 2016
IEEE Transactions on Image Processing 25(11)

August 2016
25(11)

DOI:10.1109/TIP.2016.2601264

Authors:

Zhenyu Liu

Tsinghua University

Show all 6 authorsHide

The intensive computation of High Efficiency Video Coding (HEVC) engenders challenges for the hardwired encoder in terms of the hardware overhead and the power dissipation. On the other hand, the constrains in hardwired encoder design seriously degrade the efficiency of software oriented fast coding unit (CU) partition mode decision algorithms. A fast algorithm is attributed as VLSI friendly, when it possesses the following properties: First, the maximum complexity of encoding a coding tree unit (CTU) could be reduced; Second, the parallelism of the hardwired encoder should not be deteriorated; Third, the process engine of the fast algorithm must be of low hardware- and power-overhead. In this article, we devise the convolution neural network (CNN) based fast algorithm to decrease no less than two CU/PU modes in each CTU for full RDO processing, thereby reducing the encoder's hardware complexity. As our algorithm does not depend on the correlations among CU depths or spatially nearby CUs, it is friendly to the parallel processing and does not deteriorate the rhythm of RDO pipelining. Experiments illustrated that, an averaged 61.1% Intra encoding time was saved, whereas the Bjøntegaard-Delta Bit-Rate (BDBR) augment is 2.67%. Capitalizing on the optimal arithmetic representation, we developed the high-speed (714MHz in the worst conditions (125◦C, 0.9v)) and low-cost (42.5k-gate) accelerator for our fast algorithm by using TSMC 65nm CMOS technology. One accelerator could support 1080p@55fps real-time encoding. The corresponding power dissipation was 16.2mW@714MHz. Finally, our accelerator is provided with good scalability. Four accelerators fulfill the throughput requirements of 4K@55fps.

Pseudo codes of CNN oriented XCOMPRESSCU function in HM software (M denotes the predicted CU mode; pCurCU is the pointer of current CU data structure; PO and Q P represent Y-component of current CU and quantization parameter, respectively; D CUR is the current CU depth; D MAX is the maximum CU depth; C 2N and C N stand for the RD-costs of modes 2N × 2N and N × N , respectively).

…

Pseudo codes of FASTCUMODE function (IsBoundary being true indicates the current CU is on the picture boundary; CNN s is the mode decision CNN for s × s CU (s ∈ {32, 16, 8}); E(s) enables the fast mode decision for s × s CU level).

…

Architecture of intra encoder embedded with CNN based fast CU mode decision engine. (a) Intra Encoder Top Block Diagram. (b) CNN Architecture.

…

Noisy fixed-point convolution operation (y l−1 (i) + ˆ e l−1 (i) represents the output signal and the additive roundoff error in the (l − 1)th layer's convolution result; x l (i) andˆεandˆ andˆε l (i) coming from the activation function are input signal and additive noise of the lth layer; The kernel coefficients and bias are k(i) and b; ˆ (i) is the additive rounding error from the multiplication operation; ˆ y l , which is composed of the signal y l and the noisê e l , indicates the convolution result of the lth layer).

…

Noisy floating-point convolution operation (y l−1 (i) + ˜ e l−1 (i) denotes the output signal and the associated roundoff error in the (l − 1)th layer's convolution result; The kernel coefficients and bias are k(i) and b; ˜ η(i) and˜ζand˜ and˜ζ (i) represent the multiplicative roundoff errors of the floating-point multiplication and addition operations, respectively; ˜ y l being composed of the signal y l and the noise˜enoise˜ noise˜e l , indicates the convolution result of the lth layer).

…

Figures - uploaded by Zhenyu Liu

Content may be subject to copyright.

Content uploaded by Zhenyu Liu

Content may be subject to copyright.

5088 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016

CU Partition Mode Decision for HEVC Hardwired

Intra Encoder Using Convolution Neural Network

Zhenyu Liu, Member, IEEE, Xianyu Yu, Yuan Gao, Shaolin Chen, Xiangyang Ji, Member, IEEE,

and Dongsheng Wang, Member, IEEE

Abstract— The intensive computation of High Efﬁciency Video

Coding (HEVC) engenders challenges for the hardwired encoder

in terms of the hardware overhead and the power dissipation.

On the other hand, the constrains in hardwired encoder design

seriously degrade the efﬁciency of software oriented fast coding

unit (CU) partition mode decision algorithms. A fast algorithm

is attributed as VLSI friendly, when it possesses the following

properties. First, the maximum complexity of encoding a coding

tree unit (CTU) could be reduced. Second, the parallelism of

the hardwired encoder should not be deteriorated. Third, the

process engine of the fast algorithm must be of low hardware- and

power-overhead. In this paper, we devise the convolution neural

network based fast algorithm to decrease no less than two CU

partition modes in each CTU for full rate-distortion optimiza-

tion (RDO) processing, thereby reducing the encoder’s hardware

complexity. As our algorithm does not depend on the correlations

among CU depths or spatially nearby CUs, it is friendly to

the parallel processing and does not deteriorate the rhythm

of RDO pipelining. Experiments illustrated that, an averaged

61.1% intraencoding time was saved, whereas the Bjøntegaard-

Delta bit-rate augment is 2.67%. Capitalizing on the optimal

arithmetic representation, we developed the high-speed [714 MHz

in the worst conditions (125 ◦C, 0.9 V)] and low-cost (42.5k

gate) accelerator for our fast algorithm by using TSMC 65-nm

CMOS technology. One accelerator could support HD1080p

at 55 frames/s real-time encoding. The corresponding power

dissipation was 16.2 mW at 714 MHz. Finally, our accelerator

is provided with good scalability. Four accelerators fulﬁll the

throughput requirements of UltraHD-4K at 55 frames/s.

Index Terms—HEVC, fast CU/PU mode decision, CNN, VLSI,

intra encoding.

Manuscript received December 27, 2015; revised June 5, 2016 and

July 17, 2016; accepted August 8, 2016. Date of publication August 18,

2016; date of current version September 13, 2016. This work was supported

in part by Huawei Technologies, National Science and Technology Major

Project under Grant 2016YFB0200505 and in part by the National Natural

Science Foundation of China under Grant 61325003. The associate editor

coordinating the review of this manuscript and approving it for publication

was Dr. Yui-Lam Chan.

Z. Liu and D. Wang are with Tsinghua National Laboratory for

Information Science and Technology, Research Institute of Informa-

tion Technology, Tsinghua University, Beijing 100084, China (e-mail:

liuzhenyu73@tsinghua.edu.cn).

X. Yu is with the Institute of Microelectronics, Tsinghua University, Beijing

100084, China.

Y. Gao is with the Department of Computer Science, Tsinghua University,

Beijing 100084, China.

X. Ji is with the Department of Automation, Tsinghua University, Beijing

100084, China.

S. Chen is with Huawei Technologies Company, Ltd., Shenzhen 518129,

China.

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TIP.2016.2601264

I. INTRODUCTION

THE state-of-the-art video coding standard, High

Efﬁciency Video Coding (HEVC) [1], [2], was developed

by Joint Collaborative Team on Video Coding (JCT-VC). With

the equivalent subjective video quality, HEVC can double the

compression ratio as compared to the predecessor H.264/AVC,

especially when operating on low bit-rate, high-resolution

image, and low-delay communication applications [3].

HEVC introduces the quadtree-based coding data structure.

The encoder splits one picture into basic square regions.

Each region is denoted as one coding tree unit (CTU), of

which the luma partition size can be chosen as 2N×2N

(N∈{32,16,8}). The data organization of CTU involves the

quadtree structure. For one CU (including CTU), it can be

coded as a whole (2N×2Nmode), or be split into four N×N

sub-CUs. The splitting of CU can be iterated on the basis of the

signal features, until the minimum CU size is reached. When

N≥8, the prediction unit (PU) size of Intra CU is identical

to its CU size; otherwise, it can adopt either 8 ×8or4×4

PU sizes. In general, when the minimum CU size is set, the

larger CTU size always provides better compression efﬁciency,

especially for high-resolution pictures. The experiments in

literature [3] showed that, when 8 ×8 minimum CU size was

applied, using 64×64 CTU size reduced the bit rate by 11.0%

on average as compared to the 16×16 CTU size. With Class-A

test sequences, this performance gap dilated up to 28.2%.

To alleviate the Intra encoding complexity, innumerable

algorithms have been developed for fast Intra coding

mode decision. The previous methods can be classiﬁed

into the following categories: The ﬁrst kind of fast

algorithms reduce the complexity of prediction mode

rate-distortion optimization (RDO) [4]–[6]. For example,

Ma etal. applied the number of Lbest modes from rough

mode decision and the most probable modes (MPMs)

from the neighboring coded blocks as the candidates, to

undergo the full RDO processing [6]. The methods of the

second category, which are the continuation of H.264/AVC

low complexity encoding algorithm study [7]–[10],

simplify the complexity of rate-distortion (RD)-cost

computation [11], [12]. To relieve the complexity induced

by CU/PU modes, the algorithms belonging to the third

category dynamically skip the unpromising CU/PU depths or

early terminate the CU/PU mode RDO procedure [13]–[23].

For instance, literatures [13]–[16] deﬁned the dynamic CU

depth range on the basis of the CU depth information

See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5089

of previously coded slices and CUs. The algorithm in

literature [17] terminates the splitting of the current CU when

this CU node selected SKIP mode as the best prediction mode.

Shen etal. proposed the dynamic CU depth range and early

termination methods by exploring the coding information of

spatial neighboring and co-located coding blocks [18]. In lit-

erature [19], Zhang exploited the early CU split termination

algorithm by using the approximate aggregated RD-cost with

sub-CUs’ RDO results. To improve the coding performance,

the machine-learning methods are employed to evaluate the

critical parameters in fast CU mode decision [16], [22], [23].

On the other hand, the hardwired HEVC encoders adopt

CTU pipeline processing [24]–[26], which incurs the follow-

ing constraints: First, the process latency for one CTU is

ﬁxed (6.6k clock cycles in [25]). Second, the high degree of

parallelism, especially in the CU level, is a must to guarantee

the required throughput. A fast algorithm is considered as

VLSI friendly, only when it possesses the following properties:

•The CTU-grain maximum computational complexity

should be reduced. Only with this precondition, the

encoder hardware cost, which is designed to satisfy the

stipulated process latency under the maximum burden,

can be saved. The dynamic CU level skipping and early

split termination algorithms cannot contribute to sim-

plifying the encoder’s hardware complexity. The same

problem exists in the online training stage of machine

learning based algorithms [16], [22], [23].

•The parallelism should not be degraded by the

fast algorithm. In the high resolution video encoder

deigns [24], [25], the 4×4 PU is always being encoded in

parallel with other CU modes to improve the throughput

as far as possible. If the current CU mode is inferred from

the neighboring CUs’ coding information, the CU level

parallel processing is infeasible.

•The hardware and power costs for implementing the fast

algorithm should be low. From the system view, the fast

algorithm is a part of the whole encoder. Its overheads

directly affect the encoder performance.

For the above reasons, the software oriented fast algo-

rithms are not adopted in the real-time Intra encoders in

literatures [25], [26].

In [26], Zhu et al. proposed to skip a certain number

of CU/PU modes in full RDO procedure according to the

estimated RD-cost from the source image texture investigation.

Speciﬁcally, the prediction residue of a pixel is ﬁrst evaluated

according to its edge strength, edge direction, and its location

in CU. Next, the RD-cost of one CU, with which the fast

CU mode decision could be made, is estimated from the

prediction residue evaluations. The drawbacks of [26] lie

in the empirical feature extractors and the ignorance of the

topology of feature points, which degraded the compression

efﬁciency with the averaged BDBR =+4.53%. Convolution

neural network (CNN) [27], is one model inspired by the

animal visual cortex. The CNN with appropriate architecture

can be trained with gradient-based learning algorithms [28] to

classify two-dimensional image patterns, such as handwriting

recognition [29], image classiﬁcation [30], image segmen-

tation [31], and so on. In this study, we introduce CNN to

Fig. 1. Pseudo codes of CNN oriented XCOMPRESSCU function in

HM software (Mdenotes the predicted CU mode; pCurCU is the pointer

of current CU data structure; PO and QP represent Y-component of current

CU and quantization parameter, respectively; DCUR is the current CU depth;

DMAX is the maximum CU depth; C2Nand CNstand for the RD-costs of

modes 2N×2Nand N×N, respectively).

circumvent the aforementioned hindrances of [26]: First, the

input layer’s convolution kernels, which are viewed as the

feature extractors, are trained by the samples, instead of a

rule of thumb; second, the topology information of obtained

features can be exploited by CNN during the classiﬁcation

processing. In addition, the proposed CNN for fast CU/PU

mode decision is rectiﬁed according to the speciﬁc RDO task.

For a VLSI friendly algorithm, it is desired that the process

engine of the fast algorithm should be of low hardware-

and power-costs, as well as high throughput. To this end,

we provide the reconﬁgurable Propagate Partial Multiply-

Accumulate (PPMAC) CNN accelerator by using the optimal

ﬂoating-point arithmetic. As compared to the fast mode deci-

sion engine in [26], the proposed CNN accelerator reduced

the chip area by 80.2%.

The rest of article is organized as follows. Section II brieﬂy

introduces the CU encoding procedure integrated with the

CNN oriented fast mode decision scheme. The proposed CNN

and its training method are described in Section III. The

architecture of our CNN accelerator is presented in Section IV.

The experimental results are illustrated in Section V, followed

by the conclusions in Section VI.

II. CNN BASED FAST CU/PU MODE DECISION

As a universal approximation of nonlinear systems, CNN

has the following virtues: First, the feature extractors in CNN,

derived by training, are propitious to recognizing complex

singularities, such as stroke end points or corners. Second,

CNN could exploit the topology information among the singu-

larities to improve the estimate accuracy. Finally, as compared

with the fully connected network, the weight scale of CNN is

greatly reduced, which contributes to reduction in hardware .

The inherent properties of CNN make it favored in our CU/PU

coding mode decision task.

The pseudo-code of XCOMPRESSCU function integrated with

our CNN based mode prediction algorithm is described

in Fig. 1, and the main optimizations to the reference software

5090 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016

Fig. 2. Pseudo codes of FASTCUMODE function (IsBoundary being true

indicates the current CU is on the picture boundary; CNNsis the mode

decision CNN for s×sCU (s∈{32,16,8}); E(s) enables the fast mode

decision for s×sCU level).

have been highlighted. We provide the unitary expression of

CU and PU mode decision, because the 8 ×8/4 ×4 PU mode

selection can be viewed as a special case of 2N×2N/N×N

CU mode decision. The function FASTCUMODE is the essential

procedure, in which CNN is adopted to determine the optimal

CU mode. When the current CU size is 64 ×64, the return

values of FASTCUMODE include three cases: HOMO, SPLIT,

and COMB. In other cases (the CU size is 32×32, 16×16 or

8×8), if the corresponding fast mode decision is enabled

(E(s)== true),F

ASTCUMODE merely returns HOMO or

SPLIT. If XCOMPRESSCU receives HOMO, the RD-cost of

N×Nis set as inﬁnite (CN←∞), and the CU splitting test

is eliminated. Otherwise (SPLIT is received), C2Nis assigned

inﬁnite (C2N←∞)and2N×2Nmode search is skipped.

In case of COMB, modes 2N×2Nand N×Nare both

traversed as the original HM software.

The pseudo codes of FASTCUMODE are depicted in Fig. 2.

The inputs of FASTCUMODE include the Y-component of

the current CU (PO) and the quantization parameter (QP).

To reduce the computational complexity of CNN, we apply

the local averaging and sub-sampling function AVGSUB(PO)

to derive the 8 ×8matrixP. In speciﬁc, if the size of PO

is s×s(s∈{8,16,32,64}), each entry in P,thatis, pi,j,is

derived as

pi,j=1

(s/8)2

(i+1)·s

8−1



m=i·s

(j+1)·s

8−1



n=j·s

pom,n,(1)

where, pom,nis one entry of the receptive ﬁeld in PO.

FASTCUMODE ﬁrst carries out the coarse edge strength

analysis to detect two special cases, i.e., the homogeneous

blocks and the strong edge blocks on picture boundaries. The

edge at (i,j)inPis denoted as 

i,j,wherei,j∈[0,6].



i,jis a vector with horizontal component δxi,jand vertical

component δyi,j, which are written as

δxi,j=pi,j+pi+1,j−pi,j+1−pi+1,j+1

δyi,j=pi,j+pi,j+1−pi+1,j−pi+1,j+1.(2)

We further deﬁne a threshold ETas

ET=max(QP2,Q2). (3)

Qis quantization step that depends on QP,

Q=MF(QP%6)·2QP/6,

in which, MF =[0.625,0.7031,0.7969,0.8906,1,1.125].

With (2) and (3), three auxiliary parameters, i.e., EC,EM,

and EPare devised. EChas the form

EC=ℵ

i,j|δxi,j>ETand δyi,j>ET,(4)

where, ℵrepresents the cardinal number (what is normally

referred to as a counting number) of a set. EMis

EM=max

i,jδx2

i,j+δy2

i,j.(5)

And, EPis the power of all edges, i.e.,

EP=

i,jδx2

i,j+δy2

i,j.(6)

When EP<5ETand EM≤QP2, the current CU is

identiﬁed as homogeneous and the return value is HOMO.

On the other hand, when the current CU is on the picture

boundary (IsBoundary==true) and possesses the strong edges

(EC>2), the return value is SPLIT. The coarse analysis

contributes from two aspects: First, the simple coarse analy-

sis relives the burden of CNN; second, the homogeneous

samples, which trap the CNN in ill-conditions, could be

ﬁltered out. The coding performance loss of coarse analysis

is BDBR=+0.42%. When the ﬂag E(s)isset,theCU,which

is not identiﬁed by the coarse analysis, is dispatched to the

dedicated neural network (CNN32 for 32×32 CU, CNN16 for

16×16 CU, and CNN8for 8×8 CU) to determine the proper

mode from Pand QP. The return value of CNNsis either

HOMO or SPLIT. That is, CNN just chooses one from the

two CU candidate modes for the following RDO processing.

E(s) makes the tradeoff between coding quality and computa-

tional complexity, which will be described in SectionV.

III. CNN BASED FAST CU MODE DECISION ALGORITHM

The block diagram of CTU pipelined HEVC Intra encoder

integrated with the CNN based CU mode decision engine is

illustrated in Fig. 3(a), which is inherent from our previous

study [26]. Similar to the typical HEVC encoders [24], [25],

a two-stage CTU pipeline is applied in our encoder. With the

aforementioned methods, the ﬁrst stage decides the promising

CU/PU competitors that will be dispatched to the second stage

for full RDO processing. The second stage constitutes the

reconﬁgurable predictor, 8 ×8/4 ×4 RDO engine, 32 ×32/

16 ×16 RDO engine, and reconstruction datapath. For one

8×8 CU, it is fed into 8 ×8/4 ×4 RDO engine with

the assigned PU mode (2N×2Nor N×N) to calcu-

late its best RD-cost. Other large CUs (16 ×16, 32 ×32

and 64 ×64) enter 32 ×32/16 ×16 RDO engine. After

deriving the optimal coding conﬁguration, the reconstruction

datapath produces the coding coefﬁcients and the recon-

structed pixels. Two successive CTUs could be processed

LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5091

Fig. 3. Architecture of intra encoder embedded with CNN based

fast CU mode decision engine. (a) Intra Encoder Top Block Diagram.

(b) CNN Architecture.

simultaneously in two stages. Speciﬁcally, if the (k+1)th

CTU (CTU(k+1)) is undergoing the CNN based mode

decision in the ﬁrst stage, the previous one (CTU(k)) is

carrying out RDO and reconstruction in the second stage.

For the VLSI implementation, our CNN based CU/PU pre-

decision engine does not degrade the parallelism of the RDO

processing in the second stage, even as reducing at least

two CU/PU modes.

In the CNN based CU mode decision unit, CNN32,

CNN16,andCNN

8adopt the same architecture, as depicted

in Fig. 3(b), to share the hardwired accelerator. The proposed

CNN is composed of the lower alternating convolution and

max-pooling layers, and the upper full-connected Multilayer

Perceptron (MLP). Counting the input, our CNN comprises

six layers, which are explained as follows:

•The input is an 8 ×8matrixP, which is derived as (1).

•The ﬁrst hidden layer is a convolution layer with 6 feature

maps. Each neuron is connected to a 3 ×3 receptive ﬁeld

in the input. The size of each feature map is 6 ×6, to

prevent the convolution from falling off the boundary. The

kernels in this layer are regarded as feature extractors.

There are 60 trainable parameters, which is composed of

six 3 ×3-kernels and six biases.

•The second hidden layer performs the local maxing and

sub-sampling. There is no trainable parameter.

•The third hidden layer implements the second convolu-

tion. There are seventeen 1×1 output neurons, of which

one input is QP. As the kernel size is 3×3, the trainable

parameter number is 16 ×(6×3×3+1)=880.

•The last two layers are fully connected MLP. The fourth

hidden layer consists of 11 units, and the output layer

contains 2 output units. Including biases, the trainable

parameter numbers in the fourth hidden layer and the

output layer are 180 and 24, respectively.

Gradient-based learning algorithms [28] is adopted in our

CNN training phase. To improve the CNN performance, the

training strategy is rectiﬁed from two aspects, i.e., sample

selection and target value deﬁnition, which will be descried

in the following sub-sections.

A. Training Sample Selection

In the training sample selection stage, we introduce the

following techniques.

•Firstly, the samples must not belong to the homogeneous

type, which are explained in Fig.2. We deﬁne the para-

meter γas follows to indicate edge strength in one CU,

γ=i,j(δx2

i,j+δy2

i,j)

49 ·Q2

where, δxi,jand δyi,jare deﬁned as (2). In the CNN train-

ing phase, it is desired that the samples be evenly distrib-

uted in all output categories [32]. When the value of γis

small, there is a high probability to use 2N×2Nmode.

To prevent too many homogeneous samples in training,

for 32 ×32 and 16 ×16 CUs, we use the samples with

γ>0.1; for 8 ×8 CU, the samples with γ>1.3are

adopted.

•To capitalize on the edge strength information, we did

not normalize the input signals. The vanishing gradient

problem (the gradient of active function approaches 0

when its input amplitude is large enough) can be avoided

by using the modiﬁed sigmoidal activation function.

•We eliminate such samples, in which the RD-cost differ-

ence between 2N×2Nmode and N×Nmode is too

small. The parameter RD is deﬁned as

RD =C2N−CN

C2N+CN,

where, C2Nand CNdenominate the RD-costs of mode

2N×2Nand mode N×N, respectively. The samples with

|RD|≤0.02 are discarded. This scheme will improve

the compression efﬁciency by BDBR=-0.20% averagely.

•Finally, six typical video sequences, including

PeopleOnStreet,BasketballDrive,BQTerrace,

Cactus,Vidyo3,andJohnny, are adopted to generate

the training samples.

B. Target Value Deﬁnition

We will derive the target output values for nodes O2Nand

ON(as shown in Fig.3(b)) during the training phase. Let

vector edenote the singularities in one CU. The variance of

residual transform coefﬁcient (σc) is a function of singularities,

i.e., σc(e). From the distortion model provided in [33], we can

see that the nonlinear relation between the distortion Dand Q2

could be formulated as

D=(e)Q2,(7)

5092 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016

in which, (e)is a nonlinear function of variable σc(e).Onthe

basis of RD relation [34], [35]

R=1

2ln σ2

c(e)

and the distortion model of (7), the rate has the form

R=1

2ln (e)

Q2,(8)

where, (e)=σ2

c(e)/(e).

With (7) and (8), for the 2N×2Nmode, its distortion and

rate are both nonlinear functions of e, expressed as

⎧

⎨

⎩

D2N=2N(e)Q2

R2N=1

2ln 2N(e)

Q2,(9)

where, 2N(e)and 2N(e)are nonlinear functions to be

approximated. Consequently, we have

C2N=D2N+λ2NR2N.(10)

The relationships of λ2N,Qand QP are expressed as

λ2N=a2N2QP/3,(11)

and

Q≈b2QP/6.(12)

Substituting (9), (11) and (12) to (10), we deduce that the

logarithm of C2Ncan be formulated as

ln C2N=μ0+μ1QP +ln{μ22N(e)

+μ3ln[2N(e)]+μ4QP +μ5},(13)

in which, {μi|i=0,1,...,5}represent the coefﬁcients

derived from the parameters a2Nand b. The expression of

ln CNcan be traced by analogy. That is,

CN=DN+λNRN.

=N(e)Q2+λN

2ln N(e)

=N(e)Q2+aN2QP/3

2ln N(e)

It should be noticed that, because of the changed residue

distribution, the Lagrange multipliers of mode 2Nand

mode Nare different, λ2N= λN[36]. Then, we have

ln CN=ν0+ν1QP +ln{ν2N(e)

+ν3ln[N(e)]+ν4QP +ν5}(14)

in which, {νi|i=0,1,...,5}represent the coefﬁcients derived

from the parameters aNand b.

Because the sigmoidal like activation function in our CNN

is an odd-function, it is desired that the mean of two output

nodes is zero. Consequently, the teaching output value for ON

is set as

ON=ln C2N−ln CN,

and the corresponding teaching output value for O2Nis neg-

ative of ON. Considering the effects of QP in (13) and (14),

QP is adopted as the inputs to the third and the fourth hidden

TAB LE I

DEFINITION OF THRESHOLD τ

Fig. 4. Feature extraction kernels of the ﬁrst hidden layer in 32×32/16×16

and 8×8/4×4 CU mode decision CNN. (a) Kernels of CNN32 .(b)Kernels

of CNN8.

layers, as shown in Fig3(b). Introducing QP as the input of

CNN is the prominent improvement as compared with our

previous study [32], up to BDBR=−2.04% rate reduction

could be achieved for the test of Kimono.

As |ln C2N−ln CN|always increases with the magnitudes

of QP and e, the output of activation function in our design

should be not constrained. Therefore, the active function is our

CNN has the form

⎧

⎪

⎨

⎪

⎩

1.716 tanh(0.667x)|x|<τ

1.716[tanh(2τ/3)+tanh(2τ/3)(x−τ)]x≥τ

1.716[tanh(−2τ/3)+tanh(−2τ/3)(x+τ)]x≤−τ.

The values of τon different CNN layers are derived by experi-

ments. Let τirepresent the ith layer’s threshold. The candidate

value set of τiis deﬁned as τi∈{1.0,1.1,1.2,···,3.5}.

We test the performance of various combinations of τi,and

ﬁnd the optimal one. The threshold values in our design are

provided in TableI.

The kernels in the ﬁrst convolution layers of CNN32 and

CNN8are illustrated in Fig. 4. The white square represents the

positive pole and the black one denotes the negative pole. It is

obvious that, these kernels are efﬁcient to detect singularities,

such as end points and corners. For example, Kernel#0 of

CNN32 is composed of a negative pole on the ﬁrst row and a

positive one on the second row; One corner is included in

Kernel#0 of CNN8. The computational complexity of the

proposed CNN is summarized as TableII. Totally, the

CNN processing consumes 3000 multiplications, 3000 addi-

tions/subtractions and 244 pseudo sigmoidal functions. The

experiments show that FASTCUMODE function accounts for

2-3% of the HM-12.0 overall Intra encoding time.

LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5093

Fig. 5. Noisy ﬁxed-point convolution operation (yl−1(i)+ˆel−1(i)represents the output signal and the additive roundoff error in the (l−1)th layer’s

convolution result; xl(i)and ˆεl(i)coming from the activation function are input signal and additive noise of the lth layer; The kernel coefﬁcients and bias

are k(i)and b;ˆ(i)is the additive rounding error from the multiplication operation; ˆyl, which is composed of the signal yland the noise ˆel, indicates the

convolution result of the lth layer).

TAB LE I I

LAYE R-WISE COMP UTATIONAL COMPLEXITY

OF PROPOSEDCNN

IV. VLSI IMPLEMENTATION OF CNN ACCELERATOR

Our analytical models reveal that, when the CNN scale

is small, the hardware and power costs of CNN accelerator

can be greatly reduced by using the optimal arithmetic and

the associated representation format. Based on the optimal

arithmetic, a reconﬁgurable hardwired accelerator is devised

to speed up the computation.

A. Effect of Finite Bit-Depth

In this section, we will reveal the impacts of ﬁnite bit-depth

and network scale to the output noise-to-signal-ratio (NSR) for

ﬁxed-point and ﬂoating-point, respectively.

Let’s build the data-ﬂow graph for computation error analy-

sis. In the convolution procedure, the feature maps in the

current layer are obtained from the neuron outputs of the

previous layer. This procedure is formulated as

⎧

⎪

⎨

⎪

⎩

yl(m,n)=

T−1



t=0

Kl−1



p=0

Kl−1



q=0

xl(t,m+p,n+q)kl(t,p,q)+b

xl+1(m,n)=sigmoid(yl(m,n)),

(15)

in which, xl(t,m+p,n+q)denotes the pixel (neuron output)

from the tth feature map in the previous layer, kl(t,p,q)

and bdenote the kernel coefﬁcients and the bias, respectively.

Equation (15) is composed of two stages: the biased sum of

weighted input pixels (calculating yl(m,n)), and the limitation

of yl(m,n)by activation function, i.e., sigmoid(yl(m,n)).

xl+1(m,n)is the input pixel for the following the layer.

Each pixel yl(m,n)can be viewed as the biased

weighted average of all pixels in the receptive ﬁeld. With-

out loss of generality, it is assumed that the input sig-

nals xlare independent identically distributed (i.i.d.) random

variables (r.v.) [37], [38]. Consequently, the statistical proper-

ties of ylare invariant with respect to the indices mand n.

In our error analysis, (15) is simpliﬁed with 1-D notation,

written as

yl=

Il−1



i=0

xl(i)kl(i)+b,(16)

where, i=t·K2

l+p·Kl+qand Il=T·K2

For the ﬁxed-point, a (1 +ˆ

A+ˆ

B)-bit number is composed

of 1-bit sign, ˆ

A-bits integer, and ˆ

B-bit fraction. That is

bS0



sign

bI ˆ

A−1...bI0

 

integer

bFˆ

B−1...bF

 

fraction

in which denotes the binary point.

Ais devised to prevent overﬂow when representing the

maximum amplitude of CNN input and output signals. The

bit-depth of fractional part ( ˆ

B) affects the energy of roundoff

noises. With no overﬂow, the additive roundoff noises in the

ﬁxed-point CNN merely come from the multiplications in (16).

The error analysis model of two successive layers is illustrated

in Fig.5. As mentioned in the literatures [37] and [38],

it is assumed that all signals and noises are zero-mean inde-

pendent variables, and the variance of error ˆ(i)is determined

by ˆ

B, i.e.,

⎧

⎪

⎨

⎪

⎩

Eˆ(i)·ˆ(j)=0wheni= j

Eˆ2(i)=2−2ˆ

B/12

E[xl(i)·xl(j)]=0wheni= j

Ex2

l(i)=σ2

Exl(i)·ˆ( j)=0∀iand ∀j

in which, E [·] denotes the expectation of its input variable.

In general, the activation function in CNN is non-linear.

We need to investigate the affects of the activation function

to NSR. From Appendix A, we derive lemma1. Because

the traditional sigmoid and Relu functions, and the pseudo

sigmoid functions adopted in this article, always satisfy the

premises in lemma1, it is concluded that the output NSR of

activation function is less than its input NSR.

Lemma 1: For an activation function (y), if it possesses

the properties of (0)=0, the sign of (·)is invariant,

and (y2)≤(y1)when |y2|>|y1|, the output NSR is

always not greater than the input NSR.

Let NSR

ldenote the local NSR of the lth layer. In speciﬁc,

if σˆldenotes the variance of the noises stemming from the

5094 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016

Fig. 6. Noisy ﬂoating-point convolution operation (yl−1(i)+˜el−1(i)denotes the output signal and the associated roundoff error in the (l−1)th layer’s

convolution result; The kernel coefﬁcients and bias are k(i)and b;˜η(i)and ˜

ζ(i)represent the multiplicative roundoff errors of the ﬂoating-point multiplication

and addition operations, respectively; ˜ylbeing composed of the signal yland the noise ˜el, indicates the convolution result of the lth layer).

lth layer’s multiplications, and σylrepresents the variance of

yl,NSR

lis deﬁned as



NSRl=σˆ2

σ2

From the analysis in Appendix B, NSR

lis formulated as



NSRl=2−2ˆ

B(Il+1)

12σ2

.(17)



NSRlalways increases linearly with the number of noise

sources (Il), and is inversely proportional to the power of

output signal (σ2

yl).

Theorem 1: For one ﬁxed-point CNN, of which the activa-

tion functions meet the requirements provided in lemma 1, the

NSR upper bound of the lth layer output is equal to the sum

of all 

NSRi(0 ≤i≤l), i.e.,

σ2

ˆel

σ2

≤



i=0



NSRi(18)

The detailed analysis of Theorem1 is given in AppendixB.

From (17) and (18), the ﬁnal output NSR always increases with

the network scale, which is determined by the parameters Ii

and l.

For the ﬂoating-point arithmetic, a (1 +˜

A+˜

B)-bit number

is composed of three sections, i.e., 1-bit sign, ˜

A-bit exponent,

and ˜

B-bit mantissa, which is described as

bS0



sign

bE ˜

A−1...bE0

 

exponent

bM ˜

B−1...bM0

 

mantissa

.(19)

The value of a number xrepresented by (19) is

x=(−1)bS0×⎛

⎝1+

B−1



i=0

bMi·2i−˜

B⎞

⎠

×exp ⎛

⎝2,

A−1



i=0

bEi·2i−Ebias⎞

⎠

where, Ebias is the exponent bias. In our design, Ebias is set

as 2 ˜

A−1.

The data ﬂow of noisy ﬂoating-point operation based CNN

is depicted in Fig. 6. Different from the ﬁxed-point counterpart,

the roundoff errors in ﬂoating-point operations (˜η(·)and ˜

ζ(·)

in Fig. 6) are multiplicative. For example, if we perform

the ﬁrst ﬂoating-point multiplication in (16), except for the

signal xl(0)kl(0), the noise xl(0)kl(0)˜η(0)is also generated.

In addition, all multiplications and additions in ﬂoating-point

CNN will generate the noises.

As the investigation of ﬁxed-point CNN, it is assumed

that ˜η(i)and ˜

ζ(i)are zero-mean i.i.d. r.v. [37]. If ˜η(i)

and ˜

ζ(i)are both uniformly distributed in the range of

[−2−(˜

B+1),2−(˜

B+1)], the variances of ˜η(i)and ˜

ζ(i)can be

expressed as

σ2

˜η=σ2

ζ=2−2˜

28 .

From the above properties, we could derive the local NSR,

i.e., 

NSRl, which is produced by the ﬂoating-point operations

in the lth layer. The mathematical analysis in Appendix C

yields the expression of 

NSRlas



NSRl=2−2˜

B(Il+2)

28 ·Il−1

i=01−i

Il+2k2

l(i)σ 2

xl(i)

Il−1

i=0k2

l(i)σ 2

xl(i).

In consequent, the NSR upper bound for multiple-layer

ﬂoating-point CNN is given by Theorem 2. The associated

investigation can be referred in Appendix C.

Theorem 2: For one ﬂoating-point CNN, of which the acti-

vation functions meet the requirements provided in lemma1,

the NSR upper bound of the lth layer output is

σ2

˜el

σ2

≤l



i=0

Ii+2l2−2˜

From Theorem 1 and Theorem 2, it is concluded that the

NSR upper bound is linear with the network scale. Namely,

for the small scale CNN, we can reduce the bit-depth of ﬁxed-

point fraction part and that of ﬂoating-point mantissa part,

while maintaining the desired precision. As to be illustrated

in Section V, 12-bit ﬁxed-point ( ˆ

A=5and ˆ

B=6) and 10-bit

ﬂoating-point ( ˜

A=4and ˜

B=5) both fulﬁll the precision

requirements.

As to be discussed in Section IV-B, the primary arithmetic

modules in our CNN accelerator include nine multipliers and

three four-operand adder-trees. As compared to the ﬁxed-point

arithmetic, because of the bit-depth reduction, the hardware

saving gains of the ﬂoating-point multipliers outweigh the

losses arising from the ﬂoating-point adder-trees, which ﬁnally

leads to the averaged 31% gate count reduction for the

arithmetic operations. Therefore, we adopted the ﬂoating-point

arithmetic in our CNN accelerator design.

LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5095

Fig. 7. Reconﬁgurable PPMAC Architecture of CNN Accelerator Design (LZC: leading zero counter; sigmoid: sigmoid function shared by convolution and

MLP layers; Si,Eiand Mi(i∈{0,1,2}): sign, exponent and mantissa of the ith operand to 3-Operand Adder;SR,ERand MR: sign, exponent and mantissa

of 3-Operand Adder result; ij is the kernel coefﬁcient in convolutional layers and the weight in MLP layers).

TABLE III

DATA FLOW OF PROPOSEDPPMAC IN CONVOLUTION PROCESSING

B. Propagate Partial Multiply-Accumulate CNN Accelerator

To accelerate 2D convolution and MLP processing, we

proposed the reconﬁgurable PPMAC architecture, as shown in

Fig.7. The data ﬂow of PPMAC is inspired by the Propagate

Partial SAD architecture of motion estimation accelerator in

H.264/AVC video encoder [39]. The architecture is composed

of three row-wise 1D PE arrays (1D-PEAi,i∈{0,1,2}).

1D-PEAiis of 4-stage pipeline, including 1-stage multiplier,

2-stage 3-operand adder, and the last stage 2-operand adder.

For the convolution, the last stage adder of 1D-PEAi

(i∈{1,2}) is conﬁgured to fetch the output of 1D-PEAi−1

as the operand. This operand of 1D-PEA0 is the bias, i.e.,

bin (15). The row-wise input vectors of 1D-PEAiare 

and 

X1, which are both of the size of 1 ×3 (one row

by three columns). The data ﬂow of PPMAC is described

in TableIII. With the 4-stage pipeline 1D-PEAi, our design

achieved 714MHz clock speed even at the worst working

conditions (0.9v, 125°C). It should be emphasized that, our

design needs not be compliant with IEEE-754 speciﬁcations,

which provides more design space for hardware optimizations.

For example, the 3-operand adder in our design applies the

2-stage uniform aligning based addition architecture, as shown

in Fig.7, to reduce the circuits complexity. To maintain the

precision, after the uniform aligning, we extend two least

signiﬁcant bits. The rounding operation is a must. Otherwise,

the result amplitude is always decreased. With the above

optimizations, 6.4-38.9% chip area could be saved for the

3-operand adder. Because the last stage adder in 1D-PEAi

is used as the accumulator in MLP, it cannot be divided into

more pipeline stages. For this adder, we apply the 1-stage dual-

path approach [40]. Except the two initial cycles, the hardware

utilization of 2D PE-Array is 100% during the ﬁrst convolution

layer.

In the ﬁrst layer convolution, the output of 2D PE-Array is

directly dispatched to sigmoid module, which then feeds its

output to the 2×2 max-pooling stage. Because the convolution

5096 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016

TAB LE I V

DATA FLOW OF PPMAC IN MLP PROCESSING

is column-wise and one pixel is generated in each cycle, three

pixel registers (r0, r1, and r2) are equipped to store the local

maximum pixels in the corresponding 2 ×2 receptive ﬁelds.

For the MLP layer, each neuron yjin the current layer

ﬁrst forms the biased inner product of its weight vector (wij)

and the output vector of the previous layer (xi), which is

formulated as

y

I−1



i=0

wijxi+bj,(20)

and then, emits

yj=sigmoid(y

j). (21)

In MLP computation, each 1D-PEA is conﬁgured to work

independently. In speciﬁc, the ﬁnal adder in 1D-PEAifetches

one operand from the last stage accumulation registers. The

3-pixel inputs of 1D-PEAiall come from 

X1. The outputs

of 1D-PEAibelong to different neurons in the output layer.

The initial value of accumulation registers is set as the

corresponding bias. The data ﬂow of Hidden 17@1×1layer

to Hidden 11@1×1layer processing is described as TableIV.

It should be noticed that the Feature maps 6@3×3to

Hidden 17@1×1convolution operations also adopt the MLP

alike data ﬂow, instead of the one in TableIII. This is mainly

because of that, the output number for each input feature map

is one, considering the bubbles in initialization, the hardware

utilization of 2D-PE Array is merely 33% with the convolution

data ﬂow. In contrast, using MLP data ﬂow, this metric is

improved up to 84%. In MLP procedure,1D-PEAiimplements

the MAC operations in (20), and the intermediate results

are saved in Data Buffer, as shown in Fig.7. The sigmoid

module fetches the intermediate MAC results and carries out

the activation processing (expressed as (21)), which does not

disturb the pipeline scheduling of 1D-PEAi.

V. EXPERIMENTS

In this section, we ﬁrst evaluate the coding performance of

the proposed fast CU/PU mode decision algorithm. Thereafter,

we analyze the effects of ﬁnite precision arithmetic to the

coding quality. Finally, we investigate the throughput, the hard-

ware cost, and the power dissipation of our CNN accelerator

in detail.

A. Performance Analysis of Fast CU/PU Mode Decision

The proposed method is conducted on HEVC reference test

model HM12.0. The test platform is Huawei RH5885, which

combines Intel®XeonTM E7-4830-v2 2.20GHz processor and

128.0GB RAM. Twenty-seven typical open test sequences,

including Classes A-F, were tested with QP ={22, 27, 32, 37}

and intra_main conﬁguration [41]. To verify the robustness of

our algorithm, we introduced additional six HD1080p@25fps

in-house sequences provided by Huawei®. These videos

cover the representative surveillance scenarios, such as the

building internal environment (B2Inside), the open space

with moving peoples (3PeopleD4,GrassTreeD4), the trafﬁc

scenes (RoadB,RoadCar), and the surveillance under weak

light (LowlightNight).

The coding performance results are shown in Table V, in

which the original HM12.0 is the standard benchmark. BP

and BR in Table V stand for the average PSNR difference

(BDPSNR) and the average bit rates difference (BDBR) [42],

respectively. Trepresents the encoding time reduction,

whichisdeﬁnedas

T=(THM −TCNN)/THM ×100%,

with THM and TCNN denoting the encoding time of orig-

inal HM12.0 and the CNN based counterpart, respectively.

To recall the pseudo codes in Fig.2, we can indicate the

speciﬁc CU/PU depth, in which the fast mode decision

is performed. In our evaluations, four conﬁgurations, i.e.,

[E(32),E(16),E(8)]={[0,0,1], [0,1,1], [1,0,1], [1,1,1]} were

tested. Obviously, our algorithm possessed the computational

scalability. That is, we could reduce more complexity by

scarifying the compression efﬁciency. It was observed that

the conﬁguration [0,0,1] achieved the best coding quality,

i.e., on average BDBR=+1.54%, while the encoding time

reduction was merely 43.7%. In contrast, the time saving of

[1,1,1] was 72.0%. Meanwhile, its coding performance was

degraded with BDBR=+4.79%. The conﬁguration of [1,0,1]

achieved the good balance between the coding efﬁciency

and the computational reduction. In this context, the BDBR

augment was +2.67%, while 61.1% encoding complexity was

saved. As compared with the counterpart [0,1,1], the setting

of [1,0,1] had the advantages of both coding quality and

processing speed.

The performance comparisons of the proposed algorithm

and other recent works [14], [16], [19]–[21], [26], [32] are

illustrated in TableVI. TCand TPstand for the encod-

ing time reductions from fast CU/PU mode and prediction

mode methods, respectively. BRMAX represents the maximum

BDBR increase. VLSI indicates the algorithm’s friendship to

LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5097

TAB LE V

CODING PERFORMANCE ANALYSI S OF OUR FAST CU/PU MODE DECISION UNDER VARIOUS CONFIGURATIONS

the hardwired encoder design. All performance statistics of

previous works are cited from the references.

The software oriented fast algorithms, i.e., literatures [14],

[16], [19], [21], provide the superior coding quality than ours.

However, for the hardwired encoder design, the above methods

have the following hindrances: First, the CTU-grain maximum

encoding complexity is not pruned, which impedes the fast

algorithm contributing to the optimization of the encoder’s

hardware cost. For instance, to maintain the coding quality,

literatures [19], [21] deﬁne the conservative thresholds. Only

when one CU satisﬁes these thresholds, its CU splitting trial

is terminated; Otherwise, the CU mode decision is carried

out as the original full RDO procedure. We can see that the

fast CU mode method in [19] merely reduced 26% of the

overall encoding time. The algorithm of literature [14] has

the similar problem: the maximum encoding complexity of

one CTU, which occurs in the parameter estimation phase,

does not diminish. Second, the CU level data dependency

in [16], which uses the current CU depth coding informa-

tion to prune the RDO processing of the deeper CU levels,

incurs the cumbersome to encoding parallelism. Namely, if

this algorithm is adopted, the encoding throughput will be

demoted.

Literatures [20], [26], [32] are VLSI friendly, because these

methods merely employ the source image texture analysis

to predict the promising CU mode competitors. As depicted

in Fig.3, for the CU mode pre-decision engine can be allocated

in the advanced CTU pipeline stage, the rhythm of following

RDO based CTU encoding will not be hindered. However,

because [20] and [26] did not use the topology information of

feature points, their coding efﬁciency losses are obvious, i.e.,

BDBR=+5.10% and BDBR=+4.53%, respectively. In the

5098 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016

Fig. 8. CU/PU partition comparisons between HM-12.0 and our fast algorithm (BasketballPass when QP=22). (a) HM-12.0. (b) Proposed Algorithm.

TAB LE V I

PERFORMANCE COMPARISONS BETWEEN PROPOSED

SOLUTION AND EXISTING ALGORITHMS

work [32], we adopted CNN to resolve the above prob-

lem, which ameliorates the coding loss to BDBR=+3.39%.

As compared to [32], the primary improvements of our

most recent study include the introduction of QP in CNN,

and the reﬁnement of training strategies. Consequently, the

averaged BDBR is merely +2.67%. The worst case comes

from Vidyo3, of which the BDBR augments of this study and

the counterpart [32] are +5.01% and +5.72%, respectively.

The visualized CU/PU partition comparisons between

the HM12.0 benchmark and ours (with the conﬁguration

[E(32),E(16),E(8)]=[1,0,1]) are illustrated in Fig.8. Our results

of CU blocks with strong singularities, such as the blocks on

object outlines, are consistent well with benchmarks. Whereas,

as high-lighted with the red borders, for the blocks being lack

of features, especially on picture boundaries, the probability

of false prediction increases.

B. Effects of Finite Bit-Depth Arithmetic

We analyze the effects of ﬁnite precision to the video

coding quality, including both of ﬁxed-point and ﬂoating-point.

As aforementioned, the computational precision is improved

by increasing the bit-depth. On the other hand, the CNN

hardware cost also rises proportionally to the bit-depth. Our

target is to ﬁnd the minimum bit-depth that satisﬁes the cod-

ing quality requirements. Twenty representation formats were

TAB LE V II

CODING QUALITY ANALYSIS W ITH DIFFERENT REPRESENTATION

FORMATS (BDBR UNIT:%)

tested for ﬂoating-point and ﬁxed-point, respectively, as shown

by Table VII. With 4-bit exponent and 5-bit mantissa, the

ﬂoating-point arithmetic achieved the competitive compression

ratio as the traditional 32-bit ﬂoating-point computation. The

sequence-wise statistics are provided in Table VIII. As com-

pared with TableV, it was illustrated that the ﬂuctuations

in performance incurred by the short bit-depth is less than

BDBR=0.6% in sequence Kimono. For ﬁxed-point, the com-

parative coding quality is obtained with 12-bit representation

format, in which ˆ

A=5and ˆ

B=6. Experiments revealed

that the 10-bit ﬂoating-point CNN accelerator could save

21% hardware cost as compared with the 12-bit ﬁxed-point

counterpart.

C. Performance Analysis of CNN Accelerator

With TSMC 65nm CMOS technology, our CNN accelerator

was described with Verilog-HDL and was synthesized with

SYNOPSYS Design Compiler. IC-Compiler was adopted to

implement the place-and-route job.

The synaptic weights storage is one hardware consuming

component for CNN VLSI implementation. In literature [43],

EDRAM is applied in the general purpose CNN processor to

hold all synapses values on-chip. In our design, if we use the

cost is 23.2k-gate. As the weight coefﬁcients were trained off-

line, we adopted combinational logic to realize the synap-

tic weight storage, which merely accounted for 8.9k-gate.

As compared with the traditional on-chip register-ﬁle

approach, 14.3k gates were saved.

LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5099

TABLE VIII

CODING QUALITY ANALYSIS O F FLOATING-POINT WITH

A=4, ˜

B=5AND [E(32),E(16),E(8)]=[1,0,1]

TAB LE I X

HARDWARE COST AND POWER CONSUMPTION ANALYSI S OF PROPOSED

CNN ACCELERATOR (TSMC 65nm CMOS TECHNOLOGY,

WORST CONDITIONS: 125◦C, 0.9V)

The hardware cost and power consumption statistics of

the proposed CNN accelerator design under typical working

frequencies are illustrated in TableIX. Under the worst work

conditions(0.9v, 125°C), the peak clock speed is 714MHz

and the associated gate count is 42.5-k. The CNN accelerator

is merely 2.1-3.9% of the overall encoder costs, which are

Fig. 9. Data ﬂow of the noisy signal propagating the activation function.

2055k-gate and 1086k-gate in [25] and [26], respectively.

The power consumption of our design is 16.2mW@714MHz.

It takes 372 cycles to process one 8×8 image block. One CNN

accelerator can fulﬁll the throughput of HD1080p@55fps

real-time encoding. Because our algorithm is based on the

source texture analysis, multiple 8 ×8 image blocks could

be processed in parallel. When facing the higher resolu-

tion speciﬁcations, the throughput requirements are met by

using parallelism. For example, the real-time processing of

4K@55fps videos can be handled by using four proposed CNN

accelerators.

VI. CONCLUSION

This paper presents the CNN based fast CU/PU mode

decision to reduce the maximum Intra coding complexity of

one CTU for hardwired HEVC encoder design. Speciﬁcally,

CNN investigates the textures of one CU, and then deter-

mines the promising candidate in each of 32×32/16×16 and

8×8/4×4 CU/PU mode pairs. The contributions of our pro-

posals include: (1) Because the maximum number of CU/PU

candidate mode in one CTU is reduced, the corresponding

VLSI encoder hardware complexity is ameliorated; (2) With

the CTU pipeline architecture, the parallelism of the critical

RDO processing will not be deteriorated by our fast algorithm;

(3) On the basis of the theoretical analysis, a reconﬁgurable

CNN accelerator is developed using the optimal ﬂoating-point

arithmetic, which greatly reduces the hardware overhead of

our algorithm. The experiments demonstrate that, on aver-

age, 61.1% Intra encoding complexity is reduced, whereas

the incurred compression loss is merely BDBR=+2.67%, or

equivalently BDPSNR=−0.15dB quality loss. Using TSMC

65nm CMOS techniques, one hardwired CNN accelerator,

which achieved 714MHz clock speed in the worst conditions,

is implemented with 42.5-k logic gates. The power dissipation

of our accelerator is 16.2mW@714MHz. One CNN accelerator

fulﬁlls the throughput requirement of HD1080p@55fps real-

time encoding, and the higher performance can be achieved

by applying parallelism.

APPENDIX A

NSR PROPAGATION PROPERTY OF ACTIVATION FUNCTION

As shown in Fig.9, if the input signal yand the additive

noise eare independent zero-mean r.v., the output signal and

the corresponding noise of the activation function are labeled

as xand ε, respectively. It is reasonable to assume that the

variance of e, i.e., σe, is much less than that of input signal,

i.e., σy. With the aid of Taylor’s theorem, we could derive

that

x=(y)

ε=(y)·e.(A.1)

5100 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016

For any y,x=(y)can be formulated as

x=(0)+N−1

k=0(k) ·

where, =y/Nand N→∞. Using the properties of (·)

claimed in lemma1, we deduce

|x|≥(y) N−1

k=0=(y)y.(A.2)

From (A.1) and (A.2), we obtain that

Eε2

x2=Ee2((y))2

2(y)

≤Ee2((y))2

y2((y))2

=Ee2

y2.

APPENDIX B

ROUNDING ERROR ANALYSIS OF FIXED-POINT CNN

Let the input noise signals εl(i)are i.i.d. r.v. with the

variance σεl. From the data ﬂow in Fig.5, we have

σ2

yl=E⎡

⎢

⎣⎛

⎝

Il−1



i=0

xl(i)kl(i)⎞

⎠

2⎤

⎥

⎦.(B.1)

Using E[x2

l(i)]=σ2

xland E[xl(i)·xl(j)]=0(ifi= j), (B.1)

is recast to

σ2

yl=σ2

xl·

kl2,(B.2)

where, 

kl2=Il−1

i=0k2

l(i). Assuming the roundoff error

ˆ(i)is uniformly distributed in [−2−(ˆ

B+1),2−(ˆ

B+1)],Eˆ2(i)

is equal to 2−2ˆ

12 . The variance of ˆelis formulated as

σ2

ˆel=EIl

i=0ˆ(i)+Il−1

i=0ˆεl(i)kl(i)2

=Il

i=0Eˆ2(i)+Il−1

i=0Eˆε2(i)k2

l(i)

=2−2ˆ

12 (Il+1)+σ2

ˆεl

kl2.(B.3)

The output NSR of the lth layer is deﬁned as

σ2

ˆel

σ2

.(B.4)

Substituting (B.2) and (B.3) into (B.4), we obtain

σ2

ˆel

σ2

=2−2ˆ

B(Il+1)

12σ2

+σ2

ˆεl

σ2

.(B.5)

It should be noted the ﬁrst term in (B.5) is the NSR generated

by the local noises. In addition, with the conclusion of

lemma 1, we have that the second term in (B.5) is always less

than the output NSR of the (l−1)th layer, i.e.,

σ2

ˆεl

σ2

≤σ2

ˆel−1

σ2

yl−1

In consequence, the upper bound of lth layer’s NSR is derived

σ2

ˆel

σ2

≤



NSRl+σ2

ˆel−1

σ2

yl−1

,(B.6)

in which, the variable NSR

l=2−2ˆ

B(Il+1)

12σ2

is the local NSR.

Using (B.6) and noticing σ2

ˆe0

σ2

=NSR0, the NSR upper bound

of the lth layer has the simple expression as

σ2

ˆel

σ2

≤



i=0

2−2ˆ

B(Ii+1)

12σ2



i=0



NSRi

APPENDIX C

ROUNDING ERROR ANALYSIS OF FLOATING-POINT CNN

From the data ﬂow depicted in Fig.6, we have

˜yl=

Il−1



i=0&(i)kl(i)x(i)+˜ε(i)'+(Il)b,

where, (·)is deﬁned as

⎧

⎨

⎩

(0)=1+˜η(0)Il

j=11+˜

ζ(j)For i=0

(i)=1+˜η(i)Il

j=i1+˜

ζ(j)Otherwise.

Then, it is straight forward to obtain

⎧

⎪

⎨

⎪

⎩

˜el=

Il−1



i=0

((i)−1)kl(i)xl(i)

Il−1



i=0

kl(i)˜εl(i)

Il−1



i=0

((i)−1)kl(i)˜εl(i)

+((Il)−1)b.

(C.1)

From the zero-mean and i.i.d. properties of ˜η(i)and ˜

ζ(i),we

have

E[(i)]=1

E((i)−1)2=E2(i)−1(C.2)

As mentioned in Section IV-A, because the mantissa and the

associated roundoff error are assumed to possess the uniform

distributions in [1,2]and [−2−(˜

B+1),2−(˜

B+1)]respectively,

the variances of ˜η(i)and ˜

ζ(i)are same as

E˜η2(i)=E˜

ζ2(i)

2−˜

B(2−(˜

B+1)

−2−(˜

B+1)x2dx

1x2dx

=2−2˜

In consequence, we obtain

⎧

⎪

⎨

⎪

⎩

E2(0)=1+2−2˜

28 Il+1

For i=0

E2(i)=1+2−2˜

28 Il+2−i

Otherwise.

(C.3)

When Il1, we can approximate E 2(0)with the general

expression E 2(i).

LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5101

From (C.1) and (C.2), we can see that ˜elis zero-mean and

its variance is expressed as

σ2

˜el=

Il−1



i=0E2(i)−1k2

l(i)σ 2

xl(i)+

Il−1



i=0

l(i)σ 2

˜εl(i)

Il−1



i=0E2(i)−1k2

l(i)σ 2

˜εl(i)

+E2(Il)−1b2(C.4)

In addition, when 2−2˜

28 1, which is always hold with ˜

B>1,

(C.3) can be approximated by discarding the high order term

of 2−2˜

28 .Thatis

E2(i)≈1+(Il+2−i)2−2˜

28 .(C.5)

Substituting (C.5) in (C.4) yields

σ2

˜el=

Il−1



i=0

2−2˜

28 (Il+2−i)k2

l(i)σ 2

xl(i)

+2−2˜

28 (Il+2−Il)b2+

Il−1



i=0

l(i)σ 2

˜εl(i)

Il−1



i=0

2−2˜

28 (Il+2−i)k2

l(i)σ 2

˜εl(i). (C.6)

Because 2−2˜

28 1andσ2

˜εl(i)σ2

xl(i), we discard the last

term in (C.6) to get the expression of the lth layer’s NSR as

σ2

˜el

σ2

=2−2˜

B(Il+2)

28 Il−1

i=01−i

Il+2k2

l(i)σ 2

xl(i)+2b2

Il+2

Il−1

i=0k2

l(i)σ 2

xl(i)

+Il−1

i=0k2

l(i)σ 2

˜εl(i)

Il−1

i=0k2

l(i)σ 2

xl(i)(C.7)

The ﬁrst term in (C.7), which is labelled as 

NSRl,represents

the effect of local arithmetic operations. When Il1, the

effect of b2can be neglected. Therefore, (C.7) is recast to

σ2

˜el

σ2



NSRl+Il−1

i=0k2

l(i)σ 2

εl(i)

Il−1

i=0k2

l(i)σ 2

xl(i)

=2−2˜

B(Il+2)

28 Il−1

i=01−i

Il+2k2

l(i)σ 2

xl(i)

Il−1

i=0k2

l(i)σ 2

xl(i)

+Il−1

i=0k2

l(i)σ 2

˜εl(i)

Il−1

i=0k2

l(i)σ 2

xl(i)(C.8)

It should be emphasized that the second fraction in 

NSRlmust

be less than 1. Therefore, the upper bound of σ2

˜el

σ2

˜el

σ2

≤2−2˜

B(Il+2)

28 +Il−1

i=0k2

l(i)σ 2

˜εl(i)

Il−1

i=0k2

l(i)σ 2

xl(i).(C.9)

By mathematical induction, we can derive the concise

expression of the output NSR upper bound as

σ2

˜el

σ2

≤2−2˜

28 l



i=0

Ii+2l.(C.10)

Proof: When l=0, because σ2

˜εl(i)=0, the

theorem (C.10) is true. If it is assumed that, (C.10) holds when

l=t−1, we just need to demonstrate the truth of (C.10) when

l=t.

From lemma 1 and the assumption for l=t−1, we have

σ2

˜εt(i)≤σ2

xt(i)σ2

˜et−1

σ2

yt−1

≤σ2

xt(i)2−2˜

28 t−1



i=0

Ii+2(t−1)(C.11)

With (C.11) and (C.9), when l=t,wehave

σ2

˜et

σ2

≤2−2˜

B(It+2)

28 +2−2˜

28 t−1



i=0

Ii+2(t−1)

=2−2˜

28 t



i=0

Ii+2t

Therefore, the theorem is proofed.

REFERENCES

[1] B. Bross, W.-J. Han, J.-R. Ohm, G. J. Sullivan, Y.-K. Wang, and

T. Wiegand, High Efﬁciency Video Coding (HEVC) Text Speciﬁcation

Draft 10, document JCTVC-L1003, Geneva, CH, 2013.

[2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the

high efﬁciency video coding (HEVC) standard,” IEEE Trans. Circuits

Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.

[3] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand,

“Comparison of the coding efﬁciency of video coding standards—

Including high efﬁciency video coding (HEVC),” IEEE Trans. Circuits

Syst. Video Technol., vol. 22, no. 12, pp. 1669–1684, Dec. 2012.

[4] Y. Piao, J. Min, and J. Chen, Encoder Improvement of Uniﬁed Intra

Prediction, document JCTVC-C207, Guangzhou, CN, 2010.

[5] L. Zhao, L. Zhang, X. Zhao, S. Ma, D. Zhao, and W. Gao, Further

Encoder Improvement of Intra Mode Decision, document JCTVC-D283,

Daegu, South Korea, 2011.

[6] S. Ma, S. Wang, S. Wang, L. Zhao, Q. Yu, and W. Gao, “Low

complexity rate distortion optimization for HEVC,” in Proc. Data

Compress. Conf. (DCC), Mar. 2013, pp. 73–82.

[7] Q. Chen and Y. He, “A fast bits estimation method for rate-distortion

optimization in H.264/AVC,” in Proc. Picture Coding Symp. (PCS),

Dec. 2004, pp. 133–134.

[8] Y.-K. Tu, J.-F. Yang, and M.-T. Sun, “Efﬁcient rate-distortion estimation

for H.264/AVC coders,” IEEE Trans. Circuits Syst. Video Technol.,

vol. 16, no. 5, pp. 600–611, May 2006.

[9] M. G. Sarwer and L.-M. Po, “Fast bit rate estimation for mode decision

of H.264/AVC,” IEEE Trans. Circuits Syst. Video Technol., vol. 17,

no. 10, pp. 1402–1407, Oct. 2007.

[10] X. Zhao, J. Sun, S. Ma, and W. Gao, “Novel statistical modeling,

analysis and implementation of rate-distortion estimation for H.264/AVC

coders,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 5,

pp. 647–660, May 2010.

[11] J. Zhu, Z. Liu, D. Wang, Q. Han, and Y. Song, “Fast prediction

mode decision with Hadamard transform based rate-distortion cost

estimation for HEVC intra coding,” in Proc. 20th IEEE Int. Conf. Image

Process. (ICIP), Sep. 2013, pp. 1977–1981.

[12] Z. Liu, S. Guo, and D. Wang, “Binary classiﬁcation based linear rate

estimation model for HEVC RDO,” in Proc. 21st IEEE Int. Conf. Image

Process. (ICIP), Oct. 2014, pp. 3676–3680.

[13] X. Li, J. An, X. Guo, and S. Lei, Adaptive CU Depth Range, document

JCTVC-E090, Geneva, CH, 2011.

5102 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016

[14] N. Hu and E.-H. Yang, “Fast mode selection for HEVC intra-frame

coding with entropy coding reﬁnement based on a transparent composite

model,” IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 9,

pp. 1521–1532, Sep. 2015.

[15] L. Shen, Z. Zhang, and P. An, “Fast CU size decision and mode decision

algorithm for HEVC intra coding,” IEEE Trans. Consum. Electron.,

vol. 59, no. 1, pp. 207–213, Feb. 2013.

[16] L. Shen, Z. Zhang, and Z. Liu, “Effective CU size decision for

HEVC intracoding,” IEEE Trans. Image Process., vol. 23, no. 10,

pp. 4232–4241, Oct. 2014.

[17] K. Choi, S.-H. Park, and E. S. Jang, Coding Tree Pruning Based CU

Early Termination, document JCTVC-F092, Torino, IT, 2011.

[18] L. Shen, Z. Liu, X. Zhang, W. Zhao, and Z. Zhang, “An effective CU

size decision method for HEVC encoders,” IEEE Trans. Multimedia,

vol. 15, no. 2, pp. 465–470, Feb. 2013.

[19] H. Zhang and Z. Ma, “Fast intra mode decision for high efﬁciency video

coding (HEVC),” IEEE Trans. Circuits Syst. Video Technol., vol. 24,

no. 4, pp. 660–668, Apr. 2014.

[20] Y. Zhang, Z. Li, and B. Li, “Gradient-based fast decision for intra

prediction in HEVC,” in Proc. Vis. Commun. Image Process., Nov. 2012,

pp. 1–6.

[21] B. Min and R. C. C. Cheung, “A fast CU size decision algorithm for

the HEVC intra encoder,” IEEE Trans. Circuits Syst. Video Technol.,

vol. 25, no. 5, pp. 892–896, May 2015.

[22] S. Cho and M. Kim, “Fast CU splitting and pruning for suboptimal CU

partitioning in HEVC intra coding,” IEEE Trans. Circuits Syst. Video

Technol., vol. 23, no. 9, pp. 1555–1564, Sep. 2013.

[23] Q. Hu, X. Zhang, Z. Shi, and Z. Gao, “Neyman–Pearson-based early

mode decision for HEVC encoding,” IEEE Trans. Multimedia, vol. 18,

no. 3, pp. 379–391, Mar. 2016.

[24] V. Sze, M. Budagavi, and G. J. Sullivan, Eds., High Efﬁciency Video

Coding (HEVC): Algorithms and Architectures. New York, NY, USA:

Springer-Verlag, Jul. 2014, pp. 343–375.

[25] G. Pastuszak and A. Abramowski, “Algorithm and architecture design

of the H.265/HEVC intra encoder,” IEEE Trans. Circuits Syst. Video

Technol., vol. 26, no. 1, pp. 210–222, Jan. 2016.

[26] J. Zhu, Z. Liu, D. Wang, Q. Han, and Y. Song, “HDTV1080p HEVC

intra encoder with source texture based CU/PU mode pre-decision,”

in Proc. 19th Asia South Paciﬁc Design Autom. Conf. (ASP-DAC),

Jan. 2014, pp. 367–372.

[27] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,

and time series,” in The Handbook of Brain Theory and Neural

Networks, M. Arbib, Ed. Cambridge, MA, USA: MIT Press, 1995,

pp. 255–258.

[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based

learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,

pp. 2278–2324, Nov. 1998.

[29] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber,

“Convolutional neural network committees for handwritten charac-

ter classiﬁcation,” in Proc. 11th IEEE Int. Conf. Document Anal.

Recognit. (ICDAR), Sep. 2011, pp. 1135–1139.

[30] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural

networks for image classiﬁcation,” in Proc. 25th IEEE Conf. Comput.

Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 3642–3649.

[31] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks

for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern

Recognit., Jun. 2015, pp. 3431–3440.

[32] X. Yu, Z. Liu, J. Liu, Y. Gao, and D. Wang, “VLSI friendly fast CU/PU

mode decision for HEVC intra encoding: Leveraging convolution neural

network,” in Proc. 22nd IEEE Int. Conf. Image Process. (ICIP),

Sep. 2015, pp. 1285–1289.

[33] Z. Liu, D. Wang, J. Zhou, and T. Ikenaga, “Lagrangian multiplier

optimization using correlations in residues,” in Proc. IEEE Int. Conf.

Acoust., Speech, Signal Process. (ICASSP), Mar. 2012, pp. 1185–1188.

[34] T. Berger, Rate-Distortion Theory, T. Kailath, Ed. Englewood Cliffs, NJ,

USA: Prentice-Hall, 1971.

[35] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video

compression,” IEEE Signal Process. Mag., vol. 15, no. 6, pp. 74–90,

Nov. 1998.

[36] X. Li, N. Oertel, A. Hutter, and A. Kaup, “Laplace distribution based

Lagrangian rate distortion optimization for hybrid video coding,” IEEE

Trans. Circuits Syst. Video Technol., vol. 19, no. 2, pp. 193–205,

Feb. 2009.

[37] A. V. Oppenheim and C. J. Weinstein, “Effects of ﬁnite register length

in digital ﬁltering and the fast Fourier transform,” Proc. IEEE, vol. 60,

no. 8, pp. 957–976, Aug. 1972.

[38] A. V. Oppenheim, R. W. Schafer, M. T. Yoder, and W. T. Padgett,

Discrete-Time Signal Processing, vol. 2. Englewood Cliffs, NJ, USA:

Prentice-Hall, 1989.

[39] C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, T.-C. Wang, and

L.-G. Chen, “Analysis and architecture design of variable block-size

motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. I, Reg.

Papers, vol. 53, no. 3, pp. 578–593, Mar. 2006.

[40] P. M. Farmwald, “On the design of high performance digital arith-

metic units,” Ph.D. dissertation, Stanford Univ., Stanford, CA, USA,

Aug. 1981.

[41] F. Bossen, Common HM Test Conditions and Software Reference Con-

ﬁgurations, document JCTVC-I1100, Geneva, CH, 2012.

[42] G. Bjøntegaard, Calculation of Average PSNR Differences Between RD-

Curves, document VCEG-M33, Austin, TX, USA, Apr. 2001.

[43] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in

Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitect. (MICRO),

Dec. 2014, pp. 609–622.

Zhenyu Liu (M’07) received the B.E., M.E., and

Ph.D. degrees from the Beijing Institute of Technol-

ogy, China, in 1996, 1999, and 2002, respectively, all

in electrical engineering. From 2002 to 2004, he held

a post-doctoral position with Tsinghua University,

China, where he was involved in the embedded

processor architecture design. From 2004 to 2009,

he was a Visiting Researcher with the Graduate

School of IPS, Waseda University, Japan. He joined

Tsinghua University, China, in 2009, where he is

currently an Associate Professor with RIIT&TNList.

His research interests include signal processing, energy-efﬁcient real-time

video encoding and application speciﬁc processor.

Xianyu Yu was born in 1989. He received the

B.E. degree in automation from Anhui University,

China, in 2012, and the M.E. degree in integrated

circuit engineering from Tsinghua University, China,

in 2016. He is currently with Huawei Corpora-

tion. His research interests include embedded system

and Android framework associated with multimedia

processing (audio/video/graphics).

Yua n Ga o was born in 1991. He received the

B.E. degree in electrical engineering from the

Beijing Institute of Technology, China, in 2012.

He is currently pursuing the Ph.D. degree with the

Department of Computer Science, Tsinghua Univer-

sity. His research interests include neural network

algorithm and associated very large scale integration

architecture design.

Shaolin Chen received the B.S. degree in automa-

tion from Anhui University, Hefei, China, 2005,

the M.S. degree in geodynamics from the Col-

lege of Earth Science, Graduation University of

Chinese Academy of Sciences, Beijing, China,

in 2008, and the Ph.D. degree in pattern recog-

nition and intelligent system from the Institute

of Automation, Chinese Academy of Sciences in

2012. He is currently a Senior Algorithm Engi-

neer with Huawei Technologies Company, Ltd. His

research interests include video encoding and image

enhancement.

LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5103

Xiangyang Ji (M’10) received the B.S. degree in

materials science and the M.S. degree in computer

science from the Harbin Institute of Technology,

Harbin, China, in 1999 and 2001, respectively, and

the Ph.D. degree in computer science from the Insti-

tute of Computing Technology, Chinese Academy of

Sciences, Beijing, China. He joined Tsinghua Uni-

versity, Beijing, in 2008, where he is currently a Pro-

fessor with the Department of Automation, School

of Information Science and Technology. He has

authored over 100 referred conference and journal

papers. His current research interests include signal processing, image/video

compression and communication, and intelligent imaging.

Dongsheng Wang (M’09) was born in China

in 1966. He received the B.E., M.E., and

Ph.D. degrees from the Harbin Institute of Technol-

ogy, China, in 1989, 1992 and 1995, respectively, all

in computer science. He is currently a Professor with

RIIT & TNList, Tsinghua University. His research

areas include many-core computer architecture, real-

time application oriented SoCs, disaster recovery,

and high availability computing.

Fcd-cnn: FPGA-based CU depth decision for HEVC intra encoder using CNN

Article

Full-text available

Jun 2024

Video compression for storage and transmission has always been a focal point for researchers in the field of image processing. Their efforts aim to reduce the data volume required for video representation while maintaining its quality. HEVC is one of the efficient standards for video compression, receiving special attention due to the increasing demand for high-resolution videos. The main step in video compression involves dividing the coding unit (CU) blocks into smaller blocks that have a uniform texture. In traditional methods, The Discrete Cosine Transform (DCT) is applied, followed by the use of RDO for decision-making on partitioning. This paper presents a novel convolutional neural network (CNN) and its hardware implementation as an alternative to DCT, aimed at speeding up partitioning and reducing the hardware resources required. The proposed hardware utilizes an efficient and lightweight CNN to partition CUs with low hardware resources in real-time applications. This CNN is trained for different Quantization Parameters (QPs) and block sizes to prevent overfitting. Furthermore, the system’s input size is fixed at 16×16\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$16\times 16$$\end{document}, and other input sizes are scaled to this dimension. Loop unrolling, data reuse, and resource sharing are applied in hardware implementation to save resources. The hardware architecture is fixed for all block sizes and QPs, and only the coefficients of the CNN are changed. In terms of compression quality, the proposed hardware achieves a 4.42%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4.42\%$$\end{document} BD-BR and -0.19\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-\,0.19$$\end{document} BD-PSNR compared to HM16.5. The proposed system can process 64×64\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$64\times 64$$\end{document} CU at 150 MHz and in 4914 clock cycles. The hardware resources utilized by the proposed system include 13,141 LUTs, 15,885 Flip-flops, 51 BRAMs, and 74 DSPs.

CNN-based ternary tree partition approach for VVC intra-QTMT coding

Article

Full-text available

Feb 2024

In July 2020, the Joint Video Experts Team has published the versatile video coding (VVC) standard. The VVC encoder enhances the coding efficiency compared with his predecessor high-efficiency video coding encoder, thanks to the improved coding modules and the new proposed techniques such as the new block partitioning structure called quadtree with nested multi-type tree (QTMT). However, QTMT induces a significant increase in encoding time mainly at the rate distortion optimization level (RDO) which causes an enormous computational complexity. Instead of RDO-QTMT partition process, a deep-QTMT partition approach based on a fast convolution neural network-ternary tree (CNN-TT) is proposed to predict the best intra-QTMT decision tree in order to reduce the encoding time. A database is initially established containing CU-based TT partition depths with several video contents. Then, a CNN-TT model is developed under three-levels provided by the TT structure to early determine the QTMT partition at 32×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}32. Different threshold values are fixed for each level according to the CNN-TT predicted probabilities to reach a balance between the encoding complexity and the coding efficiency. The experimental results prove that our deep-QTMT partition approach saves a significant encoder time on average between 23% and 58% with an acceptable RD performance.

Fast CU Partitioning Algorithm for VVC Based on CNN and FSVM

Article

Full-text available

Jan 2024

The new coding standard Versatile Video Coding (VVC) introduces additional encoding techniques based on the existing video coding standard, such as the difference in block partition structures. While these new technologies bring about enhanced encoding performance, they also result in a significant increase in encoding time complexity. In the paper, we propose a decision algorithm to partition fast CUs, which is based on CNN networks and FSVM. Initially, the algorithm utilizes depth information formed by combining inter-frame correlations as input to our trained CNN model, predicting the optimal depth for encoding.Following that, the second algorithm based on FSVM is introduced. The F-Score method is employed to extract appropriate features for constructing FSVM. Within the predicted depth range from the initial algorithm, the CU partitioning mode is predicted, leading to an additional reduction in encoding complexity. Our experimental results demonstrate that the proposed algorithm can save 53.55% of encoding time, with a marginal increase of only 1.47% in BDBR.It achieves a favorable balance between video quality and encoding speed.

High Efficiency Deep-learning Based Video Compression

Article

Apr 2024

Although deep learning technique has achieved significant improvement on image compression, but its advantages are not fully explored in video compression, which leads to the performance of deep-learning based video compression (DLVC) is obvious inferior to that of hybrid video coding framework. In this paper, we proposed a novel network to improve the performance of DLVC from its most important modules, including Motion Process (MP), Residual Compression (RC) and Frame Reconstruction (FR). In MP, we design a split second-order attention and multi-scale feature extraction module to fully remove the warping artifacts from multi-scale feature space and pixel space, which can help reduce the distortion in the following process. In RC, we propose a channel selection mechanism to gradually drop redundant information while preserving informative channels for a better rate-distortion performance. Finally, in FR, we introduce a residual multi-scale recurrent network to improve the quality of the current reconstructed frame by progressively exploiting temporal context information between it and its several previous reconstructed frames. Extensive experiments are conducted on the three widely used video compression datasets (HEVC, UVG and MCL-JVC), and the performance demonstrates the superiority of our proposed approach over the state-of-the-art methods.

Vesper: Learning to Manage Uncertainty in Video Streaming

Conference Paper

Apr 2024

Deep CNN Based Interpolation Filter for High Efficiency Video Coding

Conference Paper

Jan 2024

Coding Unit Partitions Using Depth-Wise Separable Convolution in Versatile Video Coding (VVC)

Conference Paper

Dec 2023

Towards Hybrid-Optimization Video Coding

Article

Mar 2024

Video coding that pursues the highest compression efficiency is the art of computing for rate-distortion optimization. The optimization has been approached in different ways, exemplified by two typical frameworks: block-based hybrid video coding and end-to-end learned video coding. The block-based hybrid framework encompasses more and more coding modes that are available at the decoder side; an encoder tries to search for the optimal coding mode for each block to be coded. This is an online, discrete, search-based optimization strategy. The end-to-end learned framework embraces more and more sophisticated neural networks; the network parameters are learned from a collection of videos, typically using gradient descent-based methods. This is an offline, continuous, numerical optimization strategy. Having analyzed these two strategies, both conceptually and with concrete schemes, this paper suggests investigating hybrid -optimization video coding, that is to combine online and offline, discrete and continuous, search-based and numerical optimization. For instance, we propose a hybrid-optimization video coding scheme, where the decoder consists of trained neural networks and supports several coding modes, and the encoder adopts both numerical and search-based algorithms for the online optimization. Our scheme achieves promising compression efficiency on par with H.265/HM for the random-access configuration.

Efficient Intra Coding through Hierarchical CU Partition Prediction for VVC

Conference Paper

Dec 2023

Scene Matters: Model-based Deep Video Compression

Conference Paper

Oct 2023

Multi-column Deep Neural Networks for Image Classification

Conference Paper

Jun 2012

Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or trafﬁc signs. Our biologically plausible, wide and deep artiﬁcial neural network architectures can. Small (often minimal) receptive ﬁelds of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the ﬁrst to achieve near-human performance. On a trafﬁc sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classiﬁcation benchmarks.

Fully Convolutional Networks for Semantic Segmentation

Article

Nov 2014

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.

A fast CU size decision algorithm for HEVC intra encoder

Article

Jan 2014

DaDianNao: A machine-learning supercomputer

Article

Jan 2014

VLSI friendly fast CU/PU mode decision for HEVC intra encoding: Leveraging convolution neural network

Conference Paper

Sep 2015

Fully Convolutional Networks for Semantic Segmentation

Article

May 2016

Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.

Fully convolutional networks for semantic segmentation

Conference Paper

Jun 2015

Neyman-Pearson Based Early Mode Decision for HEVC Encoding

Article

Jan 2015

The high efficiency video coding (HEVC) standard has highly improved the coding efficiency by adopting hierarchical structures of coding unit (CU), prediction unit (PU), and transform unit (TU). However, enormous computational complexity is introduced due to the recursive rate-distortion optimization (RDO) process on all CUs, PUs and TUs. In this paper, we propose a fast and efficient mode decision algorithm based on the Neyman-Pearson rule, which consists of early SKIP mode decision and fast CU size decision. First, the early mode decision is modeled as a binary classification problem of SKIP/non-SKIP or split/unsplit. The Neyman-Pearson-based rule is employed to balance the rate-distortion (RD) performance loss and the complexity reduction by minimizing the missed detection with a constrained incorrect decision rate. A nonparametric density estimation scheme is also developed to calculate the likelihood function of the statistical parameters. Furthermore, an online training scheme is employed to periodically update the probability density distributions for different quantization parameters (QPs) and CU depth levels. The experimental results show that the proposed overall algorithm can save 65% and 58% computational complexity on average with a 1.29% and 1.08% Bjontegaard Delta bitrate (BDBR) increase for various test sequences under random access and low delay P conditions, respectively. The proposed overall scheme also has the advantage that it canmake the trade-off between the RD performance and time saving by setting different values for the incorrect decision rate.

DaDianNao: A Machine-Learning Supercomputer

Article

Jan 2015

Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.

Algorithm and Architecture Design of the H.265/HEVC Intra Encoder

Article

Jan 2015

Improved video coding techniques introduced in the H.265/HEVC standard allow video encoders to achieve better compression efficiencies. On the other hand the increased complexity requires a new design methodology able to face challenges associated with ever higher spatio-temporal resolutions. The paper presents the computationally-scalable algorithm and its hardware architecture able to support the intra encoding up to the 2160p@30fps resolution. The scalability allows the tradeoff between the throughput and the compression efficiency. In particular, the encoder is able to check a variable number of candidate modes. The rate estimation based on bin counting and the distortion estimation in the transform domain simplify the rate-distortion analysis and enable the evaluation of a great number of candidate intra modes. The encoder preselects candidate modes by the processing of 8×8 predictions computed from original samples. The preselection shares hardware resources used for the processing of predictions generated from reconstructed samples. To support intra 4×4 modes for the 2160p@30fps resolution, the encoder incorporates a separate reconstruction loop. The processing of blocks with different sizes is interleaved to compensate the delay of reconstruction loops. Implementation results show that the encoder utilizes 1086k gates and 52 kB on-chip memories for TSMC 90nm. The main reconstruction loop can operate at 400 MHz, whereas the remaining modules work at 200 MHz. For 2160p@30fps videos, the average BD-Rate is 5.46% compared to the HM software.

CU Partition Mode Decision for HEVC Hardwired Intra Encoder Using Convolution Neural Network

Abstract and Figures

Recommended publications

Energy And Area-Efficient Hardware Implementation of HEVC Inverse Transform And Dequantization

VLSI friendly fast CU/PU mode decision for HEVC intra encoding: Leveraging convolution neural networ...

CNN oriented fast HEVC intra CU mode decision

Fast CU Determination Algorithm Based on Convolutional Neural Network for HEVC

Fast Algorithm and VLSI Architecture of Rate Distortion Optimization in H.265/HEVC