ArticlePDF Available

CU Partition Mode Decision for HEVC Hardwired Intra Encoder Using Convolution Neural Network

Authors:

Abstract and Figures

The intensive computation of High Efficiency Video Coding (HEVC) engenders challenges for the hardwired encoder in terms of the hardware overhead and the power dissipation. On the other hand, the constrains in hardwired encoder design seriously degrade the efficiency of software oriented fast coding unit (CU) partition mode decision algorithms. A fast algorithm is attributed as VLSI friendly, when it possesses the following properties: First, the maximum complexity of encoding a coding tree unit (CTU) could be reduced; Second, the parallelism of the hardwired encoder should not be deteriorated; Third, the process engine of the fast algorithm must be of low hardware- and power-overhead. In this article, we devise the convolution neural network (CNN) based fast algorithm to decrease no less than two CU/PU modes in each CTU for full RDO processing, thereby reducing the encoder's hardware complexity. As our algorithm does not depend on the correlations among CU depths or spatially nearby CUs, it is friendly to the parallel processing and does not deteriorate the rhythm of RDO pipelining. Experiments illustrated that, an averaged 61.1% Intra encoding time was saved, whereas the Bjøntegaard-Delta Bit-Rate (BDBR) augment is 2.67%. Capitalizing on the optimal arithmetic representation, we developed the high-speed (714MHz in the worst conditions (125◦C, 0.9v)) and low-cost (42.5k-gate) accelerator for our fast algorithm by using TSMC 65nm CMOS technology. One accelerator could support 1080p@55fps real-time encoding. The corresponding power dissipation was 16.2mW@714MHz. Finally, our accelerator is provided with good scalability. Four accelerators fulfill the throughput requirements of 4K@55fps.
Content may be subject to copyright.
5088 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016
CU Partition Mode Decision for HEVC Hardwired
Intra Encoder Using Convolution Neural Network
Zhenyu Liu, Member, IEEE, Xianyu Yu, Yuan Gao, Shaolin Chen, Xiangyang Ji, Member, IEEE,
and Dongsheng Wang, Member, IEEE
Abstract The intensive computation of High Efficiency Video
Coding (HEVC) engenders challenges for the hardwired encoder
in terms of the hardware overhead and the power dissipation.
On the other hand, the constrains in hardwired encoder design
seriously degrade the efficiency of software oriented fast coding
unit (CU) partition mode decision algorithms. A fast algorithm
is attributed as VLSI friendly, when it possesses the following
properties. First, the maximum complexity of encoding a coding
tree unit (CTU) could be reduced. Second, the parallelism of
the hardwired encoder should not be deteriorated. Third, the
process engine of the fast algorithm must be of low hardware- and
power-overhead. In this paper, we devise the convolution neural
network based fast algorithm to decrease no less than two CU
partition modes in each CTU for full rate-distortion optimiza-
tion (RDO) processing, thereby reducing the encoder’s hardware
complexity. As our algorithm does not depend on the correlations
among CU depths or spatially nearby CUs, it is friendly to
the parallel processing and does not deteriorate the rhythm
of RDO pipelining. Experiments illustrated that, an averaged
61.1% intraencoding time was saved, whereas the Bjøntegaard-
Delta bit-rate augment is 2.67%. Capitalizing on the optimal
arithmetic representation, we developed the high-speed [714 MHz
in the worst conditions (125 C, 0.9 V)] and low-cost (42.5k
gate) accelerator for our fast algorithm by using TSMC 65-nm
CMOS technology. One accelerator could support HD1080p
at 55 frames/s real-time encoding. The corresponding power
dissipation was 16.2 mW at 714 MHz. Finally, our accelerator
is provided with good scalability. Four accelerators fulfill the
throughput requirements of UltraHD-4K at 55 frames/s.
Index Terms—HEVC, fast CU/PU mode decision, CNN, VLSI,
intra encoding.
Manuscript received December 27, 2015; revised June 5, 2016 and
July 17, 2016; accepted August 8, 2016. Date of publication August 18,
2016; date of current version September 13, 2016. This work was supported
in part by Huawei Technologies, National Science and Technology Major
Project under Grant 2016YFB0200505 and in part by the National Natural
Science Foundation of China under Grant 61325003. The associate editor
coordinating the review of this manuscript and approving it for publication
was Dr. Yui-Lam Chan.
Z. Liu and D. Wang are with Tsinghua National Laboratory for
Information Science and Technology, Research Institute of Informa-
tion Technology, Tsinghua University, Beijing 100084, China (e-mail:
liuzhenyu73@tsinghua.edu.cn).
X. Yu is with the Institute of Microelectronics, Tsinghua University, Beijing
100084, China.
Y. Gao is with the Department of Computer Science, Tsinghua University,
Beijing 100084, China.
X. Ji is with the Department of Automation, Tsinghua University, Beijing
100084, China.
S. Chen is with Huawei Technologies Company, Ltd., Shenzhen 518129,
China.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIP.2016.2601264
I. INTRODUCTION
THE state-of-the-art video coding standard, High
Efficiency Video Coding (HEVC) [1], [2], was developed
by Joint Collaborative Team on Video Coding (JCT-VC). With
the equivalent subjective video quality, HEVC can double the
compression ratio as compared to the predecessor H.264/AVC,
especially when operating on low bit-rate, high-resolution
image, and low-delay communication applications [3].
HEVC introduces the quadtree-based coding data structure.
The encoder splits one picture into basic square regions.
Each region is denoted as one coding tree unit (CTU), of
which the luma partition size can be chosen as 2N×2N
(N∈{32,16,8}). The data organization of CTU involves the
quadtree structure. For one CU (including CTU), it can be
coded as a whole (2N×2Nmode), or be split into four N×N
sub-CUs. The splitting of CU can be iterated on the basis of the
signal features, until the minimum CU size is reached. When
N8, the prediction unit (PU) size of Intra CU is identical
to its CU size; otherwise, it can adopt either 8 ×8or4×4
PU sizes. In general, when the minimum CU size is set, the
larger CTU size always provides better compression efficiency,
especially for high-resolution pictures. The experiments in
literature [3] showed that, when 8 ×8 minimum CU size was
applied, using 64×64 CTU size reduced the bit rate by 11.0%
on average as compared to the 16×16 CTU size. With Class-A
test sequences, this performance gap dilated up to 28.2%.
To alleviate the Intra encoding complexity, innumerable
algorithms have been developed for fast Intra coding
mode decision. The previous methods can be classified
into the following categories: The first kind of fast
algorithms reduce the complexity of prediction mode
rate-distortion optimization (RDO) [4]–[6]. For example,
Ma etal. applied the number of Lbest modes from rough
mode decision and the most probable modes (MPMs)
from the neighboring coded blocks as the candidates, to
undergo the full RDO processing [6]. The methods of the
second category, which are the continuation of H.264/AVC
low complexity encoding algorithm study [7]–[10],
simplify the complexity of rate-distortion (RD)-cost
computation [11], [12]. To relieve the complexity induced
by CU/PU modes, the algorithms belonging to the third
category dynamically skip the unpromising CU/PU depths or
early terminate the CU/PU mode RDO procedure [13]–[23].
For instance, literatures [13]–[16] defined the dynamic CU
depth range on the basis of the CU depth information
1057-7149 © 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5089
of previously coded slices and CUs. The algorithm in
literature [17] terminates the splitting of the current CU when
this CU node selected SKIP mode as the best prediction mode.
Shen etal. proposed the dynamic CU depth range and early
termination methods by exploring the coding information of
spatial neighboring and co-located coding blocks [18]. In lit-
erature [19], Zhang exploited the early CU split termination
algorithm by using the approximate aggregated RD-cost with
sub-CUs’ RDO results. To improve the coding performance,
the machine-learning methods are employed to evaluate the
critical parameters in fast CU mode decision [16], [22], [23].
On the other hand, the hardwired HEVC encoders adopt
CTU pipeline processing [24]–[26], which incurs the follow-
ing constraints: First, the process latency for one CTU is
fixed (6.6k clock cycles in [25]). Second, the high degree of
parallelism, especially in the CU level, is a must to guarantee
the required throughput. A fast algorithm is considered as
VLSI friendly, only when it possesses the following properties:
The CTU-grain maximum computational complexity
should be reduced. Only with this precondition, the
encoder hardware cost, which is designed to satisfy the
stipulated process latency under the maximum burden,
can be saved. The dynamic CU level skipping and early
split termination algorithms cannot contribute to sim-
plifying the encoder’s hardware complexity. The same
problem exists in the online training stage of machine
learning based algorithms [16], [22], [23].
The parallelism should not be degraded by the
fast algorithm. In the high resolution video encoder
deigns [24], [25], the 4×4 PU is always being encoded in
parallel with other CU modes to improve the throughput
as far as possible. If the current CU mode is inferred from
the neighboring CUs’ coding information, the CU level
parallel processing is infeasible.
The hardware and power costs for implementing the fast
algorithm should be low. From the system view, the fast
algorithm is a part of the whole encoder. Its overheads
directly affect the encoder performance.
For the above reasons, the software oriented fast algo-
rithms are not adopted in the real-time Intra encoders in
literatures [25], [26].
In [26], Zhu et al. proposed to skip a certain number
of CU/PU modes in full RDO procedure according to the
estimated RD-cost from the source image texture investigation.
Specifically, the prediction residue of a pixel is first evaluated
according to its edge strength, edge direction, and its location
in CU. Next, the RD-cost of one CU, with which the fast
CU mode decision could be made, is estimated from the
prediction residue evaluations. The drawbacks of [26] lie
in the empirical feature extractors and the ignorance of the
topology of feature points, which degraded the compression
efficiency with the averaged BDBR =+4.53%. Convolution
neural network (CNN) [27], is one model inspired by the
animal visual cortex. The CNN with appropriate architecture
can be trained with gradient-based learning algorithms [28] to
classify two-dimensional image patterns, such as handwriting
recognition [29], image classification [30], image segmen-
tation [31], and so on. In this study, we introduce CNN to
Fig. 1. Pseudo codes of CNN oriented XCOMPRESSCU function in
HM software (Mdenotes the predicted CU mode; pCurCU is the pointer
of current CU data structure; PO and QP represent Y-component of current
CU and quantization parameter, respectively; DCUR is the current CU depth;
DMAX is the maximum CU depth; C2Nand CNstand for the RD-costs of
modes 2N×2Nand N×N, respectively).
circumvent the aforementioned hindrances of [26]: First, the
input layer’s convolution kernels, which are viewed as the
feature extractors, are trained by the samples, instead of a
rule of thumb; second, the topology information of obtained
features can be exploited by CNN during the classification
processing. In addition, the proposed CNN for fast CU/PU
mode decision is rectified according to the specific RDO task.
For a VLSI friendly algorithm, it is desired that the process
engine of the fast algorithm should be of low hardware-
and power-costs, as well as high throughput. To this end,
we provide the reconfigurable Propagate Partial Multiply-
Accumulate (PPMAC) CNN accelerator by using the optimal
floating-point arithmetic. As compared to the fast mode deci-
sion engine in [26], the proposed CNN accelerator reduced
the chip area by 80.2%.
The rest of article is organized as follows. Section II briefly
introduces the CU encoding procedure integrated with the
CNN oriented fast mode decision scheme. The proposed CNN
and its training method are described in Section III. The
architecture of our CNN accelerator is presented in Section IV.
The experimental results are illustrated in Section V, followed
by the conclusions in Section VI.
II. CNN BASED FAST CU/PU MODE DECISION
As a universal approximation of nonlinear systems, CNN
has the following virtues: First, the feature extractors in CNN,
derived by training, are propitious to recognizing complex
singularities, such as stroke end points or corners. Second,
CNN could exploit the topology information among the singu-
larities to improve the estimate accuracy. Finally, as compared
with the fully connected network, the weight scale of CNN is
greatly reduced, which contributes to reduction in hardware .
The inherent properties of CNN make it favored in our CU/PU
coding mode decision task.
The pseudo-code of XCOMPRESSCU function integrated with
our CNN based mode prediction algorithm is described
in Fig. 1, and the main optimizations to the reference software
5090 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016
Fig. 2. Pseudo codes of FASTCUMODE function (IsBoundary being true
indicates the current CU is on the picture boundary; CNNsis the mode
decision CNN for s×sCU (s∈{32,16,8}); E(s) enables the fast mode
decision for s×sCU level).
have been highlighted. We provide the unitary expression of
CU and PU mode decision, because the 8 ×8/4 ×4 PU mode
selection can be viewed as a special case of 2N×2N/N×N
CU mode decision. The function FASTCUMODE is the essential
procedure, in which CNN is adopted to determine the optimal
CU mode. When the current CU size is 64 ×64, the return
values of FASTCUMODE include three cases: HOMO, SPLIT,
and COMB. In other cases (the CU size is 32×32, 16×16 or
8×8), if the corresponding fast mode decision is enabled
(E(s)== true),F
ASTCUMODE merely returns HOMO or
SPLIT. If XCOMPRESSCU receives HOMO, the RD-cost of
N×Nis set as infinite (CN←∞), and the CU splitting test
is eliminated. Otherwise (SPLIT is received), C2Nis assigned
infinite (C2N←∞)and2N×2Nmode search is skipped.
In case of COMB, modes 2N×2Nand N×Nare both
traversed as the original HM software.
The pseudo codes of FASTCUMODE are depicted in Fig. 2.
The inputs of FASTCUMODE include the Y-component of
the current CU (PO) and the quantization parameter (QP).
To reduce the computational complexity of CNN, we apply
the local averaging and sub-sampling function AVGSUB(PO)
to derive the 8 ×8matrixP. In specific, if the size of PO
is s×s(s∈{8,16,32,64}), each entry in P,thatis, pi,j,is
derived as
pi,j=1
(s/8)2
(i+1)·s
81
m=i·s
8
(j+1)·s
81
n=j·s
8
pom,n,(1)
where, pom,nis one entry of the receptive field in PO.
FASTCUMODE first carries out the coarse edge strength
analysis to detect two special cases, i.e., the homogeneous
blocks and the strong edge blocks on picture boundaries. The
edge at (i,j)inPis denoted as
i,j,wherei,j∈[0,6].
i,jis a vector with horizontal component δxi,jand vertical
component δyi,j, which are written as
δxi,j=pi,j+pi+1,jpi,j+1pi+1,j+1
δyi,j=pi,j+pi,j+1pi+1,jpi+1,j+1.(2)
We further define a threshold ETas
ET=max(QP2,Q2). (3)
Qis quantization step that depends on QP,
Q=MF(QP%6)·2QP/6,
in which, MF =[0.625,0.7031,0.7969,0.8906,1,1.125].
With (2) and (3), three auxiliary parameters, i.e., EC,EM,
and EPare devised. EChas the form
EC=ℵ
i,j|δxi,j>ETand δyi,j>ET,(4)
where, represents the cardinal number (what is normally
referred to as a counting number) of a set. EMis
EM=max
i,jδx2
i,j+δy2
i,j.(5)
And, EPis the power of all edges, i.e.,
EP=
i,jδx2
i,j+δy2
i,j.(6)
When EP<5ETand EMQP2, the current CU is
identified as homogeneous and the return value is HOMO.
On the other hand, when the current CU is on the picture
boundary (IsBoundary==true) and possesses the strong edges
(EC>2), the return value is SPLIT. The coarse analysis
contributes from two aspects: First, the simple coarse analy-
sis relives the burden of CNN; second, the homogeneous
samples, which trap the CNN in ill-conditions, could be
filtered out. The coding performance loss of coarse analysis
is BDBR=+0.42%. When the flag E(s)isset,theCU,which
is not identified by the coarse analysis, is dispatched to the
dedicated neural network (CNN32 for 32×32 CU, CNN16 for
16×16 CU, and CNN8for 8×8 CU) to determine the proper
mode from Pand QP. The return value of CNNsis either
HOMO or SPLIT. That is, CNN just chooses one from the
two CU candidate modes for the following RDO processing.
E(s) makes the tradeoff between coding quality and computa-
tional complexity, which will be described in SectionV.
III. CNN BASED FAST CU MODE DECISION ALGORITHM
The block diagram of CTU pipelined HEVC Intra encoder
integrated with the CNN based CU mode decision engine is
illustrated in Fig. 3(a), which is inherent from our previous
study [26]. Similar to the typical HEVC encoders [24], [25],
a two-stage CTU pipeline is applied in our encoder. With the
aforementioned methods, the first stage decides the promising
CU/PU competitors that will be dispatched to the second stage
for full RDO processing. The second stage constitutes the
reconfigurable predictor, 8 ×8/4 ×4 RDO engine, 32 ×32/
16 ×16 RDO engine, and reconstruction datapath. For one
8×8 CU, it is fed into 8 ×8/4 ×4 RDO engine with
the assigned PU mode (2N×2Nor N×N) to calcu-
late its best RD-cost. Other large CUs (16 ×16, 32 ×32
and 64 ×64) enter 32 ×32/16 ×16 RDO engine. After
deriving the optimal coding configuration, the reconstruction
datapath produces the coding coefficients and the recon-
structed pixels. Two successive CTUs could be processed
LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5091
Fig. 3. Architecture of intra encoder embedded with CNN based
fast CU mode decision engine. (a) Intra Encoder Top Block Diagram.
(b) CNN Architecture.
simultaneously in two stages. Specifically, if the (k+1)th
CTU (CTU(k+1)) is undergoing the CNN based mode
decision in the first stage, the previous one (CTU(k)) is
carrying out RDO and reconstruction in the second stage.
For the VLSI implementation, our CNN based CU/PU pre-
decision engine does not degrade the parallelism of the RDO
processing in the second stage, even as reducing at least
two CU/PU modes.
In the CNN based CU mode decision unit, CNN32,
CNN16,andCNN
8adopt the same architecture, as depicted
in Fig. 3(b), to share the hardwired accelerator. The proposed
CNN is composed of the lower alternating convolution and
max-pooling layers, and the upper full-connected Multilayer
Perceptron (MLP). Counting the input, our CNN comprises
six layers, which are explained as follows:
The input is an 8 ×8matrixP, which is derived as (1).
The first hidden layer is a convolution layer with 6 feature
maps. Each neuron is connected to a 3 ×3 receptive field
in the input. The size of each feature map is 6 ×6, to
prevent the convolution from falling off the boundary. The
kernels in this layer are regarded as feature extractors.
There are 60 trainable parameters, which is composed of
six 3 ×3-kernels and six biases.
The second hidden layer performs the local maxing and
sub-sampling. There is no trainable parameter.
The third hidden layer implements the second convolu-
tion. There are seventeen 1×1 output neurons, of which
one input is QP. As the kernel size is 3×3, the trainable
parameter number is 16 ×(6×3×3+1)=880.
The last two layers are fully connected MLP. The fourth
hidden layer consists of 11 units, and the output layer
contains 2 output units. Including biases, the trainable
parameter numbers in the fourth hidden layer and the
output layer are 180 and 24, respectively.
Gradient-based learning algorithms [28] is adopted in our
CNN training phase. To improve the CNN performance, the
training strategy is rectified from two aspects, i.e., sample
selection and target value definition, which will be descried
in the following sub-sections.
A. Training Sample Selection
In the training sample selection stage, we introduce the
following techniques.
Firstly, the samples must not belong to the homogeneous
type, which are explained in Fig.2. We define the para-
meter γas follows to indicate edge strength in one CU,
γ=i,jx2
i,j+δy2
i,j)
49 ·Q2
where, δxi,jand δyi,jare defined as (2). In the CNN train-
ing phase, it is desired that the samples be evenly distrib-
uted in all output categories [32]. When the value of γis
small, there is a high probability to use 2N×2Nmode.
To prevent too many homogeneous samples in training,
for 32 ×32 and 16 ×16 CUs, we use the samples with
γ>0.1; for 8 ×8 CU, the samples with γ>1.3are
adopted.
To capitalize on the edge strength information, we did
not normalize the input signals. The vanishing gradient
problem (the gradient of active function approaches 0
when its input amplitude is large enough) can be avoided
by using the modified sigmoidal activation function.
We eliminate such samples, in which the RD-cost differ-
ence between 2N×2Nmode and N×Nmode is too
small. The parameter RD is defined as
RD =C2NCN
C2N+CN,
where, C2Nand CNdenominate the RD-costs of mode
2N×2Nand mode N×N, respectively. The samples with
|RD|≤0.02 are discarded. This scheme will improve
the compression efficiency by BDBR=-0.20% averagely.
Finally, six typical video sequences, including
PeopleOnStreet,BasketballDrive,BQTerrace,
Cactus,Vidyo3,andJohnny, are adopted to generate
the training samples.
B. Target Value Definition
We will derive the target output values for nodes O2Nand
ON(as shown in Fig.3(b)) during the training phase. Let
vector edenote the singularities in one CU. The variance of
residual transform coefficient (σc) is a function of singularities,
i.e., σc(e). From the distortion model provided in [33], we can
see that the nonlinear relation between the distortion Dand Q2
could be formulated as
D=(e)Q2,(7)
5092 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016
in which, (e)is a nonlinear function of variable σc(e).Onthe
basis of RD relation [34], [35]
R=1
2ln σ2
c(e)
D,
and the distortion model of (7), the rate has the form
R=1
2ln (e)
Q2,(8)
where, (e)=σ2
c(e)/(e).
With (7) and (8), for the 2N×2Nmode, its distortion and
rate are both nonlinear functions of e, expressed as
D2N=2N(e)Q2
R2N=1
2ln 2N(e)
Q2,(9)
where, 2N(e)and 2N(e)are nonlinear functions to be
approximated. Consequently, we have
C2N=D2N+λ2NR2N.(10)
The relationships of λ2N,Qand QP are expressed as
λ2N=a2N2QP/3,(11)
and
Qb2QP/6.(12)
Substituting (9), (11) and (12) to (10), we deduce that the
logarithm of C2Ncan be formulated as
ln C2N=μ0+μ1QP +ln{μ22N(e)
+μ3ln[2N(e)]+μ4QP +μ5},(13)
in which, {μi|i=0,1,...,5}represent the coefficients
derived from the parameters a2Nand b. The expression of
ln CNcan be traced by analogy. That is,
CN=DN+λNRN.
=N(e)Q2+λN
2ln N(e)
Q2
=N(e)Q2+aN2QP/3
2ln N(e)
Q2
It should be noticed that, because of the changed residue
distribution, the Lagrange multipliers of mode 2Nand
mode Nare different, λ2N= λN[36]. Then, we have
ln CN=ν0+ν1QP +ln{ν2N(e)
+ν3ln[N(e)]+ν4QP +ν5}(14)
in which, {νi|i=0,1,...,5}represent the coefficients derived
from the parameters aNand b.
Because the sigmoidal like activation function in our CNN
is an odd-function, it is desired that the mean of two output
nodes is zero. Consequently, the teaching output value for ON
is set as
ON=ln C2Nln CN,
and the corresponding teaching output value for O2Nis neg-
ative of ON. Considering the effects of QP in (13) and (14),
QP is adopted as the inputs to the third and the fourth hidden
TAB LE I
DEFINITION OF THRESHOLD τ
Fig. 4. Feature extraction kernels of the first hidden layer in 32×32/16×16
and 8×8/4×4 CU mode decision CNN. (a) Kernels of CNN32 .(b)Kernels
of CNN8.
layers, as shown in Fig3(b). Introducing QP as the input of
CNN is the prominent improvement as compared with our
previous study [32], up to BDBR=−2.04% rate reduction
could be achieved for the test of Kimono.
As |ln C2Nln CN|always increases with the magnitudes
of QP and e, the output of activation function in our design
should be not constrained. Therefore, the active function is our
CNN has the form
1.716 tanh(0.667x)|x|
1.716[tanh(2τ/3)+tanh(2τ/3)(xτ)]xτ
1.716[tanh(2τ/3)+tanh(2τ/3)(x+τ)]x≤−τ.
The values of τon different CNN layers are derived by experi-
ments. Let τirepresent the ith layer’s threshold. The candidate
value set of τiis defined as τi∈{1.0,1.1,1.2,···,3.5}.
We test the performance of various combinations of τi,and
find the optimal one. The threshold values in our design are
provided in TableI.
The kernels in the first convolution layers of CNN32 and
CNN8are illustrated in Fig. 4. The white square represents the
positive pole and the black one denotes the negative pole. It is
obvious that, these kernels are efficient to detect singularities,
such as end points and corners. For example, Kernel#0 of
CNN32 is composed of a negative pole on the first row and a
positive one on the second row; One corner is included in
Kernel#0 of CNN8. The computational complexity of the
proposed CNN is summarized as TableII. Totally, the
CNN processing consumes 3000 multiplications, 3000 addi-
tions/subtractions and 244 pseudo sigmoidal functions. The
experiments show that FASTCUMODE function accounts for
2-3% of the HM-12.0 overall Intra encoding time.
LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5093
Fig. 5. Noisy fixed-point convolution operation (yl1(i)el1(i)represents the output signal and the additive roundoff error in the (l1)th layer’s
convolution result; xl(i)and ˆεl(i)coming from the activation function are input signal and additive noise of the lth layer; The kernel coefficients and bias
are k(i)and b;ˆ(i)is the additive rounding error from the multiplication operation; ˆyl, which is composed of the signal yland the noise ˆel, indicates the
convolution result of the lth layer).
TAB LE I I
LAYE R-WISE COMP UTATIONAL COMPLEXITY
OF PROPOSEDCNN
IV. VLSI IMPLEMENTATION OF CNN ACCELERATOR
Our analytical models reveal that, when the CNN scale
is small, the hardware and power costs of CNN accelerator
can be greatly reduced by using the optimal arithmetic and
the associated representation format. Based on the optimal
arithmetic, a reconfigurable hardwired accelerator is devised
to speed up the computation.
A. Effect of Finite Bit-Depth
In this section, we will reveal the impacts of finite bit-depth
and network scale to the output noise-to-signal-ratio (NSR) for
fixed-point and floating-point, respectively.
Let’s build the data-flow graph for computation error analy-
sis. In the convolution procedure, the feature maps in the
current layer are obtained from the neuron outputs of the
previous layer. This procedure is formulated as
yl(m,n)=
T1
t=0
Kl1
p=0
Kl1
q=0
xl(t,m+p,n+q)kl(t,p,q)+b
xl+1(m,n)=sigmoid(yl(m,n)),
(15)
in which, xl(t,m+p,n+q)denotes the pixel (neuron output)
from the tth feature map in the previous layer, kl(t,p,q)
and bdenote the kernel coefficients and the bias, respectively.
Equation (15) is composed of two stages: the biased sum of
weighted input pixels (calculating yl(m,n)), and the limitation
of yl(m,n)by activation function, i.e., sigmoid(yl(m,n)).
xl+1(m,n)is the input pixel for the following the layer.
Each pixel yl(m,n)can be viewed as the biased
weighted average of all pixels in the receptive field. With-
out loss of generality, it is assumed that the input sig-
nals xlare independent identically distributed (i.i.d.) random
variables (r.v.) [37], [38]. Consequently, the statistical proper-
ties of ylare invariant with respect to the indices mand n.
In our error analysis, (15) is simplified with 1-D notation,
written as
yl=
Il1
i=0
xl(i)kl(i)+b,(16)
where, i=t·K2
l+p·Kl+qand Il=T·K2
l.
For the fixed-point, a (1 +ˆ
A+ˆ
B)-bit number is composed
of 1-bit sign, ˆ
A-bits integer, and ˆ
B-bit fraction. That is
bS0

sign
bI ˆ
A1...bI0

integer
bFˆ
B1...bF
0

fraction
,
in which denotes the binary point.
ˆ
Ais devised to prevent overflow when representing the
maximum amplitude of CNN input and output signals. The
bit-depth of fractional part ( ˆ
B) affects the energy of roundoff
noises. With no overflow, the additive roundoff noises in the
fixed-point CNN merely come from the multiplications in (16).
The error analysis model of two successive layers is illustrated
in Fig.5. As mentioned in the literatures [37] and [38],
it is assumed that all signals and noises are zero-mean inde-
pendent variables, and the variance of error ˆ(i)is determined
by ˆ
B, i.e.,
Eˆ(i)·ˆ(j)=0wheni= j
Eˆ2(i)=22ˆ
B/12
E[xl(i)·xl(j)]=0wheni= j
Ex2
l(i)=σ2
xl
Exl(i)·ˆ( j)=0iand j
in which, E [·] denotes the expectation of its input variable.
In general, the activation function in CNN is non-linear.
We need to investigate the affects of the activation function
to NSR. From Appendix A, we derive lemma1. Because
the traditional sigmoid and Relu functions, and the pseudo
sigmoid functions adopted in this article, always satisfy the
premises in lemma1, it is concluded that the output NSR of
activation function is less than its input NSR.
Lemma 1: For an activation function (y), if it possesses
the properties of (0)=0, the sign of (·)is invariant,
and (y2)(y1)when |y2|>|y1|, the output NSR is
always not greater than the input NSR.
Let NSR
ldenote the local NSR of the lth layer. In specific,
if σˆldenotes the variance of the noises stemming from the
5094 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016
Fig. 6. Noisy floating-point convolution operation (yl1(i)el1(i)denotes the output signal and the associated roundoff error in the (l1)th layer’s
convolution result; The kernel coefficients and bias are k(i)and b;˜η(i)and ˜
ζ(i)represent the multiplicative roundoff errors of the floating-point multiplication
and addition operations, respectively; ˜ylbeing composed of the signal yland the noise ˜el, indicates the convolution result of the lth layer).
lth layer’s multiplications, and σylrepresents the variance of
yl,NSR
lis defined as
NSRl=σˆ2
l
σ2
yl
.
From the analysis in Appendix B, NSR
lis formulated as
NSRl=22ˆ
B(Il+1)
12σ2
yl
.(17)
NSRlalways increases linearly with the number of noise
sources (Il), and is inversely proportional to the power of
output signal (σ2
yl).
Theorem 1: For one fixed-point CNN, of which the activa-
tion functions meet the requirements provided in lemma 1, the
NSR upper bound of the lth layer output is equal to the sum
of all
NSRi(0 il), i.e.,
σ2
ˆel
σ2
yl
l
i=0
NSRi(18)
The detailed analysis of Theorem1 is given in AppendixB.
From (17) and (18), the final output NSR always increases with
the network scale, which is determined by the parameters Ii
and l.
For the floating-point arithmetic, a (1 +˜
A+˜
B)-bit number
is composed of three sections, i.e., 1-bit sign, ˜
A-bit exponent,
and ˜
B-bit mantissa, which is described as
bS0

sign
bE ˜
A1...bE0

exponent
bM ˜
B1...bM0

mantissa
.(19)
The value of a number xrepresented by (19) is
x=(1)bS0×
1+
˜
B1
i=0
bMi·2i˜
B
×exp
2,
˜
A1
i=0
bEi·2iEbias
where, Ebias is the exponent bias. In our design, Ebias is set
as 2 ˜
A1.
The data flow of noisy floating-point operation based CNN
is depicted in Fig. 6. Different from the fixed-point counterpart,
the roundoff errors in floating-point operations (˜η(·)and ˜
ζ(·)
in Fig. 6) are multiplicative. For example, if we perform
the first floating-point multiplication in (16), except for the
signal xl(0)kl(0), the noise xl(0)kl(0)˜η(0)is also generated.
In addition, all multiplications and additions in floating-point
CNN will generate the noises.
As the investigation of fixed-point CNN, it is assumed
that ˜η(i)and ˜
ζ(i)are zero-mean i.i.d. r.v. [37]. If ˜η(i)
and ˜
ζ(i)are both uniformly distributed in the range of
[−2(˜
B+1),2(˜
B+1)], the variances of ˜η(i)and ˜
ζ(i)can be
expressed as
σ2
˜η=σ2
˜
ζ=22˜
B
28 .
From the above properties, we could derive the local NSR,
i.e.,
NSRl, which is produced by the floating-point operations
in the lth layer. The mathematical analysis in Appendix C
yields the expression of
NSRlas
NSRl=22˜
B(Il+2)
28 ·Il1
i=01i
Il+2k2
l(i2
xl(i)
Il1
i=0k2
l(i2
xl(i).
In consequent, the NSR upper bound for multiple-layer
floating-point CNN is given by Theorem 2. The associated
investigation can be referred in Appendix C.
Theorem 2: For one floating-point CNN, of which the acti-
vation functions meet the requirements provided in lemma1,
the NSR upper bound of the lth layer output is
σ2
˜el
σ2
yl
l
i=0
Ii+2l22˜
B
28
From Theorem 1 and Theorem 2, it is concluded that the
NSR upper bound is linear with the network scale. Namely,
for the small scale CNN, we can reduce the bit-depth of fixed-
point fraction part and that of floating-point mantissa part,
while maintaining the desired precision. As to be illustrated
in Section V, 12-bit fixed-point ( ˆ
A=5and ˆ
B=6) and 10-bit
floating-point ( ˜
A=4and ˜
B=5) both fulfill the precision
requirements.
As to be discussed in Section IV-B, the primary arithmetic
modules in our CNN accelerator include nine multipliers and
three four-operand adder-trees. As compared to the fixed-point
arithmetic, because of the bit-depth reduction, the hardware
saving gains of the floating-point multipliers outweigh the
losses arising from the floating-point adder-trees, which finally
leads to the averaged 31% gate count reduction for the
arithmetic operations. Therefore, we adopted the floating-point
arithmetic in our CNN accelerator design.
LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5095
Fig. 7. Reconfigurable PPMAC Architecture of CNN Accelerator Design (LZC: leading zero counter; sigmoid: sigmoid function shared by convolution and
MLP layers; Si,Eiand Mi(i∈{0,1,2}): sign, exponent and mantissa of the ith operand to 3-Operand Adder;SR,ERand MR: sign, exponent and mantissa
of 3-Operand Adder result; ij is the kernel coefficient in convolutional layers and the weight in MLP layers).
TABLE III
DATA FLOW OF PROPOSEDPPMAC IN CONVOLUTION PROCESSING
B. Propagate Partial Multiply-Accumulate CNN Accelerator
To accelerate 2D convolution and MLP processing, we
proposed the reconfigurable PPMAC architecture, as shown in
Fig.7. The data flow of PPMAC is inspired by the Propagate
Partial SAD architecture of motion estimation accelerator in
H.264/AVC video encoder [39]. The architecture is composed
of three row-wise 1D PE arrays (1D-PEAi,i∈{0,1,2}).
1D-PEAiis of 4-stage pipeline, including 1-stage multiplier,
2-stage 3-operand adder, and the last stage 2-operand adder.
For the convolution, the last stage adder of 1D-PEAi
(i∈{1,2}) is configured to fetch the output of 1D-PEAi1
as the operand. This operand of 1D-PEA0 is the bias, i.e.,
bin (15). The row-wise input vectors of 1D-PEAiare
X0
and
X1, which are both of the size of 1 ×3 (one row
by three columns). The data flow of PPMAC is described
in TableIII. With the 4-stage pipeline 1D-PEAi, our design
achieved 714MHz clock speed even at the worst working
conditions (0.9v, 125°C). It should be emphasized that, our
design needs not be compliant with IEEE-754 specifications,
which provides more design space for hardware optimizations.
For example, the 3-operand adder in our design applies the
2-stage uniform aligning based addition architecture, as shown
in Fig.7, to reduce the circuits complexity. To maintain the
precision, after the uniform aligning, we extend two least
significant bits. The rounding operation is a must. Otherwise,
the result amplitude is always decreased. With the above
optimizations, 6.4-38.9% chip area could be saved for the
3-operand adder. Because the last stage adder in 1D-PEAi
is used as the accumulator in MLP, it cannot be divided into
more pipeline stages. For this adder, we apply the 1-stage dual-
path approach [40]. Except the two initial cycles, the hardware
utilization of 2D PE-Array is 100% during the first convolution
layer.
In the first layer convolution, the output of 2D PE-Array is
directly dispatched to sigmoid module, which then feeds its
output to the 2×2 max-pooling stage. Because the convolution
5096 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016
TAB LE I V
DATA FLOW OF PPMAC IN MLP PROCESSING
is column-wise and one pixel is generated in each cycle, three
pixel registers (r0, r1, and r2) are equipped to store the local
maximum pixels in the corresponding 2 ×2 receptive fields.
For the MLP layer, each neuron yjin the current layer
first forms the biased inner product of its weight vector (wij)
and the output vector of the previous layer (xi), which is
formulated as
y
j=
I1
i=0
wijxi+bj,(20)
and then, emits
yj=sigmoid(y
j). (21)
In MLP computation, each 1D-PEA is configured to work
independently. In specific, the final adder in 1D-PEAifetches
one operand from the last stage accumulation registers. The
3-pixel inputs of 1D-PEAiall come from
X1. The outputs
of 1D-PEAibelong to different neurons in the output layer.
The initial value of accumulation registers is set as the
corresponding bias. The data flow of Hidden 17@1×1layer
to Hidden 11@1×1layer processing is described as TableIV.
It should be noticed that the Feature maps 6@3×3to
Hidden 17@1×1convolution operations also adopt the MLP
alike data flow, instead of the one in TableIII. This is mainly
because of that, the output number for each input feature map
is one, considering the bubbles in initialization, the hardware
utilization of 2D-PE Array is merely 33% with the convolution
data flow. In contrast, using MLP data flow, this metric is
improved up to 84%. In MLP procedure,1D-PEAiimplements
the MAC operations in (20), and the intermediate results
are saved in Data Buffer, as shown in Fig.7. The sigmoid
module fetches the intermediate MAC results and carries out
the activation processing (expressed as (21)), which does not
disturb the pipeline scheduling of 1D-PEAi.
V. EXPERIMENTS
In this section, we first evaluate the coding performance of
the proposed fast CU/PU mode decision algorithm. Thereafter,
we analyze the effects of finite precision arithmetic to the
coding quality. Finally, we investigate the throughput, the hard-
ware cost, and the power dissipation of our CNN accelerator
in detail.
A. Performance Analysis of Fast CU/PU Mode Decision
The proposed method is conducted on HEVC reference test
model HM12.0. The test platform is Huawei RH5885, which
combines Intel®XeonTM E7-4830-v2 2.20GHz processor and
128.0GB RAM. Twenty-seven typical open test sequences,
including Classes A-F, were tested with QP ={22, 27, 32, 37}
and intra_main configuration [41]. To verify the robustness of
our algorithm, we introduced additional six HD1080p@25fps
in-house sequences provided by Huawei®. These videos
cover the representative surveillance scenarios, such as the
building internal environment (B2Inside), the open space
with moving peoples (3PeopleD4,GrassTreeD4), the traffic
scenes (RoadB,RoadCar), and the surveillance under weak
light (LowlightNight).
The coding performance results are shown in Table V, in
which the original HM12.0 is the standard benchmark. BP
and BR in Table V stand for the average PSNR difference
(BDPSNR) and the average bit rates difference (BDBR) [42],
respectively. Trepresents the encoding time reduction,
whichisdenedas
T=(THM TCNN)/THM ×100%,
with THM and TCNN denoting the encoding time of orig-
inal HM12.0 and the CNN based counterpart, respectively.
To recall the pseudo codes in Fig.2, we can indicate the
specific CU/PU depth, in which the fast mode decision
is performed. In our evaluations, four configurations, i.e.,
[E(32),E(16),E(8)]={[0,0,1], [0,1,1], [1,0,1], [1,1,1]} were
tested. Obviously, our algorithm possessed the computational
scalability. That is, we could reduce more complexity by
scarifying the compression efficiency. It was observed that
the configuration [0,0,1] achieved the best coding quality,
i.e., on average BDBR=+1.54%, while the encoding time
reduction was merely 43.7%. In contrast, the time saving of
[1,1,1] was 72.0%. Meanwhile, its coding performance was
degraded with BDBR=+4.79%. The configuration of [1,0,1]
achieved the good balance between the coding efficiency
and the computational reduction. In this context, the BDBR
augment was +2.67%, while 61.1% encoding complexity was
saved. As compared with the counterpart [0,1,1], the setting
of [1,0,1] had the advantages of both coding quality and
processing speed.
The performance comparisons of the proposed algorithm
and other recent works [14], [16], [19]–[21], [26], [32] are
illustrated in TableVI. TCand TPstand for the encod-
ing time reductions from fast CU/PU mode and prediction
mode methods, respectively. BRMAX represents the maximum
BDBR increase. VLSI indicates the algorithm’s friendship to
LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5097
TAB LE V
CODING PERFORMANCE ANALYSI S OF OUR FAST CU/PU MODE DECISION UNDER VARIOUS CONFIGURATIONS
the hardwired encoder design. All performance statistics of
previous works are cited from the references.
The software oriented fast algorithms, i.e., literatures [14],
[16], [19], [21], provide the superior coding quality than ours.
However, for the hardwired encoder design, the above methods
have the following hindrances: First, the CTU-grain maximum
encoding complexity is not pruned, which impedes the fast
algorithm contributing to the optimization of the encoder’s
hardware cost. For instance, to maintain the coding quality,
literatures [19], [21] define the conservative thresholds. Only
when one CU satisfies these thresholds, its CU splitting trial
is terminated; Otherwise, the CU mode decision is carried
out as the original full RDO procedure. We can see that the
fast CU mode method in [19] merely reduced 26% of the
overall encoding time. The algorithm of literature [14] has
the similar problem: the maximum encoding complexity of
one CTU, which occurs in the parameter estimation phase,
does not diminish. Second, the CU level data dependency
in [16], which uses the current CU depth coding informa-
tion to prune the RDO processing of the deeper CU levels,
incurs the cumbersome to encoding parallelism. Namely, if
this algorithm is adopted, the encoding throughput will be
demoted.
Literatures [20], [26], [32] are VLSI friendly, because these
methods merely employ the source image texture analysis
to predict the promising CU mode competitors. As depicted
in Fig.3, for the CU mode pre-decision engine can be allocated
in the advanced CTU pipeline stage, the rhythm of following
RDO based CTU encoding will not be hindered. However,
because [20] and [26] did not use the topology information of
feature points, their coding efficiency losses are obvious, i.e.,
BDBR=+5.10% and BDBR=+4.53%, respectively. In the
5098 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016
Fig. 8. CU/PU partition comparisons between HM-12.0 and our fast algorithm (BasketballPass when QP=22). (a) HM-12.0. (b) Proposed Algorithm.
TAB LE V I
PERFORMANCE COMPARISONS BETWEEN PROPOSED
SOLUTION AND EXISTING ALGORITHMS
work [32], we adopted CNN to resolve the above prob-
lem, which ameliorates the coding loss to BDBR=+3.39%.
As compared to [32], the primary improvements of our
most recent study include the introduction of QP in CNN,
and the refinement of training strategies. Consequently, the
averaged BDBR is merely +2.67%. The worst case comes
from Vidyo3, of which the BDBR augments of this study and
the counterpart [32] are +5.01% and +5.72%, respectively.
The visualized CU/PU partition comparisons between
the HM12.0 benchmark and ours (with the configuration
[E(32),E(16),E(8)]=[1,0,1]) are illustrated in Fig.8. Our results
of CU blocks with strong singularities, such as the blocks on
object outlines, are consistent well with benchmarks. Whereas,
as high-lighted with the red borders, for the blocks being lack
of features, especially on picture boundaries, the probability
of false prediction increases.
B. Effects of Finite Bit-Depth Arithmetic
We analyze the effects of finite precision to the video
coding quality, including both of fixed-point and floating-point.
As aforementioned, the computational precision is improved
by increasing the bit-depth. On the other hand, the CNN
hardware cost also rises proportionally to the bit-depth. Our
target is to find the minimum bit-depth that satisfies the cod-
ing quality requirements. Twenty representation formats were
TAB LE V II
CODING QUALITY ANALYSIS W ITH DIFFERENT REPRESENTATION
FORMATS (BDBR UNIT:%)
tested for floating-point and fixed-point, respectively, as shown
by Table VII. With 4-bit exponent and 5-bit mantissa, the
floating-point arithmetic achieved the competitive compression
ratio as the traditional 32-bit floating-point computation. The
sequence-wise statistics are provided in Table VIII. As com-
pared with TableV, it was illustrated that the fluctuations
in performance incurred by the short bit-depth is less than
BDBR=0.6% in sequence Kimono. For fixed-point, the com-
parative coding quality is obtained with 12-bit representation
format, in which ˆ
A=5and ˆ
B=6. Experiments revealed
that the 10-bit floating-point CNN accelerator could save
21% hardware cost as compared with the 12-bit fixed-point
counterpart.
C. Performance Analysis of CNN Accelerator
With TSMC 65nm CMOS technology, our CNN accelerator
was described with Verilog-HDL and was synthesized with
SYNOPSYS Design Compiler. IC-Compiler was adopted to
implement the place-and-route job.
The synaptic weights storage is one hardware consuming
component for CNN VLSI implementation. In literature [43],
EDRAM is applied in the general purpose CNN processor to
hold all synapses values on-chip. In our design, if we use the
register file to buffer the weights, the corresponding hardware
cost is 23.2k-gate. As the weight coefficients were trained off-
line, we adopted combinational logic to realize the synap-
tic weight storage, which merely accounted for 8.9k-gate.
As compared with the traditional on-chip register-file
approach, 14.3k gates were saved.
LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5099
TABLE VIII
CODING QUALITY ANALYSIS O F FLOATING-POINT WITH
˜
A=4, ˜
B=5AND [E(32),E(16),E(8)]=[1,0,1]
TAB LE I X
HARDWARE COST AND POWER CONSUMPTION ANALYSI S OF PROPOSED
CNN ACCELERATOR (TSMC 65nm CMOS TECHNOLOGY,
WORST CONDITIONS: 125C, 0.9V)
The hardware cost and power consumption statistics of
the proposed CNN accelerator design under typical working
frequencies are illustrated in TableIX. Under the worst work
conditions(0.9v, 125°C), the peak clock speed is 714MHz
and the associated gate count is 42.5-k. The CNN accelerator
is merely 2.1-3.9% of the overall encoder costs, which are
Fig. 9. Data flow of the noisy signal propagating the activation function.
2055k-gate and 1086k-gate in [25] and [26], respectively.
The power consumption of our design is 16.2mW@714MHz.
It takes 372 cycles to process one 8×8 image block. One CNN
accelerator can fulfill the throughput of HD1080p@55fps
real-time encoding. Because our algorithm is based on the
source texture analysis, multiple 8 ×8 image blocks could
be processed in parallel. When facing the higher resolu-
tion specifications, the throughput requirements are met by
using parallelism. For example, the real-time processing of
4K@55fps videos can be handled by using four proposed CNN
accelerators.
VI. CONCLUSION
This paper presents the CNN based fast CU/PU mode
decision to reduce the maximum Intra coding complexity of
one CTU for hardwired HEVC encoder design. Specifically,
CNN investigates the textures of one CU, and then deter-
mines the promising candidate in each of 32×32/16×16 and
8×8/4×4 CU/PU mode pairs. The contributions of our pro-
posals include: (1) Because the maximum number of CU/PU
candidate mode in one CTU is reduced, the corresponding
VLSI encoder hardware complexity is ameliorated; (2) With
the CTU pipeline architecture, the parallelism of the critical
RDO processing will not be deteriorated by our fast algorithm;
(3) On the basis of the theoretical analysis, a reconfigurable
CNN accelerator is developed using the optimal floating-point
arithmetic, which greatly reduces the hardware overhead of
our algorithm. The experiments demonstrate that, on aver-
age, 61.1% Intra encoding complexity is reduced, whereas
the incurred compression loss is merely BDBR=+2.67%, or
equivalently BDPSNR=−0.15dB quality loss. Using TSMC
65nm CMOS techniques, one hardwired CNN accelerator,
which achieved 714MHz clock speed in the worst conditions,
is implemented with 42.5-k logic gates. The power dissipation
of our accelerator is 16.2mW@714MHz. One CNN accelerator
fulfills the throughput requirement of HD1080p@55fps real-
time encoding, and the higher performance can be achieved
by applying parallelism.
APPENDIX A
NSR PROPAGATION PROPERTY OF ACTIVATION FUNCTION
As shown in Fig.9, if the input signal yand the additive
noise eare independent zero-mean r.v., the output signal and
the corresponding noise of the activation function are labeled
as xand ε, respectively. It is reasonable to assume that the
variance of e, i.e., σe, is much less than that of input signal,
i.e., σy. With the aid of Taylor’s theorem, we could derive
that
x=(y)
ε=(y)·e.(A.1)
5100 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016
For any y,x=(y)can be formulated as
x=(0)+N1
k=0(k) ·
where, =y/Nand N→∞. Using the properties of (·)
claimed in lemma1, we deduce
|x|(y) N1
k=0=(y)y.(A.2)
From (A.1) and (A.2), we obtain that
Eε2
x2=Ee2((y))2
2(y)
Ee2((y))2
y2((y))2
=Ee2
y2.
APPENDIX B
ROUNDING ERROR ANALYSIS OF FIXED-POINT CNN
Let the input noise signals εl(i)are i.i.d. r.v. with the
variance σεl. From the data flow in Fig.5, we have
σ2
yl=E
Il1
i=0
xl(i)kl(i)
2
.(B.1)
Using E[x2
l(i)]=σ2
xland E[xl(i)·xl(j)]=0(ifi= j), (B.1)
is recast to
σ2
yl=σ2
xl·
kl2,(B.2)
where,
kl2=Il1
i=0k2
l(i). Assuming the roundoff error
ˆ(i)is uniformly distributed in [−2(ˆ
B+1),2(ˆ
B+1)],Eˆ2(i)
is equal to 22ˆ
B
12 . The variance of ˆelis formulated as
σ2
ˆel=EIl
i=0ˆ(i)+Il1
i=0ˆεl(i)kl(i)2
=Il
i=0Eˆ2(i)+Il1
i=0Eˆε2(i)k2
l(i)
=22ˆ
B
12 (Il+1)+σ2
ˆεl
kl2.(B.3)
The output NSR of the lth layer is defined as
σ2
ˆel
σ2
yl
.(B.4)
Substituting (B.2) and (B.3) into (B.4), we obtain
σ2
ˆel
σ2
yl
=22ˆ
B(Il+1)
12σ2
yl
+σ2
ˆεl
σ2
xl
.(B.5)
It should be noted the first term in (B.5) is the NSR generated
by the local noises. In addition, with the conclusion of
lemma 1, we have that the second term in (B.5) is always less
than the output NSR of the (l1)th layer, i.e.,
σ2
ˆεl
σ2
xl
σ2
ˆel1
σ2
yl1
In consequence, the upper bound of lth layer’s NSR is derived
as
σ2
ˆel
σ2
yl
NSRl+σ2
ˆel1
σ2
yl1
,(B.6)
in which, the variable NSR
l=22ˆ
B(Il+1)
12σ2
yl
is the local NSR.
Using (B.6) and noticing σ2
ˆe0
σ2
y0
=NSR0, the NSR upper bound
of the lth layer has the simple expression as
σ2
ˆel
σ2
yl
l
i=0
22ˆ
B(Ii+1)
12σ2
yi
=
l
i=0
NSRi
APPENDIX C
ROUNDING ERROR ANALYSIS OF FLOATING-POINT CNN
From the data flow depicted in Fig.6, we have
˜yl=
Il1
i=0&(i)kl(i)x(i)ε(i)'+(Il)b,
where, (·)is defined as
(0)=1η(0)Il
j=11+˜
ζ(j)For i=0
(i)=1η(i)Il
j=i1+˜
ζ(j)Otherwise.
Then, it is straight forward to obtain
˜el=
Il1
i=0
((i)1)kl(i)xl(i)
+
Il1
i=0
kl(i)˜εl(i)
+
Il1
i=0
((i)1)kl(i)˜εl(i)
+((Il)1)b.
(C.1)
From the zero-mean and i.i.d. properties of ˜η(i)and ˜
ζ(i),we
have
E[(i)]=1
E((i)1)2=E2(i)1(C.2)
As mentioned in Section IV-A, because the mantissa and the
associated roundoff error are assumed to possess the uniform
distributions in [1,2]and [−2(˜
B+1),2(˜
B+1)]respectively,
the variances of ˜η(i)and ˜
ζ(i)are same as
E˜η2(i)=E˜
ζ2(i)
=
1
2˜
B(2(˜
B+1)
2(˜
B+1)x2dx
(2
1x2dx
=22˜
B
28
In consequence, we obtain
E2(0)=1+22˜
B
28 Il+1
For i=0
E2(i)=1+22˜
B
28 Il+2i
Otherwise.
(C.3)
When Il1, we can approximate E 2(0)with the general
expression E 2(i).
LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5101
From (C.1) and (C.2), we can see that ˜elis zero-mean and
its variance is expressed as
σ2
˜el=
Il1
i=0E2(i)1k2
l(i2
xl(i)+
Il1
i=0
k2
l(i2
˜εl(i)
+
Il1
i=0E2(i)1k2
l(i2
˜εl(i)
+E2(Il)1b2(C.4)
In addition, when 22˜
B
28 1, which is always hold with ˜
B>1,
(C.3) can be approximated by discarding the high order term
of 22˜
B
28 .Thatis
E2(i)1+(Il+2i)22˜
B
28 .(C.5)
Substituting (C.5) in (C.4) yields
σ2
˜el=
Il1
i=0
22˜
B
28 (Il+2i)k2
l(i2
xl(i)
+22˜
B
28 (Il+2Il)b2+
Il1
i=0
k2
l(i2
˜εl(i)
+
Il1
i=0
22˜
B
28 (Il+2i)k2
l(i2
˜εl(i). (C.6)
Because 22˜
B
28 1andσ2
˜εl(i)σ2
xl(i), we discard the last
term in (C.6) to get the expression of the lth layer’s NSR as
σ2
˜el
σ2
yl
=22˜
B(Il+2)
28 Il1
i=01i
Il+2k2
l(i2
xl(i)+2b2
Il+2
Il1
i=0k2
l(i2
xl(i)
+Il1
i=0k2
l(i2
˜εl(i)
Il1
i=0k2
l(i2
xl(i)(C.7)
The first term in (C.7), which is labelled as
NSRl,represents
the effect of local arithmetic operations. When Il1, the
effect of b2can be neglected. Therefore, (C.7) is recast to
σ2
˜el
σ2
yl
=
NSRl+Il1
i=0k2
l(i2
εl(i)
Il1
i=0k2
l(i2
xl(i)
=22˜
B(Il+2)
28 Il1
i=01i
Il+2k2
l(i2
xl(i)
Il1
i=0k2
l(i2
xl(i)
+Il1
i=0k2
l(i2
˜εl(i)
Il1
i=0k2
l(i2
xl(i)(C.8)
It should be emphasized that the second fraction in
NSRlmust
be less than 1. Therefore, the upper bound of σ2
˜el
σ2
yl
is
σ2
˜el
σ2
yl
22˜
B(Il+2)
28 +Il1
i=0k2
l(i2
˜εl(i)
Il1
i=0k2
l(i2
xl(i).(C.9)
By mathematical induction, we can derive the concise
expression of the output NSR upper bound as
σ2
˜el
σ2
yl
22˜
B
28 l
i=0
Ii+2l.(C.10)
Proof: When l=0, because σ2
˜εl(i)=0, the
theorem (C.10) is true. If it is assumed that, (C.10) holds when
l=t1, we just need to demonstrate the truth of (C.10) when
l=t.
From lemma 1 and the assumption for l=t1, we have
σ2
˜εt(i)σ2
xt(i)σ2
˜et1
σ2
yt1
σ2
xt(i)22˜
B
28 t1
i=0
Ii+2(t1)(C.11)
With (C.11) and (C.9), when l=t,wehave
σ2
˜et
σ2
yt
22˜
B(It+2)
28 +22˜
B
28 t1
i=0
Ii+2(t1)
=22˜
B
28 t
i=0
Ii+2t
Therefore, the theorem is proofed.
REFERENCES
[1] B. Bross, W.-J. Han, J.-R. Ohm, G. J. Sullivan, Y.-K. Wang, and
T. Wiegand, High Efficiency Video Coding (HEVC) Text Specification
Draft 10, document JCTVC-L1003, Geneva, CH, 2013.
[2] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the
high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits
Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.
[3] J.-R. Ohm, G. J. Sullivan, H. Schwarz, T. K. Tan, and T. Wiegand,
“Comparison of the coding efficiency of video coding standards—
Including high efficiency video coding (HEVC),” IEEE Trans. Circuits
Syst. Video Technol., vol. 22, no. 12, pp. 1669–1684, Dec. 2012.
[4] Y. Piao, J. Min, and J. Chen, Encoder Improvement of Unified Intra
Prediction, document JCTVC-C207, Guangzhou, CN, 2010.
[5] L. Zhao, L. Zhang, X. Zhao, S. Ma, D. Zhao, and W. Gao, Further
Encoder Improvement of Intra Mode Decision, document JCTVC-D283,
Daegu, South Korea, 2011.
[6] S. Ma, S. Wang, S. Wang, L. Zhao, Q. Yu, and W. Gao, “Low
complexity rate distortion optimization for HEVC,” in Proc. Data
Compress. Conf. (DCC), Mar. 2013, pp. 73–82.
[7] Q. Chen and Y. He, “A fast bits estimation method for rate-distortion
optimization in H.264/AVC, in Proc. Picture Coding Symp. (PCS),
Dec. 2004, pp. 133–134.
[8] Y.-K. Tu, J.-F. Yang, and M.-T. Sun, “Efficient rate-distortion estimation
for H.264/AVC coders,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 16, no. 5, pp. 600–611, May 2006.
[9] M. G. Sarwer and L.-M. Po, “Fast bit rate estimation for mode decision
of H.264/AVC,IEEE Trans. Circuits Syst. Video Technol., vol. 17,
no. 10, pp. 1402–1407, Oct. 2007.
[10] X. Zhao, J. Sun, S. Ma, and W. Gao, “Novel statistical modeling,
analysis and implementation of rate-distortion estimation for H.264/AVC
coders,” IEEE Trans. Circuits Syst. Video Technol., vol. 20, no. 5,
pp. 647–660, May 2010.
[11] J. Zhu, Z. Liu, D. Wang, Q. Han, and Y. Song, “Fast prediction
mode decision with Hadamard transform based rate-distortion cost
estimation for HEVC intra coding,” in Proc. 20th IEEE Int. Conf. Image
Process. (ICIP), Sep. 2013, pp. 1977–1981.
[12] Z. Liu, S. Guo, and D. Wang, “Binary classification based linear rate
estimation model for HEVC RDO,” in Proc. 21st IEEE Int. Conf. Image
Process. (ICIP), Oct. 2014, pp. 3676–3680.
[13] X. Li, J. An, X. Guo, and S. Lei, Adaptive CU Depth Range, document
JCTVC-E090, Geneva, CH, 2011.
5102 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 25, NO. 11, NOVEMBER 2016
[14] N. Hu and E.-H. Yang, “Fast mode selection for HEVC intra-frame
coding with entropy coding refinement based on a transparent composite
model,” IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 9,
pp. 1521–1532, Sep. 2015.
[15] L. Shen, Z. Zhang, and P. An, “Fast CU size decision and mode decision
algorithm for HEVC intra coding,IEEE Trans. Consum. Electron.,
vol. 59, no. 1, pp. 207–213, Feb. 2013.
[16] L. Shen, Z. Zhang, and Z. Liu, “Effective CU size decision for
HEVC intracoding,” IEEE Trans. Image Process., vol. 23, no. 10,
pp. 4232–4241, Oct. 2014.
[17] K. Choi, S.-H. Park, and E. S. Jang, Coding Tree Pruning Based CU
Early Termination, document JCTVC-F092, Torino, IT, 2011.
[18] L. Shen, Z. Liu, X. Zhang, W. Zhao, and Z. Zhang, “An effective CU
size decision method for HEVC encoders, IEEE Trans. Multimedia,
vol. 15, no. 2, pp. 465–470, Feb. 2013.
[19] H. Zhang and Z. Ma, “Fast intra mode decision for high efficiency video
coding (HEVC),” IEEE Trans. Circuits Syst. Video Technol., vol. 24,
no. 4, pp. 660–668, Apr. 2014.
[20] Y. Zhang, Z. Li, and B. Li, “Gradient-based fast decision for intra
prediction in HEVC,” in Proc. Vis. Commun. Image Process., Nov. 2012,
pp. 1–6.
[21] B. Min and R. C. C. Cheung, “A fast CU size decision algorithm for
the HEVC intra encoder,” IEEE Trans. Circuits Syst. Video Technol.,
vol. 25, no. 5, pp. 892–896, May 2015.
[22] S. Cho and M. Kim, “Fast CU splitting and pruning for suboptimal CU
partitioning in HEVC intra coding,” IEEE Trans. Circuits Syst. Video
Technol., vol. 23, no. 9, pp. 1555–1564, Sep. 2013.
[23] Q. Hu, X. Zhang, Z. Shi, and Z. Gao, “Neyman–Pearson-based early
mode decision for HEVC encoding, IEEE Trans. Multimedia, vol. 18,
no. 3, pp. 379–391, Mar. 2016.
[24] V. Sze, M. Budagavi, and G. J. Sullivan, Eds., High Efficiency Video
Coding (HEVC): Algorithms and Architectures. New York, NY, USA:
Springer-Verlag, Jul. 2014, pp. 343–375.
[25] G. Pastuszak and A. Abramowski, “Algorithm and architecture design
of the H.265/HEVC intra encoder,” IEEE Trans. Circuits Syst. Video
Technol., vol. 26, no. 1, pp. 210–222, Jan. 2016.
[26] J. Zhu, Z. Liu, D. Wang, Q. Han, and Y. Song, “HDTV1080p HEVC
intra encoder with source texture based CU/PU mode pre-decision,”
in Proc. 19th Asia South Pacific Design Autom. Conf. (ASP-DAC),
Jan. 2014, pp. 367–372.
[27] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,
and time series,” in The Handbook of Brain Theory and Neural
Networks, M. Arbib, Ed. Cambridge, MA, USA: MIT Press, 1995,
pp. 255–258.
[28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
pp. 2278–2324, Nov. 1998.
[29] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber,
“Convolutional neural network committees for handwritten charac-
ter classification,” in Proc. 11th IEEE Int. Conf. Document Anal.
Recognit. (ICDAR), Sep. 2011, pp. 1135–1139.
[30] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural
networks for image classification,” in Proc. 25th IEEE Conf. Comput.
Vis. Pattern Recognit. (CVPR), Jun. 2012, pp. 3642–3649.
[31] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2015, pp. 3431–3440.
[32] X. Yu, Z. Liu, J. Liu, Y. Gao, and D. Wang, “VLSI friendly fast CU/PU
mode decision for HEVC intra encoding: Leveraging convolution neural
network,” in Proc. 22nd IEEE Int. Conf. Image Process. (ICIP),
Sep. 2015, pp. 1285–1289.
[33] Z. Liu, D. Wang, J. Zhou, and T. Ikenaga, “Lagrangian multiplier
optimization using correlations in residues,” in Proc. IEEE Int. Conf.
Acoust., Speech, Signal Process. (ICASSP), Mar. 2012, pp. 1185–1188.
[34] T. Berger, Rate-Distortion Theory, T. Kailath, Ed. Englewood Cliffs, NJ,
USA: Prentice-Hall, 1971.
[35] G. J. Sullivan and T. Wiegand, “Rate-distortion optimization for video
compression,IEEE Signal Process. Mag., vol. 15, no. 6, pp. 74–90,
Nov. 1998.
[36] X. Li, N. Oertel, A. Hutter, and A. Kaup, “Laplace distribution based
Lagrangian rate distortion optimization for hybrid video coding,” IEEE
Trans. Circuits Syst. Video Technol., vol. 19, no. 2, pp. 193–205,
Feb. 2009.
[37] A. V. Oppenheim and C. J. Weinstein, “Effects of finite register length
in digital filtering and the fast Fourier transform,” Proc. IEEE, vol. 60,
no. 8, pp. 957–976, Aug. 1972.
[38] A. V. Oppenheim, R. W. Schafer, M. T. Yoder, and W. T. Padgett,
Discrete-Time Signal Processing, vol. 2. Englewood Cliffs, NJ, USA:
Prentice-Hall, 1989.
[39] C.-Y. Chen, S.-Y. Chien, Y.-W. Huang, T.-C. Chen, T.-C. Wang, and
L.-G. Chen, “Analysis and architecture design of variable block-size
motion estimation for H.264/AVC,” IEEE Trans. Circuits Syst. I, Reg.
Papers, vol. 53, no. 3, pp. 578–593, Mar. 2006.
[40] P. M. Farmwald, “On the design of high performance digital arith-
metic units,” Ph.D. dissertation, Stanford Univ., Stanford, CA, USA,
Aug. 1981.
[41] F. Bossen, Common HM Test Conditions and Software Reference Con-
figurations, document JCTVC-I1100, Geneva, CH, 2012.
[42] G. Bjøntegaard, Calculation of Average PSNR Differences Between RD-
Curves, document VCEG-M33, Austin, TX, USA, Apr. 2001.
[43] Y. Chen et al., “DaDianNao: A machine-learning supercomputer,” in
Proc. 47th Annu. IEEE/ACM Int. Symp. Microarchitect. (MICRO),
Dec. 2014, pp. 609–622.
Zhenyu Liu (M’07) received the B.E., M.E., and
Ph.D. degrees from the Beijing Institute of Technol-
ogy, China, in 1996, 1999, and 2002, respectively, all
in electrical engineering. From 2002 to 2004, he held
a post-doctoral position with Tsinghua University,
China, where he was involved in the embedded
processor architecture design. From 2004 to 2009,
he was a Visiting Researcher with the Graduate
School of IPS, Waseda University, Japan. He joined
Tsinghua University, China, in 2009, where he is
currently an Associate Professor with RIIT&TNList.
His research interests include signal processing, energy-efficient real-time
video encoding and application specific processor.
Xianyu Yu was born in 1989. He received the
B.E. degree in automation from Anhui University,
China, in 2012, and the M.E. degree in integrated
circuit engineering from Tsinghua University, China,
in 2016. He is currently with Huawei Corpora-
tion. His research interests include embedded system
and Android framework associated with multimedia
processing (audio/video/graphics).
Yua n Ga o was born in 1991. He received the
B.E. degree in electrical engineering from the
Beijing Institute of Technology, China, in 2012.
He is currently pursuing the Ph.D. degree with the
Department of Computer Science, Tsinghua Univer-
sity. His research interests include neural network
algorithm and associated very large scale integration
architecture design.
Shaolin Chen received the B.S. degree in automa-
tion from Anhui University, Hefei, China, 2005,
the M.S. degree in geodynamics from the Col-
lege of Earth Science, Graduation University of
Chinese Academy of Sciences, Beijing, China,
in 2008, and the Ph.D. degree in pattern recog-
nition and intelligent system from the Institute
of Automation, Chinese Academy of Sciences in
2012. He is currently a Senior Algorithm Engi-
neer with Huawei Technologies Company, Ltd. His
research interests include video encoding and image
enhancement.
LIU et al.: CU PARTITION MODE DECISION FOR HEVC HARDWIRED INTRAENCODER 5103
Xiangyang Ji (M’10) received the B.S. degree in
materials science and the M.S. degree in computer
science from the Harbin Institute of Technology,
Harbin, China, in 1999 and 2001, respectively, and
the Ph.D. degree in computer science from the Insti-
tute of Computing Technology, Chinese Academy of
Sciences, Beijing, China. He joined Tsinghua Uni-
versity, Beijing, in 2008, where he is currently a Pro-
fessor with the Department of Automation, School
of Information Science and Technology. He has
authored over 100 referred conference and journal
papers. His current research interests include signal processing, image/video
compression and communication, and intelligent imaging.
Dongsheng Wang (M’09) was born in China
in 1966. He received the B.E., M.E., and
Ph.D. degrees from the Harbin Institute of Technol-
ogy, China, in 1989, 1992 and 1995, respectively, all
in computer science. He is currently a Professor with
RIIT & TNList, Tsinghua University. His research
areas include many-core computer architecture, real-
time application oriented SoCs, disaster recovery,
and high availability computing.
... Some methods have proposed algorithms considering hardware constraints. In [18], a hardware-friendly CNN is suggested, subsampling all blocks to 8 × 8 sizes and processing them in a CNN along with compression parameters. The proposed network is implemented on ASIC hardware with relatively low volume. ...
... Three of the available methods are compared with our work(FCD-CNN) in Table 3. For a fair comparison, it is worth mentioning that the authors in [18] implemented their system on hardware, while other methods were implemented in software. However, authors in [18] proposed a CNN network for hardware that only processed 8 × 8 image blocks. ...
... For a fair comparison, it is worth mentioning that the authors in [18] implemented their system on hardware, while other methods were implemented in software. However, authors in [18] proposed a CNN network for hardware that only processed 8 × 8 image blocks. For other block sizes, they downsampled them to 8 × 8 and made decisions accordingly. ...
Article
Full-text available
Video compression for storage and transmission has always been a focal point for researchers in the field of image processing. Their efforts aim to reduce the data volume required for video representation while maintaining its quality. HEVC is one of the efficient standards for video compression, receiving special attention due to the increasing demand for high-resolution videos. The main step in video compression involves dividing the coding unit (CU) blocks into smaller blocks that have a uniform texture. In traditional methods, The Discrete Cosine Transform (DCT) is applied, followed by the use of RDO for decision-making on partitioning. This paper presents a novel convolutional neural network (CNN) and its hardware implementation as an alternative to DCT, aimed at speeding up partitioning and reducing the hardware resources required. The proposed hardware utilizes an efficient and lightweight CNN to partition CUs with low hardware resources in real-time applications. This CNN is trained for different Quantization Parameters (QPs) and block sizes to prevent overfitting. Furthermore, the system’s input size is fixed at 16×16\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$16\times 16$$\end{document}, and other input sizes are scaled to this dimension. Loop unrolling, data reuse, and resource sharing are applied in hardware implementation to save resources. The hardware architecture is fixed for all block sizes and QPs, and only the coefficients of the CNN are changed. In terms of compression quality, the proposed hardware achieves a 4.42%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4.42\%$$\end{document} BD-BR and -0.19\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$-\,0.19$$\end{document} BD-PSNR compared to HM16.5. The proposed system can process 64×64\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$64\times 64$$\end{document} CU at 150 MHz and in 4914 clock cycles. The hardware resources utilized by the proposed system include 13,141 LUTs, 15,885 Flip-flops, 51 BRAMs, and 74 DSPs.
... [ 7] P a r k e t a l . [ 18] Liu et al. [20] Proposed putational complexity. For low-resolution sequences from classes B to E, our implemented CNN-TT model is able to accelerate the intra-coding process by 44.52% on average, varying from 33.83% to 58.61, with a loss in bitrate and quality of 3.02% and 0.07dB, respectively. ...
... 31% and 27%, respectively, while T for our proposed solution is 40.04%, which is higher than the other CNN methods. Moreover, the RD performance of our developed approach is better than [20], where it outperforms [19] in terms of BDBR by 1.75% and quality by 0.16dB. In fact, the proposed CNN-TT based QTMT partition algorithm reaches a better compromise between complexity decrease and coding efficiency. ...
Article
Full-text available
In July 2020, the Joint Video Experts Team has published the versatile video coding (VVC) standard. The VVC encoder enhances the coding efficiency compared with his predecessor high-efficiency video coding encoder, thanks to the improved coding modules and the new proposed techniques such as the new block partitioning structure called quadtree with nested multi-type tree (QTMT). However, QTMT induces a significant increase in encoding time mainly at the rate distortion optimization level (RDO) which causes an enormous computational complexity. Instead of RDO-QTMT partition process, a deep-QTMT partition approach based on a fast convolution neural network-ternary tree (CNN-TT) is proposed to predict the best intra-QTMT decision tree in order to reduce the encoding time. A database is initially established containing CU-based TT partition depths with several video contents. Then, a CNN-TT model is developed under three-levels provided by the TT structure to early determine the QTMT partition at 32×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\times $$\end{document}32. Different threshold values are fixed for each level according to the CNN-TT predicted probabilities to reach a balance between the encoding complexity and the coding efficiency. The experimental results prove that our deep-QTMT partition approach saves a significant encoder time on average between 23% and 58% with an acceptable RD performance.
... Simultaneously, CNN network structures and their improvements and variations have also been favored by numerous researchers. [20] proposed a CSD-CNN algorithm to reduce CU and PU search modes, while Xu et al. [21] introduced an early exit CNN for predicting the structure of CU partitioning. ...
Article
Full-text available
The new coding standard Versatile Video Coding (VVC) introduces additional encoding techniques based on the existing video coding standard, such as the difference in block partition structures. While these new technologies bring about enhanced encoding performance, they also result in a significant increase in encoding time complexity. In the paper, we propose a decision algorithm to partition fast CUs, which is based on CNN networks and FSVM. Initially, the algorithm utilizes depth information formed by combining inter-frame correlations as input to our trained CNN model, predicting the optimal depth for encoding.Following that, the second algorithm based on FSVM is introduced. The F-Score method is employed to extract appropriate features for constructing FSVM. Within the predicted depth range from the initial algorithm, the CU partitioning mode is predicted, leading to an additional reduction in encoding complexity. Our experimental results demonstrate that the proposed algorithm can save 53.55% of encoding time, with a marginal increase of only 1.47% in BDBR.It achieves a favorable balance between video quality and encoding speed.
Article
Although deep learning technique has achieved significant improvement on image compression, but its advantages are not fully explored in video compression, which leads to the performance of deep-learning based video compression (DLVC) is obvious inferior to that of hybrid video coding framework. In this paper, we proposed a novel network to improve the performance of DLVC from its most important modules, including Motion Process (MP), Residual Compression (RC) and Frame Reconstruction (FR). In MP, we design a split second-order attention and multi-scale feature extraction module to fully remove the warping artifacts from multi-scale feature space and pixel space, which can help reduce the distortion in the following process. In RC, we propose a channel selection mechanism to gradually drop redundant information while preserving informative channels for a better rate-distortion performance. Finally, in FR, we introduce a residual multi-scale recurrent network to improve the quality of the current reconstructed frame by progressively exploiting temporal context information between it and its several previous reconstructed frames. Extensive experiments are conducted on the three widely used video compression datasets (HEVC, UVG and MCL-JVC), and the performance demonstrates the superiority of our proposed approach over the state-of-the-art methods.
Article
Video coding that pursues the highest compression efficiency is the art of computing for rate-distortion optimization. The optimization has been approached in different ways, exemplified by two typical frameworks: block-based hybrid video coding and end-to-end learned video coding. The block-based hybrid framework encompasses more and more coding modes that are available at the decoder side; an encoder tries to search for the optimal coding mode for each block to be coded. This is an online, discrete, search-based optimization strategy. The end-to-end learned framework embraces more and more sophisticated neural networks; the network parameters are learned from a collection of videos, typically using gradient descent-based methods. This is an offline, continuous, numerical optimization strategy. Having analyzed these two strategies, both conceptually and with concrete schemes, this paper suggests investigating hybrid -optimization video coding, that is to combine online and offline, discrete and continuous, search-based and numerical optimization. For instance, we propose a hybrid-optimization video coding scheme, where the decoder consists of trained neural networks and supports several coding modes, and the encoder adopts both numerical and search-based algorithms for the online optimization. Our scheme achieves promising compression efficiency on par with H.265/HM for the random-access configuration.
Conference Paper
Traditional methods of computer vision and machine learning cannot match human performance on tasks such as the recognition of handwritten digits or traffic signs. Our biologically plausible, wide and deep artificial neural network architectures can. Small (often minimal) receptive fields of convolutional winner-take-all neurons yield large network depth, resulting in roughly as many sparsely connected neural layers as found in mammals between retina and visual cortex. Only winner neurons are trained. Several deep neural columns become experts on inputs preprocessed in different ways; their predictions are averaged. Graphics cards allow for fast training. On the very competitive MNIST handwriting benchmark, our method is the first to achieve near-human performance. On a traffic sign recognition benchmark it outperforms humans by a factor of two. We also improve the state-of-the-art on a plethora of common image classification benchmarks.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a novel architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image.
Article
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, improve on the previous best result in semantic segmentation. Our key insight is to build "fully convolutional" networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet, the VGG net, and GoogLeNet) into fully convolutional networks and transfer their learned representations by fine-tuning to the segmentation task. We then define a skip architecture that combines semantic information from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves improved segmentation of PASCAL VOC (30% relative improvement to 67.2% mean IU on 2012), NYUDv2, SIFT Flow, and PASCAL-Context, while inference takes one tenth of a second for a typical image.
Article
The high efficiency video coding (HEVC) standard has highly improved the coding efficiency by adopting hierarchical structures of coding unit (CU), prediction unit (PU), and transform unit (TU). However, enormous computational complexity is introduced due to the recursive rate-distortion optimization (RDO) process on all CUs, PUs and TUs. In this paper, we propose a fast and efficient mode decision algorithm based on the Neyman-Pearson rule, which consists of early SKIP mode decision and fast CU size decision. First, the early mode decision is modeled as a binary classification problem of SKIP/non-SKIP or split/unsplit. The Neyman-Pearson-based rule is employed to balance the rate-distortion (RD) performance loss and the complexity reduction by minimizing the missed detection with a constrained incorrect decision rate. A nonparametric density estimation scheme is also developed to calculate the likelihood function of the statistical parameters. Furthermore, an online training scheme is employed to periodically update the probability density distributions for different quantization parameters (QPs) and CU depth levels. The experimental results show that the proposed overall algorithm can save 65% and 58% computational complexity on average with a 1.29% and 1.08% Bjontegaard Delta bitrate (BDBR) increase for various test sequences under random access and low delay P conditions, respectively. The proposed overall scheme also has the advantage that it canmake the trade-off between the RD performance and time saving by setting different values for the incorrect decision rate.
Article
Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.
Article
Improved video coding techniques introduced in the H.265/HEVC standard allow video encoders to achieve better compression efficiencies. On the other hand the increased complexity requires a new design methodology able to face challenges associated with ever higher spatio-temporal resolutions. The paper presents the computationally-scalable algorithm and its hardware architecture able to support the intra encoding up to the 2160p@30fps resolution. The scalability allows the tradeoff between the throughput and the compression efficiency. In particular, the encoder is able to check a variable number of candidate modes. The rate estimation based on bin counting and the distortion estimation in the transform domain simplify the rate-distortion analysis and enable the evaluation of a great number of candidate intra modes. The encoder preselects candidate modes by the processing of 8×8 predictions computed from original samples. The preselection shares hardware resources used for the processing of predictions generated from reconstructed samples. To support intra 4×4 modes for the 2160p@30fps resolution, the encoder incorporates a separate reconstruction loop. The processing of blocks with different sizes is interleaved to compensate the delay of reconstruction loops. Implementation results show that the encoder utilizes 1086k gates and 52 kB on-chip memories for TSMC 90nm. The main reconstruction loop can operate at 400 MHz, whereas the remaining modules work at 200 MHz. For 2160p@30fps videos, the average BD-Rate is 5.46% compared to the HM software.