ArticlePDF Available

Weight-Based R-λ Rate Control for Perceptual HEVC Coding on Conversational Videos

Authors:

Abstract and Figures

This article proposes a novel weight-based R-λ scheme for rate control in HEVC, to improve the perceived visual quality of conversational videos. For rate control in HEVC, the conventional R-λ scheme is based on bit per pixel ( ) to allocate bits. However, does not reflect the visual importance variation of pixels. Therefore, we propose a novel weight-based R-λ scheme to consider this visual importance for rate control in HEVC. We first conducted an eye-tracking experiment on training videos to figure out different importance of background, face, and facial features, thus generating weight maps of encoding videos. Upon the weight maps, our scheme is capable of allocating more bits to the face (especially facial features), using a new term bit per weight. Consequently, the visual quality of face and facial features can be improved such that perceptual video coding is achieved for HEVC, as verified by our experimental results.
Content may be subject to copyright.
Weight-based R-
λ
rate control for perceptual HEVC coding
on conversational videos
$
Shengxi Li
a
, Mai Xu
a,b,
n
, Xin Deng
a
, Zulin Wang
a
a
School of Electronic and Information Engineering, Beihang University, Beijing 100191, China
b
EDA Lab, Research Institute of Tsinghua University in Shenzhen, Shenzhen, China
article info
Available online 14 May 2015
Keywords:
HEVC
Perceptual video coding
Rate control
abstract
This paper proposes a novel weight-based R-
λ
scheme for rate control in HEVC, to
improve the perceived visual quality of conversational videos. For rate control in HEVC,
the conventional R-
λ
scheme is based on bit per pixel (bpp) to allocate bits. However, bpp
does not reflect the visual importance variation of pixels. Therefore, we propose a novel
weight-based R-
λ
scheme to consider this visual importance for rate control in HEVC. We
first conducted an eye-tracking experiment on training videos to figure out different
importance of background, face, and facial features, thus generating weight maps of
encoding videos. Upon the weight maps, our scheme is capable of allocating more bits to
the face (especially facial features), using a new term bit per weight. Consequently, the
visual quality of face and facial features can be improved such that perceptual video
coding is achieved for HEVC, as verified by our experimental results.
&2015 Elsevier B.V. All rights reserved.
1. Introduction
Supported by the recent advances in related techni-
ques, the popularity of multimedia applications has been
considerably increased. It has been pointed out in [1] that
video applications with high resolutions, such as FaceTime
and Skype, occupy a large proportion of data among the
existing multimedia applications. The limited bandwidth
issue thus becomes more and more serious, causing
spectrum crush. To better relieve the bandwidth-
hungry issue, high efficiency video coding (HEVC) stan-
dard [1], also called H.265, has been formally established.
Rate control is a crucial module in HEVC, whose aim is
to optimize visual quality via reasonably allocating bits to
various frames and blocks, at a given bit-rate. An excellent
rate control scheme is able to precisely allocate bits, and to
output better visual quality of compressed videos. In other
words, at the same visual quality, a better rate control
scheme consumes less bit-rate and therefore achieves the
goal of relieving the bandwidth bottleneck. There are
many rate control schemes for different video coding
standards (e.g. TM5 for MPEG-2 [2], VM8 for MPEG-4 [3]
and JVT-N046 [4] for H.264). For HEVC, a pixel-wise
unified rate quantization (URQ) scheme has been proposed
in [5] to compute quantization parameter (QP) at a given
target bit-rate. Since this scheme works at pixel level, it
can be easily applied to blocks with various sizes. How-
ever, according to [6], Lagrange multiplier
λ
[7], which
represents the bit cost of encoding a block, is more
important than QP in allocating bits. Therefore, a new
scheme, R-
λ
scheme, was proposed in [6] to better allocate
the bits in HEVC.
Nevertheless, high resolution video delivery, especially at
low bit-rate scenarios, still poses a great challenge to HEVC.
In fact, according to the human visual system (HVS), there
exists much perceptual redundancy that can be further
exploited to greatly improve the coding efficiency of HEVC,
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/image
Signal Processing: Image Communication
http://dx.doi.org/10.1016/j.image.2015.04.011
0923-5965/&2015 Elsevier B.V. All rights reserved.
This work was supported by NSFC under Grant no. 61202139 and
61471022 and China 973 program under Grant no. 2013CB29006.
n
Corresponding author.
E-mail address: MaiXu@buaa.edu.cn (M. Xu).
Signal Processing: Image Communication 38 (2015) 127140
thus relieving the bandwidth-hungry issue [8].Forinstance,
when a person looks at a video, a small region around a
point of fixation, called region-of-interest (ROI), is concerned
most [8] with high resolution, while the peripheral region is
captured at a low resolution. Hence, in light of this phenom-
enon, a large amount of bits can be saved via reducing
perceptual redundancy in the peripheral region, with little
loss of perceived quality. Consequently, along with the
development of the understanding of the HVS, perceptual
video coding is able to more efficiently condense video data.
Rate control for perceptual video coding has received a
great deal of research effort from 2000 onwards, due to its
great potential in improving coding efficiency [912].In
H.263, a perceptual rate control (PRC) scheme [9] was
proposed. In this scheme, a perceptual sensitive weight
map of conversational scene (i.e., scene with frontal human
faces) is obtained by combining stimulus-driven (i.e., lumi-
nance adaptation and texture masking) and cognition-driven
(i.e., skin colors) factors together. According to such a map,
more bits are allocated to ROIs by reducing QP values in
these regions. Afterwards, for H.264/AVC, a novel resource
allocation method [10] was proposed to optimize the sub-
jective ratedistortioncomplexity performance of conversa-
tional video coding, by improving the visual quality of face
region extracted by the skin-tone algorithm. Moreover, Xu
et al. [13] utilized a novel window model to characterize the
relationship between the size of window and variations of
picture quality and buffer occupancy, ensuring a better
perceptual quality with less quality fluctuation. This model
was advanced in [14] with an improved video quality metric
for better correlation to the HVS. Most recently, in HEVC the
perceptual model of structural similarity (SSIM) has been
incorporated for perceptual video coding [15].Insteadof
minimizing mean squared error (MSE) and sum of absolute
difference (SAD), SSIM is minimized [15] to improve the
subjective quality of perceptual video coding in HEVC.
However, as pointed out by [16], assigning pixels with
weights according to visual attention is much more accurate
than SSIM for evaluating the subjective quality. To this end, a
scheme [12] was proposed to improve the visual quality and
meanwhile to reduce the encoding complexity, via consider-
ing the visual attention on ROIs (e.g., face and facial features).
However, to our best knowledge, although larger weights are
imposedonROIsintheaboveapproaches,theirvaluesare
assigned in an arbitrary manner. Moreover, there is no
perceptual approach for the latest R-
λ
rate control scheme
[6] in HEVC.
Therefore, we propose a novel weight-based R-
λ
rate
control scheme to improve the perceived visual quality of
compressed conversational videos, based on the weights of
face regions and facial features learned from eye-tracking data.
To be more specific, similar to [12],weconsiderfaceregionsas
ROIs, and also consider facial features (e.g., mouth and eyes) as
the most important ROIs. Different from [12], the weights
allocated to background, face, and facial features are more
precise and reasonable, as they are obtained upon the saliency
distribution learnt from our eye-tracking data of several
training videos. Based on these weights, the weight-based R-
λ
rate control scheme is proposed, using a new term, bit per
weight (bpw), to enhance the quality of face regions, especially
the facial features. Since the perceptual video coding is the
main goal of our scheme, we review it in the following.
2. The related work on perceptual video coding
Generally speaking, main parts of perceptual video coding
are perceptual models, perceptual model incorporation in
video coding and performance evaluations, as illustrated
in Fig. 1. Specifically, perceptual models, which imitate the
output of the HVS to specify the ROIs and non-ROIs, need to
be designed first for perceptual video coding. Secondly, on
the basis of the perceptual models and existing video coding
standards, perceptual model incorporation in video coding
from perceptual aspects needs to be developed to encode/
decode the videos, mainly by moving their perceptual
redundancy. Rather than incorporating perceptual model in
video coding, some machine learning based image/video
Fig. 1. The framework of perceptual video coding.
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140128
compression approaches have also proposed during the past
decade. A summarized literature review is depicted in Fig. 2,
which is to be explained in detail in the next two subsections.
2.1. Perceptual model
Perceptual models can be classified into two categories:
manual and automatic identification.
2.1.1. Manual identification
This kind of perceptual models requires manual effort
to distinguish important regions which need to be
encoded with high quality. In the early years, Geisler and
Perry [17] employed a foveated multi-resolution pyramid
(FMP) video encoder/decoder to compress each image of
varying resolutions into 5 or 6 regions in real-time, using a
pointing device. This model requires the users to specify
which regions attract them most during the video trans-
mission. Thus, this kind of models may lead to transmis-
sion and processing delay between the receiver and
transmitter sides, when specifying the ROIs. Another way
[18] is to specify ROIs before watching, hence avoiding the
transmission and processing delay. However, considering
the workload of humans, these models cannot be widely
applied to various videos.
In summary, the advantage of manual identification
modelsistheaccuratedetectionofROIs.However,asthe
cost, it is expensive and intractable to extensively apply these
models due to the involvement of manual effort or hardware
support. In addition, for the models of user input-based
selection, there exists transmission and processing delay,
thus making the real-time applications impractical.
2.1.2. Automatic identification
Just as its name implies, this category of perceptual
models is to automatically recognize ROIs in videos,
according to visual attention mechanisms. Therefore,
visual attention models are widely used among various
perceptual models. There are two classes of visual atten-
tion models: either bottom-up or top-down models. Itti's
model [19] is one of the most popular bottom-up visual
attention models in perceptual video coding. Mimicking
processing in primate occipital and posterior parietal
cortex, Itti's model integrates low-level visual cues, in
terms of color, intensity, orientation, flicker, and motion,
to generate a saliency map for selecting ROIs [11].
The other class of visual attention models is top-down
processing [2025,12]. The top-down visual attention
models are more frequently applied to video applications,
since they are more correlated with human attractiveness.
For instance, human face [10,12,21] is one of the most
important factor that draw top-down attention, especially
for conversational video applications. Also, a hierarchical
perceptual model of face [12] has been established,
endowing unequal importance within the face region.
However, the above-mentioned approaches are unable to
figure out the importance of face region.
In this paper, we quantify the saliency of face and facial
features via learning the saliency distribution from the eye
fixation data of training videos, via conducting the eye-
tracking experiment. Then, after detecting face and facial
features for automatically identifying ROI [12], the saliency
map of each frame of encoded conversational video is
assigned using the learnt saliency distribution. Although
the same ROI is utilized in [12], the weight map of our
scheme is more reasonable for the perceptual model for
video coding, as it is in light of learnt distribution of
saliency over face regions. Note that the difference
between ROI and saliency is that the former refers to the
place that may attract visual attention while the later
refers to the possibility of each pixel/region to attract
visual attention.
Fig. 2. The literatures on perceptual video coding.
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140 129
2.2. Perceptual model incorporation in video coding
After setting up the perceptual model, the next task is to
apply it in the existing video coding approaches. One
category of approaches called pre-processing is to control
the non-uniform distribution of distortion before encoding
[2628]. A common way of pre-processing is spatial blurring
[26,27].Forinstance,thespatialblurringapproach[26]
separates the scene into foreground and background. The
background is blurred to remove high frequency information
in spatial domain so that less bits are allocated to this region.
However, this may cause obvious boundaries between the
background and foreground.
Another category is to control the non-uniform distribu-
tion of distortion during encoding, therefore called embedded
encoding [29,10,30,31,12]. As it is embedded into the whole
coding process, this category of approaches is efficient in
more flexibly compressing videos with different demands. In
[10], Liu et al. established importance map at the macro block
(MB) level based on face detection results. Moreover, combin-
ing texture and nontexture information, a linear ratequanti-
zation (RQ) model is applied to H.264/AVC. Based on the
importance map and RQ model, the optimized QP values are
assigned to all MBs, which enhances the perceived visual
quality of compressed videos. In addition, after obtaining the
importance map, the other encoding parameters, such as
mode decision and motion estimation (ME) search, are
adjusted to provide ROIs with more encoding resources. Xu
et al. [12] proposed a new weight-based URQ rate control
scheme for compressing conversational videos, which assigns
bits according to bpw, instead of bit per pixel (bpp) in
conventional URQ scheme. Then, the quality of face regions
is improved such that its perceived visual quality is enhanced.
The scheme in [12] is based on the URQ model [5],which
aims at establishing the relationship between bite-rate R and
quantization parameters Q, i.e., RQrelationship.However,
since various flexible coding parameters and structures are
applied in HEVC, RQ relationship is hard to be precisely
estimated [32]. Therefore, Lagrange multiplier
λ
[7],which
stands for the slope of RD curve, has been investigated.
According to [32], the relationship between
λ
and R can be
better characterized in comparison with RQ relationships.
This way, on the basis of R-
λ
model, the state-of-the-art R-
λ
rate control scheme [6] has better performance than the URQ
scheme. Therefore, on the basis of the latest R-
λ
scheme, this
paper proposes a novelweight-basedR-
λ
scheme to further
improve the perceived video quality of HEVC.
2.3. Machine learning based compression
From the viewpoint of machine learning, the pixels or
blocks from one image or several images may have high
similarity. Such similarity can be discovered by machine
learning techniques, and then utilized to decrease redun-
dancy of video coding. For exploiting the similarity within an
image/video, image inpainting has been applied in [33,34] to
use the image blocks from spatial or temporal neighbors for
synthesizing the unimportant content, which is deliberately
deleted at the encoder side. As such, the bits can be saved as
not encoding the missing areas of the image/video. Also,
rather than predicting the missing intensity information in
[33,34] , several approaches [3538] have been proposed to
learn to predict the color in images using the color informa-
tion of some representative pixels. Then, only representative
pixels and gray scale image need to be stored, such that the
image [38,36,37] or video [35] coding can be achieved.
Fig. 3. (a) The procedure of the conventional R-λand (b) our rate control schemes.
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140130
For working on similarity across various images or
frames of videos, dictionary learning has been developed
to discover the inherent patterns of image blocks. Together
with dictionary learning, sparse representation can then
be used to effectively represent an image for image [39] or
video coding [40], instead of conventional image trans-
forms such as discrete cosine transform (DCT).
3. Review of HEVC R-λscheme
The main goal of rate control in video coding is
minimizing the distortion of a compressed video at a given
bit-rate. In order to better achieve this goal, as illustrated
in Fig. 3(a), R-
λ
rate control scheme [6] calculates a
Lagrange multiplier
λ
before computing QP. From Fig. 3(a),
we can also see that the main steps of this scheme are
working out the bpp-
λ
and
λ
-QP relationships, to finally
output QP values. So, we review bpp-
λ
and
λ
-QP relation-
ships in the following for the R-
λ
rate control scheme. Note
that this paper only focuses on the rate control at largest
coding unit (LCU) level.
3.1. bpp-
λ
relationship
Parameter
λ
, which is the slope of ratedistortion (RD)
curve [6] (also seen as Lagrange multiplier), is crucial
during the rate control process. The relationship between
λ
and RD curve can be formulated by
λ¼
D
R;ð1Þ
where Dand Rrepresent the distortion and bit-rate for
one LCU.
Furthermore, the Hyperbolic model D¼CR
K
, charac-
terizing the relationship between Rand D, is adopted in
the rate control scheme [7,41]. Here, Cand Kare para-
meters determined by the characteristic of video content.
Then, with (1), the relationship between R-
λ
can be
obtained by
λ¼
D
R¼CK R
K1
¼aR
b
;ð2Þ
where a¼CK and b¼K1 are parameters related to the
video content as well. As different LCUs are of different
contents, aand bneed to be updated along with the
encoding process of each LCU.
Next, once Ris obtained for an LCU,
λ
can be an output
for estimating QP of the current processing LCU. Here, the
bit-rate Rcan be modeled in terms of bpp by
R¼bpp fwh;ð3Þ
where wand hare the width and height of the video
frame; frepresents the frame rate. Recall that bpp is the bit
per pixel for this LCU. Upon (3),(2) can be rewritten as
λ¼αbpp
β
;ð4Þ
where α¼aðfwhÞ
b
and β¼bare parameters also related
to video contents. In HEVC,
α
and
β
need to be updated
during encoding, with their initial values set to 3.2003 and
1:367, i.e., after encoding LCU,
α
and
β
are updated for
the co-located LCU of the subsequent frames. Note that as
shown in [32], different initial values of
α
and
β
have little
impact on the compressed videos, both on RD perfor-
mance and bit-rate error. Assuming that the actual
encoded bpp is
bpp and the actual used
λ
is
λfor the
current encoding,
α
and
β
can be updated by α
0
and β
0
:
α
0
¼αþδ
α
ln
λlnðα
bpp
β
Þ

α;
β
0
¼βþδ
β
ln
λlnðα
bpp
β
Þ

ln
bpp;ð5Þ
where δ
α
and δ
β
are constants empirically set to 0.1 and
0.05 [6], respectively. Note that
bpp means the actually
consumed bpp after encoding each LCU, while
λrepre-
sents the actual
λ
used for calculating QP during encoding
each LCU. In general,
λis not equal to α
bpp
β
since
α
and
β
are unable to accurately fit the relationship between
distortion and bit-rate for each LCU. The proof of the
updating way can refer to the appendix of [32].
Then, once bpp value is achieved for each LCU,
λ
can be
estimated with (4). Here, we assume that bpp and
λ
values
for the j-th LCU are bpp
j
and
λ
j
, respectively. Next,
assuming that the number of pixels in the j-th LCU is N
j
,
we obtain bpp
j
through the target bits T
j
for the j-th LCU:
bpp
j
¼T
j
N
j
;ð6Þ
and
T
j
¼
^
TB
P
M1
i¼j
c
i
c
j
;ð7Þ
where
^
Tis the target bits remaining for encoding this
frame and Bis the remaining header bits. Moreover, there
are MLCUs in this frame, and c
i
means the texture
complexity of the i-th LCU. To be more specific, for inter
frames, the target bits are allocated according to MAD of
the co-located LCU in the previous pictures, i.e., c
i
is
related to MAD for the i-th LCU. For intra frames, c
i
is
related to the sum of absolute transformed difference
(SATD) of the i-th LCU [42]. For computing c
i
, see [43].
In addition, before establishing
λ
-QP relationship, there
exists a step to smooth
λ
with value
~
λ
j
for the j-th LCU:
~
λ
j
¼maxfmax λ
P
2
2:0
3:0
;
~
λ
j1
2
1:0
3:0

;
minfλ
j
;min λ
P
2
2:0
3:0
;
~
λ
j1
2
1:0
3:0

gg;ð8Þ
where
λ
P
represents the
λ
value of the current frame.
~
λ
j1
represents
λ
value which has been smoothed in the
ðj1Þ-th LCU . The calculation of
λ
P
is depicted in [6].
3.2.
λ
-QP relationship
After establishing bpp-
λ
relationship, the remaining task
is finding out
λ
-QP relationship. Mathematically, QP value
can be obtained through a multiple-QP optimization process:
min JðQPÞ¼DðQPÞþλRðQPÞ;ð9Þ
to provide the smallest RD cost JðQPÞ, with distortion DðQPÞ
and rate RðQPÞ.
The optimal QP can be achieved as the final output of rate
control via solving (9). However, this optimization hugely
increases encoding complexity. To reduce the encoding com-
plexity, a fitting formulation, rather than multiple-QP
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140 131
optimization, is proposed [44,45] to determine QP value QP
j
for the j-th LCU:
QP
j
¼θ
0
ln
~
λ
j
þθ
1
:ð10Þ
Recall that
~
λ
j
is the smoothed
λ
value of the j-th LCU. In (10),
θ
0
and
θ
1
are coefficients fitting the linear relationship
between QP and ln λ.Notethat
θ
0
and
θ
1
are empirically
set to 4.2005 and 13.7122, respectively, in [44]. Their values
remain the same during each video coding. Similar to (8),QP
j
needstobesmoothedaswell:
~
QP
j
¼maxfmaxð
~
QP
j1
1;QP
P
2Þ;
minfQP
j
;minð
~
QP
j1
þ1;QP
P
þ2Þgg;ð11Þ
where QP
P
is the QP value of the current frame. For the
calculation of QP
P
, please refer to [6].Moreover,
~
QP
j1
means the smoothed QP value of the ðj1Þ-th LCU. Finally,
the
~
QP
j
can be output in R-
λ
rate control scheme.
From the definition of bpp, it is worth pointing out that
each pixel is endowed with the equal visual importance
over the whole video frame. Therefore, the R-
λ
scheme
wastes many bits on encoding the non-ROIs, to which
humans pay less attention.
4. The proposed rate control scheme
This section proposes the weight-based R-
λ
rate control
scheme, to take into account local visual importance of video
content. Fig. 3(b) shows the procedure of our weight-based R-
λ
rate control scheme. Specifically, we first establish a
perceptual model of face by learning from eye fixation points
of our eye-tracking experiment. Note that such a perceptual
model is learnt offline from training videos that are different
from encoding videos. Based on such a perceptual model, the
weight-based R-
λ
rate control scheme is proposed to improve
the visual quality of face and facial features, thus providing a
better perceived quality.
4.1. Learning for perceptual model
In [8], the authors have shown that face draws a
majority of attention in conversational videos. It is inter-
esting to further quantify unequal importance of
background, face and facial features to human attention.
In this section, we conducted eye tracking experiment on
the training conversational videos to obtain values of such
unequal importance so that these values can be used for
encoding other videos.
Before the experiment, it is necessary to first extract
the face and its facial features in conversational videos
using the method of [12]. Generally speaking, our extrac-
tion technique is based on a real-time face alignment
method [46]. To be more specific, several key landmarks
obeying the point distribution model (PDM) are located in
the face of an image using the method in [46], which
combines the local detection (texture information) and
global optimization (facial structure) together. Here, 66
landmarks, produced by the PDM, are connected to pre-
cisely identify the contours and regions of face and facial
features. Note that the extraction in our scheme can be
achieved in real-time, as the face alignment method [46] is
indeed fast. Also, the 3000 fps face alignment [47] may be
used to further speed up the extraction on face and its
facial features.
For the eye tracking experiment, 18 conversational video
clips (resolution: 720 480) were collected and each of
them was cut to 750 frames at 25 Hz. Note that these
conversational video clips were collected from movies, news,
and videos captured by Nikon D800 camera. Also, note that
all training videos are different from the test videos of
Section 5. These video clips were then presented at a random
order to 24 subjects (14 males, 10 females, aging 2232). The
subjects were seated on an adjustable chair at a viewing
distance of 60 cm, ensuring that the subject's horizontal
sight is in the center of screen. The eye fixation points of
all subjects were recorded over frames of each clip by Tobii
T60 eye tracker. Some of the recorded eye tracking data are
available at our website http://www.ee.buaa.edu.cn/xumfiles.
One example of eye tracking results is shown
in Fig. 4. Next, we focus on quantifying the visual attention
on different regions of conversational videos by combining
the eye fixation points of all subjects together.
After the eye tracking experiment, f
r
,f
l
,f
m
,f
n
,f
o
, and f
b
,
which denote eye fixation points of all subjects falling into
right eye, left eye, mouth, nose, other parts in face and
background, were counted. Given the counted eye fixation
points (efp) of different regions, we have the degrees of
Fig. 4. Example of eye tracking results. The blue circles show the positions of eye fixation points and their sizes represent the staying duration of eye
fixation points.
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140132
visual attention in these regions:
c
r
¼f
r
=p
r
c
l
¼f
l
=p
l
c
m
¼f
m
=p
m
c
n
¼f
n
=p
n
c
o
¼f
o
=p
o
c
b
¼f
b
=p
b
;
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
ð12Þ
where p
r
,p
l
,p
m
,p
n
,p
o
, and p
b
are defined as the numbers
of pixels in the regions of right eye, left eye, mouth, nose,
other parts in face and background. c
r
,c
l
,c
m
,c
n
,c
o
, and c
b
are the values of visual attention degrees for these regions.
Their values output by our eye tracking experiment are
reported in Table 1. Note that the degrees of visual
attention for face and facial features may vary in different
videos, according to video content. However, the values
in Table 1 are constant to simply predict the visual
attention attended to face and facial features, as it is hard
to take into account the attention variation of face and
facial features, caused by different video content.
Finally, assuming that the background weight is 1, the
weight map of a video frame can be computed upon the
results of Table 1 by
w
n
¼
1ifnAbackground
c
r
=c
b
if nAright eye
c
l
=c
b
if nAleft eye
c
m
=c
b
if nAmouth
c
n
=c
b
if nAnose
c
o
=c
b
if nAothers;
8
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
:
ð13Þ
where w
n
is the weight of the n-th pixel in the video
frame. Given (13), we can obtain the weight map for each
conversational video frame to be encoded, with extracted
face and facial features [12].
4.2. Weight-based R-
λ
rate control scheme
For the conventional R-
λ
rate control scheme, bpp is an
important term to allocate bits. In our scheme, rather than
replacing bpp in [12], a new term bpw is introduced to
calculate bpp, for allocating bits in accordance with the
weight map of our perceptual model. Before encoding a
frame, we first initialize bpw in order to separate the
target bits of the whole frame into the target bits of ROIs
(face and facial features) and non-ROIs (background):
bpw ¼T
P
N
n¼1
w
n
;ð14Þ
where Tstands for the target bits of a frame. Recall that w
n
means the weight of the n-th pixel. Nis the number of
pixels in this frame. Then, the target bits T
0
for ROIs and T
for non-ROIs are
T
0
þT
¼T
T
¼P
nAn
w
n
bpw
P
nAn
0
w
n
bpw T
0
;
8
>
<
>
:
ð15Þ
where n
0
denotes the indices of ROI pixels, the weights of
which are larger than 1; n
means the indices of non-ROI
pixels, the weights of which are equivalent to 1.
By solving (15),T
0
and T
are obtained:
T
0
¼P
nAn
0
w
n
P
nAn
0
w
n
þP
nAn
w
n
T;ð16Þ
T
¼P
nAn
w
n
P
nAn
0
w
n
þP
nAn
w
n
T:ð17Þ
Then, the target bits for ROIs and non-ROIs can be reason-
ably arranged, according to the importance of these
regions. Next, bpw
j
for the j-th LCU can be calculated as
bpw
j
¼
^
T
0
P
nA
^
n
0
w
n
jAm
0
^
T
P
nA
^
n
w
n
jAm
;
(ð18Þ
where m
0
and m
are the LCU indices of ROIs and non-
ROIs, respectively. Note that LCU of ROIs means that the
average weight of this LCU is larger than 1. Otherwise, LCU
belongs to non-ROIs. Besides,
^
T
0
and
^
T
denote the remain-
ing target bits for ROIs and non-ROIs; ^
n
0
and ^
n
represent
the pixel indices of the current and its subsequent LCUs for
ROIs and non-ROIs.
Afterwards, T
j
, the target bits of the j-th LCU, can be
estimated via
T
j
¼X
nAn
j
w
n
bpw
j
;ð19Þ
where n
j
denotes the pixel indices in the j-th LCU. It can be
seen from (19) that the LCU with large bpw and w
n
is able
to be assigned with more target bits. So, ROIs, especially
the more important ROIs (e.g., facial features), are empha-
sized with more target bits.
Finally, given the newly estimated T
j
of (19), bpp
j
can be
acquired with (6). Then, the rate control can be achieved
through (4) and (10) with bpp
j
known for each LCU. In
addition, we adjust the boundary setting of
λ
j
and QP
j
based on the weights from our perceptual model to
impose the ROIs more priority on bit allocation. Specifi-
cally, (8) is rewritten as follows:
~
λ
j
¼max max
~
λ
j1
2
1:0
3:0
=P
nAn
j
w
n
N
j
;λ
P
2
2:0
3:0
=P
nAn
j
w
n
N
j
!
;
(
min λ
j
;min
~
λ
j1
2
1:0
3:0
P
nAn
j
w
n
N
j
;λ
P
2
2:0
3:0
P
nAn
j
w
n
N
j
!())
;
ð20Þ
and QP boundary smoothing (11) is modified correspond-
ingly:
~
QP
j
¼max max
~
QP
j1
P
nAn
j
w
n
N
j
;QP
P
2
N
j
X
nAn
j
w
n
0
@1
A;
8
<
:
Table 1
The values of visual attention degrees of different regions.
c
r
c
l
c
m
c
n
c
o
c
b
efp/p0.122 0.108 0.116 0.080 0.076 0.002
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140 133
min QP
j
;min
~
QP
j1
þP
nAn
j
w
n
N
j
;QP
P
þ2
N
j
X
nAn
j
w
n
0
@1
A
8
<
:9
=
;9
=
;
:
ð21Þ
As can be seen from (20) and (21), more variation is
allowed for
λ
j
and QP
j
to improve the quality of ROI
regions with more assigned bits. Consequently, the region
0 100 200 300 400 500 600
30
32
34
36
38
40
42
44
46
48
Bit−rates (kbps)
Average YPSNR (dB)
Whole (Conventional R−λ)
Background (Conventional R−λ)
Face (Conventional R−λ)
Whole (Ours)
Background (Ours)
Face (Ours)
0100 200 300 400 500 600 700 800 900 1000 1100
25
30
35
40
45
50
Bit−rates (kbps)
Average Y−PSNR (dB)
Whole (Conventional R−λ)
Background (Conventional R−λ)
Face (Conventional R−λ)
Whole (Ours)
Background (Ours)
Face (Ours)
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
32
34
36
38
40
42
44
46
Bit−rates (kbps)
Average Y−PSNR (dB)
Whole (Conventional R−λ)
Background (Conventional R−λ)
Face (Conventional R−λ)
Whole (Ours)
Background (Ours)
Face (Ours)
0200 400 600 800 1000 1200 1400 1600 1800 2000 2200
32
34
36
38
40
42
44
46
Bit−rates (kbps)
Average YPSNR (dB)
Whole (Conventional R−λ)
Background (Conventional R−λ)
Face (Conventional R−λ)
Whole (Ours)
Background (Ours)
Face (Ours)
0 500 1000 1500 2000 2500 3000 3500 4000
34
36
38
40
42
44
46
Bit−rates (kbps)
Average YPSNR (dB)
Whole (Conventional R−λ)
Background (Conventional R−λ)
Face (Conventional R−λ)
Whole (Ours)
Background (Ours)
Face (Ours)
0500 1000 1500 2000 2500 3000 3500 4000
36
38
40
42
44
46
48
50
52
Bit−rates (kbps)
Average Y−PSNR (dB)
Whole (Conventional R−λ)
Background (Conventional R−λ)
Face (Conventional R−λ)
Whole (Ours)
Background (Ours)
Face (Ours)
Fig. 5. Ratedistortion performance comparison over face, background, and whole regions between the conventional R-λand our schemes on compressing
six conversational video sequences.
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140134
with larger weights has a broader boundary, resulting in
better visual quality in ROIs.
In general, we utilize the new term bpw to estimate
the target bit for each LCU. Then, bpp is acquired based on
bpw, followed by
λ
and QP values. After encoding one
LCU, its QP can be the output. In addition, the relevant
parameters, such as
α
and
β
, need to be updated for the
following LCU. This way, the weight-based R-
λ
scheme
0 100 200 300 400 500 600
28
30
32
34
36
38
40
42
44
46
48
Bit−rates (kbps)
Average Y−PSNR (dB)
Whole (Perceptual URQ)
Background (Perceptual URQ)
Face (Perceptual URQ)
Whole (Ours)
Background (Ours)
Face (Ours)
0100 200 300 400 500 600 700 800 900 1000 1100
25
30
35
40
45
50
Bit−rates (kbps)
Average YPSNR (dB)
Whole (Perceptual URQ)
Background (Perceptual URQ)
Face (Perceptual URQ)
Whole (Ours)
Background (Ours)
Face (Ours)
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
32
34
36
38
40
42
44
46
Bit−rates (kbps)
Average Y−PSNR (dB)
Whole (Perceptual URQ)
Background (Perceptual URQ)
Face (Perceptual URQ)
Whole (Ours)
Background (Ours)
Face (Ours)
0200 400 600 800 1000 1200 1400 1600 1800 2000 2200
30
35
40
45
Bit−rates (kbps)
Average YPSNR (dB)
Whole (Perceptual URQ)
Background (Perceptual URQ)
Face (Perceptual URQ)
Whole (Ours)
Background (Ours)
Face (Ours)
0 500 1000 1500 2000 2500 3000 3500 4000
36
37
38
39
40
41
42
43
44
45
46
Bit−rates (kbps)
Average YPSNR (dB)
Whole (Perceptual URQ)
Background (Perceptual URQ)
Face (Perceptual URQ)
Whole (Ours)
Background (Ours)
Face (Ours)
0500 1000 1500 2000 2500 3000 3500 4000
36
38
40
42
44
46
48
50
52
Bit−rates (kbps)
Average Y−PSNR (dB)
Whole (Perceptual URQ)
Background (Perceptual URQ)
Face (Perceptual URQ)
Whole (Ours)
Background (Ours)
Face (Ours)
Fig. 6. Ratedistortion performance comparison over face, background, and whole regions between the perceptual URQ [12] and our schemes on
compressing six conversational video sequences.
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140 135
iterates to obtain QP values of each LCU until the last LCU
finishes encoding. The main difference between our
scheme and the conventional R-
λ
schemesisthatwe
exploit the weight of each pixel from our perceptual
model and bpw to estimate bpp for each LCU. bpp is
therefore correspondingly adjusted according to the
weights of LCU and bpw. Larger values of weights and
bpw, which indicate the more important regions and
0 100 200 300 400 500 600
25
30
35
40
45
Bit−rates(kbps)
Average YPSNR (dB)
Face (Conventional R−λ)
Nose (Conventional R−λ)
Eyes (Conventional R−λ)
Mouth (Conventional R−λ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0100 200 300 400 500 600 700 800 900 1000 1100
25
30
35
40
45
Bit−rates(kbps)
Average Y−PSNR (dB)
Face (Conventional R−λ)
Nose (Conventional R−λ)
Eyes (Conventional R−λ)
Mouth (Conventional R−λ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
30
32
34
36
38
40
42
44
46
48
Bit−rates(kbps)
Average Y−PSNR (dB)
Face (Conventional R−λ)
Nose (Conventional R−λ)
Eyes (Conventional R−λ)
Mouth (Conventional R−λ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0200 400 600 800 1000 1200 1400 1600 1800 2000 2200
28
30
32
34
36
38
40
42
44
46
48
Bit−rates(kbps)
Average Y−PSNR (dB)
Face (Conventional R−λ)
Nose (Conventional R−λ)
Eyes (Conventional R−λ)
Mouth (Conventional R−λ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0 500 1000 1500 2000 2500 3000 3500 4000
30
32
34
36
38
40
42
44
46
48
Bit−rates(kbps)
Average Y−PSNR (dB)
Face (Conventional R−λ)
Nose (Conventional R−λ)
Eyes (Conventional R−λ)
Mouth (Conventional R−λ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0500 1000 1500 2000 2500 3000 3500 4000
30
35
40
45
50
Bit−rates (kbps)
Average Y−PSNR (dB)
Face (Conventional R−λ)
Nose (Conventional R−λ)
Eyes (Conventional R−λ)
Mouth (Conventional R−λ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
Fig. 7. Ratedistortion performance comparison over facial features between the conventional R-λand our schemes on compressing six conversational
video sequences.
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140136
greater bit-rates, lead to larger bpp, thus probably achiev-
ing better quality. This way, the ROIs, especially the more
important ROIs, are allocated more bits to ensure better
perceived quality.
5. Experimental results
In this section, experimental results are presented to
validate the proposed weight-based R-
λ
scheme for
0 100 200 300 400 500 600
25
30
35
40
45
Bit−rates(kbps)
Average Y−PSNR (dB)
Face (Perceptual URQ)
Nose (Perceptual URQ)
Eyes (Perceptual URQ)
Mouth (Perceptual URQ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0 100 200 300 400 500 600 700 800 900 1000 1100
25
30
35
40
45
Bit−rates(kbps)
Average Y−PSNR (dB)
Face (Perceptual URQ)
Nose (Perceptual URQ)
Eyes (Perceptual URQ)
Mouth (Perceptual URQ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
30
32
34
36
38
40
42
44
46
48
Bit−rates(kbps)
Average Y−PSNR (dB)
Face (Perceptual URQ)
Nose (Perceptual URQ)
Eyes (Perceptual URQ)
Mouth (Perceptual URQ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200
28
30
32
34
36
38
40
42
44
46
48
Bit−rates(kbps)
Average Y−PSNR (dB)
Face (Perceptual URQ)
Nose (Perceptual URQ)
Eyes (Perceptual URQ)
Mouth (Perceptual URQ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0 500 1000 1500 2000 2500 3000 3500 4000
30
32
34
36
38
40
42
44
46
48
Bit−rates(kbps)
Average Y−PSNR (dB)
Face (Peceptual URQ)
Nose (Peceptual URQ)
Eyes (Peceptual URQ)
Mouth (Peceptual URQ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
0 500 1000 1500 2000 2500 3000 3500 4000
30
35
40
45
50
Bit−rates (kbps)
Average Y−PSNR (dB)
Face (Perceptual URQ)
Nose (Perceptual URQ)
Eyes (Perceptual URQ)
Mouth (Perceptual URQ)
Face (Ours)
Nose (Ours)
Eyes (Ours)
Mouth (Ours)
Fig. 8. Ratedistortion performance comparison over facial features between the perceptual URQ scheme [12] and our schemes on compressing six
conversational video sequences.
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140 137
Table 2
DMOS comparison of conventional, perceptual urq, and our schemes.
Sequences Resolution Bit-rates (kbps) Conventional R-λPerceptual URQ Our
Akiyo 352 288 16 55.25 56.34 45.12
64 29.61 25.59 24.37
Foreman 352 288 32 70.65 72.64 66.58
128 34.25 36.90 29.86
Johnny 1280 720 64 60.78 59.53 54.54
256 33.89 29.48 25.69
Vidyo4 1280720 64 67.16 68.85 61.67
256 37.21 34.29 27.56
Yan 1920 1080 128 70.64 67.26 65.99
512 36.35 32.48 29.64
Lee 19201080 128 56.40 53.40 47.70
512 42.06 36.70 29.26
Fig. 9. Visual quality comparison of random selected frames of Foreman (CIF resolution), Johnny (720p resolution), and Lee (1080p resolution). (a), (b), and
(c) show the 56th decoded frames of Foreman compressed at 32 kbps. (d), (e), and (f) show the 23rd decoded frames of Johnny compressed at 64 kbps.
Moreover, (g), (h), and (i) show the 41st decoded frames of Lee compressed at 128 kbps.
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140138
perceptual coding of conversational videos on HEVC
platform. We used six test video sequences: two CIF
conversational video sequences Akiyo and Foreman,two
720p conversational videos Johnny and Vidyo4 from HEVC
test database, and two 1080p conversational video seq-
uences Yan and Lee from [12]. We utilized the HEVC test
model (HM 16.0 software) with its default R-
λ
rate
control scheme [6] as the reference scheme. Our
weight-based R-
λ
rate control scheme was then
embedded into HM 16.0 for comparison. Furthermore,
HEVC perceptual video coding work of [12],calledper-
ceptual URQ, was used in the comparison. Note that Lee
was captured in a dark room in order to validate the
robustness of our scheme to poor illumination. The
parameter setting file encoder_lowdelay_P_main:cfg was
used with videos of 150 frames at 25 Hz.
5.1. Objective quality comparison
Figs. 5 and 6show the ratedistortion performance of
the conventional, perceptual URQ, and our schemes in
face, background, and whole regions. As can be seen from
these figures, our scheme outperforms the conventional R-
λ
scheme on HM 16.0 platform in terms of average Y-PSNR
of the face regions, for all video sequences at different bit-
rates. Moreover, the quality of face regions of perceptual
URQ scheme is lower than that of our scheme. As the cost,
the average Y-PSNR of the background is decreased in our
scheme. However, thanks to the HVS, the perceived video
quality is increased as verified in the next subsection.
Moreover, Figs. 7 and 8are plotted to further show the
improvement within face regions. One may observe from
these figures that the ratedistortion performance of facial
features is significantly improved at various bit-rates, for
all CIF, 720p, and 1080p videos, in our scheme over both
conventional R-
λ
and perceptual URQ schemes. Further-
more, the quality improvement within a face for the 720p
and 1080p videos is much better than that for the CIF
videos. This may be due to the fact that more bits can be
allocated to ROIs in the 720p and 1080p videos.
5.2. Subjective quality comparison
Since humans are the final receivers of video assess-
ment, subjective evaluation is the most accurate and
convincing metric [48]. In this paper, we adopted a single
stimulus continuous quality scale (SSCQS) procedure, pro-
posed by Rec. ITU-R BT.500, to rate the subjective quality.
The evaluation we conducted was divided into three
sessions for CIF, 720p, and HD videos, respectively. Note
that the uncompressed reference and test video sequences
in each session were displayed with a random order.
Before each session, the observers were required to view
5 other training videos (one per quality scale) to help them
better understand the subjective quality assessment. 15
observers (5 females and 10 males), aging 1934, were
involved in this test. Note that the observers are different
from the subjects in eye-tracking experiment. We used a
24HP LS24B370 LCD monitor with its resolution being set
to 1920 1080 to display the videos. Note that all the
videos are displayed in their original resolutions, to avoid
the influence of scaling operation. The viewing distance
was set to be three to four times of the video height for
rational evaluation. The quality rate scales for observers to
evaluate after viewing are excellent (10081), good (80
61), fair (6041), poor (4021), and bad (201).
After the subjective evaluation, difference mean opi-
nion scores (DMOS) were computed to reveal the differ-
ence of subjective quality between the compressed and
uncompressed videos. Smaller value of DMOS corresponds
to better subjective quality of the compressed video
sequence. Then, Table 2 compares the average DMOS
values of all compressed video sequences. From this table,
we can see that the DMOS values of our scheme are
smaller than the perceptual URQ scheme, and much less
than the conventional R-
λ
scheme, especially at high
resolutions. Therefore, our scheme can provide higher
subjective video quality. It can be further seen from this
table that our scheme is able to perform better than the
perceptual URQ scheme at low bit-rates (in comparison
with the conventional R-
λ
scheme). Moreover, the
improvement of subjective quality of our scheme over
perceptual URQ implies the effectiveness of the learning
strategy on allocating weights on face regions. It is because
our scheme has better subjective quality, while maintain-
ing comparable Y-PSNRs with perceptual URQ scheme for
whole video frames. Fig. 9 further shows the visual quality
of our and conventional R-
λ
schemes.
In summary, our subjective results here, together with
the previous objective results, illustrate that our scheme
on conversational video coding of HEVC is significantly
superior in perceived visual quality.
6. Conclusion
This paper has proposed a novel weight-based R-
λ
scheme for the rate control of conversational videos in
HEVC, to improve its perceived visual quality. First, a
perceptual model was established by learning from the
training videos with eye fixation points in our eye-tracking
experiment, to reveal the importance of visual content for
conversational video coding. Then, weight maps can be
generated for the encoding video frames. With such maps,
a novel weight-based R-
λ
rate control scheme was pro-
posed for HEVC using bpw to take into account the visual
importance of each pixel. Thus, in accordance with HVS,
the perceived visual quality is improved by our scheme, as
more bits are assigned to ROIs (faces), especially the more
important ROIs (facial features). Finally, the experimental
results verified such an improvement over several con-
versational video sequences on HEVC platform (HM 16.0).
References
[1] G.J. Sullivan, J. Ohm, W. Han, T. Wiegand, Overview of the high
efficiency video coding (HEVC) standard, IEEE Trans. Circuits Syst.
Video Technol. 22 (12) (2012) 1649166 8.
[2] B.G. Haskell, Digital Video: An Introduction to MPEG-2, Springer
Science and Business Media, New York, USA, 1997.
[3] A. Vetro, H. Sun, Y. Wang, MPEG-4 rate control for multiple video
objects, IEEE Trans. Circuits Syst. Video Technol. 9 (1) (1999)
186 199.
[4] G.J. Sullivan, T. Wiegand, K.-P. Lim, Text description of joint model
reference encoding methods and decoding concealment methods,
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140 139
Document: JVT-N046, Joint Video Team (JVT) of ISO/IECMPEG and
ITU-T VCEG.
[5] H. Choi, J. Yoo, J. Nam, D. Sim, I. Bajic, Pixel-wise unified rate-
quantization model for multi-level rate control, J. Sel. Top. Signal
Process. IEEE 7 (6) (2013) 1112112 3.
[6] B. Li, H. Li, L. Li, J. Zhang, Rate control by R-lambda model for HEVC,
Document: JCTVC-K0103, Joint Collaborative Team on Video Coding.
[7] G.J. Sullivan, T. Wiegand, Rate-distortion optimization for video
compression, IEEE Signal Process. Mag. 15 (6) (1998) 7490.
[8] J. Lee, T. Ebrahimi, Perceptual video compression: a survey, Signal
Processing, IEEE 6 (6) (Oct. 2012) 684697.
[9] X. Yang, W. Lin, Z. Lu, X. Lin, S. Rahardja, E. Ong, S. Yao, Rate control
for videophone using local perceptual cues, IEEE Trans. Circuits Syst.
Video Technol. 15 (4) (2005) 496507.
[10] Y. Liu, Z.G. Li, Y.C. Soh, Region-of-interest based resource allocation
for conversational video communication of H.264/AVC, IEEE Trans.
Circuits Syst. Video Technol. 18 (1) (2008) 134139 .
[11] Z. Li, S. Qin, L. Itti, Visual attention guided bit allocation in video
compression, Image Vis. Comput. 29 (1) (2011) 114.
[12] M. Xu, X. Deng, S. Li, Z. Wang, Region-of-interest based conversa-
tional HEVC coding with hierarchical perception model of face, IEEE
J. Sel. Top. Signal Process 8 (3) (2014) 475489.
[13] L. Xu, D. Zhao, X. Ji, L. Deng, S. Kwong, W. Gao, Window-level rate
control for smooth picture quality and smooth buffer occupancy,
IEEE Trans. Image Process. 20 (3) (2011) 723734.
[14] L. Xu, S. Li, K.N. Ngan, L. Ma, Consistent visual quality control in
video coding, IEEE Trans. Circuits Syst. Video Technol. 23 (6) (2013)
975989.
[15] A. Rehman, Z. Wang, SSIM-inspired perceptual video coding for
HEVC, in: IEEE International Conference on Multimedia and Expo
(ICME), 2012, pp. 497502.
[16] Z. Wang, Q. Li, Information content weighting for perceptual image
quality assessment, IEEE Trans. Image Process. 20 (5) (2011)
118 5119 8.
[17] W.S. Geisler, J.S. Perry, A real-time foveated multi-resolution system
for low-bandwidth video communication, in: Proceedings of the
SPIE: The International Society for Optical Engineering, vol. 3299,
1998, pp. 294305.
[18] M.G. Martini, C.T. Hewage, Flexible macroblock ordering for context-
aware ultrasound video transmission over mobile WIMAX, Int. J.
Telemed. Appl. 2010 (2010) 6.
[19] L. Itti, Automatic foveation for video compression using a neurobio-
logical model of visual attention, IEEE Trans. Image Process. 13 (10)
(2004) 13041318.
[20] M.-C. Chi, M.-J. Chen, C.-H. Yeh, J.-A. Jhu, Region-of-interest video
coding based on rate and distortion variations for H. 263 þ, Signal
Process.: Image Commun. 23 (2) (2008) 127142.
[21] M. Cerf, J. Harel, W. Einhäuser, C. Koch, Predicting human gaze using
low-level saliency combined with face detection, Adv. Neural Inf.
Process. Syst. 20 (2008) 241248.
[22] N. Doulamis, A. Doulamis, D. Kalogeras, S. Kollias, Low bit-rate
coding of image sequences using adaptive regions of interest, IEEE
Trans. Circuits Syst. Video Technol. 8 (8) (1998) 928934.
[23] D.M. Saxe, R.A. Foulds, Robust region of interest coding for improved
sign language telecommunication, IEEE Trans. Inf. Technol. Biomed.
6 (4) (2002) 310316.
[24] Y. Sun, I. Ahmad, D. Li, Y.-Q. Zhang, Region-based rate control and bit
allocation for wireless video transmission, IEEE Trans. Multimed. 8
(1) (2006) 110.
[25] M.-C. Chi, C.-H. Yeh, M.-J. Chen, Robust region-of-interest determi-
nation based on user attention model through visual rhythm
analysis, IEEE Trans. Circuits Syst. Video Technol. 19 (7) (20 09)
10251038.
[26] A. Cavallaro, O. Steiger, T. Ebrahimi, Semantic video analysis for
adaptive content delivery and automatic description, IEEE Trans.
Circuits Syst. Video Technol. 15 (10) (2005) 12001209.
[27] G. Boccignone, A. Marcelli, P. Napoletano, G. di Fiore, G. Iacovoni,
S. Morsa, Bayesian integration of face and low-level cues for
foveated video coding, IEEE Trans. Circuits Syst. Video Technol. 18
(12) (2008) 17271740.
[28] L.S. Karlsson, M. Sjostrom, Improved ROI video coding using variable
Gaussian pre-filters and variance in intensity, in: IEEE International
Conference on Image Processing, 2005, ICIP 2005, vol. 2, 2005,
pp. 313316.
[29] D. Chai, K.N. Ngan, Face segmentation using skin-color map in
videophone applications, IEEE Trans. Circuits Syst. Video Technol. 9
(4) (1999) 551564.
[30] M. Wang, T. Zhang, C. Liu, S. Goto, Region-of-interest based dyna-
mical parameter allocation for H.264/AVC encoder, in: IEEE Picture
Coding Symposium, 2009, PCS 2009, 2009, pp. 14.
[31] Q. Chen, G. Zhai, X. Yang, W. Zhang, Application of scalable visual
sensitivity profile in image and video coding, in: IEEE International
Symposium on Circuits and Systems, 2008, ISCAS 2008, 2008,
pp. 268271.
[32] B. Li, H. Li, L. Li, J. Zhang,
λ
domain based rate control for high
efficiency video coding, IEEE Trans. Image Process. 23 (9) (2014)
38413854.
[33] D. Liu, X. Sun, F. Wu, S. Li, Y.-Q. Zhang, Image compression with
edge-based inpainting, IEEE Trans. Circuits Syst. Video Technol. 17
(10) (2007) 12731287.
[34] H. Xiong, Y. Xu, Y.F. Zheng, C.W. Chen, Priority belief propagation-
based inpainting prediction with tensor voting projected structure
in video compression, IEEE Trans. Circuits Syst. Video Technol. 21 (8)
(2011) 11151129 .
[35] L. Cheng, S. Vishwanathan, Learning to compress images and videos,
in: Proceedings of the 24th International Conference on Machine
Learning, ACM, 2007, pp. 161168.
[36] X. He, M. Ji, H. Bao, A unified active and semi-supervised learning
framework for image compression, in: IEEE Conference on Compu-
ter Vision and Pattern Recognition, 2009, pp. 6572.
[37] A. Levin, D. Lischinski, Y. Weiss, Colorization using optimization, in:
ACM Transactions on Graphics (TOG), vol. 23, ACM, 2004, pp. 689
694.
[38] E. Kavitha, M.A. Ahmed, A machine learning approach to image
compression, Int. J. Technol. Comput. Sci. Eng. 1 (2) (2014) 7081.
[39] M. Xu, S. Li, J. Lu, W. Zhu, Compressibility constrained sparse
representation with learnt dictionary for low bit-rate image com-
pression, IEEE Trans. Circuits Syst. Video Technol. 24 (10) (2014)
1743175 7.
[40] Y. Sun, M. Xu, X. Tao, J. Lu, Online dictionary learning based intra-
frame video coding, Wirel. Pers. Commun. 74 (4) (2014) 12811295.
[41] S. Mallat, F. Falzon, Analysis of low bit rate image transform coding,
IEEE Trans. Signal Process. 46 (4) (1998) 10271042.
[42] X.W. Marta Karczewicz, Intra frame rate control based on SATD,
Document: JCTVC-M0257, Joint Collaborative Team on Video Coding.
[43] B. Li, H. Li, L. Li., Adaptive bit allocation for R-
λ
model rate control in
HM, Document: JCTVC-M0036, Joint Collaborative Team on Video
Coding.
[44] B. Li, D. Zhang, H. Li, J. Xu, Qp determination by
λ
value, Document:
JCTVC-I0426, Joint Collaborative Team on Video Coding.
[45] B. Li, L. Li, J. Zhang, J. Xu, H. Li, Encoding with fixed lagrange
multipliers, Document: JCTVC-I0242, Joint Collaborative Team on
Video Coding.
[46] J.M. Saragihand, S.S. Lucey, J.F. Cohn, Face alignment through sub-
space constrained mean-shifts, in: Proceeding of ICCV, 2009,
pp. 10341041.
[47] S. Ren, X. Cao, Y. Wei, J. Sun, Face alignment at 300 0 fps via
regressing local binary features, in: 2014 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1685
1692.
[48] B.A.C.L.K. Seshadrinathan, R. Soundararajan, Study of subjective and
objective quality assessment of video, IEEE Trans. Image Process. 19
(6) (2010) 14271441 .
S. Li et al. / Signal Processing: Image Communication 38 (2015) 127140140
... However, JND-based methods usually are limited by the predefined viewing conditions. Furthermore, several coding methods that are focused on visual attention have been presented in [17][18][19][20][21][22][23], and these methods have been verified as effective for modelling visual characteristics. The idea behind these methods is that paying more attention to the important regions that are attractive for humans based on saliency detection. ...
... The idea behind these methods is that paying more attention to the important regions that are attractive for humans based on saliency detection. More specifically, face-based PVC approaches were proposed in [17][18][19] to improve the coding efficiency of HEVC by considering that human faces are anticipated as region of interest (ROI) in video conferencing/conversation scenarios. Wang et al. [20] presented an ROI-based compression method for light field videos by considering the differing importance of regions in a scene. ...
Article
Full-text available
Since the ultimate consumers and judgers of most video applications are human subjects, there has been growing interest in incorporating characteristics of the human visual system (HVS) in video coding for the further development of coding technology, called perceptual video coding (PVC). Although there have been numerous PVC methods reported in the literature to date, exploring various factors affecting the performance of PVC still remains challenging due to the complexity of HVS and its perceptual mechanisms. In this paper, we propose a perceptual video coding scheme based on a novel spatio-temporal influence map model (PVC-STIM). In the first step, we develop a novel perceptual model by considering multiple perceptual characteristics of HVS, with a special focus on the fusion of several spatial and temporal features, i.e. spatial masking effect, spatial stimuli, visual saliency and temporal motion attention. In the second step, the proposed perceptual model is incorporated into the classic video coding framework to adjust the Lagrange multiplier in order to reasonably allocate visual quality, which improves the rate and perceived distortion performance. Experimental results show that the proposed PVC-STIM method can achieve on average 8.76% bitrate savings while retaining similar perceived quality, compared to HEVC, and can also outperform two PVC approaches.
... The authors also presented a trade-off between PSNR and computational cost for bitrate by using the HEVC rate-distortion feature. In [31], HEVC based rate control technique is explored, which was related smart city visual surveillance and smart video conferencing for better quality RoI encoding by leaving the rest of the image in low quality. In HEVC compression is implemented for moving object segmentation and classification methodology by incorporating HEVC. ...
... The achieved pixel-accuracy of segmentation is 67-96% with different time intervals and 3D-lidar frequencies. A pedestrian segmentation and detection scheme using mean-shift segmentation [31] focus on unmanned Ariel vehicles by using a locally collected dataset from surveillance cameras. This scheme was tested on available datasets and achieved pixel-accuracy of 76% calculated by (1) where, t i is the total number of pixels belonging to class I , and p ii represents the number of true positives. ...
Article
Full-text available
Smart video surveillance helps to build more robust smart city environment. The varied angle cameras act as smart sensors and collect visual data from smart city environment and transmit it for further visual analysis. The transmitted visual data is required to be in high quality for efficient analysis which is a challenging task while transmitting videos on low capacity bandwidth communication channels. In latest smart surveillance cameras, high quality of video transmission is maintained through various video encoding techniques such as high efficiency video coding. However, these video coding techniques still provide limited capabilities and the demand of high-quality based encoding for salient regions such as pedestrians, vehicles, cyclist/motorcyclist and road in video surveillance systems is still not met. This work is a contribution towards building an efficient salient region-based surveillance framework for smart cities. The proposed framework integrates a deep learning-based video surveillance technique that extracts salient regions from a video frame without information loss, and then encodes it in reduced size. We have applied this approach in diverse case studies environments of smart city to test the applicability of the framework. The successful result in terms of bitrate 56.92%, peak signal to noise ratio 5.35 bd and SR based segmentation accuracy of 92% and 96% for two different benchmark datasets is the outcome of proposed work. Consequently, the generation of less computational region-based video data makes it adaptable to improve surveillance solution in Smart Cities.
... To transmit videos within the constraint of limited network bandwidth, video compression is vital for reducing the bit rate. However, highly efficient video coding standards, such as H.264/AVC [1] and H.265/HEVC [2], introduce artifacts when using de-correlation and predictive coding techniques, degrading the quality of the video to some extent [3]. As illustrated in Figure 1, after being transmitted in low bandwidth, the reconstructed video is of low quality. ...
Article
Full-text available
For compressed images and videos, quality enhancement is essential. Though there have been remarkable achievements related to deep learning, deep learning models are too large to apply to real-time tasks. Therefore, a fast multi-frame quality enhancement method for compressed video, named Fast-MFQE, is proposed to meet the requirement of video-quality enhancement for real-time applications. There are three main modules in this method. One is the image pre-processing building module (IPPB), which is used to reduce redundant information of input images. The second one is the spatio-temporal fusion attention (STFA) module. It is introduced to effectively merge temporal and spatial information of input video frames. The third one is the feature reconstruction network (FRN), which is developed to effectively reconstruct and enhance the spatio-temporal information. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods in terms of lightweight parameters, inference speed, and quality enhancement performance. Even at a resolution of 1080p, the Fast-MFQE achieves a remarkable inference speed of over 25 frames per second, while providing a PSNR increase of 19.6% on average when QP = 37.
... Recently, Zhou et al. [51] also established an SSIM-based rate-distortion model, and the model was transformed into a global optimization problem to guide the LCU-level RC of HEVC. A weight-based R-λ perceptual RC scheme was presented by Li et al. [24] based on the observation that faces draw more attention in conventional video, weight map-based eye tracking was utilized in the bit allocation procedure. In [23], the researchers claim that visual saliency can represent the probability of human attention; hence, graph-based visual saliency was utilized to adjust QP, which assigned less bitrates with a low probability of visual attention. ...
Article
Full-text available
High efficiency video coding (HEVC) has achieved high coding efficiency as the video coding standard. For rate control in HEVC, the conventional R-λ scheme is based on mean absolute difference in allocating bits; however, the scheme does not fully utilize the perceptual importance variation to guide rate control, thus the subjective and objective quality of coded videos has room to improve. Therefore, in this paper, we propose a rate control scheme that considers perceptual importance. We first develop a perceptual importance analysis scheme to accurately abstract the spatial and temporal perceptual importance maps of video contents. The results of the analysis are then used to guide the bit allocation. Utilizing this model, a region-level bit allocation procedure is developed to maintain video quality balance. Subsequently, a largest coding unit (LCU)-level bit allocation scheme is designed to obtain the target bit of each LCU. To achieve a more accurate bitrate, an improved R-λ model based on the Broyden-Fletcher-Goldfarb-Shanno model is utilized to update the R-λ parameter. The experimental results showed that our method not only improved subjective and objective video quality with lower bitrate errors compared to the original RC in HEVC, but also outperformed state-of-the-art methods.
... video compression standard, Xu et al. [34] used a new HP model for realizing the facial features in HD session video, which can adaptively divide the CTU for ROI and effectively reduce coding time. Li et al. [35] proposed a perceptual rate control algorithm for session video that detects the face region in the video and uses the bpw to replace the bpp for bit allocation at the CTU level. A saliency-aware QP search algorithm was proposed in [36], which searches the different QP offsets in different CUs and then assigns the different bits to different saliency regions. ...
Article
Full-text available
The fast development of video technology and hardware has led to a great amount of video applications in the field of industry, such as the video conference, video surveillance and live video streaming. Most of the applications are facing with the problem of limited network bandwidth while try to keep high video resolution. Perceptual video compression addresses the problem by introducing saliency information to reduce the perceptual redundancy, which retains more information in salient region while compresses the non-salient region as much as possible. In this paper, in order to combine the multi-scale information in video saliency, an advanced video saliency detection model SALDPC is proposed which is built based on deformable pyramid convolution decoder and multi-scale temporal recurrence. In order to better guide video coding through saliency, based on the HEVC video coding standard, a saliency-aware rate-distortion optimization algorithm, SRDO, is proposed using block-based saliency information to change the rate-distortion balance and guide more reasonable bit allocation. Furthermore, a more flexible QP selection method, SAQP, is developed using CU's saliency information adaptively to change the QP of CU to ensure the high quality of the high saliency areas. The final results are available in the following three configurations: SRDO, SRDO + SQP, and SRDO + SAQP. Experimental results show that our method achieves a very high video quality improvement while significantly reducing the video encoding time. Compared to HEVC, the BD-EWPSNR of the SRDO method improves 0.703 dB, and the BD-Rate based on EWPSNR saves 20.822%; the BD-EWPSNR of the SRDO + SAQP is improved up to 1.217 dB, and the BD-Rate based on EWPSNR further saves up to 32.41%. At the same time, in terms of compression time, the proposed method saves up to 29.06% compared to HEVC. Experimental results show the superiority of the proposed method in comparison to the state-of-the-art methods.
Article
Despite the fact that Versatile Video Coding (VVC) has achieved superior coding performance, two major problems remain for the rate control (RC) model in VVC. First, the regions concerned by human eyes are not clear enough in the coded video due to the deviation between the target bit allocation strategy of the coding tree unit (CTU) in RC and the human visual attention mechanism (HVAM). Second, there are significant quality fluctuations in the coded video frames due to the inappropriate updating speed. To address the above problems, we propose an efficient rate control (ERC) model. Specifically, in order to make the coded video more consistent with the attention of human eyes, we extract texture and motion-based spatial-temporal information to guide the bit allocation at the CTU level. Furthermore, based on the quasi-Newton algorithm and bit error, we propose an adaptive parameter updating (APU) method with the proper updating speed to precisely control the bits per frame. The proposed ERC outperforms the default RC model of VVC Test Model (VTM) 9.1 by saving the average Bjøntegaard Delta Rate (BD-Rate) on full-frame video sequences by 3.60% and 4.94% under low delay P (LDP) and random access (RA) configurations respectively, with higher bitrate accuracy. Moreover, the Peak Signal-to-Noise Ratio (PSNR) and actual coded bits per frame in the video coded by the proposed ERC are more stable.
Article
The great success of deep learning has boosted the fast development of video quality enhancement. However, existing methods mainly focus on enhancing the objective quality of compressed video, and ignore their perceptual quality that plays a key role in determining quality of experience (QoE) of videos. In this paper, we aim at enhancing the perceptual quality of compressed video. Our main observation is that perceptual quality enhancement mostly relies on recovering the high-frequency details with fine textures. Accordingly, we propose a novel generative adversarial network (GAN) based on multi-level wavelet packet transform (WPT), which is called multi-level wavelet-based GAN+ (MW-GAN+), to exploit high-frequency details for enhancing the perceptual quality of compressed video. In MW-GAN+, we first propose a multi-level wavelet pixel-adaptive (MWP) module to extract temporal information across video frames, such that frame similarity can be utilized in recovering high-frequency details. Then, a wavelet reconstruction network, consisting of wavelet-dense residual blocks (WDRB), is developed to recover high-frequency details in a multi-level manner for enhanced frame reconstruction. Finally, we develop a 3D discriminator to encourage temporal coherence with a 3D-CNN based architecture. Experimental results demonstrate the superiority of our method over state-of-the-art methods in enhancing the perceptual quality of compressed video. Our code is available at https://github.com/IceClear/MW-GAN .
Article
Full-text available
Typical image compression algorithms first transform the image from its spatial domain representation to frequency domain representation using some transform technique such as Discrete Cosine Transform and Discrete Wavelet Transform and then code the transformed values. Recently, instead of performing a frequency transformation, machine learning based approach has been proposed which has two fundamental steps: selecting the most representative pixels and colorization. In this paper, a novel active learning method for automatically extracting the RP is proposed for image compression. To implement, active learning method automatic RP extraction is required and which extraction method is chosen determines the performance of the method. In this paper the active learning problem is formulated into an RP minimization problem resulting in the optimal RP set in the sense that it minimizes the error between the original and the reconstructed color image. The proposed method gives better result in comparison with other compression techniques.
Article
Full-text available
In this paper, we propose an online learning based intra-frame video coding approach, exploiting the texture sparsity of natural images. The proposed method is capable of learning the basic texture elements from previous frames with convergence guaranteed, leading to effective dictionaries for sparser representation of incoming frames. Benefiting from online learning, the proposed online dictionary learning based codec (ODL codec) is able to achieve a goal that the more video frames are being coded, the less non-zero coefficients are required to be transmitted. Then, these non-zero coefficients for image patches are further quantized and coded combined with dictionary synchronization. The experimental results demonstrate that the number of non-zero coefficients of each frame decreases rapidly while more frames are encoded. Compared to the off-line mode training, the proposed ODL codec, learning from video on the fly, is able to reduce the computational complexity with fast convergence. Finally, the rate distortion performance shows improvement in terms of PSNR compared with the K-SVD dictionary based compression and H.264/AVC for intra-frame video at low bit rates.
Conference Paper
Full-text available
Recent advances in video capturing and display technologies, along with the exponentially increasing demand of video services, challenge the video coding research community to design new algorithms able to significantly improve the compression performance of the current H.264/AVC standard. This target is currently gaining evidence with the standardization activities in the High Efficiency Video Coding (HEVC) project. The distortion models used in HEVC are mean squared error (MSE) and sum of absolute difference (SAD). However, they are widely criticized for not correlating well with perceptual image quality. The structural similarity (SSIM) index has been found to be a good indicator of perceived image quality. Meanwhile, it is computationally simple compared with other state-of-the-art perceptual quality measures and has a number of desirable mathematical properties for optimization tasks. We propose a perceptual video coding method to improve upon the current HEVC based on an SSIM-inspired divisive normalization scheme as an attempt to transform the DCT domain frame prediction residuals to a perceptually uniform space before encoding. Based on the residual divisive normalization process, we define a distortion model for mode selection and show that such a divisive normalization strategy largely simplifies the subsequent perceptual rate-distortion optimization procedure. We further adjust the divisive normalization factors based on local content of the video frame. Experiments show that the proposed scheme can achieve significant gain in terms of rate-SSIM performance when compared with HEVC.
Article
This paper presents a highly efficient, very accurate regression approach for face alignment. Our approach has two novel components: a set of local binary features, and a locality principle for learning those features. The locality principle guides us to learn a set of highly discriminative local binary features for each facial landmark independently. The obtained local binary features are used to jointly learn a linear regression for the final output. Our approach achieves the state-of-the-art results when tested on the current most challenging benchmarks. Furthermore, because extracting and regressing local binary features is computationally very cheap, our system is much faster than previous methods. It achieves over 3, 000 fps on a desktop or 300 fps on a mobile phone for locating a few dozens of landmarks.
Article
This paper proposes a compressibility constrained sparse representation (CCSR) approach to low bit-rate image compression using a learnt over-complete dictionary of texture patches. Conventional sparse representation approaches for image compression are based on matching pursuit (MP) algorithms. Actually, the weakness of these approaches is that they are not stable in terms of sparsity of the estimated coefficients, thereby resulting in the inferior performance in low bit-rate image compression. In comparison with MP, convex relaxation approaches are more stable for sparse representation. However, it is intractable to directly apply convex relaxation approaches to image compression, as their coefficients are not always compressible. To utilize convex relaxation in image compression, we first propose in this paper a CCSR formulation, imposing the compressibility constraint on the coefficients of sparse representation for each image patch. In addition, we work out the CCSR formulation to obtain sparse and compressible coefficients, through recursively solving the (ell _{1}) -norm optimization problem of sparse representation. Given these coefficients, each image patch can be represented by the linear combination of texture elements encoded in an over-complete dictionary, learnt from other training images. Finally, low bit-rate image compression can be achieved, owing to the sparsity and compressibility of coefficients by our CCSR approach. The experimental results demonstrate the effectiveness and superiority of the CCSR approach on compressing the natural and remote sensing images at low bit-rates.
Article
Rate control is a useful tool for video coding, especially in real-time communication applications. Most of existing rate control algorithms are based on the (R-Q) model, which characterizes the relationship between bitrate (R) and quantization (Q) , under the assumption that (Q) is the critical factor on rate control. However, with the video coding schemes becoming more and more flexible, it is very difficult to accurately model the (R-Q) relationship. In fact, we find that there exists a more robust correspondence between (R) and the Lagrange multiplier (lambda ) . Therefore, in this paper, we propose a novel (lambda ) -domain rate control algorithm based on the (R-lambda ) model, and implement it in the newest video coding standard high efficiency video coding (HEVC). Experimental results show that the proposed (lambda ) -domain rate control can achieve the target bitrates more accurately than the original rate control algorithm in the HEVC reference software as well as obtain significant R-D performance gain. Thanks to the high accurate rate control algorithm, hierarchical bit allocation can be enabled in the implemented video coding scheme, which can bring additional R-D performance gain. Experimental results demonstrate that the proposed (lambda ) -domain rate control algorithm is effective for HEVC, which outperforms the (R-Q) model based rate control in HM-8.0 (HEVC reference software) by 0.55 dB on average and up to 1.81 dB for low delay coding structure, and 1.08 dB on average and up to 3.77 dB for random access coding structure. The proposed (lambda ) -domain rate control algorithm has already been adopted by Joint Collaborative Team on Video Coding and integrated into the HEVC reference software.
Article
Visual quality consistency is one of the most important issues in video quality assessment. When people view a sequential video, they may have an unpleasant perceptual experience if the video has an inconsistent visual quality even though the average visual quality of the video is not compromised. Thus, consistent visual quality control is mostly expected in general video encoding with limited channel bandwidth and buffer resources. However, there still has not been enough study on such an issue. In this paper, a new objective visual quality metric (VQM) is proposed first, which can easily be incorporated into video coding for guiding video coding. Second, a VQM-based window model is proposed to handle the tradeoff between visual quality consistency and buffer constraint in video coding. Third, a window-level rate control algorithm is developed to accomplish visual quality control based on the above two proposals. Finally, experimental results prove that consistent visual quality, high rate-distortion efficiency, accurate bit control, and compliant buffer constraint can be achieved by the proposed rate control algorithm.
Article
In this paper, we present a pixel-wise unified rate quantization (R-Q) model for a low-complexity rate control on configurable coding units of high efficiency video coding (HEVC). In the case of HEVC, which employs hierarchical coding block structure, multiple R-Q models can be employed for the various block sizes. However, we found that the ratios of distortions over bits for all the blocks are a nearly constant because of employment of the rate distortion optimization technique. Hence, one relationship model between rate and quantization can be derived from the characteristic of similar ratios of distortions over bits regardless of block sizes. Thus, we propose the pixel-wise unified R-Q model for HEVC rate control working on the multi-level for all block sizes. We employ a simple leaky bucket model for bit control. The rate control based on the proposed pixel-wise unified R-Q model is implemented on HEVC test model 6.1 (HM6.1). According to the evaluation for the proposed rate control, the average matching percentage to target bitrates is 99.47% and the average PSNR degradation is 0.76 dB. Based on the comparative study, we found that the proposed rate control shows low bit fluctuation and good RD performance, compared to R-lambda rate control for long sequences.
Article
With the advances in understanding perceptual properties of the human visual system and constructing their computational models, efforts toward incorporating human perceptual mechanisms in video compression to achieve maximal perceptual quality have received great attention. This paper thoroughly reviews the recent advances of perceptual video compression mainly in terms of the three major components, namely, perceptual model definition, implementation of coding, and performance evaluation. Furthermore, open research issues and challenges are discussed in order to provide perspectives for future research trends.