ArticlePDF Available

Weight-Based R-λ Rate Control for Perceptual HEVC Coding on Conversational Videos

May 2015
Signal Processing Image Communication 38

May 2015
38

DOI:10.1016/j.image.2015.04.011

Authors:

Mai Xu

Beihang University (BUAA)

Xin Deng

Imperial College London

This article proposes a novel weight-based R-λ scheme for rate control in HEVC, to improve the perceived visual quality of conversational videos. For rate control in HEVC, the conventional R-λ scheme is based on bit per pixel ( ) to allocate bits. However, does not reflect the visual importance variation of pixels. Therefore, we propose a novel weight-based R-λ scheme to consider this visual importance for rate control in HEVC. We first conducted an eye-tracking experiment on training videos to figure out different importance of background, face, and facial features, thus generating weight maps of encoding videos. Upon the weight maps, our scheme is capable of allocating more bits to the face (especially facial features), using a new term bit per weight. Consequently, the visual quality of face and facial features can be improved such that perceptual video coding is achieved for HEVC, as verified by our experimental results.

The framework of perceptual video coding.

…

The literatures on perceptual video coding.

…

DMOS comparison of conventional, perceptual urq, and our schemes.

…

(a) The procedure of the conventional R-λ and (b) our rate control schemes.

…

Example of eye tracking results. The blue circles show the positions of eye fixation points and their sizes represent the staying duration of eye fixation points.

…

Figures - uploaded by Mai Xu

Content may be subject to copyright.

Content uploaded by Mai Xu

Content may be subject to copyright.

Weight-based R-

rate control for perceptual HEVC coding

on conversational videos

Shengxi Li

, Mai Xu

a,b,

, Xin Deng

, Zulin Wang

School of Electronic and Information Engineering, Beihang University, Beijing 100191, China

EDA Lab, Research Institute of Tsinghua University in Shenzhen, Shenzhen, China

article info

Available online 14 May 2015

Keywords:

HEVC

Perceptual video coding

Rate control

abstract

This paper proposes a novel weight-based R-

scheme for rate control in HEVC, to

improve the perceived visual quality of conversational videos. For rate control in HEVC,

the conventional R-

scheme is based on bit per pixel (bpp) to allocate bits. However, bpp

does not reflect the visual importance variation of pixels. Therefore, we propose a novel

weight-based R-

scheme to consider this visual importance for rate control in HEVC. We

first conducted an eye-tracking experiment on training videos to figure out different

importance of background, face, and facial features, thus generating weight maps of

encoding videos. Upon the weight maps, our scheme is capable of allocating more bits to

the face (especially facial features), using a new term bit per weight. Consequently, the

visual quality of face and facial features can be improved such that perceptual video

coding is achieved for HEVC, as verified by our experimental results.

1. Introduction

Supported by the recent advances in related techni-

ques, the popularity of multimedia applications has been

considerably increased. It has been pointed out in [1] that

video applications with high resolutions, such as FaceTime

and Skype, occupy a large proportion of data among the

existing multimedia applications. The limited bandwidth

issue thus becomes more and more serious, causing

“spectrum crush”. To better relieve the bandwidth-

hungry issue, high efficiency video coding (HEVC) stan-

dard [1], also called H.265, has been formally established.

Rate control is a crucial module in HEVC, whose aim is

to optimize visual quality via reasonably allocating bits to

various frames and blocks, at a given bit-rate. An excellent

rate control scheme is able to precisely allocate bits, and to

output better visual quality of compressed videos. In other

words, at the same visual quality, a better rate control

scheme consumes less bit-rate and therefore achieves the

goal of relieving the bandwidth bottleneck. There are

many rate control schemes for different video coding

standards (e.g. TM5 for MPEG-2 [2], VM8 for MPEG-4 [3]

and JVT-N046 [4] for H.264). For HEVC, a pixel-wise

unified rate quantization (URQ) scheme has been proposed

in [5] to compute quantization parameter (QP) at a given

target bit-rate. Since this scheme works at pixel level, it

can be easily applied to blocks with various sizes. How-

ever, according to [6], Lagrange multiplier

[7], which

represents the bit cost of encoding a block, is more

important than QP in allocating bits. Therefore, a new

scheme, R-

scheme, was proposed in [6] to better allocate

the bits in HEVC.

Nevertheless, high resolution video delivery, especially at

low bit-rate scenarios, still poses a great challenge to HEVC.

In fact, according to the human visual system (HVS), there

exists much perceptual redundancy that can be further

exploited to greatly improve the coding efficiency of HEVC,

Contents lists available at ScienceDirect

journal homepage: www.elsevier.com/locate/image

Signal Processing: Image Communication

http://dx.doi.org/10.1016/j.image.2015.04.011

☆

This work was supported by NSFC under Grant no. 61202139 and

61471022 and China 973 program under Grant no. 2013CB29006.

Corresponding author.

E-mail address: MaiXu@buaa.edu.cn (M. Xu).

Signal Processing: Image Communication 38 (2015) 127–140

thus relieving the bandwidth-hungry issue [8].Forinstance,

when a person looks at a video, a small region around a

point of fixation, called region-of-interest (ROI), is concerned

most [8] with high resolution, while the peripheral region is

captured at a low resolution. Hence, in light of this phenom-

enon, a large amount of bits can be saved via reducing

perceptual redundancy in the peripheral region, with little

loss of perceived quality. Consequently, along with the

development of the understanding of the HVS, perceptual

video coding is able to more efficiently condense video data.

Rate control for perceptual video coding has received a

great deal of research effort from 2000 onwards, due to its

great potential in improving coding efficiency [9–12].In

H.263, a perceptual rate control (PRC) scheme [9] was

proposed. In this scheme, a perceptual sensitive weight

map of conversational scene (i.e., scene with frontal human

faces) is obtained by combining stimulus-driven (i.e., lumi-

nance adaptation and texture masking) and cognition-driven

(i.e., skin colors) factors together. According to such a map,

more bits are allocated to ROIs by reducing QP values in

these regions. Afterwards, for H.264/AVC, a novel resource

allocation method [10] was proposed to optimize the sub-

jective rate–distortion–complexity performance of conversa-

tional video coding, by improving the visual quality of face

region extracted by the skin-tone algorithm. Moreover, Xu

et al. [13] utilized a novel window model to characterize the

relationship between the size of window and variations of

picture quality and buffer occupancy, ensuring a better

perceptual quality with less quality fluctuation. This model

was advanced in [14] with an improved video quality metric

for better correlation to the HVS. Most recently, in HEVC the

perceptual model of structural similarity (SSIM) has been

incorporated for perceptual video coding [15].Insteadof

minimizing mean squared error (MSE) and sum of absolute

difference (SAD), SSIM is minimized [15] to improve the

subjective quality of perceptual video coding in HEVC.

However, as pointed out by [16], assigning pixels with

weights according to visual attention is much more accurate

than SSIM for evaluating the subjective quality. To this end, a

scheme [12] was proposed to improve the visual quality and

meanwhile to reduce the encoding complexity, via consider-

ing the visual attention on ROIs (e.g., face and facial features).

However, to our best knowledge, although larger weights are

imposedonROIsintheaboveapproaches,theirvaluesare

assigned in an arbitrary manner. Moreover, there is no

perceptual approach for the latest R-

rate control scheme

[6] in HEVC.

Therefore, we propose a novel weight-based R-

rate

control scheme to improve the perceived visual quality of

compressed conversational videos, based on the weights of

face regions and facial features learned from eye-tracking data.

To be more specific, similar to [12],weconsiderfaceregionsas

ROIs, and also consider facial features (e.g., mouth and eyes) as

the most important ROIs. Different from [12], the weights

allocated to background, face, and facial features are more

precise and reasonable, as they are obtained upon the saliency

distribution learnt from our eye-tracking data of several

training videos. Based on these weights, the weight-based R-

rate control scheme is proposed, using a new term, bit per

weight (bpw), to enhance the quality of face regions, especially

the facial features. Since the perceptual video coding is the

main goal of our scheme, we review it in the following.

2. The related work on perceptual video coding

Generally speaking, main parts of perceptual video coding

are perceptual models, perceptual model incorporation in

video coding and performance evaluations, as illustrated

in Fig. 1. Specifically, perceptual models, which imitate the

output of the HVS to specify the ROIs and non-ROIs, need to

be designed first for perceptual video coding. Secondly, on

the basis of the perceptual models and existing video coding

standards, perceptual model incorporation in video coding

from perceptual aspects needs to be developed to encode/

decode the videos, mainly by moving their perceptual

redundancy. Rather than incorporating perceptual model in

video coding, some machine learning based image/video

Fig. 1. The framework of perceptual video coding.

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140128

compression approaches have also proposed during the past

decade. A summarized literature review is depicted in Fig. 2,

which is to be explained in detail in the next two subsections.

2.1. Perceptual model

Perceptual models can be classified into two categories:

manual and automatic identification.

2.1.1. Manual identification

This kind of perceptual models requires manual effort

to distinguish important regions which need to be

encoded with high quality. In the early years, Geisler and

Perry [17] employed a foveated multi-resolution pyramid

(FMP) video encoder/decoder to compress each image of

varying resolutions into 5 or 6 regions in real-time, using a

pointing device. This model requires the users to specify

which regions attract them most during the video trans-

mission. Thus, this kind of models may lead to transmis-

sion and processing delay between the receiver and

transmitter sides, when specifying the ROIs. Another way

[18] is to specify ROIs before watching, hence avoiding the

transmission and processing delay. However, considering

the workload of humans, these models cannot be widely

applied to various videos.

In summary, the advantage of manual identification

modelsistheaccuratedetectionofROIs.However,asthe

cost, it is expensive and intractable to extensively apply these

models due to the involvement of manual effort or hardware

support. In addition, for the models of user input-based

selection, there exists transmission and processing delay,

thus making the real-time applications impractical.

2.1.2. Automatic identification

Just as its name implies, this category of perceptual

models is to automatically recognize ROIs in videos,

according to visual attention mechanisms. Therefore,

visual attention models are widely used among various

perceptual models. There are two classes of visual atten-

tion models: either bottom-up or top-down models. Itti's

model [19] is one of the most popular bottom-up visual

attention models in perceptual video coding. Mimicking

processing in primate occipital and posterior parietal

cortex, Itti's model integrates low-level visual cues, in

terms of color, intensity, orientation, flicker, and motion,

to generate a saliency map for selecting ROIs [11].

The other class of visual attention models is top-down

processing [20–25,12]. The top-down visual attention

models are more frequently applied to video applications,

since they are more correlated with human attractiveness.

For instance, human face [10,12,21] is one of the most

important factor that draw top-down attention, especially

for conversational video applications. Also, a hierarchical

perceptual model of face [12] has been established,

endowing unequal importance within the face region.

However, the above-mentioned approaches are unable to

figure out the importance of face region.

In this paper, we quantify the saliency of face and facial

features via learning the saliency distribution from the eye

fixation data of training videos, via conducting the eye-

tracking experiment. Then, after detecting face and facial

features for automatically identifying ROI [12], the saliency

map of each frame of encoded conversational video is

assigned using the learnt saliency distribution. Although

the same ROI is utilized in [12], the weight map of our

scheme is more reasonable for the perceptual model for

video coding, as it is in light of learnt distribution of

saliency over face regions. Note that the difference

between ROI and saliency is that the former refers to the

place that may attract visual attention while the later

refers to the possibility of each pixel/region to attract

visual attention.

Fig. 2. The literatures on perceptual video coding.

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140 129

2.2. Perceptual model incorporation in video coding

After setting up the perceptual model, the next task is to

apply it in the existing video coding approaches. One

category of approaches called pre-processing is to control

the non-uniform distribution of distortion before encoding

[26–28]. A common way of pre-processing is spatial blurring

[26,27].Forinstance,thespatialblurringapproach[26]

separates the scene into foreground and background. The

background is blurred to remove high frequency information

in spatial domain so that less bits are allocated to this region.

However, this may cause obvious boundaries between the

background and foreground.

Another category is to control the non-uniform distribu-

tion of distortion during encoding, therefore called embedded

encoding [29,10,30,31,12]. As it is embedded into the whole

coding process, this category of approaches is efficient in

more flexibly compressing videos with different demands. In

[10], Liu et al. established importance map at the macro block

(MB) level based on face detection results. Moreover, combin-

ing texture and nontexture information, a linear rate–quanti-

zation (R–Q) model is applied to H.264/AVC. Based on the

importance map and R–Q model, the optimized QP values are

assigned to all MBs, which enhances the perceived visual

quality of compressed videos. In addition, after obtaining the

importance map, the other encoding parameters, such as

mode decision and motion estimation (ME) search, are

adjusted to provide ROIs with more encoding resources. Xu

et al. [12] proposed a new weight-based URQ rate control

scheme for compressing conversational videos, which assigns

bits according to bpw, instead of bit per pixel (bpp) in

conventional URQ scheme. Then, the quality of face regions

is improved such that its perceived visual quality is enhanced.

The scheme in [12] is based on the URQ model [5],which

aims at establishing the relationship between bite-rate R and

quantization parameters Q, i.e., R–Qrelationship.However,

since various flexible coding parameters and structures are

applied in HEVC, R–Q relationship is hard to be precisely

estimated [32]. Therefore, Lagrange multiplier

[7],which

stands for the slope of R–D curve, has been investigated.

According to [32], the relationship between

and R can be

better characterized in comparison with R–Q relationships.

This way, on the basis of R-

model, the state-of-the-art R-

rate control scheme [6] has better performance than the URQ

scheme. Therefore, on the basis of the latest R-

scheme, this

paper proposes a novelweight-basedR-

scheme to further

improve the perceived video quality of HEVC.

2.3. Machine learning based compression

From the viewpoint of machine learning, the pixels or

blocks from one image or several images may have high

similarity. Such similarity can be discovered by machine

learning techniques, and then utilized to decrease redun-

dancy of video coding. For exploiting the similarity within an

image/video, image inpainting has been applied in [33,34] to

use the image blocks from spatial or temporal neighbors for

synthesizing the unimportant content, which is deliberately

deleted at the encoder side. As such, the bits can be saved as

not encoding the missing areas of the image/video. Also,

rather than predicting the missing intensity information in

[33,34] , several approaches [35–38] have been proposed to

learn to predict the color in images using the color informa-

tion of some representative pixels. Then, only representative

pixels and gray scale image need to be stored, such that the

image [38,36,37] or video [35] coding can be achieved.

Fig. 3. (a) The procedure of the conventional R-λand (b) our rate control schemes.

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140130

For working on similarity across various images or

frames of videos, dictionary learning has been developed

to discover the inherent patterns of image blocks. Together

with dictionary learning, sparse representation can then

be used to effectively represent an image for image [39] or

video coding [40], instead of conventional image trans-

forms such as discrete cosine transform (DCT).

3. Review of HEVC R-λscheme

The main goal of rate control in video coding is

minimizing the distortion of a compressed video at a given

bit-rate. In order to better achieve this goal, as illustrated

in Fig. 3(a), R-

rate control scheme [6] calculates a

Lagrange multiplier

before computing QP. From Fig. 3(a),

we can also see that the main steps of this scheme are

working out the bpp-

and

-QP relationships, to finally

output QP values. So, we review bpp-

and

-QP relation-

ships in the following for the R-

rate control scheme. Note

that this paper only focuses on the rate control at largest

coding unit (LCU) level.

3.1. bpp-

relationship

Parameter

, which is the slope of rate–distortion (R–D)

curve [6] (also seen as Lagrange multiplier), is crucial

during the rate control process. The relationship between

and R–D curve can be formulated by

λ¼

∂D

∂R;ð1Þ

where Dand Rrepresent the distortion and bit-rate for

one LCU.

Furthermore, the Hyperbolic model D¼CR

K

, charac-

terizing the relationship between Rand D, is adopted in

the rate control scheme [7,41]. Here, Cand Kare para-

meters determined by the characteristic of video content.

Then, with (1), the relationship between R-

can be

obtained by

λ¼

∂D

∂R¼CK R

K1

¼aR

;ð2Þ

where a¼CK and b¼K1 are parameters related to the

video content as well. As different LCUs are of different

contents, aand bneed to be updated along with the

encoding process of each LCU.

Next, once Ris obtained for an LCU,

can be an output

for estimating QP of the current processing LCU. Here, the

bit-rate Rcan be modeled in terms of bpp by

R¼bpp fwh;ð3Þ

where wand hare the width and height of the video

frame; frepresents the frame rate. Recall that bpp is the bit

per pixel for this LCU. Upon (3),(2) can be rewritten as

λ¼αbpp

;ð4Þ

where α¼aðfwhÞ

and β¼bare parameters also related

to video contents. In HEVC,

and

need to be updated

during encoding, with their initial values set to 3.2003 and

1:367, i.e., after encoding LCU,

and

are updated for

the co-located LCU of the subsequent frames. Note that as

shown in [32], different initial values of

and

have little

impact on the compressed videos, both on R–D perfor-

mance and bit-rate error. Assuming that the actual

encoded bpp is



bpp and the actual used



λfor the

current encoding,

and

can be updated by α

and β

¼αþδ

ln



λlnðα



bpp



α;

¼βþδ

ln



λlnðα



bpp



ln



bpp;ð5Þ

where δ

and δ

are constants empirically set to 0.1 and

0.05 [6], respectively. Note that



bpp means the actually

consumed bpp after encoding each LCU, while



λrepre-

sents the actual

used for calculating QP during encoding

each LCU. In general,



λis not equal to α



bpp

since

and

are unable to accurately fit the relationship between

distortion and bit-rate for each LCU. The proof of the

updating way can refer to the appendix of [32].

Then, once bpp value is achieved for each LCU,

can be

estimated with (4). Here, we assume that bpp and

values

for the j-th LCU are bpp

and

, respectively. Next,

assuming that the number of pixels in the j-th LCU is N

we obtain bpp

through the target bits T

for the j-th LCU:

bpp

¼T

;ð6Þ

and

TB

M1

i¼j

c

;ð7Þ

where

Tis the target bits remaining for encoding this

frame and Bis the remaining header bits. Moreover, there

are MLCUs in this frame, and c

means the texture

complexity of the i-th LCU. To be more specific, for inter

frames, the target bits are allocated according to MAD of

the co-located LCU in the previous pictures, i.e., c

related to MAD for the i-th LCU. For intra frames, c

related to the sum of absolute transformed difference

(SATD) of the i-th LCU [42]. For computing c

, see [43].

In addition, before establishing

-QP relationship, there

exists a step to smooth

with value

for the j-th LCU:

¼maxfmax λ

2

2:0

3:0

;

j1

2

1:0

3:0



;

minfλ

;min λ

2

2:0

3:0

;

j1

2

1:0

3:0



gg;ð8Þ

where

represents the

value of the current frame.

j1

represents

value which has been smoothed in the

ðj1Þ-th LCU . The calculation of

is depicted in [6].

3.2.

-QP relationship

After establishing bpp-

relationship, the remaining task

is finding out

-QP relationship. Mathematically, QP value

can be obtained through a multiple-QP optimization process:

min JðQPÞ¼DðQPÞþλRðQPÞ;ð9Þ

to provide the smallest RD cost JðQPÞ, with distortion DðQPÞ

and rate RðQPÞ.

The optimal QP can be achieved as the final output of rate

control via solving (9). However, this optimization hugely

increases encoding complexity. To reduce the encoding com-

plexity, a fitting formulation, rather than multiple-QP

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140 131

optimization, is proposed [44,45] to determine QP value QP

for the j-th LCU:

¼θ

ln

þθ

:ð10Þ

Recall that

is the smoothed

value of the j-th LCU. In (10),

and

are coefficients fitting the linear relationship

between QP and ln λ.Notethat

and

are empirically

set to 4.2005 and 13.7122, respectively, in [44]. Their values

remain the same during each video coding. Similar to (8),QP

needstobesmoothedaswell:

¼maxfmaxð

j1

1;QP

2Þ;

minfQP

;minð

j1

þ1;QP

þ2Þgg;ð11Þ

where QP

is the QP value of the current frame. For the

calculation of QP

, please refer to [6].Moreover,

j1

means the smoothed QP value of the ðj1Þ-th LCU. Finally,

the

can be output in R-

rate control scheme.

From the definition of bpp, it is worth pointing out that

each pixel is endowed with the equal visual importance

over the whole video frame. Therefore, the R-

scheme

wastes many bits on encoding the non-ROIs, to which

humans pay less attention.

4. The proposed rate control scheme

This section proposes the weight-based R-

rate control

scheme, to take into account local visual importance of video

content. Fig. 3(b) shows the procedure of our weight-based R-

rate control scheme. Specifically, we first establish a

perceptual model of face by learning from eye fixation points

of our eye-tracking experiment. Note that such a perceptual

model is learnt offline from training videos that are different

from encoding videos. Based on such a perceptual model, the

weight-based R-

rate control scheme is proposed to improve

the visual quality of face and facial features, thus providing a

better perceived quality.

4.1. Learning for perceptual model

In [8], the authors have shown that face draws a

majority of attention in conversational videos. It is inter-

esting to further quantify unequal importance of

background, face and facial features to human attention.

In this section, we conducted eye tracking experiment on

the training conversational videos to obtain values of such

unequal importance so that these values can be used for

encoding other videos.

Before the experiment, it is necessary to first extract

the face and its facial features in conversational videos

using the method of [12]. Generally speaking, our extrac-

tion technique is based on a real-time face alignment

method [46]. To be more specific, several key landmarks

obeying the point distribution model (PDM) are located in

the face of an image using the method in [46], which

combines the local detection (texture information) and

global optimization (facial structure) together. Here, 66

landmarks, produced by the PDM, are connected to pre-

cisely identify the contours and regions of face and facial

features. Note that the extraction in our scheme can be

achieved in real-time, as the face alignment method [46] is

indeed fast. Also, the 3000 fps face alignment [47] may be

used to further speed up the extraction on face and its

facial features.

For the eye tracking experiment, 18 conversational video

clips (resolution: 720 480) were collected and each of

them was cut to 750 frames at 25 Hz. Note that these

conversational video clips were collected from movies, news,

and videos captured by Nikon D800 camera. Also, note that

all training videos are different from the test videos of

Section 5. These video clips were then presented at a random

order to 24 subjects (14 males, 10 females, aging 22–32). The

subjects were seated on an adjustable chair at a viewing

distance of 60 cm, ensuring that the subject's horizontal

sight is in the center of screen. The eye fixation points of

all subjects were recorded over frames of each clip by Tobii

T60 eye tracker. Some of the recorded eye tracking data are

available at our website http://www.ee.buaa.edu.cn/xumfiles.

One example of eye tracking results is shown

in Fig. 4. Next, we focus on quantifying the visual attention

on different regions of conversational videos by combining

the eye fixation points of all subjects together.

After the eye tracking experiment, f

, and f

which denote eye fixation points of all subjects falling into

right eye, left eye, mouth, nose, other parts in face and

background, were counted. Given the counted eye fixation

points (efp) of different regions, we have the degrees of

Fig. 4. Example of eye tracking results. The blue circles show the positions of eye fixation points and their sizes represent the staying duration of eye

fixation points.

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140132

visual attention in these regions:

¼f

;

ð12Þ

where p

, and p

are defined as the numbers

of pixels in the regions of right eye, left eye, mouth, nose,

other parts in face and background. c

, and c

are the values of visual attention degrees for these regions.

Their values output by our eye tracking experiment are

reported in Table 1. Note that the degrees of visual

attention for face and facial features may vary in different

videos, according to video content. However, the values

in Table 1 are constant to simply predict the visual

attention attended to face and facial features, as it is hard

to take into account the attention variation of face and

facial features, caused by different video content.

Finally, assuming that the background weight is 1, the

weight map of a video frame can be computed upon the

results of Table 1 by

1ifnAbackground

if nAright eye

if nAleft eye

if nAmouth

if nAnose

if nAothers;

ð13Þ

where w

is the weight of the n-th pixel in the video

frame. Given (13), we can obtain the weight map for each

conversational video frame to be encoded, with extracted

face and facial features [12].

4.2. Weight-based R-

rate control scheme

For the conventional R-

rate control scheme, bpp is an

important term to allocate bits. In our scheme, rather than

replacing bpp in [12], a new term bpw is introduced to

calculate bpp, for allocating bits in accordance with the

weight map of our perceptual model. Before encoding a

frame, we first initialize bpw in order to separate the

target bits of the whole frame into the target bits of ROIs

(face and facial features) and non-ROIs (background):

bpw ¼T

n¼1

;ð14Þ

where Tstands for the target bits of a frame. Recall that w

means the weight of the n-th pixel. Nis the number of

pixels in this frame. Then, the target bits T

for ROIs and T

″

for non-ROIs are

þT

″

¼T

″

¼P

nAn

″

bpw

nAn

bpw T

;

ð15Þ

where n

denotes the indices of ROI pixels, the weights of

which are larger than 1; n

″

means the indices of non-ROI

pixels, the weights of which are equivalent to 1.

By solving (15),T

and T

″

are obtained:

¼P

nAn

þP

nAn

″

T;ð16Þ

″

¼P

nAn

″

nAn

þP

nAn

″

T:ð17Þ

Then, the target bits for ROIs and non-ROIs can be reason-

ably arranged, according to the importance of these

regions. Next, bpw

for the j-th LCU can be calculated as

bpw

jAm

″

jAm

″

;

(ð18Þ

where m

and m

″

are the LCU indices of ROIs and non-

ROIs, respectively. Note that LCU of ROIs means that the

average weight of this LCU is larger than 1. Otherwise, LCU

belongs to non-ROIs. Besides,

and

″

denote the remain-

ing target bits for ROIs and non-ROIs; ^

and ^

″

represent

the pixel indices of the current and its subsequent LCUs for

ROIs and non-ROIs.

Afterwards, T

, the target bits of the j-th LCU, can be

estimated via

¼X

nAn

bpw

;ð19Þ

where n

denotes the pixel indices in the j-th LCU. It can be

seen from (19) that the LCU with large bpw and w

is able

to be assigned with more target bits. So, ROIs, especially

the more important ROIs (e.g., facial features), are empha-

sized with more target bits.

Finally, given the newly estimated T

of (19), bpp

can be

acquired with (6). Then, the rate control can be achieved

through (4) and (10) with bpp

known for each LCU. In

addition, we adjust the boundary setting of

and QP

based on the weights from our perceptual model to

impose the ROIs more priority on bit allocation. Specifi-

cally, (8) is rewritten as follows:

¼max max

j1

2

1:0

3:0

nAn

;λ

2

2:0

3:0

nAn

;

(

min λ

;min

j1

2

1:0

3:0

P

nAn

;λ

2

2:0

3:0

P

nAn

!())

;

ð20Þ

and QP boundary smoothing (11) is modified correspond-

ingly:

¼max max

j1

P

nAn

;QP

2

nAn

Table 1

The values of visual attention degrees of different regions.

efp/p0.122 0.108 0.116 0.080 0.076 0.002

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140 133

min QP

;min

j1

þP

nAn

;QP

þ2

nAn

;

ð21Þ

As can be seen from (20) and (21), more variation is

allowed for

and QP

to improve the quality of ROI

regions with more assigned bits. Consequently, the region

0 100 200 300 400 500 600

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Conventional R−λ)

Background (Conventional R−λ)

Face (Conventional R−λ)

Whole (Ours)

Background (Ours)

Face (Ours)

0100 200 300 400 500 600 700 800 900 1000 1100

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Conventional R−λ)

Background (Conventional R−λ)

Face (Conventional R−λ)

Whole (Ours)

Background (Ours)

Face (Ours)

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Conventional R−λ)

Background (Conventional R−λ)

Face (Conventional R−λ)

Whole (Ours)

Background (Ours)

Face (Ours)

0200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Conventional R−λ)

Background (Conventional R−λ)

Face (Conventional R−λ)

Whole (Ours)

Background (Ours)

Face (Ours)

0 500 1000 1500 2000 2500 3000 3500 4000

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Conventional R−λ)

Background (Conventional R−λ)

Face (Conventional R−λ)

Whole (Ours)

Background (Ours)

Face (Ours)

0500 1000 1500 2000 2500 3000 3500 4000

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Conventional R−λ)

Background (Conventional R−λ)

Face (Conventional R−λ)

Whole (Ours)

Background (Ours)

Face (Ours)

Fig. 5. Rate–distortion performance comparison over face, background, and whole regions between the conventional R-λand our schemes on compressing

six conversational video sequences.

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140134

with larger weights has a broader boundary, resulting in

better visual quality in ROIs.

In general, we utilize the new term bpw to estimate

the target bit for each LCU. Then, bpp is acquired based on

bpw, followed by

and QP values. After encoding one

LCU, its QP can be the output. In addition, the relevant

parameters, such as

and

, need to be updated for the

following LCU. This way, the weight-based R-

scheme

0 100 200 300 400 500 600

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Perceptual URQ)

Background (Perceptual URQ)

Face (Perceptual URQ)

Whole (Ours)

Background (Ours)

Face (Ours)

0100 200 300 400 500 600 700 800 900 1000 1100

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Perceptual URQ)

Background (Perceptual URQ)

Face (Perceptual URQ)

Whole (Ours)

Background (Ours)

Face (Ours)

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Perceptual URQ)

Background (Perceptual URQ)

Face (Perceptual URQ)

Whole (Ours)

Background (Ours)

Face (Ours)

0200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Perceptual URQ)

Background (Perceptual URQ)

Face (Perceptual URQ)

Whole (Ours)

Background (Ours)

Face (Ours)

0 500 1000 1500 2000 2500 3000 3500 4000

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Perceptual URQ)

Background (Perceptual URQ)

Face (Perceptual URQ)

Whole (Ours)

Background (Ours)

Face (Ours)

0500 1000 1500 2000 2500 3000 3500 4000

Bit−rates (kbps)

Average Y−PSNR (dB)

Whole (Perceptual URQ)

Background (Perceptual URQ)

Face (Perceptual URQ)

Whole (Ours)

Background (Ours)

Face (Ours)

Fig. 6. Rate–distortion performance comparison over face, background, and whole regions between the perceptual URQ [12] and our schemes on

compressing six conversational video sequences.

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140 135

iterates to obtain QP values of each LCU until the last LCU

finishes encoding. The main difference between our

scheme and the conventional R-

schemesisthatwe

exploit the weight of each pixel from our perceptual

model and bpw to estimate bpp for each LCU. bpp is

therefore correspondingly adjusted according to the

weights of LCU and bpw. Larger values of weights and

bpw, which indicate the more important regions and

0 100 200 300 400 500 600

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Conventional R−λ)

Nose (Conventional R−λ)

Eyes (Conventional R−λ)

Mouth (Conventional R−λ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0100 200 300 400 500 600 700 800 900 1000 1100

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Conventional R−λ)

Nose (Conventional R−λ)

Eyes (Conventional R−λ)

Mouth (Conventional R−λ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Conventional R−λ)

Nose (Conventional R−λ)

Eyes (Conventional R−λ)

Mouth (Conventional R−λ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Conventional R−λ)

Nose (Conventional R−λ)

Eyes (Conventional R−λ)

Mouth (Conventional R−λ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0 500 1000 1500 2000 2500 3000 3500 4000

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Conventional R−λ)

Nose (Conventional R−λ)

Eyes (Conventional R−λ)

Mouth (Conventional R−λ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0500 1000 1500 2000 2500 3000 3500 4000

Bit−rates (kbps)

Average Y−PSNR (dB)

Face (Conventional R−λ)

Nose (Conventional R−λ)

Eyes (Conventional R−λ)

Mouth (Conventional R−λ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

Fig. 7. Rate–distortion performance comparison over facial features between the conventional R-λand our schemes on compressing six conversational

video sequences.

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140136

greater bit-rates, lead to larger bpp, thus probably achiev-

ing better quality. This way, the ROIs, especially the more

important ROIs, are allocated more bits to ensure better

perceived quality.

5. Experimental results

In this section, experimental results are presented to

validate the proposed weight-based R-

scheme for

0 100 200 300 400 500 600

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Perceptual URQ)

Nose (Perceptual URQ)

Eyes (Perceptual URQ)

Mouth (Perceptual URQ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0 100 200 300 400 500 600 700 800 900 1000 1100

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Perceptual URQ)

Nose (Perceptual URQ)

Eyes (Perceptual URQ)

Mouth (Perceptual URQ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Perceptual URQ)

Nose (Perceptual URQ)

Eyes (Perceptual URQ)

Mouth (Perceptual URQ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0 200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Perceptual URQ)

Nose (Perceptual URQ)

Eyes (Perceptual URQ)

Mouth (Perceptual URQ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0 500 1000 1500 2000 2500 3000 3500 4000

Bit−rates(kbps)

Average Y−PSNR (dB)

Face (Peceptual URQ)

Nose (Peceptual URQ)

Eyes (Peceptual URQ)

Mouth (Peceptual URQ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

0 500 1000 1500 2000 2500 3000 3500 4000

Bit−rates (kbps)

Average Y−PSNR (dB)

Face (Perceptual URQ)

Nose (Perceptual URQ)

Eyes (Perceptual URQ)

Mouth (Perceptual URQ)

Face (Ours)

Nose (Ours)

Eyes (Ours)

Mouth (Ours)

Fig. 8. Rate–distortion performance comparison over facial features between the perceptual URQ scheme [12] and our schemes on compressing six

conversational video sequences.

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140 137

Table 2

DMOS comparison of conventional, perceptual urq, and our schemes.

Sequences Resolution Bit-rates (kbps) Conventional R-λPerceptual URQ Our

Akiyo 352 288 16 55.25 56.34 45.12

64 29.61 25.59 24.37

Foreman 352 288 32 70.65 72.64 66.58

128 34.25 36.90 29.86

Johnny 1280 720 64 60.78 59.53 54.54

256 33.89 29.48 25.69

Vidyo4 1280720 64 67.16 68.85 61.67

256 37.21 34.29 27.56

Yan 1920 1080 128 70.64 67.26 65.99

512 36.35 32.48 29.64

Lee 19201080 128 56.40 53.40 47.70

512 42.06 36.70 29.26

Fig. 9. Visual quality comparison of random selected frames of Foreman (CIF resolution), Johnny (720p resolution), and Lee (1080p resolution). (a), (b), and

(c) show the 56th decoded frames of Foreman compressed at 32 kbps. (d), (e), and (f) show the 23rd decoded frames of Johnny compressed at 64 kbps.

Moreover, (g), (h), and (i) show the 41st decoded frames of Lee compressed at 128 kbps.

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140138

perceptual coding of conversational videos on HEVC

platform. We used six test video sequences: two CIF

conversational video sequences Akiyo and Foreman,two

720p conversational videos Johnny and Vidyo4 from HEVC

test database, and two 1080p conversational video seq-

uences Yan and Lee from [12]. We utilized the HEVC test

model (HM 16.0 software) with its default R-

rate

control scheme [6] as the reference scheme. Our

weight-based R-

rate control scheme was then

embedded into HM 16.0 for comparison. Furthermore,

HEVC perceptual video coding work of [12],calledper-

ceptual URQ, was used in the comparison. Note that Lee

was captured in a dark room in order to validate the

robustness of our scheme to poor illumination. The

parameter setting file encoder_lowdelay_P_main:cfg was

used with videos of 150 frames at 25 Hz.

5.1. Objective quality comparison

Figs. 5 and 6show the rate–distortion performance of

the conventional, perceptual URQ, and our schemes in

face, background, and whole regions. As can be seen from

these figures, our scheme outperforms the conventional R-

scheme on HM 16.0 platform in terms of average Y-PSNR

of the face regions, for all video sequences at different bit-

rates. Moreover, the quality of face regions of perceptual

URQ scheme is lower than that of our scheme. As the cost,

the average Y-PSNR of the background is decreased in our

scheme. However, thanks to the HVS, the perceived video

quality is increased as verified in the next subsection.

Moreover, Figs. 7 and 8are plotted to further show the

improvement within face regions. One may observe from

these figures that the rate–distortion performance of facial

features is significantly improved at various bit-rates, for

all CIF, 720p, and 1080p videos, in our scheme over both

conventional R-

and perceptual URQ schemes. Further-

more, the quality improvement within a face for the 720p

and 1080p videos is much better than that for the CIF

videos. This may be due to the fact that more bits can be

allocated to ROIs in the 720p and 1080p videos.

5.2. Subjective quality comparison

Since humans are the final receivers of video assess-

ment, subjective evaluation is the most accurate and

convincing metric [48]. In this paper, we adopted a single

stimulus continuous quality scale (SSCQS) procedure, pro-

posed by Rec. ITU-R BT.500, to rate the subjective quality.

The evaluation we conducted was divided into three

sessions for CIF, 720p, and HD videos, respectively. Note

that the uncompressed reference and test video sequences

in each session were displayed with a random order.

Before each session, the observers were required to view

5 other training videos (one per quality scale) to help them

better understand the subjective quality assessment. 15

observers (5 females and 10 males), aging 19–34, were

involved in this test. Note that the observers are different

from the subjects in eye-tracking experiment. We used a

24”HP LS24B370 LCD monitor with its resolution being set

to 1920 1080 to display the videos. Note that all the

videos are displayed in their original resolutions, to avoid

the influence of scaling operation. The viewing distance

was set to be three to four times of the video height for

rational evaluation. The quality rate scales for observers to

evaluate after viewing are excellent (100–81), good (80–

61), fair (60–41), poor (40–21), and bad (20–1).

After the subjective evaluation, difference mean opi-

nion scores (DMOS) were computed to reveal the differ-

ence of subjective quality between the compressed and

uncompressed videos. Smaller value of DMOS corresponds

to better subjective quality of the compressed video

sequence. Then, Table 2 compares the average DMOS

values of all compressed video sequences. From this table,

we can see that the DMOS values of our scheme are

smaller than the perceptual URQ scheme, and much less

than the conventional R-

scheme, especially at high

resolutions. Therefore, our scheme can provide higher

subjective video quality. It can be further seen from this

table that our scheme is able to perform better than the

perceptual URQ scheme at low bit-rates (in comparison

with the conventional R-

scheme). Moreover, the

improvement of subjective quality of our scheme over

perceptual URQ implies the effectiveness of the learning

strategy on allocating weights on face regions. It is because

our scheme has better subjective quality, while maintain-

ing comparable Y-PSNRs with perceptual URQ scheme for

whole video frames. Fig. 9 further shows the visual quality

of our and conventional R-

schemes.

In summary, our subjective results here, together with

the previous objective results, illustrate that our scheme

on conversational video coding of HEVC is significantly

superior in perceived visual quality.

6. Conclusion

This paper has proposed a novel weight-based R-

scheme for the rate control of conversational videos in

HEVC, to improve its perceived visual quality. First, a

perceptual model was established by learning from the

training videos with eye fixation points in our eye-tracking

experiment, to reveal the importance of visual content for

conversational video coding. Then, weight maps can be

generated for the encoding video frames. With such maps,

a novel weight-based R-

rate control scheme was pro-

posed for HEVC using bpw to take into account the visual

importance of each pixel. Thus, in accordance with HVS,

the perceived visual quality is improved by our scheme, as

more bits are assigned to ROIs (faces), especially the more

important ROIs (facial features). Finally, the experimental

results verified such an improvement over several con-

versational video sequences on HEVC platform (HM 16.0).

References

[1] G.J. Sullivan, J. Ohm, W. Han, T. Wiegand, Overview of the high

efficiency video coding (HEVC) standard, IEEE Trans. Circuits Syst.

Video Technol. 22 (12) (2012) 1649–166 8.

[2] B.G. Haskell, Digital Video: An Introduction to MPEG-2, Springer

Science and Business Media, New York, USA, 1997.

[3] A. Vetro, H. Sun, Y. Wang, MPEG-4 rate control for multiple video

objects, IEEE Trans. Circuits Syst. Video Technol. 9 (1) (1999)

186 –199.

[4] G.J. Sullivan, T. Wiegand, K.-P. Lim, Text description of joint model

reference encoding methods and decoding concealment methods,

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140 139

Document: JVT-N046, Joint Video Team (JVT) of ISO/IECMPEG and

ITU-T VCEG.

[5] H. Choi, J. Yoo, J. Nam, D. Sim, I. Bajic, Pixel-wise unified rate-

quantization model for multi-level rate control, J. Sel. Top. Signal

Process. IEEE 7 (6) (2013) 1112–112 3.

[6] B. Li, H. Li, L. Li, J. Zhang, Rate control by R-lambda model for HEVC,

Document: JCTVC-K0103, Joint Collaborative Team on Video Coding.

[7] G.J. Sullivan, T. Wiegand, Rate-distortion optimization for video

compression, IEEE Signal Process. Mag. 15 (6) (1998) 74–90.

[8] J. Lee, T. Ebrahimi, Perceptual video compression: a survey, Signal

Processing, IEEE 6 (6) (Oct. 2012) 684–697.

[9] X. Yang, W. Lin, Z. Lu, X. Lin, S. Rahardja, E. Ong, S. Yao, Rate control

for videophone using local perceptual cues, IEEE Trans. Circuits Syst.

Video Technol. 15 (4) (2005) 496–507.

[10] Y. Liu, Z.G. Li, Y.C. Soh, Region-of-interest based resource allocation

for conversational video communication of H.264/AVC, IEEE Trans.

Circuits Syst. Video Technol. 18 (1) (2008) 134–139 .

[11] Z. Li, S. Qin, L. Itti, Visual attention guided bit allocation in video

compression, Image Vis. Comput. 29 (1) (2011) 1–14.

[12] M. Xu, X. Deng, S. Li, Z. Wang, Region-of-interest based conversa-

tional HEVC coding with hierarchical perception model of face, IEEE

J. Sel. Top. Signal Process 8 (3) (2014) 475–489.

[13] L. Xu, D. Zhao, X. Ji, L. Deng, S. Kwong, W. Gao, Window-level rate

control for smooth picture quality and smooth buffer occupancy,

IEEE Trans. Image Process. 20 (3) (2011) 723–734.

[14] L. Xu, S. Li, K.N. Ngan, L. Ma, Consistent visual quality control in

video coding, IEEE Trans. Circuits Syst. Video Technol. 23 (6) (2013)

975–989.

[15] A. Rehman, Z. Wang, SSIM-inspired perceptual video coding for

HEVC, in: IEEE International Conference on Multimedia and Expo

(ICME), 2012, pp. 497–502.

[16] Z. Wang, Q. Li, Information content weighting for perceptual image

quality assessment, IEEE Trans. Image Process. 20 (5) (2011)

118 5–119 8.

[17] W.S. Geisler, J.S. Perry, A real-time foveated multi-resolution system

for low-bandwidth video communication, in: Proceedings of the

SPIE: The International Society for Optical Engineering, vol. 3299,

1998, pp. 294–305.

[18] M.G. Martini, C.T. Hewage, Flexible macroblock ordering for context-

aware ultrasound video transmission over mobile WIMAX, Int. J.

Telemed. Appl. 2010 (2010) 6.

[19] L. Itti, Automatic foveation for video compression using a neurobio-

logical model of visual attention, IEEE Trans. Image Process. 13 (10)

(2004) 1304–1318.

[20] M.-C. Chi, M.-J. Chen, C.-H. Yeh, J.-A. Jhu, Region-of-interest video

coding based on rate and distortion variations for H. 263 þ, Signal

Process.: Image Commun. 23 (2) (2008) 127–142.

[21] M. Cerf, J. Harel, W. Einhäuser, C. Koch, Predicting human gaze using

low-level saliency combined with face detection, Adv. Neural Inf.

Process. Syst. 20 (2008) 241–248.

[22] N. Doulamis, A. Doulamis, D. Kalogeras, S. Kollias, Low bit-rate

coding of image sequences using adaptive regions of interest, IEEE

Trans. Circuits Syst. Video Technol. 8 (8) (1998) 928–934.

[23] D.M. Saxe, R.A. Foulds, Robust region of interest coding for improved

sign language telecommunication, IEEE Trans. Inf. Technol. Biomed.

6 (4) (2002) 310–316.

[24] Y. Sun, I. Ahmad, D. Li, Y.-Q. Zhang, Region-based rate control and bit

allocation for wireless video transmission, IEEE Trans. Multimed. 8

(1) (2006) 1–10.

[25] M.-C. Chi, C.-H. Yeh, M.-J. Chen, Robust region-of-interest determi-

nation based on user attention model through visual rhythm

analysis, IEEE Trans. Circuits Syst. Video Technol. 19 (7) (20 09)

1025–1038.

[26] A. Cavallaro, O. Steiger, T. Ebrahimi, Semantic video analysis for

adaptive content delivery and automatic description, IEEE Trans.

Circuits Syst. Video Technol. 15 (10) (2005) 1200–1209.

[27] G. Boccignone, A. Marcelli, P. Napoletano, G. di Fiore, G. Iacovoni,

S. Morsa, Bayesian integration of face and low-level cues for

foveated video coding, IEEE Trans. Circuits Syst. Video Technol. 18

(12) (2008) 1727–1740.

[28] L.S. Karlsson, M. Sjostrom, Improved ROI video coding using variable

Gaussian pre-filters and variance in intensity, in: IEEE International

Conference on Image Processing, 2005, ICIP 2005, vol. 2, 2005,

pp. 313–316.

[29] D. Chai, K.N. Ngan, Face segmentation using skin-color map in

videophone applications, IEEE Trans. Circuits Syst. Video Technol. 9

(4) (1999) 551–564.

[30] M. Wang, T. Zhang, C. Liu, S. Goto, Region-of-interest based dyna-

mical parameter allocation for H.264/AVC encoder, in: IEEE Picture

Coding Symposium, 2009, PCS 2009, 2009, pp. 1–4.

[31] Q. Chen, G. Zhai, X. Yang, W. Zhang, Application of scalable visual

sensitivity profile in image and video coding, in: IEEE International

Symposium on Circuits and Systems, 2008, ISCAS 2008, 2008,

pp. 268–271.

[32] B. Li, H. Li, L. Li, J. Zhang,

domain based rate control for high

efficiency video coding, IEEE Trans. Image Process. 23 (9) (2014)

3841–3854.

[33] D. Liu, X. Sun, F. Wu, S. Li, Y.-Q. Zhang, Image compression with

edge-based inpainting, IEEE Trans. Circuits Syst. Video Technol. 17

(10) (2007) 1273–1287.

[34] H. Xiong, Y. Xu, Y.F. Zheng, C.W. Chen, Priority belief propagation-

based inpainting prediction with tensor voting projected structure

in video compression, IEEE Trans. Circuits Syst. Video Technol. 21 (8)

(2011) 1115–1129 .

[35] L. Cheng, S. Vishwanathan, Learning to compress images and videos,

in: Proceedings of the 24th International Conference on Machine

Learning, ACM, 2007, pp. 161–168.

[36] X. He, M. Ji, H. Bao, A unified active and semi-supervised learning

framework for image compression, in: IEEE Conference on Compu-

ter Vision and Pattern Recognition, 2009, pp. 65–72.

[37] A. Levin, D. Lischinski, Y. Weiss, Colorization using optimization, in:

ACM Transactions on Graphics (TOG), vol. 23, ACM, 2004, pp. 689–

694.

[38] E. Kavitha, M.A. Ahmed, A machine learning approach to image

compression, Int. J. Technol. Comput. Sci. Eng. 1 (2) (2014) 70–81.

[39] M. Xu, S. Li, J. Lu, W. Zhu, Compressibility constrained sparse

representation with learnt dictionary for low bit-rate image com-

pression, IEEE Trans. Circuits Syst. Video Technol. 24 (10) (2014)

1743–175 7.

[40] Y. Sun, M. Xu, X. Tao, J. Lu, Online dictionary learning based intra-

frame video coding, Wirel. Pers. Commun. 74 (4) (2014) 1281–1295.

[41] S. Mallat, F. Falzon, Analysis of low bit rate image transform coding,

IEEE Trans. Signal Process. 46 (4) (1998) 1027–1042.

[42] X.W. Marta Karczewicz, Intra frame rate control based on SATD,

Document: JCTVC-M0257, Joint Collaborative Team on Video Coding.

[43] B. Li, H. Li, L. Li., Adaptive bit allocation for R-

model rate control in

HM, Document: JCTVC-M0036, Joint Collaborative Team on Video

Coding.

[44] B. Li, D. Zhang, H. Li, J. Xu, Qp determination by

value, Document:

JCTVC-I0426, Joint Collaborative Team on Video Coding.

[45] B. Li, L. Li, J. Zhang, J. Xu, H. Li, Encoding with fixed lagrange

multipliers, Document: JCTVC-I0242, Joint Collaborative Team on

Video Coding.

[46] J.M. Saragihand, S.S. Lucey, J.F. Cohn, Face alignment through sub-

space constrained mean-shifts, in: Proceeding of ICCV, 2009,

pp. 1034–1041.

[47] S. Ren, X. Cao, Y. Wei, J. Sun, Face alignment at 300 0 fps via

regressing local binary features, in: 2014 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1685–

1692.

[48] B.A.C.L.K. Seshadrinathan, R. Soundararajan, Study of subjective and

objective quality assessment of video, IEEE Trans. Image Process. 19

(6) (2010) 1427–1441 .

S. Li et al. / Signal Processing: Image Communication 38 (2015) 127–140140

PVC-STIM: Perceptual video coding based on spatio-temporal influence map

Article

Full-text available

Oct 2022

Since the ultimate consumers and judgers of most video applications are human subjects, there has been growing interest in incorporating characteristics of the human visual system (HVS) in video coding for the further development of coding technology, called perceptual video coding (PVC). Although there have been numerous PVC methods reported in the literature to date, exploring various factors affecting the performance of PVC still remains challenging due to the complexity of HVS and its perceptual mechanisms. In this paper, we propose a perceptual video coding scheme based on a novel spatio-temporal influence map model (PVC-STIM). In the first step, we develop a novel perceptual model by considering multiple perceptual characteristics of HVS, with a special focus on the fusion of several spatial and temporal features, i.e. spatial masking effect, spatial stimuli, visual saliency and temporal motion attention. In the second step, the proposed perceptual model is incorporated into the classic video coding framework to adjust the Lagrange multiplier in order to reasonably allocate visual quality, which improves the rate and perceived distortion performance. Experimental results show that the proposed PVC-STIM method can achieve on average 8.76% bitrate savings while retaining similar perceived quality, compared to HEVC, and can also outperform two PVC approaches.

Application of region-based video surveillance in smart cities using deep learning

Article

Full-text available

Dec 2021
MULTIMED TOOLS APPL

Smart video surveillance helps to build more robust smart city environment. The varied angle cameras act as smart sensors and collect visual data from smart city environment and transmit it for further visual analysis. The transmitted visual data is required to be in high quality for efficient analysis which is a challenging task while transmitting videos on low capacity bandwidth communication channels. In latest smart surveillance cameras, high quality of video transmission is maintained through various video encoding techniques such as high efficiency video coding. However, these video coding techniques still provide limited capabilities and the demand of high-quality based encoding for salient regions such as pedestrians, vehicles, cyclist/motorcyclist and road in video surveillance systems is still not met. This work is a contribution towards building an efficient salient region-based surveillance framework for smart cities. The proposed framework integrates a deep learning-based video surveillance technique that extracts salient regions from a video frame without information loss, and then encodes it in reduced size. We have applied this approach in diverse case studies environments of smart city to test the applicability of the framework. The successful result in terms of bitrate 56.92%, peak signal to noise ratio 5.35 bd and SR based segmentation accuracy of 92% and 96% for two different benchmark datasets is the outcome of proposed work. Consequently, the generation of less computational region-based video data makes it adaptable to improve surveillance solution in Smart Cities.

Fast-MFQE: A Fast Approach for Multi-Frame Quality Enhancement on Compressed Video

Article

Full-text available

Aug 2023
SENSORS-BASEL

For compressed images and videos, quality enhancement is essential. Though there have been remarkable achievements related to deep learning, deep learning models are too large to apply to real-time tasks. Therefore, a fast multi-frame quality enhancement method for compressed video, named Fast-MFQE, is proposed to meet the requirement of video-quality enhancement for real-time applications. There are three main modules in this method. One is the image pre-processing building module (IPPB), which is used to reduce redundant information of input images. The second one is the spatio-temporal fusion attention (STFA) module. It is introduced to effectively merge temporal and spatial information of input video frames. The third one is the feature reconstruction network (FRN), which is developed to effectively reconstruct and enhance the spatio-temporal information. Experimental results demonstrate that the proposed method outperforms state-of-the-art methods in terms of lightweight parameters, inference speed, and quality enhancement performance. Even at a resolution of 1080p, the Fast-MFQE achieves a remarkable inference speed of over 25 frames per second, while providing a PSNR increase of 19.6% on average when QP = 37.

Perceptual importance analysis-based rate control method for HEVC

Article

Full-text available

Apr 2022
MULTIMED TOOLS APPL

High efficiency video coding (HEVC) has achieved high coding efficiency as the video coding standard. For rate control in HEVC, the conventional R-λ scheme is based on mean absolute difference in allocating bits; however, the scheme does not fully utilize the perceptual importance variation to guide rate control, thus the subjective and objective quality of coded videos has room to improve. Therefore, in this paper, we propose a rate control scheme that considers perceptual importance. We first develop a perceptual importance analysis scheme to accurately abstract the spatial and temporal perceptual importance maps of video contents. The results of the analysis are then used to guide the bit allocation. Utilizing this model, a region-level bit allocation procedure is developed to maintain video quality balance. Subsequently, a largest coding unit (LCU)-level bit allocation scheme is designed to obtain the target bit of each LCU. To achieve a more accurate bitrate, an improved R-λ model based on the Broyden-Fletcher-Goldfarb-Shanno model is utilized to update the R-λ parameter. The experimental results showed that our method not only improved subjective and objective video quality with lower bitrate errors compared to the original RC in HEVC, but also outperformed state-of-the-art methods.

Video saliency aware intelligent HD video compression with the improvement of visual quality and the reduction of coding complexity

Article

Full-text available

May 2022
NEURAL COMPUT APPL

The fast development of video technology and hardware has led to a great amount of video applications in the field of industry, such as the video conference, video surveillance and live video streaming. Most of the applications are facing with the problem of limited network bandwidth while try to keep high video resolution. Perceptual video compression addresses the problem by introducing saliency information to reduce the perceptual redundancy, which retains more information in salient region while compresses the non-salient region as much as possible. In this paper, in order to combine the multi-scale information in video saliency, an advanced video saliency detection model SALDPC is proposed which is built based on deformable pyramid convolution decoder and multi-scale temporal recurrence. In order to better guide video coding through saliency, based on the HEVC video coding standard, a saliency-aware rate-distortion optimization algorithm, SRDO, is proposed using block-based saliency information to change the rate-distortion balance and guide more reasonable bit allocation. Furthermore, a more flexible QP selection method, SAQP, is developed using CU's saliency information adaptively to change the QP of CU to ensure the high quality of the high saliency areas. The final results are available in the following three configurations: SRDO, SRDO + SQP, and SRDO + SAQP. Experimental results show that our method achieves a very high video quality improvement while significantly reducing the video encoding time. Compared to HEVC, the BD-EWPSNR of the SRDO method improves 0.703 dB, and the BD-Rate based on EWPSNR saves 20.822%; the BD-EWPSNR of the SRDO + SAQP is improved up to 1.217 dB, and the BD-Rate based on EWPSNR further saves up to 32.41%. At the same time, in terms of compression time, the proposed method saves up to 29.06% compared to HEVC. Experimental results show the superiority of the proposed method in comparison to the state-of-the-art methods.

Efficient Rate Control in Versatile Video Coding With Adaptive Spatial–Temporal Bit Allocation and Parameter Updating

Article

Jan 2022

Despite the fact that Versatile Video Coding (VVC) has achieved superior coding performance, two major problems remain for the rate control (RC) model in VVC. First, the regions concerned by human eyes are not clear enough in the coded video due to the deviation between the target bit allocation strategy of the coding tree unit (CTU) in RC and the human visual attention mechanism (HVAM). Second, there are significant quality fluctuations in the coded video frames due to the inappropriate updating speed. To address the above problems, we propose an efficient rate control (ERC) model. Specifically, in order to make the coded video more consistent with the attention of human eyes, we extract texture and motion-based spatial-temporal information to guide the bit allocation at the CTU level. Furthermore, based on the quasi-Newton algorithm and bit error, we propose an adaptive parameter updating (APU) method with the proper updating speed to precisely control the bits per frame. The proposed ERC outperforms the default RC model of VVC Test Model (VTM) 9.1 by saving the average Bjøntegaard Delta Rate (BD-Rate) on full-frame video sequences by 3.60% and 4.94% under low delay P (LDP) and random access (RA) configurations respectively, with higher bitrate accuracy. Moreover, the Peak Signal-to-Noise Ratio (PSNR) and actual coded bits per frame in the video coded by the proposed ERC are more stable.

Perceptual Video Coding based on Adaptive Region-Level Intra-Period

Conference Paper

Apr 2022

Two-stage Multi-frame Cooperative Quality Enhancement on Compressed Video

Conference Paper

Dec 2021

A New Framework Based on Spatio-Temporal Information for Enhancing Compressed Video

Conference Paper

Sep 2021

MW-GAN+ for Perceptual Quality Enhancement on Compressed Video

Article

Nov 2021

The great success of deep learning has boosted the fast development of video quality enhancement. However, existing methods mainly focus on enhancing the objective quality of compressed video, and ignore their perceptual quality that plays a key role in determining quality of experience (QoE) of videos. In this paper, we aim at enhancing the perceptual quality of compressed video. Our main observation is that perceptual quality enhancement mostly relies on recovering the high-frequency details with fine textures. Accordingly, we propose a novel generative adversarial network (GAN) based on multi-level wavelet packet transform (WPT), which is called multi-level wavelet-based GAN+ (MW-GAN+), to exploit high-frequency details for enhancing the perceptual quality of compressed video. In MW-GAN+, we first propose a multi-level wavelet pixel-adaptive (MWP) module to extract temporal information across video frames, such that frame similarity can be utilized in recovering high-frequency details. Then, a wavelet reconstruction network, consisting of wavelet-dense residual blocks (WDRB), is developed to recover high-frequency details in a multi-level manner for enhanced frame reconstruction. Finally, we develop a 3D discriminator to encourage temporal coherence with a 3D-CNN based architecture. Experimental results demonstrate the superiority of our method over state-of-the-art methods in enhancing the perceptual quality of compressed video. Our code is available at https://github.com/IceClear/MW-GAN .

A MACHINE LEARNING APPROACH TO IMAGE COMPRESSION TECHNIQUE

Article

Full-text available

Jun 2014

Typical image compression algorithms first transform the image from its spatial domain representation to frequency domain representation using some transform technique such as Discrete Cosine Transform and Discrete Wavelet Transform and then code the transformed values. Recently, instead of performing a frequency transformation, machine learning based approach has been proposed which has two fundamental steps: selecting the most representative pixels and colorization. In this paper, a novel active learning method for automatically extracting the RP is proposed for image compression. To implement, active learning method automatic RP extraction is required and which extraction method is chosen determines the performance of the method. In this paper the active learning problem is formulated into an RP minimization problem resulting in the optimal RP set in the sense that it minimizes the error between the original and the reconstructed color image. The proposed method gives better result in comparison with other compression techniques.

Online Dictionary Learning Based Intra-frame Video Coding

Article

Full-text available

Feb 2014

In this paper, we propose an online learning based intra-frame video coding approach, exploiting the texture sparsity of natural images. The proposed method is capable of learning the basic texture elements from previous frames with convergence guaranteed, leading to effective dictionaries for sparser representation of incoming frames. Benefiting from online learning, the proposed online dictionary learning based codec (ODL codec) is able to achieve a goal that the more video frames are being coded, the less non-zero coefficients are required to be transmitted. Then, these non-zero coefficients for image patches are further quantized and coded combined with dictionary synchronization. The experimental results demonstrate that the number of non-zero coefficients of each frame decreases rapidly while more frames are encoded. Compared to the off-line mode training, the proposed ODL codec, learning from video on the fly, is able to reduce the computational complexity with fast convergence. Finally, the rate distortion performance shows improvement in terms of PSNR compared with the K-SVD dictionary based compression and H.264/AVC for intra-frame video at low bit rates.

SSIM-inspired perceptual video coding for HEVC

Conference Paper

Full-text available

Jul 2012

Recent advances in video capturing and display technologies, along with the exponentially increasing demand of video services, challenge the video coding research community to design new algorithms able to significantly improve the compression performance of the current H.264/AVC standard. This target is currently gaining evidence with the standardization activities in the High Efficiency Video Coding (HEVC) project. The distortion models used in HEVC are mean squared error (MSE) and sum of absolute difference (SAD). However, they are widely criticized for not correlating well with perceptual image quality. The structural similarity (SSIM) index has been found to be a good indicator of perceived image quality. Meanwhile, it is computationally simple compared with other state-of-the-art perceptual quality measures and has a number of desirable mathematical properties for optimization tasks. We propose a perceptual video coding method to improve upon the current HEVC based on an SSIM-inspired divisive normalization scheme as an attempt to transform the DCT domain frame prediction residuals to a perceptually uniform space before encoding. Based on the residual divisive normalization process, we define a distortion model for mode selection and show that such a divisive normalization strategy largely simplifies the subsequent perceptual rate-distortion optimization procedure. We further adjust the divisive normalization factors based on local content of the video frame. Experiments show that the proposed scheme can achieve significant gain in terms of rate-SSIM performance when compared with HEVC.

Face Alignment at 3000 FPS via Regressing Local Binary Features

Article

Sep 2014

This paper presents a highly efficient, very accurate regression approach for face alignment. Our approach has two novel components: a set of local binary features, and a locality principle for learning those features. The locality principle guides us to learn a set of highly discriminative local binary features for each facial landmark independently. The obtained local binary features are used to jointly learn a linear regression for the final output. Our approach achieves the state-of-the-art results when tested on the current most challenging benchmarks. Furthermore, because extracting and regressing local binary features is computationally very cheap, our system is much faster than previous methods. It achieves over 3, 000 fps on a desktop or 300 fps on a mobile phone for locating a few dozens of landmarks.

Compressibility Constrained Sparse Representation With Learnt Dictionary for Low Bit-Rate Image Compression

Article

Oct 2014

This paper proposes a compressibility constrained sparse representation (CCSR) approach to low bit-rate image compression using a learnt over-complete dictionary of texture patches. Conventional sparse representation approaches for image compression are based on matching pursuit (MP) algorithms. Actually, the weakness of these approaches is that they are not stable in terms of sparsity of the estimated coefficients, thereby resulting in the inferior performance in low bit-rate image compression. In comparison with MP, convex relaxation approaches are more stable for sparse representation. However, it is intractable to directly apply convex relaxation approaches to image compression, as their coefficients are not always compressible. To utilize convex relaxation in image compression, we first propose in this paper a CCSR formulation, imposing the compressibility constraint on the coefficients of sparse representation for each image patch. In addition, we work out the CCSR formulation to obtain sparse and compressible coefficients, through recursively solving the (ell _{1}) -norm optimization problem of sparse representation. Given these coefficients, each image patch can be represented by the linear combination of texture elements encoded in an over-complete dictionary, learnt from other training images. Finally, low bit-rate image compression can be achieved, owing to the sparsity and compressibility of coefficients by our CCSR approach. The experimental results demonstrate the effectiveness and superiority of the CCSR approach on compressing the natural and remote sensing images at low bit-rates.

(lambda ) Domain Rate Control Algorithm for High Efficiency Video Coding

Article

Sep 2014

Rate control is a useful tool for video coding, especially in real-time communication applications. Most of existing rate control algorithms are based on the (R-Q) model, which characterizes the relationship between bitrate (R) and quantization (Q) , under the assumption that (Q) is the critical factor on rate control. However, with the video coding schemes becoming more and more flexible, it is very difficult to accurately model the (R-Q) relationship. In fact, we find that there exists a more robust correspondence between (R) and the Lagrange multiplier (lambda ) . Therefore, in this paper, we propose a novel (lambda ) -domain rate control algorithm based on the (R-lambda ) model, and implement it in the newest video coding standard high efficiency video coding (HEVC). Experimental results show that the proposed (lambda ) -domain rate control can achieve the target bitrates more accurately than the original rate control algorithm in the HEVC reference software as well as obtain significant R-D performance gain. Thanks to the high accurate rate control algorithm, hierarchical bit allocation can be enabled in the implemented video coding scheme, which can bring additional R-D performance gain. Experimental results demonstrate that the proposed (lambda ) -domain rate control algorithm is effective for HEVC, which outperforms the (R-Q) model based rate control in HM-8.0 (HEVC reference software) by 0.55 dB on average and up to 1.81 dB for low delay coding structure, and 1.08 dB on average and up to 3.77 dB for random access coding structure. The proposed (lambda ) -domain rate control algorithm has already been adopted by Joint Collaborative Team on Video Coding and integrated into the HEVC reference software.

Region-of-Interest Based Conversational HEVC Coding with Hierarchical Perception Model of Face

Article

Jun 2014

Consistent Visual Quality Control in Video Coding

Article

Jun 2013

Visual quality consistency is one of the most important issues in video quality assessment. When people view a sequential video, they may have an unpleasant perceptual experience if the video has an inconsistent visual quality even though the average visual quality of the video is not compromised. Thus, consistent visual quality control is mostly expected in general video encoding with limited channel bandwidth and buffer resources. However, there still has not been enough study on such an issue. In this paper, a new objective visual quality metric (VQM) is proposed first, which can easily be incorporated into video coding for guiding video coding. Second, a VQM-based window model is proposed to handle the tradeoff between visual quality consistency and buffer constraint in video coding. Third, a window-level rate control algorithm is developed to accomplish visual quality control based on the above two proposals. Finally, experimental results prove that consistent visual quality, high rate-distortion efficiency, accurate bit control, and compliant buffer constraint can be achieved by the proposed rate control algorithm.

Pixel-Wise Unified Rate-Quantization Model for Multi-Level Rate Control

Article

Dec 2013
IEEE J-STSP

In this paper, we present a pixel-wise unified rate quantization (R-Q) model for a low-complexity rate control on configurable coding units of high efficiency video coding (HEVC). In the case of HEVC, which employs hierarchical coding block structure, multiple R-Q models can be employed for the various block sizes. However, we found that the ratios of distortions over bits for all the blocks are a nearly constant because of employment of the rate distortion optimization technique. Hence, one relationship model between rate and quantization can be derived from the characteristic of similar ratios of distortions over bits regardless of block sizes. Thus, we propose the pixel-wise unified R-Q model for HEVC rate control working on the multi-level for all block sizes. We employ a simple leaky bucket model for bit control. The rate control based on the proposed pixel-wise unified R-Q model is implemented on HEVC test model 6.1 (HM6.1). According to the evaluation for the proposed rate control, the average matching percentage to target bitrates is 99.47% and the average PSNR degradation is 0.76 dB. Based on the comparative study, we found that the proposed rate control shows low bit fluctuation and good RD performance, compared to R-lambda rate control for long sequences.

Perceptual Video Compression: A Survey

Article

Oct 2012
IEEE J-STSP

With the advances in understanding perceptual properties of the human visual system and constructing their computational models, efforts toward incorporating human perceptual mechanisms in video compression to achieve maximal perceptual quality have received great attention. This paper thoroughly reviews the recent advances of perceptual video compression mainly in terms of the three major components, namely, perceptual model definition, implementation of coding, and performance evaluation. Furthermore, open research issues and challenges are discussed in order to provide perspectives for future research trends.

Weight-Based R-λ Rate Control for Perceptual HEVC Coding on Conversational Videos

Abstract and Figures

Recommended publications

A novel weight-based URQ scheme for perceptual video coding of conversational video in HEVC

Quality-Oriented Perceptual HEVC Based on the Spatiotemporal Saliency Detection Model

Region-of-Interest Based Conversational HEVC Coding with Hierarchical Perception Model of Face

State-of-the-art video coding approaches: A survey

Subjective rate-distortion optimization in HEVC with perceptual model of multiple faces